Run ONNX Models in .NET: HuggingFace Embeddings & Phi-3 Without Azure

Verified Apr 2026 Intermediate Original .NET 10 Microsoft.ML.OnnxRuntime 1.20.0 Microsoft.ML.OnnxRuntimeGenAI 0.6.0

By Rajesh Mishra · Mar 12, 2026 · 11 min read

In 30 Seconds

Guide to running ONNX models in .NET for offline AI inference. Covers: loading HuggingFace embedding models, running Phi-3 locally with ONNX Runtime GenAI, integrating ONNX embeddings with Semantic Kernel, and performance benchmarks for CPU vs GPU inference.

Why ONNX Matters for .NET Developers

The .NET AI stack overwhelmingly points toward cloud APIs — Azure OpenAI, Azure AI Search, Cognitive Services. But there are scenarios where cloud inference isn’t viable:

Regulated industries that prohibit sending data to third-party APIs
Edge devices with intermittent or no internet connectivity
High-volume inference where API costs become unsustainable
Latency-sensitive paths where network round-trips are unacceptable

ONNX Runtime lets you run the same models locally in C#. No Python environment. No Docker containers running Flask APIs. Pure .NET inference.

ONNX local inference pipeline — the model loads once into an InferenceSession; each request tokenizes input, runs inference, and extracts the output tensor.

Running an Embedding Model

Embedding models convert text to numeric vectors. They’re essential for semantic search, RAG, and similarity matching. Here’s how to run one locally.

Step 1: Get the Model

Download an ONNX-format embedding model. The all-MiniLM-L6-v2 model from Sentence Transformers is a good starting point — 80MB, 384 dimensions, fast inference.

# Using the HuggingFace CLI (install with pip install huggingface-hub)
huggingface-cli download sentence-transformers/all-MiniLM-L6-v2 --local-dir models/minilm

Or download the ONNX version directly from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (Files tab → onnx folder).

You need three files:

model.onnx — The model weights
tokenizer.json — The tokenizer configuration
special_tokens_map.json — Special token mappings

Step 2: Project Setup

dotnet new console -n OnnxEmbeddings
cd OnnxEmbeddings
dotnet add package Microsoft.ML.OnnxRuntime
dotnet add package Microsoft.ML.Tokenizers

Step 3: Tokenize and Embed

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using Microsoft.ML.Tokenizers;

// Load the tokenizer
var tokenizer = BertTokenizer.Create("models/minilm/tokenizer.json");

// Load the ONNX model — reuse this across requests
using var session = new InferenceSession("models/minilm/model.onnx");

string[] texts =
[
    "Semantic Kernel is a .NET AI orchestration framework",
    "The weather in Seattle is rainy in November",
    "Azure OpenAI provides GPT models as a service"
];

foreach (var text in texts)
{
    var embedding = GenerateEmbedding(text, tokenizer, session);
    Console.WriteLine($"'{text[..40]}...' → [{embedding[0]:F4}, {embedding[1]:F4}, ... ] ({embedding.Length} dims)");
}

static float[] GenerateEmbedding(string text, BertTokenizer tokenizer, InferenceSession session)
{
    // Tokenize
    var encoded = tokenizer.Encode(text);
    var inputIds = encoded.Ids.ToArray();
    var attentionMask = Enumerable.Repeat(1L, inputIds.Length).ToArray();
    var tokenTypeIds = new long[inputIds.Length]; // All zeros for single-sentence

    // Create tensors
    var shape = new[] { 1, inputIds.Length };
    var inputIdsTensor = new DenseTensor<long>(
        inputIds.Select(id => (long)id).ToArray(), shape);
    var attentionTensor = new DenseTensor<long>(attentionMask, shape);
    var tokenTypeTensor = new DenseTensor<long>(tokenTypeIds, shape);

    // Run inference
    var inputs = new List<NamedOnnxValue>
    {
        NamedOnnxValue.CreateFromTensor("input_ids", inputIdsTensor),
        NamedOnnxValue.CreateFromTensor("attention_mask", attentionTensor),
        NamedOnnxValue.CreateFromTensor("token_type_ids", tokenTypeTensor)
    };

    using var results = session.Run(inputs);

    // Extract embeddings — mean pooling over token embeddings
    var lastHiddenState = results.First().AsTensor<float>();
    var embeddingDim = lastHiddenState.Dimensions[2];
    var seqLength = lastHiddenState.Dimensions[1];

    var pooled = new float[embeddingDim];
    for (var d = 0; d < embeddingDim; d++)
    {
        float sum = 0;
        for (var t = 0; t < seqLength; t++)
            sum += lastHiddenState[0, t, d];
        pooled[d] = sum / seqLength;
    }

    // L2 normalize
    var norm = MathF.Sqrt(pooled.Sum(x => x * x));
    for (var i = 0; i < pooled.Length; i++)
        pooled[i] /= norm;

    return pooled;
}

Step 4: Semantic Similarity

With embeddings, you can compute similarity between any two pieces of text:

static float CosineSimilarity(float[] a, float[] b)
{
    float dot = 0, normA = 0, normB = 0;
    for (var i = 0; i < a.Length; i++)
    {
        dot += a[i] * b[i];
        normA += a[i] * a[i];
        normB += b[i] * b[i];
    }
    return dot / (MathF.Sqrt(normA) * MathF.Sqrt(normB));
}

var query = GenerateEmbedding("AI framework for C# developers", tokenizer, session);
var doc1 = GenerateEmbedding("Semantic Kernel orchestrates AI tasks in .NET", tokenizer, session);
var doc2 = GenerateEmbedding("Recipe for chocolate chip cookies", tokenizer, session);

Console.WriteLine($"Query ↔ AI doc:     {CosineSimilarity(query, doc1):F4}");  // ~0.75
Console.WriteLine($"Query ↔ Cookie doc: {CosineSimilarity(query, doc2):F4}");  // ~0.15

Running Phi-3 Locally (Text Generation)

ONNX Runtime GenAI enables running small language models locally for text generation.

Setup

dotnet new console -n LocalLLM
cd LocalLLM
dotnet add package Microsoft.ML.OnnxRuntimeGenAI

Download the Phi-3 mini ONNX model:

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
  --local-dir models/phi3 \
  --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/*

Generate Text

using Microsoft.ML.OnnxRuntimeGenAI;

var modelPath = "models/phi3/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4";

using var model = new Model(modelPath);
using var tokenizer = new Tokenizer(model);

var prompt = "<|user|>\nExplain dependency injection in C# in 3 sentences.<|end|>\n<|assistant|>\n";

var sequences = tokenizer.Encode(prompt);
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 256);
generatorParams.SetSearchOption("temperature", 0.7);
generatorParams.SetInputSequences(sequences);

using var generator = new Generator(model, generatorParams);

Console.Write("Phi-3: ");
while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var token = generator.GetSequence(0)[^1];
    Console.Write(tokenizer.Decode(new ReadOnlySpan<int>(ref token)));
}
Console.WriteLine();

Performance Expectations

Model	Size	CPU Speed	GPU Speed	RAM
all-MiniLM-L6-v2 (embedding)	80 MB	~5ms/embed	~1ms/embed	200 MB
Phi-3 mini int4 (generation)	2.3 GB	~15 tok/sec	~60 tok/sec	3.5 GB
Phi-3 small int4 (generation)	4.2 GB	~8 tok/sec	~40 tok/sec	6 GB

CPU inference is practical for embeddings and classification. For text generation, you’ll want a GPU for interactive use cases.

Integrating ONNX with Semantic Kernel

Wrap an ONNX embedding model as a Semantic Kernel embedding service:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Embeddings;

public class OnnxEmbeddingService : ITextEmbeddingGenerationService
{
    private readonly InferenceSession _session;
    private readonly BertTokenizer _tokenizer;

    public IReadOnlyDictionary<string, object?> Attributes { get; } =
        new Dictionary<string, object?>();

    public OnnxEmbeddingService(string modelPath, string tokenizerPath)
    {
        _session = new InferenceSession(modelPath);
        _tokenizer = BertTokenizer.Create(tokenizerPath);
    }

    public Task<IList<ReadOnlyMemory<float>>> GenerateEmbeddingsAsync(
        IList<string> data,
        Kernel? kernel = null,
        CancellationToken cancellationToken = default)
    {
        IList<ReadOnlyMemory<float>> embeddings = data
            .Select(text => new ReadOnlyMemory<float>(
                GenerateEmbedding(text, _tokenizer, _session)))
            .ToList();

        return Task.FromResult(embeddings);
    }

    // GenerateEmbedding method from earlier example
}

var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAIChatCompletion("chat-deployment", endpoint, credential)
    .Build();

// Use local ONNX for embeddings, Azure OpenAI for chat
kernel.Services.AddSingleton<ITextEmbeddingGenerationService>(
    new OnnxEmbeddingService("models/minilm/model.onnx", "models/minilm/tokenizer.json"));

This hybrid approach gives you the best of both worlds: cloud LLM for reasoning, local model for embeddings (no per-embedding API cost).

When to Use Each Approach

Scenario	Recommended	Why
Chat/reasoning	Azure OpenAI	Frontier models far exceed local model quality
Embeddings (high volume)	ONNX local	Save $0.00002/embed × millions = significant savings
Classification	ONNX or ML.NET	Fast, cheap, offline capable
Regulated data	ONNX local	Data never leaves your server
Prototyping	Azure OpenAI	Faster to iterate, no model management

Next Steps

What is ML.NET? — ML.NET fundamentals for .NET developers
ML.NET Sentiment Analysis Tutorial — Train and deploy a classifier
Semantic Kernel Memory and Vector Stores — Use ONNX embeddings in a RAG pipeline

University: Running Phi-4 Locally in C# — Ollama, ONNX, and Foundry Local

⚠ Production Considerations

ONNX model files can be large (100MB-4GB). Don't include them in your Docker image build. Download them at startup or mount them as volumes. A 2GB model in your image makes every deployment painfully slow.
ONNX Runtime loads the full model into memory on InferenceSession creation. A 1GB model needs ~1.2GB RAM. Plan your container memory limits accordingly, and reuse the InferenceSession across requests — don't create a new one per inference call.

🧠 Architect’s Note

ONNX is the escape hatch from cloud dependency. When your Azure OpenAI budget runs out, when regulations ban sending data to external APIs, or when you need sub-millisecond inference — ONNX models running locally in .NET give you an alternative that no cloud provider can take away.

AI-Friendly Summary

Summary

Key Takeaways

ONNX Runtime runs models from PyTorch/TensorFlow in .NET without Python
Use Microsoft.ML.OnnxRuntime for classification and embedding models
Use Microsoft.ML.OnnxRuntimeGenAI for text generation (Phi-3, Mistral)
ONNX embedding models can implement SK's ITextEmbeddingGenerationService
CPU inference is viable for embeddings; GPU accelerates generation models

Implementation Checklist

Install Microsoft.ML.OnnxRuntime or OnnxRuntimeGenAI NuGet package
Download ONNX model files (model.onnx, tokenizer.json)
Create InferenceSession with model path
Prepare input tensors matching model's expected shapes
Run inference and extract output tensors

Frequently Asked Questions

What is ONNX?

ONNX (Open Neural Network Exchange) is an open format for representing machine learning models. A model trained in PyTorch or TensorFlow can be exported to ONNX and then run in any language with an ONNX Runtime — including C# and .NET. It's the universal model format.

When should I use ONNX instead of Azure OpenAI?

Use ONNX when you need: offline capability (no internet required), zero inference costs (no API charges), data privacy (data never leaves your machine), low latency (no network round-trip), or deterministic behavior (same input always produces same output). Use cloud APIs when you need frontier model quality or language generation.

Can I use ONNX models with Semantic Kernel?

Yes. You can wrap ONNX inference in a Semantic Kernel plugin or implement ITextEmbeddingGenerationService with an ONNX embedding model. Semantic Kernel doesn't care where embeddings come from — it works with any implementation of the embedding interface.

Track your progress through this learning path.

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Run ONNX Models in .NET: HuggingFace Embeddings & Phi-3 Without Azure

Why ONNX Matters for .NET Developers

Running an Embedding Model

Step 1: Get the Model

Step 2: Project Setup

Step 3: Tokenize and Embed

Step 4: Semantic Similarity

Running Phi-3 Locally (Text Generation)

Setup

Generate Text

Performance Expectations

Integrating ONNX with Semantic Kernel

When to Use Each Approach

Next Steps

⚠ Production Considerations

🧠 Architect’s Note

AI-Friendly Summary

Summary

Key Takeaways

Implementation Checklist

Frequently Asked Questions

You Might Also Enjoy

Phi-4 Local in C#: Ollama vs ONNX vs Foundry

Ollama Connection Refused in C#: SK Fix

Build a Local AI App: Ollama + Semantic Kernel + .NET Aspire

Was this article useful?

Why ONNX Matters for .NET Developers

Running an Embedding Model

Step 1: Get the Model

Step 2: Project Setup

Step 3: Tokenize and Embed

Step 4: Semantic Similarity

Running Phi-3 Locally (Text Generation)

Setup

Generate Text

Performance Expectations

Integrating ONNX with Semantic Kernel

When to Use Each Approach

Next Steps

Related Articles

⚠ Production Considerations

Get sharper .NET AI architecture notes every week

🧠 Architect’s Note

AI-Friendly Summary

Summary

Key Takeaways

Implementation Checklist

Frequently Asked Questions

You Might Also Enjoy

Phi-4 Local in C#: Ollama vs ONNX vs Foundry

Ollama Connection Refused in C#: SK Fix

Build a Local AI App: Ollama + Semantic Kernel + .NET Aspire

Was this article useful?