Why ONNX Matters for .NET Developers
The .NET AI stack overwhelmingly points toward cloud APIs — Azure OpenAI, Azure AI Search, Cognitive Services. But there are scenarios where cloud inference isn’t viable:
- Regulated industries that prohibit sending data to third-party APIs
- Edge devices with intermittent or no internet connectivity
- High-volume inference where API costs become unsustainable
- Latency-sensitive paths where network round-trips are unacceptable
ONNX Runtime lets you run the same models locally in C#. No Python environment. No Docker containers running Flask APIs. Pure .NET inference.
Running an Embedding Model
Embedding models convert text to numeric vectors. They’re essential for semantic search, RAG, and similarity matching. Here’s how to run one locally.
Step 1: Get the Model
Download an ONNX-format embedding model. The all-MiniLM-L6-v2 model from Sentence Transformers is a good starting point — 80MB, 384 dimensions, fast inference.
# Using the HuggingFace CLI (install with pip install huggingface-hub)
huggingface-cli download sentence-transformers/all-MiniLM-L6-v2 --local-dir models/minilm
Or download the ONNX version directly from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (Files tab → onnx folder).
You need three files:
model.onnx— The model weightstokenizer.json— The tokenizer configurationspecial_tokens_map.json— Special token mappings
Step 2: Project Setup
dotnet new console -n OnnxEmbeddings
cd OnnxEmbeddings
dotnet add package Microsoft.ML.OnnxRuntime
dotnet add package Microsoft.ML.Tokenizers
Step 3: Tokenize and Embed
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using Microsoft.ML.Tokenizers;
// Load the tokenizer
var tokenizer = BertTokenizer.Create("models/minilm/tokenizer.json");
// Load the ONNX model — reuse this across requests
using var session = new InferenceSession("models/minilm/model.onnx");
string[] texts =
[
"Semantic Kernel is a .NET AI orchestration framework",
"The weather in Seattle is rainy in November",
"Azure OpenAI provides GPT models as a service"
];
foreach (var text in texts)
{
var embedding = GenerateEmbedding(text, tokenizer, session);
Console.WriteLine($"'{text[..40]}...' → [{embedding[0]:F4}, {embedding[1]:F4}, ... ] ({embedding.Length} dims)");
}
static float[] GenerateEmbedding(string text, BertTokenizer tokenizer, InferenceSession session)
{
// Tokenize
var encoded = tokenizer.Encode(text);
var inputIds = encoded.Ids.ToArray();
var attentionMask = Enumerable.Repeat(1L, inputIds.Length).ToArray();
var tokenTypeIds = new long[inputIds.Length]; // All zeros for single-sentence
// Create tensors
var shape = new[] { 1, inputIds.Length };
var inputIdsTensor = new DenseTensor<long>(
inputIds.Select(id => (long)id).ToArray(), shape);
var attentionTensor = new DenseTensor<long>(attentionMask, shape);
var tokenTypeTensor = new DenseTensor<long>(tokenTypeIds, shape);
// Run inference
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor("input_ids", inputIdsTensor),
NamedOnnxValue.CreateFromTensor("attention_mask", attentionTensor),
NamedOnnxValue.CreateFromTensor("token_type_ids", tokenTypeTensor)
};
using var results = session.Run(inputs);
// Extract embeddings — mean pooling over token embeddings
var lastHiddenState = results.First().AsTensor<float>();
var embeddingDim = lastHiddenState.Dimensions[2];
var seqLength = lastHiddenState.Dimensions[1];
var pooled = new float[embeddingDim];
for (var d = 0; d < embeddingDim; d++)
{
float sum = 0;
for (var t = 0; t < seqLength; t++)
sum += lastHiddenState[0, t, d];
pooled[d] = sum / seqLength;
}
// L2 normalize
var norm = MathF.Sqrt(pooled.Sum(x => x * x));
for (var i = 0; i < pooled.Length; i++)
pooled[i] /= norm;
return pooled;
}
Step 4: Semantic Similarity
With embeddings, you can compute similarity between any two pieces of text:
static float CosineSimilarity(float[] a, float[] b)
{
float dot = 0, normA = 0, normB = 0;
for (var i = 0; i < a.Length; i++)
{
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (MathF.Sqrt(normA) * MathF.Sqrt(normB));
}
var query = GenerateEmbedding("AI framework for C# developers", tokenizer, session);
var doc1 = GenerateEmbedding("Semantic Kernel orchestrates AI tasks in .NET", tokenizer, session);
var doc2 = GenerateEmbedding("Recipe for chocolate chip cookies", tokenizer, session);
Console.WriteLine($"Query ↔ AI doc: {CosineSimilarity(query, doc1):F4}"); // ~0.75
Console.WriteLine($"Query ↔ Cookie doc: {CosineSimilarity(query, doc2):F4}"); // ~0.15
Running Phi-3 Locally (Text Generation)
ONNX Runtime GenAI enables running small language models locally for text generation.
Setup
dotnet new console -n LocalLLM
cd LocalLLM
dotnet add package Microsoft.ML.OnnxRuntimeGenAI
Download the Phi-3 mini ONNX model:
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
--local-dir models/phi3 \
--include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/*
Generate Text
using Microsoft.ML.OnnxRuntimeGenAI;
var modelPath = "models/phi3/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4";
using var model = new Model(modelPath);
using var tokenizer = new Tokenizer(model);
var prompt = "<|user|>\nExplain dependency injection in C# in 3 sentences.<|end|>\n<|assistant|>\n";
var sequences = tokenizer.Encode(prompt);
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 256);
generatorParams.SetSearchOption("temperature", 0.7);
generatorParams.SetInputSequences(sequences);
using var generator = new Generator(model, generatorParams);
Console.Write("Phi-3: ");
while (!generator.IsDone())
{
generator.ComputeLogits();
generator.GenerateNextToken();
var token = generator.GetSequence(0)[^1];
Console.Write(tokenizer.Decode(new ReadOnlySpan<int>(ref token)));
}
Console.WriteLine();
Performance Expectations
| Model | Size | CPU Speed | GPU Speed | RAM |
|---|---|---|---|---|
| all-MiniLM-L6-v2 (embedding) | 80 MB | ~5ms/embed | ~1ms/embed | 200 MB |
| Phi-3 mini int4 (generation) | 2.3 GB | ~15 tok/sec | ~60 tok/sec | 3.5 GB |
| Phi-3 small int4 (generation) | 4.2 GB | ~8 tok/sec | ~40 tok/sec | 6 GB |
CPU inference is practical for embeddings and classification. For text generation, you’ll want a GPU for interactive use cases.
Integrating ONNX with Semantic Kernel
Wrap an ONNX embedding model as a Semantic Kernel embedding service:
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Embeddings;
public class OnnxEmbeddingService : ITextEmbeddingGenerationService
{
private readonly InferenceSession _session;
private readonly BertTokenizer _tokenizer;
public IReadOnlyDictionary<string, object?> Attributes { get; } =
new Dictionary<string, object?>();
public OnnxEmbeddingService(string modelPath, string tokenizerPath)
{
_session = new InferenceSession(modelPath);
_tokenizer = BertTokenizer.Create(tokenizerPath);
}
public Task<IList<ReadOnlyMemory<float>>> GenerateEmbeddingsAsync(
IList<string> data,
Kernel? kernel = null,
CancellationToken cancellationToken = default)
{
IList<ReadOnlyMemory<float>> embeddings = data
.Select(text => new ReadOnlyMemory<float>(
GenerateEmbedding(text, _tokenizer, _session)))
.ToList();
return Task.FromResult(embeddings);
}
// GenerateEmbedding method from earlier example
}
Register it with Semantic Kernel:
var kernel = Kernel.CreateBuilder()
.AddAzureOpenAIChatCompletion("chat-deployment", endpoint, credential)
.Build();
// Use local ONNX for embeddings, Azure OpenAI for chat
kernel.Services.AddSingleton<ITextEmbeddingGenerationService>(
new OnnxEmbeddingService("models/minilm/model.onnx", "models/minilm/tokenizer.json"));
This hybrid approach gives you the best of both worlds: cloud LLM for reasoning, local model for embeddings (no per-embedding API cost).
When to Use Each Approach
| Scenario | Recommended | Why |
|---|---|---|
| Chat/reasoning | Azure OpenAI | Frontier models far exceed local model quality |
| Embeddings (high volume) | ONNX local | Save $0.00002/embed × millions = significant savings |
| Classification | ONNX or ML.NET | Fast, cheap, offline capable |
| Regulated data | ONNX local | Data never leaves your server |
| Prototyping | Azure OpenAI | Faster to iterate, no model management |
Next Steps
- What is ML.NET? — ML.NET fundamentals for .NET developers
- ML.NET Sentiment Analysis Tutorial — Train and deploy a classifier
- Semantic Kernel Memory and Vector Stores — Use ONNX embeddings in a RAG pipeline