Skip to main content

Run ONNX Models in .NET: HuggingFace Embeddings & Phi-3 Without Azure

Verified Apr 2026 Intermediate Original .NET 10 Microsoft.ML.OnnxRuntime 1.20.0 Microsoft.ML.OnnxRuntimeGenAI 0.6.0
By Rajesh Mishra · Mar 12, 2026 · 11 min read
In 30 Seconds

Guide to running ONNX models in .NET for offline AI inference. Covers: loading HuggingFace embedding models, running Phi-3 locally with ONNX Runtime GenAI, integrating ONNX embeddings with Semantic Kernel, and performance benchmarks for CPU vs GPU inference.

Why ONNX Matters for .NET Developers

The .NET AI stack overwhelmingly points toward cloud APIs — Azure OpenAI, Azure AI Search, Cognitive Services. But there are scenarios where cloud inference isn’t viable:

  • Regulated industries that prohibit sending data to third-party APIs
  • Edge devices with intermittent or no internet connectivity
  • High-volume inference where API costs become unsustainable
  • Latency-sensitive paths where network round-trips are unacceptable

ONNX Runtime lets you run the same models locally in C#. No Python environment. No Docker containers running Flask APIs. Pure .NET inference.

Input TextTokenizerText → Token IDs.onnx File(loaded once)InferenceSession(ONNX Runtime)Output TensorEmbedding / Prediction load at startup
ONNX local inference pipeline — the model loads once into an InferenceSession; each request tokenizes input, runs inference, and extracts the output tensor.

Running an Embedding Model

Embedding models convert text to numeric vectors. They’re essential for semantic search, RAG, and similarity matching. Here’s how to run one locally.

Step 1: Get the Model

Download an ONNX-format embedding model. The all-MiniLM-L6-v2 model from Sentence Transformers is a good starting point — 80MB, 384 dimensions, fast inference.

# Using the HuggingFace CLI (install with pip install huggingface-hub)
huggingface-cli download sentence-transformers/all-MiniLM-L6-v2 --local-dir models/minilm

Or download the ONNX version directly from https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (Files tab → onnx folder).

You need three files:

  • model.onnx — The model weights
  • tokenizer.json — The tokenizer configuration
  • special_tokens_map.json — Special token mappings

Step 2: Project Setup

dotnet new console -n OnnxEmbeddings
cd OnnxEmbeddings
dotnet add package Microsoft.ML.OnnxRuntime
dotnet add package Microsoft.ML.Tokenizers

Step 3: Tokenize and Embed

using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using Microsoft.ML.Tokenizers;

// Load the tokenizer
var tokenizer = BertTokenizer.Create("models/minilm/tokenizer.json");

// Load the ONNX model — reuse this across requests
using var session = new InferenceSession("models/minilm/model.onnx");

string[] texts =
[
    "Semantic Kernel is a .NET AI orchestration framework",
    "The weather in Seattle is rainy in November",
    "Azure OpenAI provides GPT models as a service"
];

foreach (var text in texts)
{
    var embedding = GenerateEmbedding(text, tokenizer, session);
    Console.WriteLine($"'{text[..40]}...' → [{embedding[0]:F4}, {embedding[1]:F4}, ... ] ({embedding.Length} dims)");
}

static float[] GenerateEmbedding(string text, BertTokenizer tokenizer, InferenceSession session)
{
    // Tokenize
    var encoded = tokenizer.Encode(text);
    var inputIds = encoded.Ids.ToArray();
    var attentionMask = Enumerable.Repeat(1L, inputIds.Length).ToArray();
    var tokenTypeIds = new long[inputIds.Length]; // All zeros for single-sentence

    // Create tensors
    var shape = new[] { 1, inputIds.Length };
    var inputIdsTensor = new DenseTensor<long>(
        inputIds.Select(id => (long)id).ToArray(), shape);
    var attentionTensor = new DenseTensor<long>(attentionMask, shape);
    var tokenTypeTensor = new DenseTensor<long>(tokenTypeIds, shape);

    // Run inference
    var inputs = new List<NamedOnnxValue>
    {
        NamedOnnxValue.CreateFromTensor("input_ids", inputIdsTensor),
        NamedOnnxValue.CreateFromTensor("attention_mask", attentionTensor),
        NamedOnnxValue.CreateFromTensor("token_type_ids", tokenTypeTensor)
    };

    using var results = session.Run(inputs);

    // Extract embeddings — mean pooling over token embeddings
    var lastHiddenState = results.First().AsTensor<float>();
    var embeddingDim = lastHiddenState.Dimensions[2];
    var seqLength = lastHiddenState.Dimensions[1];

    var pooled = new float[embeddingDim];
    for (var d = 0; d < embeddingDim; d++)
    {
        float sum = 0;
        for (var t = 0; t < seqLength; t++)
            sum += lastHiddenState[0, t, d];
        pooled[d] = sum / seqLength;
    }

    // L2 normalize
    var norm = MathF.Sqrt(pooled.Sum(x => x * x));
    for (var i = 0; i < pooled.Length; i++)
        pooled[i] /= norm;

    return pooled;
}

Step 4: Semantic Similarity

With embeddings, you can compute similarity between any two pieces of text:

static float CosineSimilarity(float[] a, float[] b)
{
    float dot = 0, normA = 0, normB = 0;
    for (var i = 0; i < a.Length; i++)
    {
        dot += a[i] * b[i];
        normA += a[i] * a[i];
        normB += b[i] * b[i];
    }
    return dot / (MathF.Sqrt(normA) * MathF.Sqrt(normB));
}

var query = GenerateEmbedding("AI framework for C# developers", tokenizer, session);
var doc1 = GenerateEmbedding("Semantic Kernel orchestrates AI tasks in .NET", tokenizer, session);
var doc2 = GenerateEmbedding("Recipe for chocolate chip cookies", tokenizer, session);

Console.WriteLine($"Query ↔ AI doc:     {CosineSimilarity(query, doc1):F4}");  // ~0.75
Console.WriteLine($"Query ↔ Cookie doc: {CosineSimilarity(query, doc2):F4}");  // ~0.15

Running Phi-3 Locally (Text Generation)

ONNX Runtime GenAI enables running small language models locally for text generation.

Setup

dotnet new console -n LocalLLM
cd LocalLLM
dotnet add package Microsoft.ML.OnnxRuntimeGenAI

Download the Phi-3 mini ONNX model:

huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \
  --local-dir models/phi3 \
  --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/*

Generate Text

using Microsoft.ML.OnnxRuntimeGenAI;

var modelPath = "models/phi3/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4";

using var model = new Model(modelPath);
using var tokenizer = new Tokenizer(model);

var prompt = "<|user|>\nExplain dependency injection in C# in 3 sentences.<|end|>\n<|assistant|>\n";

var sequences = tokenizer.Encode(prompt);
using var generatorParams = new GeneratorParams(model);
generatorParams.SetSearchOption("max_length", 256);
generatorParams.SetSearchOption("temperature", 0.7);
generatorParams.SetInputSequences(sequences);

using var generator = new Generator(model, generatorParams);

Console.Write("Phi-3: ");
while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var token = generator.GetSequence(0)[^1];
    Console.Write(tokenizer.Decode(new ReadOnlySpan<int>(ref token)));
}
Console.WriteLine();

Performance Expectations

ModelSizeCPU SpeedGPU SpeedRAM
all-MiniLM-L6-v2 (embedding)80 MB~5ms/embed~1ms/embed200 MB
Phi-3 mini int4 (generation)2.3 GB~15 tok/sec~60 tok/sec3.5 GB
Phi-3 small int4 (generation)4.2 GB~8 tok/sec~40 tok/sec6 GB

CPU inference is practical for embeddings and classification. For text generation, you’ll want a GPU for interactive use cases.

Integrating ONNX with Semantic Kernel

Wrap an ONNX embedding model as a Semantic Kernel embedding service:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Embeddings;

public class OnnxEmbeddingService : ITextEmbeddingGenerationService
{
    private readonly InferenceSession _session;
    private readonly BertTokenizer _tokenizer;

    public IReadOnlyDictionary<string, object?> Attributes { get; } =
        new Dictionary<string, object?>();

    public OnnxEmbeddingService(string modelPath, string tokenizerPath)
    {
        _session = new InferenceSession(modelPath);
        _tokenizer = BertTokenizer.Create(tokenizerPath);
    }

    public Task<IList<ReadOnlyMemory<float>>> GenerateEmbeddingsAsync(
        IList<string> data,
        Kernel? kernel = null,
        CancellationToken cancellationToken = default)
    {
        IList<ReadOnlyMemory<float>> embeddings = data
            .Select(text => new ReadOnlyMemory<float>(
                GenerateEmbedding(text, _tokenizer, _session)))
            .ToList();

        return Task.FromResult(embeddings);
    }

    // GenerateEmbedding method from earlier example
}

Register it with Semantic Kernel:

var kernel = Kernel.CreateBuilder()
    .AddAzureOpenAIChatCompletion("chat-deployment", endpoint, credential)
    .Build();

// Use local ONNX for embeddings, Azure OpenAI for chat
kernel.Services.AddSingleton<ITextEmbeddingGenerationService>(
    new OnnxEmbeddingService("models/minilm/model.onnx", "models/minilm/tokenizer.json"));

This hybrid approach gives you the best of both worlds: cloud LLM for reasoning, local model for embeddings (no per-embedding API cost).

When to Use Each Approach

ScenarioRecommendedWhy
Chat/reasoningAzure OpenAIFrontier models far exceed local model quality
Embeddings (high volume)ONNX localSave $0.00002/embed × millions = significant savings
ClassificationONNX or ML.NETFast, cheap, offline capable
Regulated dataONNX localData never leaves your server
PrototypingAzure OpenAIFaster to iterate, no model management

Next Steps

⚠ Production Considerations

  • ONNX model files can be large (100MB-4GB). Don't include them in your Docker image build. Download them at startup or mount them as volumes. A 2GB model in your image makes every deployment painfully slow.
  • ONNX Runtime loads the full model into memory on InferenceSession creation. A 1GB model needs ~1.2GB RAM. Plan your container memory limits accordingly, and reuse the InferenceSession across requests — don't create a new one per inference call.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

ONNX is the escape hatch from cloud dependency. When your Azure OpenAI budget runs out, when regulations ban sending data to external APIs, or when you need sub-millisecond inference — ONNX models running locally in .NET give you an alternative that no cloud provider can take away.

AI-Friendly Summary

Summary

Guide to running ONNX models in .NET for offline AI inference. Covers: loading HuggingFace embedding models, running Phi-3 locally with ONNX Runtime GenAI, integrating ONNX embeddings with Semantic Kernel, and performance benchmarks for CPU vs GPU inference.

Key Takeaways

  • ONNX Runtime runs models from PyTorch/TensorFlow in .NET without Python
  • Use Microsoft.ML.OnnxRuntime for classification and embedding models
  • Use Microsoft.ML.OnnxRuntimeGenAI for text generation (Phi-3, Mistral)
  • ONNX embedding models can implement SK's ITextEmbeddingGenerationService
  • CPU inference is viable for embeddings; GPU accelerates generation models

Implementation Checklist

  • Install Microsoft.ML.OnnxRuntime or OnnxRuntimeGenAI NuGet package
  • Download ONNX model files (model.onnx, tokenizer.json)
  • Create InferenceSession with model path
  • Prepare input tensors matching model's expected shapes
  • Run inference and extract output tensors

Frequently Asked Questions

What is ONNX?

ONNX (Open Neural Network Exchange) is an open format for representing machine learning models. A model trained in PyTorch or TensorFlow can be exported to ONNX and then run in any language with an ONNX Runtime — including C# and .NET. It's the universal model format.

When should I use ONNX instead of Azure OpenAI?

Use ONNX when you need: offline capability (no internet required), zero inference costs (no API charges), data privacy (data never leaves your machine), low latency (no network round-trip), or deterministic behavior (same input always produces same output). Use cloud APIs when you need frontier model quality or language generation.

Can I use ONNX models with Semantic Kernel?

Yes. You can wrap ONNX inference in a Semantic Kernel plugin or implement ITextEmbeddingGenerationService with an ONNX embedding model. Semantic Kernel doesn't care where embeddings come from — it works with any implementation of the embedding interface.

Track your progress through this learning path.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#ONNX #ONNX Runtime #.NET AI #Local AI #HuggingFace