Microsoft’s Phi-4 family delivers competitive reasoning and coding performance in a model small enough to run on a developer laptop. With four mature integration paths for .NET — Ollama, ONNX Runtime GenAI, Azure AI Foundry Local, and LLamaSharp — there is no longer a good reason to send every development-time request to the cloud.
This guide walks through all four approaches with working C# code, compares their performance characteristics, and shows how to switch between local and cloud with a single configuration value.
Why Run Phi-4 Locally
The financial case is straightforward. Developers running intensive AI-assisted workflows against Azure OpenAI or OpenAI direct report monthly bills in the $200–400 range. Switching development and testing traffic to a local model brings that figure down dramatically — one pattern that works well is local for iteration, cloud only for staging and production, which can reduce spend to under $50/month for the same development throughput. If you want to work through the numbers systematically, see AI Cost Optimization for .NET Developers.
Beyond cost, local inference solves problems that cloud inference cannot:
Privacy and compliance. HIPAA and GDPR require knowing where data is processed. Local inference means patient records, PII, and confidential business data never leave your network. No BAA negotiation, no data processing addendum — the data simply does not move.
Offline capability. Laptops lose connectivity. CI environments sometimes firewall external APIs. A local model works identically on a plane at 35,000 feet and in an air-gapped staging environment.
Latency. A well-configured local model on modern consumer GPU hardware produces responses in under 100ms for short prompts. Cloud API roundtrips typically add 300–800ms depending on region and load. For interactive applications where the user is watching a loading indicator, this difference is perceptible.
Experimentation without quota pressure. Exploring new prompting techniques, testing edge cases, generating synthetic training data — these activities consume tokens faster than production workloads. Running locally eliminates the mental overhead of rate limit budgets.
Phi-4 Model Family
Microsoft publishes three models in the Phi-4 family relevant to local deployment:
| Model | Parameters | VRAM Required | Best For |
|---|---|---|---|
| Phi-4 | 14B | 16GB GPU | Complex reasoning, code generation |
| Phi-4-mini | 3.8B | 4GB GPU / CPU | Simple tasks, fast iteration |
| Phi-4-multimodal | 5.6B | 8GB GPU | Text + image understanding |
Quantized variants trade a small accuracy reduction for substantially lower memory requirements. The Q4_K_M quantization of Phi-4-mini runs comfortably in 3GB VRAM and is the recommended starting point for developer laptops with 8GB unified memory. Q5_K_M offers better accuracy at slightly higher memory cost. Both are available directly through Ollama’s model library.
Most .NET developers should start with Phi-4-mini Q4_K_M. It responds quickly enough for interactive use, fits on almost any modern GPU, and handles the majority of typical development tasks — code explanation, LINQ generation, unit test scaffolding, and documentation drafting.
Approach 1: Ollama + OllamaSharp
Ollama is the fastest path from zero to running inference. Install the desktop app or CLI for your operating system, pull a model, and you have a running OpenAI-compatible API server.
Setup:
# Install Ollama (download from ollama.com for your OS)
ollama pull phi4-mini
ollama run phi4-mini "Hello, can you help me write C# code?"
The ollama run command confirms the model is working. After that, Ollama serves requests at http://localhost:11434. The /v1 path provides OpenAI-compatible endpoints.
C# integration with OllamaSharp:
using OllamaSharp;
using Microsoft.Extensions.AI;
// Option A: OllamaSharp DI extension
builder.Services.AddSingleton(new OllamaApiClient(new Uri("http://localhost:11434")));
builder.Services.AddSingleton<IChatClient>(sp =>
sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));
// Option B: OpenAI-compatible endpoint (simpler, works with MEAI directly)
builder.Services.AddOpenAIChatClient(
modelId: "phi4-mini",
endpoint: new Uri("http://localhost:11434/v1"),
apiKey: "ollama"); // Any non-empty string
Option B is often preferable because it uses the same IChatClient registration pattern as Azure OpenAI, making the environment-switching approach in the final section work cleanly.
Semantic Kernel integration:
using Microsoft.SemanticKernel;
var kernelBuilder = Kernel.CreateBuilder();
kernelBuilder.AddOpenAIChatCompletion(
modelId: "phi4-mini",
endpoint: new Uri("http://localhost:11434/v1"),
apiKey: "ollama");
var kernel = kernelBuilder.Build();
var result = await kernel.InvokePromptAsync("Explain async/await in C# in 3 sentences.");
Console.WriteLine(result);
Semantic Kernel treats Ollama as an OpenAI-compatible provider. All standard SK features — prompt templates, function calling, plugins — work against the local endpoint. The apiKey value can be any non-empty string; Ollama does not validate it.
Ollama must be running (
ollama serveor the Ollama desktop app) before starting your .NET app. Add a health check to your startup code if you want to fail fast with a clear error message rather than a connection refused exception at first inference.
Approach 2: ONNX Runtime GenAI
For maximum performance on local hardware, ONNX Runtime GenAI compiles models to native code targeting your specific hardware: DirectML for Windows GPU, CUDA for NVIDIA, or optimized CPU kernels. This eliminates the server-client overhead of Ollama and produces the highest token throughput of any local approach.
The tradeoff is setup complexity. ONNX models must be downloaded separately from Hugging Face, and the model format is different from Ollama’s. Microsoft’s ONNX model repository includes pre-converted Phi-4 models ready for use with ONNX Runtime GenAI.
using Microsoft.ML.OnnxRuntimeGenAI;
// Download Phi-4-mini ONNX model from Hugging Face
// Model path: ./phi4-mini-onnx
var model = new Model("./phi4-mini-onnx");
var tokenizer = new Tokenizer(model);
var prompt = "<|system|>You are a helpful assistant.<|end|><|user|>Write a C# hello world.<|end|><|assistant|>";
var sequences = tokenizer.Encode(prompt);
var generatorParams = new GeneratorParams(model);
generatorParams.SetInputSequences(sequences);
generatorParams.SetSearchOption("max_length", 512);
using var generator = new Generator(model, generatorParams);
while (!generator.IsDone())
{
generator.ComputeLogits();
generator.GenerateNextToken();
var outputTokens = generator.GetSequence(0);
var newToken = outputTokens[^1];
Console.Write(tokenizer.Decode([newToken]));
}
The generation loop gives you token-by-token streaming output. For integration with higher-level abstractions, you can wrap this in an IChatClient implementation. For the full picture of what ONNX Runtime enables beyond just language models — including embeddings and classification without any cloud dependency — see ONNX Models in .NET — Run AI Without Azure.
The NuGet package Microsoft.ML.OnnxRuntimeGenAI has hardware-specific variants: Microsoft.ML.OnnxRuntimeGenAI.Cuda for NVIDIA GPU and Microsoft.ML.OnnxRuntimeGenAI.DirectML for Windows GPU via DirectML. Use the plain package for CPU-only.
Approach 3: Azure AI Foundry Local
Azure AI Foundry Local is Microsoft’s managed local runtime. It handles model downloading, hardware detection, and serving automatically — simpler than managing ONNX models manually, with better tooling integration than Ollama for teams already in the Azure ecosystem.
// Azure AI Foundry Local requires the Foundry CLI installed
// foundry model run microsoft/phi-4-mini
// Then connect via its local endpoint
builder.Services.AddOpenAIChatClient(
modelId: "microsoft/phi-4-mini",
endpoint: new Uri("http://localhost:5272/v1"), // Foundry Local default port
apiKey: "foundry");
Foundry Local exposes the same OpenAI-compatible REST API as Ollama but on port 5272 by default. The connection code is nearly identical, which means you can switch between Ollama and Foundry Local by changing only the endpoint URI and model ID — useful if your team has mixed preferences or if you want to benchmark them side by side.
The Foundry CLI (foundry model run) manages model lifecycle. It downloads, caches, and serves models from a central location, which avoids the problem of multiple developers on a team each downloading multi-gigabyte model files independently.
Approach 4: LLamaSharp
LLamaSharp runs inference in-process using GGUF-format models. No separate server process is required — the model loads directly into your .NET application’s memory space. This makes it the right choice for desktop applications, single-user tools, or scenarios where managing an external process is undesirable.
using LLama;
using LLama.Common;
// Download Phi-4-mini.gguf from Hugging Face
var parameters = new ModelParams("./phi4-mini.Q4_K_M.gguf")
{
ContextSize = 4096,
GpuLayerCount = 35 // Layers to offload to GPU (0 = CPU only)
};
using var model = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
var executor = new InstructExecutor(context);
await foreach (var text in executor.InferAsync("Write a hello world in C#:"))
{
Console.Write(text);
}
The GpuLayerCount parameter controls GPU offloading. Set it to 0 for CPU-only inference (slower but works on any machine), or to the maximum layer count of your model for full GPU offload. For Phi-4-mini Q4_K_M, 35 layers covers the full model on a 4GB GPU.
LLamaSharp’s raw token throughput is lower than ONNX Runtime GenAI because it uses llama.cpp bindings rather than DirectML or CUDA-native kernels. For most interactive use cases, the difference is acceptable.
Performance Comparison
These figures are approximate and hardware-dependent. All measured on an NVIDIA RTX 4070 (12GB VRAM) running Phi-4-mini Q4_K_M.
| Approach | Tokens/sec (RTX 4070) | RAM/VRAM | Setup Steps | GPU Required | SK Integration |
|---|---|---|---|---|---|
| Ollama + phi4-mini | 45–60 tok/s | 4GB VRAM | 2 | Optional | Simple (OpenAI-compat) |
| ONNX Runtime GenAI | 80–120 tok/s | 4GB VRAM | 5+ | Strongly recommended | Manual wrapper |
| Foundry Local | 40–70 tok/s | 4GB VRAM | 3 | Optional | Simple (OpenAI-compat) |
| LLamaSharp | 30–50 tok/s | 4GB VRAM | 3 | Optional | Manual wrapper |
For most .NET development workflows, Ollama’s token throughput is more than sufficient. You only need to reach for ONNX Runtime GenAI if you are building a production service where throughput directly affects user experience or operating cost.
GPU is technically optional for all four approaches — they all support CPU-only inference. In practice, CPU inference with a 3.8B model produces 3–8 tokens per second, which is too slow for interactive use. Treat GPU as effectively required for a good experience.
Environment-Based Provider Switching
The most practical local AI setup treats local inference as the development default and cloud inference as the production default, with a single configuration switch controlling which is active.
// Program.cs — swap local ↔ cloud with one env variable
var useLocalAI = builder.Configuration.GetValue<bool>("UseLocalAI");
if (useLocalAI)
{
// Development: free, no rate limits
builder.Services.AddOpenAIChatClient(
modelId: "phi4-mini",
endpoint: new Uri("http://localhost:11434/v1"),
apiKey: "ollama");
}
else
{
// Production: managed service with content filtering
builder.Services.AddAzureOpenAIChatClient(
new Uri(builder.Configuration["AzureOpenAI:Endpoint"]!),
new AzureKeyCredential(builder.Configuration["AzureOpenAI:ApiKey"]!));
}
In appsettings.Development.json:
{ "UseLocalAI": true }
In production, omit UseLocalAI entirely (defaulting to false) or set it explicitly in your environment variables. Your service classes inject IChatClient and contain no provider-specific code — the switch is entirely at the composition root.
This pattern works because both Ollama (via its /v1 OpenAI-compatible endpoint) and Azure OpenAI register as IChatClient implementations. From the perspective of your business logic, they are identical.
One important caveat before deploying this pattern: test your specific prompt patterns and features against both backends. Ollama’s OpenAI-compatible endpoint does not implement the full API surface. Structured output (response_format with JSON schema), logprobs, and certain streaming behaviors may differ. Discover these gaps in development, not in production.
Further Reading
- Ollama model library — Phi-4
- Microsoft Phi-4 on Hugging Face
- ONNX Runtime GenAI on NuGet
- LLamaSharp on GitHub
For a complete application with .NET Aspire orchestration, see Build a Local AI App with Ollama, Semantic Kernel, and .NET Aspire.