Skip to main content

Run Phi-4 Locally in C#: Ollama vs ONNX vs Foundry Local

Intermediate From Trending AI .NET 9 OllamaSharp 5.1.0 Microsoft.ML.OnnxRuntimeGenAI 0.6.0 Microsoft.SemanticKernel 1.54.0
By Rajesh Mishra · Mar 21, 2026 · 14 min read
Verified Mar 2026 .NET 9 OllamaSharp 5.1.0
In 30 Seconds

Phi-4 runs locally in .NET via four approaches: Ollama (simplest — one command, OpenAI-compatible API), ONNX Runtime GenAI (maximum performance via DirectML/CUDA), Azure AI Foundry Local (managed runtime with Azure tooling), or LLamaSharp (in-process for GGUF models). All four integrate with Semantic Kernel. Phi-4 (14B) requires 16GB VRAM; Phi-4-mini (3.8B) runs on 4GB VRAM or CPU. Switch between local and Azure OpenAI by changing only the DI registration.

Microsoft’s Phi-4 family delivers competitive reasoning and coding performance in a model small enough to run on a developer laptop. With four mature integration paths for .NET — Ollama, ONNX Runtime GenAI, Azure AI Foundry Local, and LLamaSharp — there is no longer a good reason to send every development-time request to the cloud.

This guide walks through all four approaches with working C# code, compares their performance characteristics, and shows how to switch between local and cloud with a single configuration value.

Why Run Phi-4 Locally

The financial case is straightforward. Developers running intensive AI-assisted workflows against Azure OpenAI or OpenAI direct report monthly bills in the $200–400 range. Switching development and testing traffic to a local model brings that figure down dramatically — one pattern that works well is local for iteration, cloud only for staging and production, which can reduce spend to under $50/month for the same development throughput. If you want to work through the numbers systematically, see AI Cost Optimization for .NET Developers.

Beyond cost, local inference solves problems that cloud inference cannot:

Privacy and compliance. HIPAA and GDPR require knowing where data is processed. Local inference means patient records, PII, and confidential business data never leave your network. No BAA negotiation, no data processing addendum — the data simply does not move.

Offline capability. Laptops lose connectivity. CI environments sometimes firewall external APIs. A local model works identically on a plane at 35,000 feet and in an air-gapped staging environment.

Latency. A well-configured local model on modern consumer GPU hardware produces responses in under 100ms for short prompts. Cloud API roundtrips typically add 300–800ms depending on region and load. For interactive applications where the user is watching a loading indicator, this difference is perceptible.

Experimentation without quota pressure. Exploring new prompting techniques, testing edge cases, generating synthetic training data — these activities consume tokens faster than production workloads. Running locally eliminates the mental overhead of rate limit budgets.

Phi-4 Model Family

Microsoft publishes three models in the Phi-4 family relevant to local deployment:

ModelParametersVRAM RequiredBest For
Phi-414B16GB GPUComplex reasoning, code generation
Phi-4-mini3.8B4GB GPU / CPUSimple tasks, fast iteration
Phi-4-multimodal5.6B8GB GPUText + image understanding

Quantized variants trade a small accuracy reduction for substantially lower memory requirements. The Q4_K_M quantization of Phi-4-mini runs comfortably in 3GB VRAM and is the recommended starting point for developer laptops with 8GB unified memory. Q5_K_M offers better accuracy at slightly higher memory cost. Both are available directly through Ollama’s model library.

Most .NET developers should start with Phi-4-mini Q4_K_M. It responds quickly enough for interactive use, fits on almost any modern GPU, and handles the majority of typical development tasks — code explanation, LINQ generation, unit test scaffolding, and documentation drafting.

Approach 1: Ollama + OllamaSharp

Ollama is the fastest path from zero to running inference. Install the desktop app or CLI for your operating system, pull a model, and you have a running OpenAI-compatible API server.

Setup:

# Install Ollama (download from ollama.com for your OS)
ollama pull phi4-mini
ollama run phi4-mini "Hello, can you help me write C# code?"

The ollama run command confirms the model is working. After that, Ollama serves requests at http://localhost:11434. The /v1 path provides OpenAI-compatible endpoints.

C# integration with OllamaSharp:

using OllamaSharp;
using Microsoft.Extensions.AI;

// Option A: OllamaSharp DI extension
builder.Services.AddSingleton(new OllamaApiClient(new Uri("http://localhost:11434")));
builder.Services.AddSingleton<IChatClient>(sp =>
    sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));

// Option B: OpenAI-compatible endpoint (simpler, works with MEAI directly)
builder.Services.AddOpenAIChatClient(
    modelId: "phi4-mini",
    endpoint: new Uri("http://localhost:11434/v1"),
    apiKey: "ollama"); // Any non-empty string

Option B is often preferable because it uses the same IChatClient registration pattern as Azure OpenAI, making the environment-switching approach in the final section work cleanly.

Semantic Kernel integration:

using Microsoft.SemanticKernel;

var kernelBuilder = Kernel.CreateBuilder();
kernelBuilder.AddOpenAIChatCompletion(
    modelId: "phi4-mini",
    endpoint: new Uri("http://localhost:11434/v1"),
    apiKey: "ollama");

var kernel = kernelBuilder.Build();
var result = await kernel.InvokePromptAsync("Explain async/await in C# in 3 sentences.");
Console.WriteLine(result);

Semantic Kernel treats Ollama as an OpenAI-compatible provider. All standard SK features — prompt templates, function calling, plugins — work against the local endpoint. The apiKey value can be any non-empty string; Ollama does not validate it.

Ollama must be running (ollama serve or the Ollama desktop app) before starting your .NET app. Add a health check to your startup code if you want to fail fast with a clear error message rather than a connection refused exception at first inference.

Approach 2: ONNX Runtime GenAI

For maximum performance on local hardware, ONNX Runtime GenAI compiles models to native code targeting your specific hardware: DirectML for Windows GPU, CUDA for NVIDIA, or optimized CPU kernels. This eliminates the server-client overhead of Ollama and produces the highest token throughput of any local approach.

The tradeoff is setup complexity. ONNX models must be downloaded separately from Hugging Face, and the model format is different from Ollama’s. Microsoft’s ONNX model repository includes pre-converted Phi-4 models ready for use with ONNX Runtime GenAI.

using Microsoft.ML.OnnxRuntimeGenAI;

// Download Phi-4-mini ONNX model from Hugging Face
// Model path: ./phi4-mini-onnx
var model = new Model("./phi4-mini-onnx");
var tokenizer = new Tokenizer(model);

var prompt = "<|system|>You are a helpful assistant.<|end|><|user|>Write a C# hello world.<|end|><|assistant|>";
var sequences = tokenizer.Encode(prompt);

var generatorParams = new GeneratorParams(model);
generatorParams.SetInputSequences(sequences);
generatorParams.SetSearchOption("max_length", 512);

using var generator = new Generator(model, generatorParams);
while (!generator.IsDone())
{
    generator.ComputeLogits();
    generator.GenerateNextToken();
    var outputTokens = generator.GetSequence(0);
    var newToken = outputTokens[^1];
    Console.Write(tokenizer.Decode([newToken]));
}

The generation loop gives you token-by-token streaming output. For integration with higher-level abstractions, you can wrap this in an IChatClient implementation. For the full picture of what ONNX Runtime enables beyond just language models — including embeddings and classification without any cloud dependency — see ONNX Models in .NET — Run AI Without Azure.

The NuGet package Microsoft.ML.OnnxRuntimeGenAI has hardware-specific variants: Microsoft.ML.OnnxRuntimeGenAI.Cuda for NVIDIA GPU and Microsoft.ML.OnnxRuntimeGenAI.DirectML for Windows GPU via DirectML. Use the plain package for CPU-only.

Approach 3: Azure AI Foundry Local

Azure AI Foundry Local is Microsoft’s managed local runtime. It handles model downloading, hardware detection, and serving automatically — simpler than managing ONNX models manually, with better tooling integration than Ollama for teams already in the Azure ecosystem.

// Azure AI Foundry Local requires the Foundry CLI installed
// foundry model run microsoft/phi-4-mini

// Then connect via its local endpoint
builder.Services.AddOpenAIChatClient(
    modelId: "microsoft/phi-4-mini",
    endpoint: new Uri("http://localhost:5272/v1"),  // Foundry Local default port
    apiKey: "foundry");

Foundry Local exposes the same OpenAI-compatible REST API as Ollama but on port 5272 by default. The connection code is nearly identical, which means you can switch between Ollama and Foundry Local by changing only the endpoint URI and model ID — useful if your team has mixed preferences or if you want to benchmark them side by side.

The Foundry CLI (foundry model run) manages model lifecycle. It downloads, caches, and serves models from a central location, which avoids the problem of multiple developers on a team each downloading multi-gigabyte model files independently.

Approach 4: LLamaSharp

LLamaSharp runs inference in-process using GGUF-format models. No separate server process is required — the model loads directly into your .NET application’s memory space. This makes it the right choice for desktop applications, single-user tools, or scenarios where managing an external process is undesirable.

using LLama;
using LLama.Common;

// Download Phi-4-mini.gguf from Hugging Face
var parameters = new ModelParams("./phi4-mini.Q4_K_M.gguf")
{
    ContextSize = 4096,
    GpuLayerCount = 35 // Layers to offload to GPU (0 = CPU only)
};

using var model = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
var executor = new InstructExecutor(context);

await foreach (var text in executor.InferAsync("Write a hello world in C#:"))
{
    Console.Write(text);
}

The GpuLayerCount parameter controls GPU offloading. Set it to 0 for CPU-only inference (slower but works on any machine), or to the maximum layer count of your model for full GPU offload. For Phi-4-mini Q4_K_M, 35 layers covers the full model on a 4GB GPU.

LLamaSharp’s raw token throughput is lower than ONNX Runtime GenAI because it uses llama.cpp bindings rather than DirectML or CUDA-native kernels. For most interactive use cases, the difference is acceptable.

Performance Comparison

These figures are approximate and hardware-dependent. All measured on an NVIDIA RTX 4070 (12GB VRAM) running Phi-4-mini Q4_K_M.

ApproachTokens/sec (RTX 4070)RAM/VRAMSetup StepsGPU RequiredSK Integration
Ollama + phi4-mini45–60 tok/s4GB VRAM2OptionalSimple (OpenAI-compat)
ONNX Runtime GenAI80–120 tok/s4GB VRAM5+Strongly recommendedManual wrapper
Foundry Local40–70 tok/s4GB VRAM3OptionalSimple (OpenAI-compat)
LLamaSharp30–50 tok/s4GB VRAM3OptionalManual wrapper

For most .NET development workflows, Ollama’s token throughput is more than sufficient. You only need to reach for ONNX Runtime GenAI if you are building a production service where throughput directly affects user experience or operating cost.

GPU is technically optional for all four approaches — they all support CPU-only inference. In practice, CPU inference with a 3.8B model produces 3–8 tokens per second, which is too slow for interactive use. Treat GPU as effectively required for a good experience.

Environment-Based Provider Switching

The most practical local AI setup treats local inference as the development default and cloud inference as the production default, with a single configuration switch controlling which is active.

// Program.cs — swap local ↔ cloud with one env variable
var useLocalAI = builder.Configuration.GetValue<bool>("UseLocalAI");

if (useLocalAI)
{
    // Development: free, no rate limits
    builder.Services.AddOpenAIChatClient(
        modelId: "phi4-mini",
        endpoint: new Uri("http://localhost:11434/v1"),
        apiKey: "ollama");
}
else
{
    // Production: managed service with content filtering
    builder.Services.AddAzureOpenAIChatClient(
        new Uri(builder.Configuration["AzureOpenAI:Endpoint"]!),
        new AzureKeyCredential(builder.Configuration["AzureOpenAI:ApiKey"]!));
}

In appsettings.Development.json:

{ "UseLocalAI": true }

In production, omit UseLocalAI entirely (defaulting to false) or set it explicitly in your environment variables. Your service classes inject IChatClient and contain no provider-specific code — the switch is entirely at the composition root.

This pattern works because both Ollama (via its /v1 OpenAI-compatible endpoint) and Azure OpenAI register as IChatClient implementations. From the perspective of your business logic, they are identical.

One important caveat before deploying this pattern: test your specific prompt patterns and features against both backends. Ollama’s OpenAI-compatible endpoint does not implement the full API surface. Structured output (response_format with JSON schema), logprobs, and certain streaming behaviors may differ. Discover these gaps in development, not in production.

Further Reading

For a complete application with .NET Aspire orchestration, see Build a Local AI App with Ollama, Semantic Kernel, and .NET Aspire.

⚠ Production Considerations

  • Ollama's OpenAI-compatible endpoint at /v1 does not support all OpenAI API features. Structured outputs (response_format JSON schema), logprobs, and some streaming options may not work identically. Test your specific features against Ollama before assuming full compatibility.
  • Local models have no content filtering. Phi-4 may respond to prompts that Azure OpenAI's content filter would block. If your application processes user-generated content, implement your own content safety layer when using local models in production.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

Use local Phi-4 for two distinct purposes: (1) development and testing — free, fast, no account needed; (2) privacy-sensitive workloads in production — where data cannot leave your infrastructure. For most customer-facing features, Azure OpenAI's managed service, SLA, and content filtering make it the better production choice.

AI-Friendly Summary

Summary

Phi-4 runs locally in .NET via four approaches: Ollama (simplest — one command, OpenAI-compatible API), ONNX Runtime GenAI (maximum performance via DirectML/CUDA), Azure AI Foundry Local (managed runtime with Azure tooling), or LLamaSharp (in-process for GGUF models). All four integrate with Semantic Kernel. Phi-4 (14B) requires 16GB VRAM; Phi-4-mini (3.8B) runs on 4GB VRAM or CPU. Switch between local and Azure OpenAI by changing only the DI registration.

Key Takeaways

  • Phi-4 (14B) needs 16GB VRAM; Phi-4-mini (3.8B) needs 4GB VRAM or CPU-only
  • Ollama is the fastest setup — one command to pull, one to run, OpenAI-compatible endpoint
  • Integrate with SK via AddOpenAIChatCompletion pointing to http://localhost:11434/v1
  • ONNX Runtime GenAI gives best raw performance on local hardware via DirectML/CUDA
  • Use IChatClient + environment-based DI to switch local ↔ cloud without code changes

Implementation Checklist

  • Install Ollama from ollama.com and run ollama pull phi4-mini
  • Verify Ollama is running with ollama list and ollama run phi4-mini 'Hello'
  • Add OllamaSharp NuGet package or use the OpenAI-compat endpoint directly
  • Register in SK with AddOpenAIChatCompletion pointing to http://localhost:11434/v1
  • Use environment-based DI registration to switch between local and Azure OpenAI
  • For production, test that your code works with both backends before deploying

Frequently Asked Questions

What is Phi-4 and why should .NET developers use it locally?

Phi-4 is Microsoft's 14-billion parameter small language model, outperforming many larger models on reasoning and coding tasks. Running it locally eliminates API costs ($0 after hardware), ensures data privacy (no data leaves your machine), works offline, and provides sub-100ms latency on modern GPUs. The 3.8B parameter Phi-4-mini runs on CPU with 8GB RAM.

What hardware do I need to run Phi-4 locally?

Phi-4 (14B) requires a GPU with 16GB VRAM (NVIDIA RTX 3080/4070 or Apple M2 Pro). Phi-4-mini (3.8B) runs on CPU with 8GB RAM (slow) or GPU with 4GB VRAM (fast). On Apple Silicon M-series chips, the unified memory architecture allows 8GB M1/M2 to run Phi-4-mini comfortably. Check Ollama's model library for quantized variants that require less VRAM.

How do I integrate Ollama with Semantic Kernel in C#?

Ollama exposes an OpenAI-compatible REST endpoint at http://localhost:11434/v1. Register it in SK using: builder.AddOpenAIChatCompletion(modelId: 'phi4', endpoint: new Uri('http://localhost:11434/v1'), apiKey: 'ollama'). The apiKey value can be any non-empty string — Ollama doesn't validate it.

What is the difference between Ollama, ONNX Runtime GenAI, and Azure AI Foundry Local?

Ollama is the simplest: pull and run any model with one command, OpenAI-compatible API, no code required. ONNX Runtime GenAI offers maximum performance — models compile to native code for your hardware (DirectML for Windows GPU, CUDA for NVIDIA). Azure AI Foundry Local is Microsoft's managed local runtime, integrated with Azure tooling. LLamaSharp runs GGUF-format models in-process without a separate server.

Can I switch between local Phi-4 and Azure OpenAI without changing my application code?

Yes. Use Microsoft.Extensions.AI's IChatClient abstraction and register different implementations per environment. In development, register OllamaSharp's IChatClient. In production, register Azure OpenAI's IChatClient. Your service classes inject IChatClient and never know which backend they're using.

How does Phi-4 perform compared to GPT-4o for .NET coding tasks?

Phi-4 scores competitively with GPT-4o on many coding benchmarks and outperforms it on some mathematical reasoning tasks. For typical .NET development queries — explaining code, generating LINQ, writing unit tests — Phi-4 produces good results. For complex multi-file refactoring or unfamiliar framework APIs, GPT-4o still has an edge due to its larger training corpus.

What is the Ollama model ID for Phi-4 and Phi-4-mini?

Use 'phi4' for Phi-4 (14B) and 'phi4-mini' for Phi-4-mini (3.8B). Pull them with: ollama pull phi4 or ollama pull phi4-mini. Verify with ollama list. The full model identifier including tag is phi4:latest and phi4-mini:latest.

Track your progress through this learning path.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Phi-4 #Local AI #Ollama #ONNX Runtime #Small Language Models #.NET AI