Running local AI inference costs nothing after hardware and keeps sensitive data off the network. This workshop builds a complete application from scratch: Ollama serving Phi-4-mini locally, Semantic Kernel orchestrating it with function calling, .NET Aspire handling container orchestration and service discovery, and Azure OpenAI as a cloud fallback when local inference is unavailable.
By the end you have a working application with a clean architecture that supports both local-first development and cloud production deployment without changing your business logic.
Why Local AI for .NET Development
The economic case is direct. Developers running AI-assisted workflows against cloud APIs report monthly bills in the $200-400 range for intensive use. Routing development and test traffic to a local model reduces that to near zero — you pay only for hardware you already own.
Beyond cost, local inference solves problems cloud inference cannot:
Privacy and compliance. HIPAA, GDPR, and similar regulations require knowing exactly where data is processed. Local inference means patient records, source code, and confidential business data never traverse a network. No data processing addendum, no BAA negotiation required.
Offline capability. Laptops lose connectivity. CI environments may firewall external APIs. A local model works identically on a plane, in an air-gapped lab, and on a developer workstation with a spotty VPN connection.
Fast iteration without quota pressure. Prompt engineering, edge case testing, and synthetic data generation consume tokens faster than production workloads. Running locally eliminates the mental overhead of rate limits and token budgets.
Latency. A modern GPU running Phi-4-mini produces responses in under 100ms for short prompts. Cloud API roundtrips typically add 300-800ms. For interactive applications, this difference is visible to users.
For a deeper comparison of model options — including ONNX Runtime, Foundry Local, and LLamaSharp — see Running Phi-4 Locally in C# — Ollama, ONNX Runtime, and Foundry Local Compared.
Step 1: Set Up Ollama and Pull Phi-4-mini
Install Ollama from ollama.com for your operating system (Windows, macOS, or Linux). Then pull and verify the model:
# Pull Phi-4-mini (3.8B — runs on 4GB VRAM or CPU)
ollama pull phi4-mini
# Verify the model responds correctly
ollama run phi4-mini "Explain dependency injection in .NET in one sentence."
# List installed models to confirm
ollama list
The ollama run command confirms the model is working and shows you the response quality. After this, Ollama serves requests at http://localhost:11434. The /v1 path provides an OpenAI-compatible API endpoint.
Phi-4-mini (3.8B parameters) is the practical choice for this workshop. It runs on 4GB VRAM for good interactive performance, or on CPU with 8GB RAM for slower but functional inference. For a machine comparison, see the Running Phi-4 Locally guide which benchmarks Phi-4-mini against the full Phi-4 14B model.
If Ollama throws a Connection refused error when you start your .NET application, the most common causes are Ollama not running, wrong endpoint URL format, or Docker networking issues. See Fix Ollama Connection Refused Error in .NET and Semantic Kernel for a complete diagnostic checklist.
Step 2: Register OllamaSharp in Dependency Injection
Create a new ASP.NET Core Web API project and add the required packages:
dotnet new webapi -n LocalAiApp.Api
cd LocalAiApp.Api
dotnet add package OllamaSharp --version 5.1.0
dotnet add package Microsoft.SemanticKernel --version 1.54.0
Register OllamaSharp in Program.cs. OllamaSharp 5.x uses OllamaApiClient — the older OllamaClient type from deprecated preview packages will not compile:
using OllamaSharp;
using Microsoft.Extensions.AI;
var builder = WebApplication.CreateBuilder(args);
// Read Ollama endpoint from configuration (defaults to localhost for local dev)
var ollamaEndpoint = new Uri(
builder.Configuration.GetConnectionString("ollama")
?? "http://localhost:11434");
// Register OllamaApiClient as a singleton
builder.Services.AddSingleton(new OllamaApiClient(ollamaEndpoint));
// Expose as IChatClient for Microsoft.Extensions.AI consumers
builder.Services.AddSingleton<IChatClient>(sp =>
sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));
The GetConnectionString("ollama") call is intentional — when running under .NET Aspire, the framework injects the Ollama container’s dynamic address as a connection string. For plain local development, the null-coalescing fallback to http://localhost:11434 handles the case where no Aspire orchestration is present.
You can now inject IChatClient into any service:
public class ChatService(IChatClient chatClient)
{
public async Task<string> AskAsync(string question)
{
var response = await chatClient.GetResponseAsync(question);
return response.Text;
}
}
This service has no knowledge of Ollama, Azure OpenAI, or any specific backend. The injection point is where the backend decision lives.
Step 3: Add Semantic Kernel Over Ollama
OllamaSharp’s IChatClient works for straightforward chat completions. For Semantic Kernel features — prompt templates, plugins, planners — register SK pointing at Ollama’s OpenAI-compatible endpoint:
using Microsoft.SemanticKernel;
// After the OllamaApiClient registration above, add SK:
builder.Services.AddKernel()
.AddOpenAIChatCompletion(
modelId: "phi4-mini",
endpoint: new Uri($"{ollamaEndpoint.ToString().TrimEnd('/')}/v1"),
apiKey: "ollama"); // Required parameter — Ollama ignores its value
The apiKey parameter is required by the OpenAI client library but Ollama does not validate it. Any non-empty string works. The convention of passing "ollama" makes it obvious in code reviews that this is a local endpoint.
Verify SK is wired correctly with a minimal test endpoint:
app.MapGet("/test", async (Kernel kernel) =>
{
var result = await kernel.InvokePromptAsync(
"In one sentence, what is the capital of France?");
return Results.Ok(result.ToString());
});
Run the app (dotnet run) and hit GET /test. You should see Phi-4-mini’s response via SK with no cloud API calls.
Step 4: Function Calling with a Local Model
Semantic Kernel’s plugin system works with Phi-4-mini via its tool calling support. Define a plugin class with [KernelFunction] attributes:
using Microsoft.SemanticKernel;
public class WeatherPlugin
{
[KernelFunction("get_current_weather")]
[Description("Gets the current weather for a specified location")]
public string GetCurrentWeather(
[Description("The city and country, e.g., 'London, UK'")] string location)
{
// In a real app, call a weather API here
return $"The weather in {location} is 18°C, partly cloudy.";
}
[KernelFunction("get_weather_forecast")]
[Description("Gets a 3-day weather forecast for a specified location")]
public string GetWeatherForecast(
[Description("The city and country")] string location,
[Description("Number of days, 1-3")] int days = 3)
{
return $"Forecast for {location}: Day 1: 18°C, Day 2: 22°C, Day 3: 15°C (rain).";
}
}
Register the plugin and configure function calling in your chat endpoint:
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.OpenAI;
app.MapPost("/chat", async (Kernel kernel, ChatRequest request) =>
{
// Add the plugin to the kernel
var plugin = KernelPlugin.CreateFromObject(new WeatherPlugin());
kernel.Plugins.Add(plugin);
// Configure automatic function calling
var executionSettings = new OpenAIPromptExecutionSettings
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};
var result = await kernel.InvokePromptAsync(
request.Message,
new KernelArguments(executionSettings));
return Results.Ok(new { reply = result.ToString() });
});
record ChatRequest(string Message);
FunctionChoiceBehavior.Auto() tells SK to let the model decide when to call functions. With Phi-4-mini, this works for clear, specific queries like “What is the weather in London?” but is less reliable for ambiguous prompts. Test every function invocation path explicitly against the local model — do not assume that a plugin working against GPT-4o will work identically against Phi-4-mini.
Streaming Responses
For interactive UIs, stream the response token by token:
app.MapGet("/chat/stream", async (string message, Kernel kernel, HttpContext ctx) =>
{
ctx.Response.ContentType = "text/event-stream";
var executionSettings = new OpenAIPromptExecutionSettings
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};
await foreach (var update in kernel.InvokePromptStreamingAsync(
message, new KernelArguments(executionSettings)))
{
var text = update.ToString();
if (!string.IsNullOrEmpty(text))
{
await ctx.Response.WriteAsync($"data: {text}\n\n");
await ctx.Response.Body.FlushAsync();
}
}
});
Server-Sent Events (SSE) work well for streaming local model output to browser clients. The text/event-stream content type enables native browser EventSource support without a library.
Step 5: .NET Aspire Integration
.NET Aspire orchestrates the Ollama container, handles service discovery, and provides the metrics dashboard — eliminating hardcoded URLs and manual container management.
Create an Aspire AppHost project:
dotnet new aspire-apphost -n LocalAiApp.AppHost
cd LocalAiApp.AppHost
dotnet add package Aspire.Hosting.Ollama
Configure the AppHost in Program.cs to provision Ollama and wire it to your API service:
// LocalAiApp.AppHost/Program.cs
using Aspire.Hosting;
using Aspire.Hosting.Ollama;
var builder = DistributedApplication.CreateBuilder(args);
// Provision Ollama container and pull phi4-mini automatically
var ollama = builder.AddOllama("ollama")
.WithModel("phi4-mini")
.WithDataVolume(); // Persist downloaded models between runs
// Register the API project and give it a reference to Ollama
var api = builder.AddProject<Projects.LocalAiApp_Api>("api")
.WithReference(ollama)
.WaitFor(ollama); // Don't start the API until Ollama is ready
builder.Build().Run();
The WithReference(ollama) call injects the Ollama container’s dynamic endpoint as a connection string named "ollama" in the API project’s configuration — which is exactly what builder.Configuration.GetConnectionString("ollama") reads in Step 2.
WaitFor(ollama) prevents the API from starting until Ollama’s health check passes, avoiding the connection refused errors that occur when the API starts before Ollama finishes loading the model.
In the API project, add the Aspire service defaults package and call AddServiceDefaults():
cd ../LocalAiApp.Api
dotnet add package Microsoft.Extensions.ServiceDefaults
// LocalAiApp.Api/Program.cs — complete version
using OllamaSharp;
using Microsoft.Extensions.AI;
using Microsoft.SemanticKernel;
var builder = WebApplication.CreateBuilder(args);
// Aspire service defaults — adds health checks, telemetry, and service discovery
builder.AddServiceDefaults();
// Read the Ollama endpoint injected by Aspire (or fall back for local dev)
var ollamaEndpoint = new Uri(
builder.Configuration.GetConnectionString("ollama")
?? "http://localhost:11434");
// Register OllamaSharp
builder.Services.AddSingleton(new OllamaApiClient(ollamaEndpoint));
builder.Services.AddSingleton<IChatClient>(sp =>
sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));
// Register Semantic Kernel
builder.Services.AddKernel()
.AddOpenAIChatCompletion(
modelId: "phi4-mini",
endpoint: new Uri($"{ollamaEndpoint.ToString().TrimEnd('/')}/v1"),
apiKey: "ollama");
builder.Services.AddControllers();
var app = builder.Build();
app.MapDefaultEndpoints(); // /health and /alive from Aspire service defaults
app.MapControllers();
app.Run();
The architecture with Aspire orchestration looks like this:
Start the entire stack with:
cd LocalAiApp.AppHost
dotnet run
Aspire pulls the Ollama Docker image, downloads phi4-mini (this takes time on first run — subsequent runs use the volume cache), starts both containers, and opens the Aspire dashboard at http://localhost:15888. The dashboard shows container status, structured logs, and distributed traces across both the API and Ollama.
Step 6: Cloud Fallback to Azure OpenAI
The cloud fallback pattern handles two scenarios: Ollama is unavailable (container crashed, VRAM exhausted), or a specific request exceeds the local model’s capability. The implementation catches HttpRequestException from the local AI call and retries with an Azure OpenAI client:
// LocalAiApp.Api/Services/ResilientChatService.cs
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using Azure;
using Azure.AI.OpenAI;
public class ResilientChatService
{
private readonly Kernel _localKernel;
private readonly Kernel _cloudKernel;
private readonly ILogger<ResilientChatService> _logger;
public ResilientChatService(
Kernel localKernel,
IConfiguration config,
ILogger<ResilientChatService> logger)
{
_localKernel = localKernel;
_logger = logger;
// Build a separate cloud kernel for fallback
var cloudBuilder = Kernel.CreateBuilder();
cloudBuilder.AddAzureOpenAIChatCompletion(
deploymentName: config["AzureOpenAI:DeploymentName"] ?? "gpt-4o-mini",
endpoint: config["AzureOpenAI:Endpoint"]!,
apiKey: config["AzureOpenAI:ApiKey"]!);
_cloudKernel = cloudBuilder.Build();
}
public async Task<string> InvokeAsync(
string prompt,
OpenAIPromptExecutionSettings? settings = null,
CancellationToken ct = default)
{
settings ??= new OpenAIPromptExecutionSettings
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};
try
{
// Attempt local inference first
var result = await _localKernel.InvokePromptAsync(
prompt, new KernelArguments(settings), cancellationToken: ct);
return result.ToString();
}
catch (HttpRequestException ex)
{
// Local Ollama is unreachable — fall back to Azure OpenAI
_logger.LogWarning(ex,
"Ollama unavailable, falling back to Azure OpenAI");
var fallback = await _cloudKernel.InvokePromptAsync(
prompt, new KernelArguments(settings), cancellationToken: ct);
return fallback.ToString();
}
catch (TaskCanceledException ex) when (!ct.IsCancellationRequested)
{
// Timeout from Ollama (e.g., model loading) — fall back
_logger.LogWarning(ex,
"Ollama timed out, falling back to Azure OpenAI");
var fallback = await _cloudKernel.InvokePromptAsync(
prompt, new KernelArguments(settings), cancellationToken: ct);
return fallback.ToString();
}
}
}
Register ResilientChatService in Program.cs:
builder.Services.AddSingleton<ResilientChatService>();
And update the chat endpoint to use it:
app.MapPost("/chat/resilient", async (
ChatRequest request,
ResilientChatService chatService) =>
{
var reply = await chatService.InvokeAsync(request.Message);
return Results.Ok(new { reply });
});
Monitor how often the fallback activates by adding a metric counter:
// Inject IMeterFactory or use System.Diagnostics.Metrics
private static readonly Counter<int> FallbackCounter =
new Meter("LocalAiApp.Api").CreateCounter<int>("ai_fallback_total");
// Inside the catch block:
FallbackCounter.Add(1, new TagList { { "reason", "http_error" } });
The Aspire dashboard picks up this custom metric automatically via OpenTelemetry. Set an alert if ai_fallback_total exceeds your acceptable threshold per hour.
Step 7: Docker Compose Alternative
If your team does not use Aspire, Docker Compose provides the same Ollama container orchestration. This is also useful for CI pipelines where the Aspire AppHost model is impractical.
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-models:/root/.ollama # Persist downloaded models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu] # Remove if no GPU
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 10s
timeout: 5s
retries: 10
start_period: 60s # Allow time for model loading
ollama-init:
image: ollama/ollama:latest
depends_on:
ollama:
condition: service_healthy
entrypoint: >
sh -c "ollama pull phi4-mini"
environment:
- OLLAMA_HOST=http://ollama:11434
restart: "no"
api:
build: ./LocalAiApp.Api
ports:
- "8080:8080"
environment:
- ConnectionStrings__ollama=http://ollama:11434
depends_on:
ollama-init:
condition: service_completed_successfully
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 5
volumes:
ollama-models:
Key points in this Compose configuration:
ollama-initis a one-shot container that pulls the phi4-mini model after Ollama is healthy. This separates the model download from the Ollama server startup.- The
apiservice usesservice_completed_successfullyonollama-initto ensure the model is downloaded before the API starts accepting traffic. - The
ConnectionStrings__ollamaenvironment variable uses ASP.NET Core’s double-underscore convention to setConnectionStrings:ollamain configuration — the same key thatGetConnectionString("ollama")reads. - The GPU reservation block uses NVIDIA Container Toolkit. Remove it entirely for CPU-only inference.
Start the stack:
docker compose up --build
The first run downloads the Ollama image and phi4-mini model. Subsequent starts use the ollama-models volume cache and are much faster.
End-to-End Smoke Test
With either the Aspire or Docker Compose stack running, verify the full flow:
# Basic chat via IChatClient
curl -X POST http://localhost:8080/chat/resilient \
-H "Content-Type: application/json" \
-d '{"message": "What is async/await in C#?"}'
# Function calling test (should invoke WeatherPlugin)
curl -X POST http://localhost:8080/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is the weather in London?"}'
# Health check
curl http://localhost:8080/health
The health endpoint returns Aspire’s composite health report including the Ollama connectivity check. If Ollama is unhealthy, the health endpoint returns 503 and load balancers stop routing traffic.
Summary
You now have a local-first AI application with these characteristics:
- Zero API cost in development — Phi-4-mini runs entirely on local hardware
- No hardcoded URLs — .NET Aspire injects the Ollama endpoint via service discovery
- Production-ready fallback — Azure OpenAI handles requests when local inference fails
- Observable — custom metrics track fallback activation; Aspire dashboard shows traces
- Portable — Docker Compose alternative works in CI and on teams without Aspire
The architecture cleanly separates the AI backend (Ollama vs Azure OpenAI) from business logic. Your services inject IChatClient or Kernel — neither knows which backend is active. The routing decision lives exclusively in Program.cs and ResilientChatService.