Skip to main content

Build a Local AI App with Ollama, Semantic Kernel, and Aspire

Intermediate Original .NET 9 OllamaSharp 5.1.0 Microsoft.SemanticKernel 1.54.0 Aspire.Hosting 9.0.0
By Rajesh Mishra · Mar 21, 2026 · 18 min read
Verified Mar 2026 .NET 9 OllamaSharp 5.1.0
In 30 Seconds

This workshop builds a fully local AI application using Ollama (phi4-mini), Semantic Kernel, and .NET Aspire. It covers OllamaSharp 5.x DI registration, connecting SK to Ollama's OpenAI-compatible endpoint, implementing KernelFunction plugins with FunctionChoiceBehavior.Auto(), orchestrating Ollama as a container via .NET Aspire's community Ollama integration, and adding a cloud fallback to Azure OpenAI on HttpRequestException. The result is a production-ready architecture where local AI handles development and privacy-sensitive workloads while Azure OpenAI handles the cloud path.

What You'll Build

Build a fully local AI app in C# with Ollama, Semantic Kernel, and .NET Aspire. Function calling, streaming, health checks, and Azure OpenAI cloud fallback.

OllamaSharp 5.1.0Microsoft.SemanticKernel 1.54.0Aspire.Hosting 9.0.0 .NET 9 · 18 min read to complete

Running local AI inference costs nothing after hardware and keeps sensitive data off the network. This workshop builds a complete application from scratch: Ollama serving Phi-4-mini locally, Semantic Kernel orchestrating it with function calling, .NET Aspire handling container orchestration and service discovery, and Azure OpenAI as a cloud fallback when local inference is unavailable.

By the end you have a working application with a clean architecture that supports both local-first development and cloud production deployment without changing your business logic.

Why Local AI for .NET Development

The economic case is direct. Developers running AI-assisted workflows against cloud APIs report monthly bills in the $200-400 range for intensive use. Routing development and test traffic to a local model reduces that to near zero — you pay only for hardware you already own.

Beyond cost, local inference solves problems cloud inference cannot:

Privacy and compliance. HIPAA, GDPR, and similar regulations require knowing exactly where data is processed. Local inference means patient records, source code, and confidential business data never traverse a network. No data processing addendum, no BAA negotiation required.

Offline capability. Laptops lose connectivity. CI environments may firewall external APIs. A local model works identically on a plane, in an air-gapped lab, and on a developer workstation with a spotty VPN connection.

Fast iteration without quota pressure. Prompt engineering, edge case testing, and synthetic data generation consume tokens faster than production workloads. Running locally eliminates the mental overhead of rate limits and token budgets.

Latency. A modern GPU running Phi-4-mini produces responses in under 100ms for short prompts. Cloud API roundtrips typically add 300-800ms. For interactive applications, this difference is visible to users.

For a deeper comparison of model options — including ONNX Runtime, Foundry Local, and LLamaSharp — see Running Phi-4 Locally in C# — Ollama, ONNX Runtime, and Foundry Local Compared.

Step 1: Set Up Ollama and Pull Phi-4-mini

Install Ollama from ollama.com for your operating system (Windows, macOS, or Linux). Then pull and verify the model:

# Pull Phi-4-mini (3.8B — runs on 4GB VRAM or CPU)
ollama pull phi4-mini

# Verify the model responds correctly
ollama run phi4-mini "Explain dependency injection in .NET in one sentence."

# List installed models to confirm
ollama list

The ollama run command confirms the model is working and shows you the response quality. After this, Ollama serves requests at http://localhost:11434. The /v1 path provides an OpenAI-compatible API endpoint.

Phi-4-mini (3.8B parameters) is the practical choice for this workshop. It runs on 4GB VRAM for good interactive performance, or on CPU with 8GB RAM for slower but functional inference. For a machine comparison, see the Running Phi-4 Locally guide which benchmarks Phi-4-mini against the full Phi-4 14B model.

If Ollama throws a Connection refused error when you start your .NET application, the most common causes are Ollama not running, wrong endpoint URL format, or Docker networking issues. See Fix Ollama Connection Refused Error in .NET and Semantic Kernel for a complete diagnostic checklist.

Step 2: Register OllamaSharp in Dependency Injection

Create a new ASP.NET Core Web API project and add the required packages:

dotnet new webapi -n LocalAiApp.Api
cd LocalAiApp.Api
dotnet add package OllamaSharp --version 5.1.0
dotnet add package Microsoft.SemanticKernel --version 1.54.0

Register OllamaSharp in Program.cs. OllamaSharp 5.x uses OllamaApiClient — the older OllamaClient type from deprecated preview packages will not compile:

using OllamaSharp;
using Microsoft.Extensions.AI;

var builder = WebApplication.CreateBuilder(args);

// Read Ollama endpoint from configuration (defaults to localhost for local dev)
var ollamaEndpoint = new Uri(
    builder.Configuration.GetConnectionString("ollama")
    ?? "http://localhost:11434");

// Register OllamaApiClient as a singleton
builder.Services.AddSingleton(new OllamaApiClient(ollamaEndpoint));

// Expose as IChatClient for Microsoft.Extensions.AI consumers
builder.Services.AddSingleton<IChatClient>(sp =>
    sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));

The GetConnectionString("ollama") call is intentional — when running under .NET Aspire, the framework injects the Ollama container’s dynamic address as a connection string. For plain local development, the null-coalescing fallback to http://localhost:11434 handles the case where no Aspire orchestration is present.

You can now inject IChatClient into any service:

public class ChatService(IChatClient chatClient)
{
    public async Task<string> AskAsync(string question)
    {
        var response = await chatClient.GetResponseAsync(question);
        return response.Text;
    }
}

This service has no knowledge of Ollama, Azure OpenAI, or any specific backend. The injection point is where the backend decision lives.

Step 3: Add Semantic Kernel Over Ollama

OllamaSharp’s IChatClient works for straightforward chat completions. For Semantic Kernel features — prompt templates, plugins, planners — register SK pointing at Ollama’s OpenAI-compatible endpoint:

using Microsoft.SemanticKernel;

// After the OllamaApiClient registration above, add SK:
builder.Services.AddKernel()
    .AddOpenAIChatCompletion(
        modelId: "phi4-mini",
        endpoint: new Uri($"{ollamaEndpoint.ToString().TrimEnd('/')}/v1"),
        apiKey: "ollama"); // Required parameter — Ollama ignores its value

The apiKey parameter is required by the OpenAI client library but Ollama does not validate it. Any non-empty string works. The convention of passing "ollama" makes it obvious in code reviews that this is a local endpoint.

Verify SK is wired correctly with a minimal test endpoint:

app.MapGet("/test", async (Kernel kernel) =>
{
    var result = await kernel.InvokePromptAsync(
        "In one sentence, what is the capital of France?");
    return Results.Ok(result.ToString());
});

Run the app (dotnet run) and hit GET /test. You should see Phi-4-mini’s response via SK with no cloud API calls.

Step 4: Function Calling with a Local Model

Semantic Kernel’s plugin system works with Phi-4-mini via its tool calling support. Define a plugin class with [KernelFunction] attributes:

using Microsoft.SemanticKernel;

public class WeatherPlugin
{
    [KernelFunction("get_current_weather")]
    [Description("Gets the current weather for a specified location")]
    public string GetCurrentWeather(
        [Description("The city and country, e.g., 'London, UK'")] string location)
    {
        // In a real app, call a weather API here
        return $"The weather in {location} is 18°C, partly cloudy.";
    }

    [KernelFunction("get_weather_forecast")]
    [Description("Gets a 3-day weather forecast for a specified location")]
    public string GetWeatherForecast(
        [Description("The city and country")] string location,
        [Description("Number of days, 1-3")] int days = 3)
    {
        return $"Forecast for {location}: Day 1: 18°C, Day 2: 22°C, Day 3: 15°C (rain).";
    }
}

Register the plugin and configure function calling in your chat endpoint:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.OpenAI;

app.MapPost("/chat", async (Kernel kernel, ChatRequest request) =>
{
    // Add the plugin to the kernel
    var plugin = KernelPlugin.CreateFromObject(new WeatherPlugin());
    kernel.Plugins.Add(plugin);

    // Configure automatic function calling
    var executionSettings = new OpenAIPromptExecutionSettings
    {
        FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
    };

    var result = await kernel.InvokePromptAsync(
        request.Message,
        new KernelArguments(executionSettings));

    return Results.Ok(new { reply = result.ToString() });
});

record ChatRequest(string Message);

FunctionChoiceBehavior.Auto() tells SK to let the model decide when to call functions. With Phi-4-mini, this works for clear, specific queries like “What is the weather in London?” but is less reliable for ambiguous prompts. Test every function invocation path explicitly against the local model — do not assume that a plugin working against GPT-4o will work identically against Phi-4-mini.

Streaming Responses

For interactive UIs, stream the response token by token:

app.MapGet("/chat/stream", async (string message, Kernel kernel, HttpContext ctx) =>
{
    ctx.Response.ContentType = "text/event-stream";

    var executionSettings = new OpenAIPromptExecutionSettings
    {
        FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
    };

    await foreach (var update in kernel.InvokePromptStreamingAsync(
        message, new KernelArguments(executionSettings)))
    {
        var text = update.ToString();
        if (!string.IsNullOrEmpty(text))
        {
            await ctx.Response.WriteAsync($"data: {text}\n\n");
            await ctx.Response.Body.FlushAsync();
        }
    }
});

Server-Sent Events (SSE) work well for streaming local model output to browser clients. The text/event-stream content type enables native browser EventSource support without a library.

Step 5: .NET Aspire Integration

.NET Aspire orchestrates the Ollama container, handles service discovery, and provides the metrics dashboard — eliminating hardcoded URLs and manual container management.

Create an Aspire AppHost project:

dotnet new aspire-apphost -n LocalAiApp.AppHost
cd LocalAiApp.AppHost
dotnet add package Aspire.Hosting.Ollama

Configure the AppHost in Program.cs to provision Ollama and wire it to your API service:

// LocalAiApp.AppHost/Program.cs
using Aspire.Hosting;
using Aspire.Hosting.Ollama;

var builder = DistributedApplication.CreateBuilder(args);

// Provision Ollama container and pull phi4-mini automatically
var ollama = builder.AddOllama("ollama")
    .WithModel("phi4-mini")
    .WithDataVolume(); // Persist downloaded models between runs

// Register the API project and give it a reference to Ollama
var api = builder.AddProject<Projects.LocalAiApp_Api>("api")
    .WithReference(ollama)
    .WaitFor(ollama); // Don't start the API until Ollama is ready

builder.Build().Run();

The WithReference(ollama) call injects the Ollama container’s dynamic endpoint as a connection string named "ollama" in the API project’s configuration — which is exactly what builder.Configuration.GetConnectionString("ollama") reads in Step 2.

WaitFor(ollama) prevents the API from starting until Ollama’s health check passes, avoiding the connection refused errors that occur when the API starts before Ollama finishes loading the model.

In the API project, add the Aspire service defaults package and call AddServiceDefaults():

cd ../LocalAiApp.Api
dotnet add package Microsoft.Extensions.ServiceDefaults
// LocalAiApp.Api/Program.cs — complete version
using OllamaSharp;
using Microsoft.Extensions.AI;
using Microsoft.SemanticKernel;

var builder = WebApplication.CreateBuilder(args);

// Aspire service defaults — adds health checks, telemetry, and service discovery
builder.AddServiceDefaults();

// Read the Ollama endpoint injected by Aspire (or fall back for local dev)
var ollamaEndpoint = new Uri(
    builder.Configuration.GetConnectionString("ollama")
    ?? "http://localhost:11434");

// Register OllamaSharp
builder.Services.AddSingleton(new OllamaApiClient(ollamaEndpoint));
builder.Services.AddSingleton<IChatClient>(sp =>
    sp.GetRequiredService<OllamaApiClient>().AsChatClient("phi4-mini"));

// Register Semantic Kernel
builder.Services.AddKernel()
    .AddOpenAIChatCompletion(
        modelId: "phi4-mini",
        endpoint: new Uri($"{ollamaEndpoint.ToString().TrimEnd('/')}/v1"),
        apiKey: "ollama");

builder.Services.AddControllers();

var app = builder.Build();
app.MapDefaultEndpoints(); // /health and /alive from Aspire service defaults
app.MapControllers();
app.Run();

The architecture with Aspire orchestration looks like this:

.NET AspireAppHostAI API ServiceOllama(phi4-mini)Azure OpenAI(fallback)Aspire Dashboard(metrics) orchestratescontainerprimaryfallback on errortelemetry

Start the entire stack with:

cd LocalAiApp.AppHost
dotnet run

Aspire pulls the Ollama Docker image, downloads phi4-mini (this takes time on first run — subsequent runs use the volume cache), starts both containers, and opens the Aspire dashboard at http://localhost:15888. The dashboard shows container status, structured logs, and distributed traces across both the API and Ollama.

Step 6: Cloud Fallback to Azure OpenAI

The cloud fallback pattern handles two scenarios: Ollama is unavailable (container crashed, VRAM exhausted), or a specific request exceeds the local model’s capability. The implementation catches HttpRequestException from the local AI call and retries with an Azure OpenAI client:

// LocalAiApp.Api/Services/ResilientChatService.cs
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using Azure;
using Azure.AI.OpenAI;

public class ResilientChatService
{
    private readonly Kernel _localKernel;
    private readonly Kernel _cloudKernel;
    private readonly ILogger<ResilientChatService> _logger;

    public ResilientChatService(
        Kernel localKernel,
        IConfiguration config,
        ILogger<ResilientChatService> logger)
    {
        _localKernel = localKernel;
        _logger = logger;

        // Build a separate cloud kernel for fallback
        var cloudBuilder = Kernel.CreateBuilder();
        cloudBuilder.AddAzureOpenAIChatCompletion(
            deploymentName: config["AzureOpenAI:DeploymentName"] ?? "gpt-4o-mini",
            endpoint: config["AzureOpenAI:Endpoint"]!,
            apiKey: config["AzureOpenAI:ApiKey"]!);
        _cloudKernel = cloudBuilder.Build();
    }

    public async Task<string> InvokeAsync(
        string prompt,
        OpenAIPromptExecutionSettings? settings = null,
        CancellationToken ct = default)
    {
        settings ??= new OpenAIPromptExecutionSettings
        {
            FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
        };

        try
        {
            // Attempt local inference first
            var result = await _localKernel.InvokePromptAsync(
                prompt, new KernelArguments(settings), cancellationToken: ct);
            return result.ToString();
        }
        catch (HttpRequestException ex)
        {
            // Local Ollama is unreachable — fall back to Azure OpenAI
            _logger.LogWarning(ex,
                "Ollama unavailable, falling back to Azure OpenAI");
            var fallback = await _cloudKernel.InvokePromptAsync(
                prompt, new KernelArguments(settings), cancellationToken: ct);
            return fallback.ToString();
        }
        catch (TaskCanceledException ex) when (!ct.IsCancellationRequested)
        {
            // Timeout from Ollama (e.g., model loading) — fall back
            _logger.LogWarning(ex,
                "Ollama timed out, falling back to Azure OpenAI");
            var fallback = await _cloudKernel.InvokePromptAsync(
                prompt, new KernelArguments(settings), cancellationToken: ct);
            return fallback.ToString();
        }
    }
}

Register ResilientChatService in Program.cs:

builder.Services.AddSingleton<ResilientChatService>();

And update the chat endpoint to use it:

app.MapPost("/chat/resilient", async (
    ChatRequest request,
    ResilientChatService chatService) =>
{
    var reply = await chatService.InvokeAsync(request.Message);
    return Results.Ok(new { reply });
});

Monitor how often the fallback activates by adding a metric counter:

// Inject IMeterFactory or use System.Diagnostics.Metrics
private static readonly Counter<int> FallbackCounter =
    new Meter("LocalAiApp.Api").CreateCounter<int>("ai_fallback_total");

// Inside the catch block:
FallbackCounter.Add(1, new TagList { { "reason", "http_error" } });

The Aspire dashboard picks up this custom metric automatically via OpenTelemetry. Set an alert if ai_fallback_total exceeds your acceptable threshold per hour.

Step 7: Docker Compose Alternative

If your team does not use Aspire, Docker Compose provides the same Ollama container orchestration. This is also useful for CI pipelines where the Aspire AppHost model is impractical.

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-models:/root/.ollama  # Persist downloaded models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]  # Remove if no GPU
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 10s
      timeout: 5s
      retries: 10
      start_period: 60s  # Allow time for model loading

  ollama-init:
    image: ollama/ollama:latest
    depends_on:
      ollama:
        condition: service_healthy
    entrypoint: >
      sh -c "ollama pull phi4-mini"
    environment:
      - OLLAMA_HOST=http://ollama:11434
    restart: "no"

  api:
    build: ./LocalAiApp.Api
    ports:
      - "8080:8080"
    environment:
      - ConnectionStrings__ollama=http://ollama:11434
    depends_on:
      ollama-init:
        condition: service_completed_successfully
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 5

volumes:
  ollama-models:

Key points in this Compose configuration:

  • ollama-init is a one-shot container that pulls the phi4-mini model after Ollama is healthy. This separates the model download from the Ollama server startup.
  • The api service uses service_completed_successfully on ollama-init to ensure the model is downloaded before the API starts accepting traffic.
  • The ConnectionStrings__ollama environment variable uses ASP.NET Core’s double-underscore convention to set ConnectionStrings:ollama in configuration — the same key that GetConnectionString("ollama") reads.
  • The GPU reservation block uses NVIDIA Container Toolkit. Remove it entirely for CPU-only inference.

Start the stack:

docker compose up --build

The first run downloads the Ollama image and phi4-mini model. Subsequent starts use the ollama-models volume cache and are much faster.

End-to-End Smoke Test

With either the Aspire or Docker Compose stack running, verify the full flow:

# Basic chat via IChatClient
curl -X POST http://localhost:8080/chat/resilient \
  -H "Content-Type: application/json" \
  -d '{"message": "What is async/await in C#?"}'

# Function calling test (should invoke WeatherPlugin)
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is the weather in London?"}'

# Health check
curl http://localhost:8080/health

The health endpoint returns Aspire’s composite health report including the Ollama connectivity check. If Ollama is unhealthy, the health endpoint returns 503 and load balancers stop routing traffic.

Summary

You now have a local-first AI application with these characteristics:

  • Zero API cost in development — Phi-4-mini runs entirely on local hardware
  • No hardcoded URLs — .NET Aspire injects the Ollama endpoint via service discovery
  • Production-ready fallback — Azure OpenAI handles requests when local inference fails
  • Observable — custom metrics track fallback activation; Aspire dashboard shows traces
  • Portable — Docker Compose alternative works in CI and on teams without Aspire

The architecture cleanly separates the AI backend (Ollama vs Azure OpenAI) from business logic. Your services inject IChatClient or Kernel — neither knows which backend is active. The routing decision lives exclusively in Program.cs and ResilientChatService.

⚠ Production Considerations

  • Ollama loads the model into VRAM on the first request, which can take 5-30 seconds. The first inference call after container startup will time out with default HttpClient timeouts. Pre-warm the model by sending a probe request during application startup, or configure a longer timeout for the initial connection. In Aspire, the health check endpoint delays traffic until Ollama is ready.
  • Phi-4-mini's function calling reliability drops significantly on ambiguous prompts. When building plugins, write narrow, unambiguous function descriptions and test every tool invocation path against the local model. A plugin that works perfectly with GPT-4o may silently fail to trigger — or trigger the wrong function — with Phi-4-mini.
  • The Azure OpenAI cloud fallback introduces latency and cost on every failure. If Ollama is consistently unhealthy (container crashed, VRAM exhausted), every request will fall through to Azure and your costs will spike. Monitor the fallback activation rate via a counter metric and alert when it exceeds your threshold.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

Treat the local/cloud split as an architectural decision, not just a config toggle. Local Ollama is ideal for development, CI, and privacy-sensitive production workloads. Azure OpenAI is the right choice for customer-facing features that need content filtering, SLAs, and managed scaling. Designing both paths from the start — as this workshop does — lets you switch the active backend per environment without touching business logic.

AI-Friendly Summary

Summary

This workshop builds a fully local AI application using Ollama (phi4-mini), Semantic Kernel, and .NET Aspire. It covers OllamaSharp 5.x DI registration, connecting SK to Ollama's OpenAI-compatible endpoint, implementing KernelFunction plugins with FunctionChoiceBehavior.Auto(), orchestrating Ollama as a container via .NET Aspire's community Ollama integration, and adding a cloud fallback to Azure OpenAI on HttpRequestException. The result is a production-ready architecture where local AI handles development and privacy-sensitive workloads while Azure OpenAI handles the cloud path.

Key Takeaways

  • OllamaSharp 5.x uses OllamaApiClient(new Uri('http://localhost:11434')) — not OllamaClient()
  • Semantic Kernel connects to Ollama via AddOpenAIChatCompletion pointing at http://localhost:11434/v1
  • KernelFunction plugins and FunctionChoiceBehavior.Auto() work with Phi-4-mini but need testing
  • .NET Aspire's Ollama integration eliminates hardcoded URLs via service discovery
  • Catch HttpRequestException around local AI calls and retry with Azure OpenAI as the cloud fallback
  • Docker Compose is a viable alternative when Aspire orchestration is not required

Implementation Checklist

  • Install Ollama and run ollama pull phi4-mini to download the model
  • Create an Aspire AppHost project and add the Aspire.Hosting.Ollama package
  • Register Ollama in AppHost with builder.AddOllama('ollama').WithModel('phi4-mini')
  • In the API project, call builder.AddServiceDefaults() and read the Ollama connection string
  • Register OllamaApiClient in DI and expose as IChatClient via .AsChatClient()
  • Register Semantic Kernel with AddOpenAIChatCompletion pointing to the Ollama /v1 endpoint
  • Implement a KernelFunction plugin and set FunctionChoiceBehavior.Auto() in execution settings
  • Add HttpRequestException catch around local AI calls to retry with Azure OpenAI fallback

Frequently Asked Questions

What is the correct OllamaSharp 5.x constructor for registering the client in DI?

Use new OllamaApiClient(new Uri('http://localhost:11434')) — the constructor takes a Uri, not a plain string. Register it as a singleton with builder.Services.AddSingleton(new OllamaApiClient(new Uri('http://localhost:11434'))), then expose it as IChatClient by calling .AsChatClient('phi4-mini').

How do I connect Semantic Kernel to a local Ollama model?

Ollama exposes an OpenAI-compatible REST endpoint at /v1. Register it in SK with: kernelBuilder.AddOpenAIChatCompletion(modelId: 'phi4-mini', endpoint: new Uri('http://localhost:11434/v1'), apiKey: 'ollama'). The apiKey can be any non-empty string — Ollama does not validate it.

Does function calling work with local models like Phi-4-mini via Ollama?

Yes, with caveats. Phi-4-mini supports tool calling, so KernelFunction plugins and FunctionChoiceBehavior.Auto() work. However, smaller local models are less reliable at triggering the correct function on ambiguous prompts compared to GPT-4o. Test your specific plugins against Phi-4-mini before relying on them in production paths.

How does .NET Aspire integrate with Ollama?

Using the Aspire.Hosting.Ollama community integration package. In your AppHost, call builder.AddOllama('ollama').WithModel('phi4-mini'). Aspire provisions Ollama as a container, pulls the model, and injects the connection string into dependent services via service discovery — no hardcoded URLs required.

What NuGet package provides .NET Aspire's Ollama integration?

The Aspire.Hosting.Ollama community package, available on NuGet. Add it to your AppHost project with: dotnet add package Aspire.Hosting.Ollama. In consuming services, read the injected connection string via builder.Configuration.GetConnectionString('ollama') to get the dynamically assigned endpoint URL.

How do I implement Azure OpenAI as a cloud fallback when Ollama is unavailable?

Catch HttpRequestException or TaskCanceledException around your local AI call and retry with an Azure OpenAI client. For cleaner separation, implement IFunctionInvocationFilter in Semantic Kernel to intercept failures. Store both clients in DI and resolve the fallback client only when the primary fails.

Can I run this workshop app without GPU hardware?

Yes, but expect slow inference. Phi-4-mini runs on CPU with 8GB RAM, producing roughly 3-8 tokens per second — workable for testing but not interactive use. For a better experience on CPU, try a smaller quantized model. With a 4GB GPU, Phi-4-mini Q4_K_M produces 45-60 tokens per second, which is comfortable for interactive development.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Ollama #Semantic Kernel #.NET Aspire #Local AI #OllamaSharp #.NET AI