Skip to main content

Azure OpenAI Cost Optimization for .NET Developers

Intermediate Original .NET 9 Azure.AI.OpenAI 2.1.0 Microsoft.SemanticKernel 1.54.0 Microsoft.ML.Tokenizers 0.22.0
By Rajesh Mishra · Mar 21, 2026 · 15 min read
Verified Mar 2026 .NET 9 Azure.AI.OpenAI 2.1.0
In 30 Seconds

Azure OpenAI costs in C# are controlled through six strategies: model routing (GPT-4o-mini is 33x cheaper than GPT-4o for simple tasks), semantic caching (40-60% call reduction for repeated queries), token budgets enforced per-user, prompt compression, Ollama for local dev (zero cost), and the Batch API for non-real-time workloads (50% discount). Implement IFunctionInvocationFilter for cross-cutting cost controls in Semantic Kernel.

Azure OpenAI billing is opaque until your first invoice arrives and surprises you. The cost structure is simple — you pay per token — but the levers for controlling that cost are not obvious until you understand what drives consumption. This guide covers six concrete strategies with C# code you can apply immediately.

Understanding Azure OpenAI Pricing

The most important pricing insight for .NET developers is the price ratio between models. As of early 2026, the gap between GPT-4o and GPT-4o-mini is dramatic:

ModelInput (per 1M tokens)Output (per 1M tokens)Ratio vs GPT-4o
GPT-4o$5.00$15.001x
GPT-4o-mini$0.15$0.60~33x cheaper
text-embedding-3-small$0.02n/a
text-embedding-3-large$0.13n/a

See the Azure OpenAI pricing page for current rates — these change as models mature.

The 33x price difference between GPT-4o and GPT-4o-mini is the single most actionable number in AI cost optimization. If 40% of your queries are simple enough for GPT-4o-mini to handle, that 40% now costs 33x less. No architectural changes required — just routing.

Pay-as-you-go vs Provisioned Throughput (PTU). Pay-as-you-go charges per token with no upfront commitment. PTU is a monthly capacity reservation that gives guaranteed throughput at a fixed monthly price. PTU becomes cost-effective at high, consistent volumes — typically 50M+ tokens per month and above 50% utilization of purchased capacity. Below that threshold, pay-as-you-go is cheaper and simpler.

Before you optimize cost, you need to know what you are spending. See the token counting guide for how to measure token consumption per request so you have a baseline to optimize against.

Strategy 1: Token Budgeting

In multi-tenant applications, a single user can exhaust shared Azure OpenAI quota and degrade the experience for all other users. Per-user token budgets prevent this and give you predictable cost scaling.

The pattern is straightforward: maintain a per-user daily token counter in a distributed cache, check it before each AI call, and deduct actual usage after. Use Redis (via IDistributedCache) to share state across multiple application instances:

using Microsoft.Extensions.Caching.Distributed;
using System.Text.Json;

public class UserTokenBudgetService
{
    private readonly IDistributedCache _cache;
    private const int DailyTokenBudget = 50_000; // per user per day

    public UserTokenBudgetService(IDistributedCache cache) => _cache = cache;

    public async Task<bool> TryConsumeTokensAsync(
        string userId, int tokensToConsume, CancellationToken ct = default)
    {
        var key = $"token-budget:{userId}:{DateTime.UtcNow:yyyy-MM-dd}";
        var currentBytes = await _cache.GetAsync(key, ct);
        var current = currentBytes != null
            ? JsonSerializer.Deserialize<int>(currentBytes)
            : 0;

        if (current + tokensToConsume > DailyTokenBudget)
            return false;

        var newValue = current + tokensToConsume;
        await _cache.SetAsync(
            key,
            JsonSerializer.SerializeToUtf8Bytes(newValue),
            new DistributedCacheEntryOptions
            {
                AbsoluteExpiration = DateTimeOffset.UtcNow.Date.AddDays(1)
            }, ct);

        return true;
    }
}

In your API layer, check the budget before forwarding to Azure OpenAI and return HTTP 429 when the budget is exhausted. After the AI call completes, deduct completion.Usage.TotalTokenCount from the user’s remaining quota.

Setting the right budget limit requires telemetry. Start by logging actual token usage per user per day for two weeks, then set the budget at the 90th percentile plus 20%. This accommodates normal heavy users while protecting against runaway consumption.

Strategy 2: Model Routing

Model routing is the highest-return optimization available. The idea is to classify each incoming query before routing it — simple queries go to GPT-4o-mini, complex queries go to GPT-4o.

A classifier that runs on GPT-4o-mini costs fewer than 20 tokens per classification. If it correctly identifies even 30% of queries as simple and routes them away from GPT-4o, the classifier overhead is paid back many times over.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

public class ModelRoutingService
{
    private readonly IChatCompletionService _miniService;  // GPT-4o-mini
    private readonly IChatCompletionService _fullService;  // GPT-4o

    public ModelRoutingService(
        [FromKeyedServices("gpt-4o-mini")] IChatCompletionService mini,
        [FromKeyedServices("gpt-4o")] IChatCompletionService full)
    {
        _miniService = mini;
        _fullService = full;
    }

    public async Task<string> CompleteAsync(string userQuery, CancellationToken ct = default)
    {
        // Classify complexity using the cheap model
        var complexity = await ClassifyComplexityAsync(userQuery, ct);

        var service = complexity == "simple" ? _miniService : _fullService;
        var history = new ChatHistory("You are a helpful assistant.");
        history.AddUserMessage(userQuery);

        var response = await service.GetChatMessageContentAsync(history, cancellationToken: ct);
        return response.Content ?? string.Empty;
    }

    private async Task<string> ClassifyComplexityAsync(string query, CancellationToken ct)
    {
        var history = new ChatHistory(
            "Classify this query as 'simple' (factual, short answer) or 'complex' (reasoning, analysis, code). " +
            "Reply with only the single word: simple or complex.");
        history.AddUserMessage(query);

        var result = await _miniService.GetChatMessageContentAsync(history, cancellationToken: ct);
        return result.Content?.Trim().ToLowerInvariant() == "complex" ? "complex" : "simple";
    }
}

Register keyed services in your DI container so each service gets the correct underlying model deployment:

builder.Services.AddKeyedSingleton<IChatCompletionService>("gpt-4o-mini",
    (sp, _) => sp.GetRequiredService<Kernel>()
        .GetRequiredService<IChatCompletionService>());
// Similar registration for gpt-4o pointing at your GPT-4o deployment

Track the routing distribution in production. If your classifier routes fewer than 20% of queries to GPT-4o-mini, either your query mix is genuinely complex or the classifier needs tuning. Logging the routing decision alongside the query lets you spot misclassifications quickly.

Strategy 3: Semantic Caching as IFunctionInvocationFilter

Many production AI workloads receive the same or nearly the same queries repeatedly. A support chatbot receives variations of “How do I reset my password?” hundreds of times per day. Computing a fresh AI response each time wastes money.

Semantic caching intercepts AI function calls in Semantic Kernel using the IFunctionInvocationFilter interface. Before the real AI call executes, the filter checks whether a cached response exists for this input. If it does, the filter short-circuits the pipeline and returns the cached result:

using Microsoft.SemanticKernel;
using Microsoft.Extensions.Caching.Memory;
using System.Security.Cryptography;
using System.Text;

public class SemanticCacheFilter : IFunctionInvocationFilter
{
    private readonly IMemoryCache _cache;
    private readonly TimeSpan _cacheDuration;

    public SemanticCacheFilter(IMemoryCache cache, TimeSpan? cacheDuration = null)
    {
        _cache = cache;
        _cacheDuration = cacheDuration ?? TimeSpan.FromHours(1);
    }

    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        // Only cache prompt functions, not tool calls
        if (!context.Function.PluginName.Contains("Prompt"))
        {
            await next(context);
            return;
        }

        var cacheKey = ComputeCacheKey(context);

        if (_cache.TryGetValue(cacheKey, out string? cachedResult))
        {
            context.Result = new FunctionResult(context.Function, cachedResult);
            return; // Skip the actual AI call
        }

        await next(context);

        if (context.Result?.GetValue<string>() is string result)
        {
            _cache.Set(cacheKey, result, _cacheDuration);
        }
    }

    private static string ComputeCacheKey(FunctionInvocationContext context)
    {
        var keyInput = $"{context.Function.PluginName}:{context.Function.Name}:" +
                       string.Join(",", context.Arguments.Select(a => $"{a.Key}={a.Value}"));

        var hash = SHA256.HashData(Encoding.UTF8.GetBytes(keyInput));
        return $"sk-cache:{Convert.ToHexString(hash)[..16]}";
    }
}

Register the filter with your kernel:

builder.Services.AddMemoryCache();
builder.Services.AddSingleton<SemanticCacheFilter>();
kernel.FunctionInvocationFilters.Add(serviceProvider.GetRequiredService<SemanticCacheFilter>());

The implementation above uses exact-match caching by hashing function arguments. For true semantic caching — matching near-identical queries — you would compute embeddings of the user query, store them in a vector cache, and return a hit when cosine similarity exceeds a threshold (typically 0.92). The exact-match version shown here handles the common case where repeated queries are truly identical (same FAQ, same document reference) without the added complexity of embedding lookup.

Cache hit rates of 40-60% are achievable in support-chat and documentation-assistant scenarios. Log cache hit/miss ratios alongside estimated cost savings so you can demonstrate the value and tune the TTL.

Strategy 4: Prompt Compression

Every token in your prompt costs money. System prompts that grew incrementally over months often contain redundancy, outdated examples, and verbose phrasing that the model does not need.

Set max_tokens on every request. Without an output token cap, a model can generate an arbitrarily long response. For most use cases, 1,024 or 2,048 tokens is more than sufficient:

var options = new ChatCompletionOptions
{
    MaxOutputTokenCount = 1024, // Cap output tokens
};
var completion = await chatClient.CompleteChatAsync(messages, options);

Audit your system prompt for compression opportunities. Common sources of excess tokens:

  • Lengthy few-shot examples that the model no longer needs after fine-tuning
  • Verbose role definitions (“You are an extremely helpful assistant who always…”) that can be shortened to a single sentence
  • Repeated instructions that appear in both the system prompt and user messages
  • Full function signatures repeated every turn in tool-use conversations — use abbreviated references in multi-turn history instead

A prompt compression pass on a mature system prompt often reduces token count by 20-40% with no measurable quality degradation. Measure quality before and after using a fixed test set of 50-100 representative queries.

In multi-turn conversations, trim older messages from the context window once the conversation exceeds a threshold. Keep the system prompt and the most recent N turns. Summarizing older turns rather than dropping them preserves context while reducing token count.

Strategy 5: Ollama for Local Development

Every API call made during development, debugging, and testing costs real money. A developer running the application locally for a few hours can easily consume thousands of tokens before any production traffic arrives.

Ollama eliminates this entirely. Register it as your IChatClient for non-production environments, and your application code changes nothing — only the DI registration differs:

// Program.cs
if (builder.Environment.IsDevelopment())
{
    // Zero-cost local AI with Ollama
    builder.Services.AddOpenAIChatClient(
        modelId: "phi4-mini",
        endpoint: new Uri("http://localhost:11434/v1"),
        apiKey: "ollama");
}
else
{
    // Production Azure OpenAI
    builder.Services.AddAzureOpenAIChatClient(
        new Uri(builder.Configuration["AzureOpenAI:Endpoint"]!),
        new AzureKeyCredential(builder.Configuration["AzureOpenAI:ApiKey"]!));
}

IChatClient is the same interface in both cases. Services that depend on IChatClient receive the correct implementation for their environment without any conditional logic in business code.

Phi-4-mini running locally is capable enough for most development scenarios — testing prompt logic, validating JSON output parsing, exercising the RAG pipeline. For setup instructions including model download and the Ollama service configuration, see the full Phi-4 local development guide.

Strategy 6: Azure OpenAI Batch API

Real-time API calls are expensive because they require reserved capacity and low-latency routing. For workloads where the response is not needed immediately, the Azure OpenAI Batch API processes requests at 50% lower cost than the real-time API.

When to use it:

  • Document classification pipelines (classify 10,000 support tickets overnight)
  • Bulk embedding generation for vector database population
  • Automated evaluation of AI output quality (run nightly, not during user sessions)
  • Offline content enrichment (generate summaries, extract entities from a corpus)

The Batch API accepts a JSONL file where each line is a self-contained request:

// Batch API — submit JSONL, get results asynchronously
// Requires Azure.AI.OpenAI Batch API support (GA in 2025)
// Each line in the JSONL is a full chat completion request
var batchRequest = new
{
    custom_id = "request-001",
    method = "POST",
    url = "/v1/chat/completions",
    body = new
    {
        model = "gpt-4o-mini",
        messages = new[] { new { role = "user", content = "Classify this document: ..." } },
        max_tokens = 10
    }
};
// Submit via Azure OpenAI Batch endpoint
// Poll for completion, then retrieve results

The Batch API returns results within 24 hours. This latency is the trade-off for the cost reduction — only use the Batch API for workloads that can tolerate asynchronous processing. For document classification, embedding generation, and offline evaluation, 24-hour latency is entirely acceptable.

See the Azure OpenAI Batch API documentation for request format details and polling patterns.

Monitoring Cost

You cannot optimize what you do not measure. The architects note above is correct: instrument before you optimize. A simple .NET Meter tracks estimated cost per request, per model, and per feature:

using System.Diagnostics.Metrics;

private static readonly Meter _meter = new("MyApp.AI.Cost", "1.0.0");
private static readonly Histogram<double> _costHistogram =
    _meter.CreateHistogram<double>("ai.request.cost.usd", "USD", "Estimated cost per request");

// After each call
double estimatedCost = (completion.Usage.InputTokenCount / 1_000_000.0 * 5.0) +
                       (completion.Usage.OutputTokenCount / 1_000_000.0 * 15.0);
_costHistogram.Record(estimatedCost, new TagList { { "model", "gpt-4o" }, { "feature", "support-chat" } });

Tag every cost metric with the model name and the application feature. After a week of production data, you will see clearly which features consume the most tokens — often it is one or two features driving 80% of cost. That is where to focus optimization effort.

Export this metric to Azure Monitor or Prometheus. Alert when daily cost exceeds a threshold. Use the feature tag to attribute cost back to product features, which turns AI cost into a conversation about feature value rather than a pure engineering expense.

Putting It Together

The six strategies are not mutually exclusive — the most cost-effective production systems implement several simultaneously:

  1. Instrument with cost metrics to establish baseline and identify expensive features
  2. Route by model complexity to redirect simple queries to GPT-4o-mini (33x savings on routed queries)
  3. Cache repeated queries with SemanticCacheFilter (40-60% call reduction in high-repetition scenarios)
  4. Budget per user to protect shared quota in multi-tenant apps
  5. Compress prompts to remove redundancy (20-40% prompt size reduction)
  6. Switch to Ollama in development to eliminate dev/test API costs entirely
  7. Batch non-real-time workloads at 50% discount

Applied together, these strategies routinely reduce Azure OpenAI spend by 60-80% from an unoptimized baseline. The model routing optimization alone — ensuring simple queries go to GPT-4o-mini — typically delivers the largest single reduction.

Further Reading

⚠ Production Considerations

  • Semantic caching is only safe for deterministic or factual queries. Caching responses to subjective or time-sensitive questions ('What's the best approach today?' 'What happened this week?') returns stale answers. Add a cache TTL and exclude question types that require freshness.
  • The complexity classifier itself consumes tokens. If the classifier averages 20 input tokens and your queries average 100 input tokens, the routing overhead is 20%. Only implement model routing when the percentage of queries correctly routed to cheaper models exceeds 30% — below that, the overhead cost equals the savings.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

Build cost observability before implementing cost optimization. Without metrics on per-feature token consumption, you cannot identify which features are expensive or validate that your optimizations are working. Instrument first, optimize second — the data will show you where 80% of your cost comes from.

AI-Friendly Summary

Summary

Azure OpenAI costs in C# are controlled through six strategies: model routing (GPT-4o-mini is 33x cheaper than GPT-4o for simple tasks), semantic caching (40-60% call reduction for repeated queries), token budgets enforced per-user, prompt compression, Ollama for local dev (zero cost), and the Batch API for non-real-time workloads (50% discount). Implement IFunctionInvocationFilter for cross-cutting cost controls in Semantic Kernel.

Key Takeaways

  • GPT-4o-mini vs GPT-4o: 33x price difference — route all simple queries to mini
  • Semantic caching as IFunctionInvocationFilter: 40-60% cost reduction for repeated queries
  • Per-user token budgets prevent individual users from monopolizing shared quota
  • Batch API: 50% cost reduction for non-real-time workloads (document processing, evaluation)
  • Ollama for local dev eliminates all API costs during development and testing

Implementation Checklist

  • Classify request complexity before routing to expensive models
  • Implement semantic cache using vector similarity on embeddings
  • Set max_tokens on every request to cap unexpected output costs
  • Track completion.Usage.TotalTokenCount per user for budget enforcement
  • Register Ollama as IChatClient in development environments
  • Evaluate Batch API for any non-real-time AI workloads

Frequently Asked Questions

What is the biggest driver of high Azure OpenAI costs for .NET developers?

The single biggest cost driver is using GPT-4o for all requests when many queries could be handled by GPT-4o-mini at 33x lower cost. Classify request complexity before routing and send simple queries to the cheaper model.

How does semantic caching reduce Azure OpenAI costs?

Semantic caching stores previous AI responses indexed by embedding. When a new request arrives, compute its embedding and search the cache for similar past queries. If similarity exceeds 0.92, return the cached response without calling the API. This can reduce repeated or near-identical queries by 40-60%.

How do I implement model routing in C# with Semantic Kernel?

Create a classifier function that calls GPT-4o-mini to assess query complexity as simple/moderate/complex. Route simple queries to GPT-4o-mini, moderate to GPT-4o, and complex to GPT-4o with a higher max_tokens limit. The classifier call costs <10 tokens and pays for itself when it correctly routes even 5% of queries to the cheaper model.

What is Provisioned Throughput (PTU) and when should .NET teams use it?

PTU is a monthly capacity reservation for Azure OpenAI that gives guaranteed throughput in exchange for a fixed cost. It is cost-effective when your usage exceeds roughly 50% of the PTU capacity consistently — typically at 50M+ tokens per month. Below that, pay-as-you-go is cheaper. Use the Azure OpenAI PTU calculator to find your break-even point.

How can I use Ollama to reduce costs in development?

Register Ollama as your IChatClient for non-production environments. Your application code doesn't change — only the DI registration differs between development (Ollama endpoint) and production (Azure OpenAI endpoint). This eliminates API costs entirely during development and testing.

What is the Azure OpenAI Batch API and how much does it save?

The Batch API processes asynchronous requests at 50% lower cost than the real-time API. Submit a JSONL file of requests, and Azure processes them within 24 hours. Ideal for document classification, bulk embedding, batch evaluation, and offline enrichment — any workload that doesn't require real-time responses.

How do I set per-user token budgets in a multi-tenant .NET app?

Maintain a per-user token counter in Redis or a database. Before each AI call, check if the user has remaining daily quota. After each call, deduct completion.Usage.TotalTokenCount from their remaining budget. Return HTTP 429 with a descriptive message when the budget is exhausted rather than forwarding to Azure OpenAI.

Track your progress through this learning path.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Cost Optimization #Azure OpenAI #Token Budget #Model Routing #Semantic Caching #.NET AI