Azure OpenAI billing is opaque until your first invoice arrives and surprises you. The cost structure is simple — you pay per token — but the levers for controlling that cost are not obvious until you understand what drives consumption. This guide covers six concrete strategies with C# code you can apply immediately.
Understanding Azure OpenAI Pricing
The most important pricing insight for .NET developers is the price ratio between models. As of early 2026, the gap between GPT-4o and GPT-4o-mini is dramatic:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Ratio vs GPT-4o |
|---|---|---|---|
| GPT-4o | $5.00 | $15.00 | 1x |
| GPT-4o-mini | $0.15 | $0.60 | ~33x cheaper |
| text-embedding-3-small | $0.02 | n/a | — |
| text-embedding-3-large | $0.13 | n/a | — |
See the Azure OpenAI pricing page for current rates — these change as models mature.
The 33x price difference between GPT-4o and GPT-4o-mini is the single most actionable number in AI cost optimization. If 40% of your queries are simple enough for GPT-4o-mini to handle, that 40% now costs 33x less. No architectural changes required — just routing.
Pay-as-you-go vs Provisioned Throughput (PTU). Pay-as-you-go charges per token with no upfront commitment. PTU is a monthly capacity reservation that gives guaranteed throughput at a fixed monthly price. PTU becomes cost-effective at high, consistent volumes — typically 50M+ tokens per month and above 50% utilization of purchased capacity. Below that threshold, pay-as-you-go is cheaper and simpler.
Before you optimize cost, you need to know what you are spending. See the token counting guide for how to measure token consumption per request so you have a baseline to optimize against.
Strategy 1: Token Budgeting
In multi-tenant applications, a single user can exhaust shared Azure OpenAI quota and degrade the experience for all other users. Per-user token budgets prevent this and give you predictable cost scaling.
The pattern is straightforward: maintain a per-user daily token counter in a distributed cache, check it before each AI call, and deduct actual usage after. Use Redis (via IDistributedCache) to share state across multiple application instances:
using Microsoft.Extensions.Caching.Distributed;
using System.Text.Json;
public class UserTokenBudgetService
{
private readonly IDistributedCache _cache;
private const int DailyTokenBudget = 50_000; // per user per day
public UserTokenBudgetService(IDistributedCache cache) => _cache = cache;
public async Task<bool> TryConsumeTokensAsync(
string userId, int tokensToConsume, CancellationToken ct = default)
{
var key = $"token-budget:{userId}:{DateTime.UtcNow:yyyy-MM-dd}";
var currentBytes = await _cache.GetAsync(key, ct);
var current = currentBytes != null
? JsonSerializer.Deserialize<int>(currentBytes)
: 0;
if (current + tokensToConsume > DailyTokenBudget)
return false;
var newValue = current + tokensToConsume;
await _cache.SetAsync(
key,
JsonSerializer.SerializeToUtf8Bytes(newValue),
new DistributedCacheEntryOptions
{
AbsoluteExpiration = DateTimeOffset.UtcNow.Date.AddDays(1)
}, ct);
return true;
}
}
In your API layer, check the budget before forwarding to Azure OpenAI and return HTTP 429 when the budget is exhausted. After the AI call completes, deduct completion.Usage.TotalTokenCount from the user’s remaining quota.
Setting the right budget limit requires telemetry. Start by logging actual token usage per user per day for two weeks, then set the budget at the 90th percentile plus 20%. This accommodates normal heavy users while protecting against runaway consumption.
Strategy 2: Model Routing
Model routing is the highest-return optimization available. The idea is to classify each incoming query before routing it — simple queries go to GPT-4o-mini, complex queries go to GPT-4o.
A classifier that runs on GPT-4o-mini costs fewer than 20 tokens per classification. If it correctly identifies even 30% of queries as simple and routes them away from GPT-4o, the classifier overhead is paid back many times over.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
public class ModelRoutingService
{
private readonly IChatCompletionService _miniService; // GPT-4o-mini
private readonly IChatCompletionService _fullService; // GPT-4o
public ModelRoutingService(
[FromKeyedServices("gpt-4o-mini")] IChatCompletionService mini,
[FromKeyedServices("gpt-4o")] IChatCompletionService full)
{
_miniService = mini;
_fullService = full;
}
public async Task<string> CompleteAsync(string userQuery, CancellationToken ct = default)
{
// Classify complexity using the cheap model
var complexity = await ClassifyComplexityAsync(userQuery, ct);
var service = complexity == "simple" ? _miniService : _fullService;
var history = new ChatHistory("You are a helpful assistant.");
history.AddUserMessage(userQuery);
var response = await service.GetChatMessageContentAsync(history, cancellationToken: ct);
return response.Content ?? string.Empty;
}
private async Task<string> ClassifyComplexityAsync(string query, CancellationToken ct)
{
var history = new ChatHistory(
"Classify this query as 'simple' (factual, short answer) or 'complex' (reasoning, analysis, code). " +
"Reply with only the single word: simple or complex.");
history.AddUserMessage(query);
var result = await _miniService.GetChatMessageContentAsync(history, cancellationToken: ct);
return result.Content?.Trim().ToLowerInvariant() == "complex" ? "complex" : "simple";
}
}
Register keyed services in your DI container so each service gets the correct underlying model deployment:
builder.Services.AddKeyedSingleton<IChatCompletionService>("gpt-4o-mini",
(sp, _) => sp.GetRequiredService<Kernel>()
.GetRequiredService<IChatCompletionService>());
// Similar registration for gpt-4o pointing at your GPT-4o deployment
Track the routing distribution in production. If your classifier routes fewer than 20% of queries to GPT-4o-mini, either your query mix is genuinely complex or the classifier needs tuning. Logging the routing decision alongside the query lets you spot misclassifications quickly.
Strategy 3: Semantic Caching as IFunctionInvocationFilter
Many production AI workloads receive the same or nearly the same queries repeatedly. A support chatbot receives variations of “How do I reset my password?” hundreds of times per day. Computing a fresh AI response each time wastes money.
Semantic caching intercepts AI function calls in Semantic Kernel using the IFunctionInvocationFilter interface. Before the real AI call executes, the filter checks whether a cached response exists for this input. If it does, the filter short-circuits the pipeline and returns the cached result:
using Microsoft.SemanticKernel;
using Microsoft.Extensions.Caching.Memory;
using System.Security.Cryptography;
using System.Text;
public class SemanticCacheFilter : IFunctionInvocationFilter
{
private readonly IMemoryCache _cache;
private readonly TimeSpan _cacheDuration;
public SemanticCacheFilter(IMemoryCache cache, TimeSpan? cacheDuration = null)
{
_cache = cache;
_cacheDuration = cacheDuration ?? TimeSpan.FromHours(1);
}
public async Task OnFunctionInvocationAsync(
FunctionInvocationContext context,
Func<FunctionInvocationContext, Task> next)
{
// Only cache prompt functions, not tool calls
if (!context.Function.PluginName.Contains("Prompt"))
{
await next(context);
return;
}
var cacheKey = ComputeCacheKey(context);
if (_cache.TryGetValue(cacheKey, out string? cachedResult))
{
context.Result = new FunctionResult(context.Function, cachedResult);
return; // Skip the actual AI call
}
await next(context);
if (context.Result?.GetValue<string>() is string result)
{
_cache.Set(cacheKey, result, _cacheDuration);
}
}
private static string ComputeCacheKey(FunctionInvocationContext context)
{
var keyInput = $"{context.Function.PluginName}:{context.Function.Name}:" +
string.Join(",", context.Arguments.Select(a => $"{a.Key}={a.Value}"));
var hash = SHA256.HashData(Encoding.UTF8.GetBytes(keyInput));
return $"sk-cache:{Convert.ToHexString(hash)[..16]}";
}
}
Register the filter with your kernel:
builder.Services.AddMemoryCache();
builder.Services.AddSingleton<SemanticCacheFilter>();
kernel.FunctionInvocationFilters.Add(serviceProvider.GetRequiredService<SemanticCacheFilter>());
The implementation above uses exact-match caching by hashing function arguments. For true semantic caching — matching near-identical queries — you would compute embeddings of the user query, store them in a vector cache, and return a hit when cosine similarity exceeds a threshold (typically 0.92). The exact-match version shown here handles the common case where repeated queries are truly identical (same FAQ, same document reference) without the added complexity of embedding lookup.
Cache hit rates of 40-60% are achievable in support-chat and documentation-assistant scenarios. Log cache hit/miss ratios alongside estimated cost savings so you can demonstrate the value and tune the TTL.
Strategy 4: Prompt Compression
Every token in your prompt costs money. System prompts that grew incrementally over months often contain redundancy, outdated examples, and verbose phrasing that the model does not need.
Set max_tokens on every request. Without an output token cap, a model can generate an arbitrarily long response. For most use cases, 1,024 or 2,048 tokens is more than sufficient:
var options = new ChatCompletionOptions
{
MaxOutputTokenCount = 1024, // Cap output tokens
};
var completion = await chatClient.CompleteChatAsync(messages, options);
Audit your system prompt for compression opportunities. Common sources of excess tokens:
- Lengthy few-shot examples that the model no longer needs after fine-tuning
- Verbose role definitions (“You are an extremely helpful assistant who always…”) that can be shortened to a single sentence
- Repeated instructions that appear in both the system prompt and user messages
- Full function signatures repeated every turn in tool-use conversations — use abbreviated references in multi-turn history instead
A prompt compression pass on a mature system prompt often reduces token count by 20-40% with no measurable quality degradation. Measure quality before and after using a fixed test set of 50-100 representative queries.
In multi-turn conversations, trim older messages from the context window once the conversation exceeds a threshold. Keep the system prompt and the most recent N turns. Summarizing older turns rather than dropping them preserves context while reducing token count.
Strategy 5: Ollama for Local Development
Every API call made during development, debugging, and testing costs real money. A developer running the application locally for a few hours can easily consume thousands of tokens before any production traffic arrives.
Ollama eliminates this entirely. Register it as your IChatClient for non-production environments, and your application code changes nothing — only the DI registration differs:
// Program.cs
if (builder.Environment.IsDevelopment())
{
// Zero-cost local AI with Ollama
builder.Services.AddOpenAIChatClient(
modelId: "phi4-mini",
endpoint: new Uri("http://localhost:11434/v1"),
apiKey: "ollama");
}
else
{
// Production Azure OpenAI
builder.Services.AddAzureOpenAIChatClient(
new Uri(builder.Configuration["AzureOpenAI:Endpoint"]!),
new AzureKeyCredential(builder.Configuration["AzureOpenAI:ApiKey"]!));
}
IChatClient is the same interface in both cases. Services that depend on IChatClient receive the correct implementation for their environment without any conditional logic in business code.
Phi-4-mini running locally is capable enough for most development scenarios — testing prompt logic, validating JSON output parsing, exercising the RAG pipeline. For setup instructions including model download and the Ollama service configuration, see the full Phi-4 local development guide.
Strategy 6: Azure OpenAI Batch API
Real-time API calls are expensive because they require reserved capacity and low-latency routing. For workloads where the response is not needed immediately, the Azure OpenAI Batch API processes requests at 50% lower cost than the real-time API.
When to use it:
- Document classification pipelines (classify 10,000 support tickets overnight)
- Bulk embedding generation for vector database population
- Automated evaluation of AI output quality (run nightly, not during user sessions)
- Offline content enrichment (generate summaries, extract entities from a corpus)
The Batch API accepts a JSONL file where each line is a self-contained request:
// Batch API — submit JSONL, get results asynchronously
// Requires Azure.AI.OpenAI Batch API support (GA in 2025)
// Each line in the JSONL is a full chat completion request
var batchRequest = new
{
custom_id = "request-001",
method = "POST",
url = "/v1/chat/completions",
body = new
{
model = "gpt-4o-mini",
messages = new[] { new { role = "user", content = "Classify this document: ..." } },
max_tokens = 10
}
};
// Submit via Azure OpenAI Batch endpoint
// Poll for completion, then retrieve results
The Batch API returns results within 24 hours. This latency is the trade-off for the cost reduction — only use the Batch API for workloads that can tolerate asynchronous processing. For document classification, embedding generation, and offline evaluation, 24-hour latency is entirely acceptable.
See the Azure OpenAI Batch API documentation for request format details and polling patterns.
Monitoring Cost
You cannot optimize what you do not measure. The architects note above is correct: instrument before you optimize. A simple .NET Meter tracks estimated cost per request, per model, and per feature:
using System.Diagnostics.Metrics;
private static readonly Meter _meter = new("MyApp.AI.Cost", "1.0.0");
private static readonly Histogram<double> _costHistogram =
_meter.CreateHistogram<double>("ai.request.cost.usd", "USD", "Estimated cost per request");
// After each call
double estimatedCost = (completion.Usage.InputTokenCount / 1_000_000.0 * 5.0) +
(completion.Usage.OutputTokenCount / 1_000_000.0 * 15.0);
_costHistogram.Record(estimatedCost, new TagList { { "model", "gpt-4o" }, { "feature", "support-chat" } });
Tag every cost metric with the model name and the application feature. After a week of production data, you will see clearly which features consume the most tokens — often it is one or two features driving 80% of cost. That is where to focus optimization effort.
Export this metric to Azure Monitor or Prometheus. Alert when daily cost exceeds a threshold. Use the feature tag to attribute cost back to product features, which turns AI cost into a conversation about feature value rather than a pure engineering expense.
Putting It Together
The six strategies are not mutually exclusive — the most cost-effective production systems implement several simultaneously:
- Instrument with cost metrics to establish baseline and identify expensive features
- Route by model complexity to redirect simple queries to GPT-4o-mini (33x savings on routed queries)
- Cache repeated queries with
SemanticCacheFilter(40-60% call reduction in high-repetition scenarios) - Budget per user to protect shared quota in multi-tenant apps
- Compress prompts to remove redundancy (20-40% prompt size reduction)
- Switch to Ollama in development to eliminate dev/test API costs entirely
- Batch non-real-time workloads at 50% discount
Applied together, these strategies routinely reduce Azure OpenAI spend by 60-80% from an unoptimized baseline. The model routing optimization alone — ensuring simple queries go to GPT-4o-mini — typically delivers the largest single reduction.
Further Reading
- Azure OpenAI pricing
- Azure OpenAI Batch API documentation
- University: Token Counting and Context Management in C# for Azure OpenAI