Skip to main content

Cut Azure AI Costs in C#: Token Budgets, Model Routing & Semantic Cache

Verified Apr 2026 Intermediate Original .NET 10 Azure.AI.OpenAI 2.x Microsoft.SemanticKernel 1.54.0 Microsoft.ML.Tokenizers 0.22.0
By Rajesh Mishra · Mar 21, 2026 · 15 min read
Diagram showing token budget controls, semantic caching, and task-based routing across nano, mini, and full Azure AI model tiers
Cost control comes from budget guardrails, cache hits, and routing work to the smallest model that meets the task.
In 30 Seconds

Azure AI Foundry costs in C# are controlled through six strategies: model tier routing (GPT-5.4-mini is 5x cheaper than GPT-5.4; GPT-5.4-nano is 20x cheaper for simple lookups), semantic caching (40-60% call reduction for repeated queries), token budgets enforced per-user, prompt compression, Ollama for local dev (zero cost), and the Batch API for non-real-time workloads (50% discount). Implement IFunctionInvocationFilter for cross-cutting cost controls in Semantic Kernel. Pricing based on April 2026 Azure AI Foundry rates. Uses .NET 10 (LTS).

Azure AI Foundry billing is opaque until your first invoice arrives and surprises you. The cost structure is simple — you pay per token — but the levers for controlling that cost are not obvious until you understand what drives consumption. This guide covers six concrete strategies with C# code you can apply immediately.

Platform & SDK Reference (2026)

This article targets Azure AI Foundry (the evolution of Azure OpenAI Service) with the GPT-5.4 model family. All code examples use .NET 10 (LTS, supported until Nov 2028). Legacy GPT-4.x models are deprecated for new applications.

SDK / RuntimeVersion
.NET10 (LTS — recommended)
Azure.AI.OpenAI2.x
Microsoft.SemanticKernel1.54.0+

Understanding Azure AI Foundry Pricing

The most important pricing insight for .NET developers in 2026 is that the GPT-5.4 family gives you four distinct price tiers for the same provider. The gap between the top and bottom of the family makes routing decisions extremely high-value:

ModelInput (per 1M tokens)Output (per 1M tokens)Ratio vs GPT-5.4
GPT-5.4~$2.00~$8.001x (baseline)
GPT-5.4-mini~$0.40~$1.60~5x cheaper
GPT-5.4-nano~$0.10~$0.40~20x cheaper
text-embedding-3-small~$0.02n/a
text-embedding-3-large~$0.13n/a

See the Azure AI Foundry pricing page for current rates — these change as models mature and OpenAI continues reducing prices over time.

The 20x price difference between GPT-5.4 and GPT-5.4-nano is the single most actionable number in AI cost optimization for 2026. GPT-5.4-nano handles classification, extraction, short-answer, and structured output tasks with high reliability. GPT-5.4-mini handles the majority of conversational and reasoning tasks. GPT-5.4 is for complex multi-step reasoning, long-form generation, and code-intensive workloads.

If 50% of your queries are simple enough for GPT-5.4-nano and another 30% are suitable for GPT-5.4-mini, only 20% of your traffic needs the GPT-5.4 tier. That routing decision alone can reduce your AI spend by 70–80% with no quality degradation on the routed queries.

Pay-as-you-go vs Provisioned Throughput (PTU). Pay-as-you-go charges per token with no upfront commitment. PTU is a monthly capacity reservation that gives guaranteed throughput at a fixed monthly price. PTU becomes cost-effective at high, consistent volumes — typically 50M+ tokens per month and above 50% utilisation of purchased capacity. Below that threshold, pay-as-you-go is cheaper and simpler.

Before you optimise cost, you need to know what you are spending. See the token counting guide for how to measure token consumption per request so you have a baseline to optimise against.

Strategy 1: Token Budgeting

In multi-tenant applications, a single user can exhaust shared Azure OpenAI quota and degrade the experience for all other users. Per-user token budgets prevent this and give you predictable cost scaling.

The pattern is straightforward: maintain a per-user daily token counter in a distributed cache, check it before each AI call, and deduct actual usage after. Use Redis (via IDistributedCache) to share state across multiple application instances:

using Microsoft.Extensions.Caching.Distributed;
using System.Text.Json;

public class UserTokenBudgetService
{
    private readonly IDistributedCache _cache;
    private const int DailyTokenBudget = 50_000; // per user per day

    public UserTokenBudgetService(IDistributedCache cache) => _cache = cache;

    public async Task<bool> TryConsumeTokensAsync(
        string userId, int tokensToConsume, CancellationToken ct = default)
    {
        var key = $"token-budget:{userId}:{DateTime.UtcNow:yyyy-MM-dd}";
        var currentBytes = await _cache.GetAsync(key, ct);
        var current = currentBytes != null
            ? JsonSerializer.Deserialize<int>(currentBytes)
            : 0;

        if (current + tokensToConsume > DailyTokenBudget)
            return false;

        var newValue = current + tokensToConsume;
        await _cache.SetAsync(
            key,
            JsonSerializer.SerializeToUtf8Bytes(newValue),
            new DistributedCacheEntryOptions
            {
                AbsoluteExpiration = DateTimeOffset.UtcNow.Date.AddDays(1)
            }, ct);

        return true;
    }
}

In your API layer, check the budget before forwarding to Azure OpenAI and return HTTP 429 when the budget is exhausted. After the AI call completes, deduct completion.Usage.TotalTokenCount from the user’s remaining quota.

Setting the right budget limit requires telemetry. Start by logging actual token usage per user per day for two weeks, then set the budget at the 90th percentile plus 20%. This accommodates normal heavy users while protecting against runaway consumption.

Strategy 2: Model Routing

Model routing is the highest-return optimisation available. The GPT-5.4 family gives you three deployment tiers to route between — nano, mini, and the flagship — each with a meaningful quality-cost tradeoff. The idea is to classify each incoming query before routing it: simple, factual, or extraction queries go to GPT-5.4-nano; balanced conversational queries go to GPT-5.4-mini; complex reasoning or long-form generation goes to GPT-5.4.

A classifier that runs on GPT-5.4-nano costs fewer than 20 tokens per classification. If it correctly routes even 40% of queries to GPT-5.4-nano, the classifier overhead is paid back many times over.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

public class ModelRoutingService
{
    private readonly IChatCompletionService _nanoService;  // GPT-5.4-nano — simple tasks
    private readonly IChatCompletionService _miniService;  // GPT-5.4-mini — balanced
    private readonly IChatCompletionService _fullService;  // GPT-5.4   — complex

    public ModelRoutingService(
        [FromKeyedServices("gpt-54-nano")] IChatCompletionService nano,
        [FromKeyedServices("gpt-54-mini")] IChatCompletionService mini,
        [FromKeyedServices("gpt-54")] IChatCompletionService full)
    {
        _nanoService = nano;
        _miniService = mini;
        _fullService = full;
    }

    public async Task<string> CompleteAsync(string userQuery, CancellationToken ct = default)
    {
        // Classify using nano — cheapest possible classifier
        var tier = await ClassifyTierAsync(userQuery, ct);

        var service = tier switch
        {
            "simple"   => _nanoService,
            "moderate" => _miniService,
            _          => _fullService,
        };

        var history = new ChatHistory("You are a helpful assistant.");
        history.AddUserMessage(userQuery);

        var response = await service.GetChatMessageContentAsync(history, cancellationToken: ct);
        return response.Content ?? string.Empty;
    }

    private async Task<string> ClassifyTierAsync(string query, CancellationToken ct)
    {
        var history = new ChatHistory(
            "Classify this query as one word: " +
            "'simple' (fact lookup, short answer, classification), " +
            "'moderate' (conversational, multi-step but clear), or " +
            "'complex' (deep reasoning, code generation, long-form analysis). " +
            "Reply with only lowercase: simple, moderate, or complex.");
        history.AddUserMessage(query);

        var result = await _nanoService.GetChatMessageContentAsync(history, cancellationToken: ct);
        return result.Content?.Trim().ToLowerInvariant() switch
        {
            "complex"  => "complex",
            "moderate" => "moderate",
            _          => "simple",
        };
    }
}

Register keyed services in your DI container so each service gets the correct underlying deployment:

// Each key maps to a different Azure AI Foundry deployment
builder.Services.AddKeyedSingleton<IChatCompletionService>("gpt-54-nano",
    (sp, _) => /* kernel wired to your gpt-5.4-nano deployment */ );
builder.Services.AddKeyedSingleton<IChatCompletionService>("gpt-54-mini",
    (sp, _) => /* kernel wired to your gpt-5.4-mini deployment */ );
builder.Services.AddKeyedSingleton<IChatCompletionService>("gpt-54",
    (sp, _) => /* kernel wired to your gpt-5.4 deployment */ );

Track the routing distribution in production. If your classifier routes fewer than 30% of queries to nano or mini, either your query mix is genuinely complex or the classifier is being overly cautious. Log the routing decision alongside the query tier to spot misclassification patterns quickly.

Strategy 3: Semantic Caching as IFunctionInvocationFilter

Many production AI workloads receive the same or nearly the same queries repeatedly. A support chatbot receives variations of “How do I reset my password?” hundreds of times per day. Computing a fresh AI response each time wastes money.

Semantic caching intercepts AI function calls in Semantic Kernel using the IFunctionInvocationFilter interface. Before the real AI call executes, the filter checks whether a cached response exists for this input. If it does, the filter short-circuits the pipeline and returns the cached result:

using Microsoft.SemanticKernel;
using Microsoft.Extensions.Caching.Memory;
using System.Security.Cryptography;
using System.Text;

public class SemanticCacheFilter : IFunctionInvocationFilter
{
    private readonly IMemoryCache _cache;
    private readonly TimeSpan _cacheDuration;

    public SemanticCacheFilter(IMemoryCache cache, TimeSpan? cacheDuration = null)
    {
        _cache = cache;
        _cacheDuration = cacheDuration ?? TimeSpan.FromHours(1);
    }

    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        // Only cache prompt functions, not tool calls
        if (!context.Function.PluginName.Contains("Prompt"))
        {
            await next(context);
            return;
        }

        var cacheKey = ComputeCacheKey(context);

        if (_cache.TryGetValue(cacheKey, out string? cachedResult))
        {
            context.Result = new FunctionResult(context.Function, cachedResult);
            return; // Skip the actual AI call
        }

        await next(context);

        if (context.Result?.GetValue<string>() is string result)
        {
            _cache.Set(cacheKey, result, _cacheDuration);
        }
    }

    private static string ComputeCacheKey(FunctionInvocationContext context)
    {
        var keyInput = $"{context.Function.PluginName}:{context.Function.Name}:" +
                       string.Join(",", context.Arguments.Select(a => $"{a.Key}={a.Value}"));

        var hash = SHA256.HashData(Encoding.UTF8.GetBytes(keyInput));
        return $"sk-cache:{Convert.ToHexString(hash)[..16]}";
    }
}

Register the filter with your kernel:

builder.Services.AddMemoryCache();
builder.Services.AddSingleton<SemanticCacheFilter>();
kernel.FunctionInvocationFilters.Add(serviceProvider.GetRequiredService<SemanticCacheFilter>());

The implementation above uses exact-match caching by hashing function arguments. For true semantic caching — matching near-identical queries — you would compute embeddings of the user query, store them in a vector cache, and return a hit when cosine similarity exceeds a threshold (typically 0.92). The exact-match version shown here handles the common case where repeated queries are truly identical (same FAQ, same document reference) without the added complexity of embedding lookup.

Cache hit rates of 40-60% are achievable in support-chat and documentation-assistant scenarios. Log cache hit/miss ratios alongside estimated cost savings so you can demonstrate the value and tune the TTL.

Strategy 4: Prompt Compression

Every token in your prompt costs money. System prompts that grew incrementally over months often contain redundancy, outdated examples, and verbose phrasing that the model does not need.

Set max_tokens on every request. Without an output token cap, a model can generate an arbitrarily long response. For most use cases, 1,024 or 2,048 tokens is more than sufficient:

var options = new ChatCompletionOptions
{
    MaxOutputTokenCount = 1024, // Cap output tokens
};
var completion = await chatClient.CompleteChatAsync(messages, options);

Audit your system prompt for compression opportunities. Common sources of excess tokens:

  • Lengthy few-shot examples that the model no longer needs after fine-tuning
  • Verbose role definitions (“You are an extremely helpful assistant who always…”) that can be shortened to a single sentence
  • Repeated instructions that appear in both the system prompt and user messages
  • Full function signatures repeated every turn in tool-use conversations — use abbreviated references in multi-turn history instead

A prompt compression pass on a mature system prompt often reduces token count by 20-40% with no measurable quality degradation. Measure quality before and after using a fixed test set of 50-100 representative queries.

In multi-turn conversations, trim older messages from the context window once the conversation exceeds a threshold. Keep the system prompt and the most recent N turns. Summarizing older turns rather than dropping them preserves context while reducing token count.

Strategy 5: Ollama for Local Development

Every API call made during development, debugging, and testing costs real money. A developer running the application locally for a few hours can easily consume thousands of tokens before any production traffic arrives.

Ollama eliminates this entirely. Register it as your IChatClient for non-production environments, and your application code changes nothing — only the DI registration differs:

// Program.cs
if (builder.Environment.IsDevelopment())
{
    // Zero-cost local AI with Ollama
    builder.Services.AddOpenAIChatClient(
        modelId: "phi4-mini",
        endpoint: new Uri("http://localhost:11434/v1"),
        apiKey: "ollama");
}
else
{
    // Production Azure AI Foundry
    builder.Services.AddAzureOpenAIChatClient(
        new Uri(builder.Configuration["AzureAI:Endpoint"]!),
        new AzureKeyCredential(builder.Configuration["AzureAI:ApiKey"]!));
}

IChatClient is the same interface in both cases. Services that depend on IChatClient receive the correct implementation for their environment without any conditional logic in business code.

Phi-4-mini running locally is capable enough for most development scenarios — testing prompt logic, validating JSON output parsing, exercising the RAG pipeline. For setup instructions including model download and Ollama configuration, see the full Phi-4 local development guide.

Strategy 6: Azure OpenAI Batch API

Real-time API calls are expensive because they require reserved capacity and low-latency routing. For workloads where the response is not needed immediately, the Azure OpenAI Batch API processes requests at 50% lower cost than the real-time API.

When to use it:

  • Document classification pipelines (classify 10,000 support tickets overnight)
  • Bulk embedding generation for vector database population
  • Automated evaluation of AI output quality (run nightly, not during user sessions)
  • Offline content enrichment (generate summaries, extract entities from a corpus)

The Batch API accepts a JSONL file where each line is a self-contained request:

// Batch API — submit JSONL, get results asynchronously
// Each line in the JSONL is a full chat completion request
var batchRequest = new
{
    custom_id = "request-001",
    method = "POST",
    url = "/v1/chat/completions",
    body = new
    {
        model = "gpt-5.4-mini",  // 5x cheaper than gpt-5.4; Batch halves the cost again
        messages = new[] { new { role = "user", content = "Classify this document: ..." } },
        max_tokens = 10
    }
};
// Submit via Azure AI Foundry Batch endpoint
// Poll for completion, then retrieve results

The Batch API returns results within 24 hours. This latency is the trade-off for the cost reduction — only use the Batch API for workloads that can tolerate asynchronous processing. For document classification, embedding generation, and offline evaluation, 24-hour latency is entirely acceptable.

See the Azure AI Foundry Batch API documentation for request format details and polling patterns.

Monitoring Cost

You cannot optimize what you do not measure. The architects note above is correct: instrument before you optimize. A simple .NET Meter tracks estimated cost per request, per model, and per feature:

using System.Diagnostics.Metrics;

private static readonly Meter _meter = new("MyApp.AI.Cost", "1.0.0");
private static readonly Histogram<double> _costHistogram =
    _meter.CreateHistogram<double>("ai.request.cost.usd", "USD", "Estimated cost per request");

// April 2026 pricing (per million tokens) — GPT-5.4 family
private static readonly Dictionary<string, (double Input, double Output)> _pricing = new()
{
    ["gpt-5.4"]      = (2.00,  8.00),
    ["gpt-5.4-mini"] = (0.40,  1.60),
    ["gpt-5.4-nano"] = (0.10,  0.40),
};

// After each call — tag with model name and the feature that triggered it
var (inputRate, outputRate) = _pricing.GetValueOrDefault(modelName, (2.00, 8.00));
double estimatedCost = (completion.Usage.InputTokenCount  / 1_000_000.0 * inputRate) +
                       (completion.Usage.OutputTokenCount / 1_000_000.0 * outputRate);
_costHistogram.Record(estimatedCost, new TagList { { "model", modelName }, { "feature", featureName } });

Tag every cost metric with the model name and the application feature. After a week of production data, you will see clearly which features consume the most tokens — often it is one or two features driving 80% of cost. That is where to focus optimization effort.

Export this metric to Azure Monitor or Prometheus. Alert when daily cost exceeds a threshold. Use the feature tag to attribute cost back to product features, which turns AI cost into a conversation about feature value rather than a pure engineering expense.

Putting It Together

The six strategies are not mutually exclusive — the most cost-effective production systems implement several simultaneously:

  1. Instrument with cost metrics to establish baseline and identify expensive features
  2. Route by model tier to redirect queries to the cheapest capable model (nano/mini/GPT-5.4) — up to 20x savings per routed query
  3. Cache repeated queries with SemanticCacheFilter (40-60% call reduction in high-repetition scenarios)
  4. Budget per user to protect shared quota in multi-tenant apps
  5. Compress prompts to remove redundancy (20-40% prompt size reduction)
  6. Switch to Ollama in development to eliminate dev/test API costs entirely
  7. Batch non-real-time workloads at 50% discount

Applied together, these strategies routinely reduce Azure AI Foundry spend by 60-80% from an unoptimised baseline. In 2026, three-tier model routing across GPT-5.4-nano, GPT-5.4-mini, and GPT-5.4 typically delivers the largest single cost reduction — routing just 50% of queries to nano or mini at their respective price points cuts the per-token spend significantly across that traffic.

Model Selection for Cost Optimisation (2026)

Choosing the right model tier is the highest-leverage cost decision available. Here is the practical guide for .NET teams:

ScenarioRecommended modelWhy
All new production APIsgpt-5.4-miniBest cost/performance default
Simple classification, extractiongpt-5.4-nano20x cheaper than flagship
Agents, reasoning chains, complex codegpt-5.4Quality justifies cost
Bulk async processing (documents, eval)gpt-5.4-mini + Batch API50% additional discount
Local dev and testingOllama (phi4-mini)Zero cost

Default rule: Start every new feature with gpt-5.4-mini. Only upgrade to gpt-5.4 when you have evidence (from your cost metrics) that mini is producing inadequate output for that specific workload. The savings from defaulting to mini are immediate and require no architectural change.

Cost Optimisation Tips for Azure AI Foundry

  • Prefer mini models for all API endpointsgpt-5.4-mini handles 80% of use cases that teams initially over-provision to gpt-5.4
  • Cache prompt prefixes where possible — Azure AI Foundry supports prompt caching; repeated system prompts cost significantly less on subsequent calls
  • Use embeddings instead of full context injection — vector retrieval via text-embedding-3-small + Azure AI Search is orders of magnitude cheaper than stuffing full documents into every prompt
  • Instrument before you optimise — without per-feature token metrics you cannot know which 20% of features consume 80% of your spend
  • Set MaxOutputTokenCount on every request — uncapped output is the most common source of unexpected cost spikes in production

Why GPT-5.4 Replaces GPT-4o for New Applications

If you are migrating from code written in 2024 or early 2025, it used GPT-4o or GPT-4o-mini. GPT-5.4 is the current standard for new .NET AI applications:

  • Better reasoning per token — GPT-5.4-mini outperforms GPT-4o on most benchmark tasks at comparable cost
  • Larger context window — supports significantly longer documents and conversation histories natively
  • Three-tier family — nano/mini/flagship gives you granular routing options that GPT-4o did not
  • Native agent support — designed for tool use and multi-step orchestration from the ground up
  • Same SDK, same IChatClient pattern — update the deployment model name and nothing else changes in your C# code

The migration path is: update your Azure AI Foundry deployments to GPT-5.4-mini, update your deployment name strings, re-run your test suite. If quality is maintained (it will be for most workloads), you are done.

Further Reading

⚠ Production Considerations

  • Semantic caching is only safe for deterministic or factual queries. Caching responses to subjective or time-sensitive questions ('What's the best approach today?' 'What happened this week?') returns stale answers. Add a cache TTL and exclude question types that require freshness.
  • The complexity classifier itself consumes tokens. If the classifier averages 20 input tokens and your queries average 100 input tokens, the routing overhead is 20%. Only implement model routing when the percentage of queries correctly routed to cheaper models exceeds 30% — below that, the overhead cost equals the savings.

🧠 Architect’s Note

Build cost observability before implementing cost optimization. Without metrics on per-feature token consumption, you cannot identify which features are expensive or validate that your optimizations are working. Instrument first, optimize second — the data will show you where 80% of your cost comes from.

AI-Friendly Summary

Summary

Azure AI Foundry costs in C# are controlled through six strategies: model tier routing (GPT-5.4-mini is 5x cheaper than GPT-5.4; GPT-5.4-nano is 20x cheaper for simple lookups), semantic caching (40-60% call reduction for repeated queries), token budgets enforced per-user, prompt compression, Ollama for local dev (zero cost), and the Batch API for non-real-time workloads (50% discount). Implement IFunctionInvocationFilter for cross-cutting cost controls in Semantic Kernel. Pricing based on April 2026 Azure AI Foundry rates. Uses .NET 10 (LTS).

Key Takeaways

  • GPT-5.4 model tier routing: GPT-5.4-mini is 5x cheaper than GPT-5.4; GPT-5.4-nano is 20x cheaper — route simple queries to the cheapest capable tier
  • Semantic caching as IFunctionInvocationFilter: 40-60% cost reduction for repeated queries
  • Per-user token budgets prevent individual users from monopolizing shared quota
  • Batch API: 50% cost reduction for non-real-time workloads (document processing, evaluation) — combine with GPT-5.4-mini for maximum savings
  • Ollama for local dev eliminates all API costs during development and testing
  • Use .NET 10 (LTS) for all new cost-optimized production systems

Implementation Checklist

  • Classify request complexity before routing — use nano for simple, mini for balanced, GPT-5.4 for complex
  • Implement semantic cache using vector similarity on embeddings
  • Set max_tokens on every request to cap unexpected output costs
  • Track completion.Usage.TotalTokenCount per user for budget enforcement
  • Register Ollama as IChatClient in development environments — eliminates all dev/test API cost
  • Evaluate Batch API for any non-real-time AI workloads
  • Target .NET 10 (LTS) for all new production systems

Frequently Asked Questions

What is the biggest driver of high Azure OpenAI costs for .NET developers?

The single biggest cost driver is using a premium model (GPT-5.4 or GPT-5.4-pro) for all requests when the majority of queries could be handled by GPT-5.4-mini at 5x lower cost, or GPT-5.4-nano at 20x lower cost for simple lookups. Classify request complexity before routing and send simple queries to the cheapest tier that can handle them reliably.

How does semantic caching reduce Azure OpenAI costs?

Semantic caching stores previous AI responses indexed by embedding. When a new request arrives, compute its embedding and search the cache for similar past queries. If similarity exceeds 0.92, return the cached response without calling the API. This can reduce repeated or near-identical queries by 40-60%.

How do I implement model routing in C# with Semantic Kernel?

Create a classifier function that calls GPT-5.4-nano to assess query complexity as simple/moderate/complex. Route simple queries to GPT-5.4-nano, moderate to GPT-5.4-mini, and complex to GPT-5.4 with a higher max_tokens limit. The classifier call costs fewer than 20 tokens and pays for itself when it correctly routes even 10% of queries to the cheaper tier.

What is Provisioned Throughput (PTU) and when should .NET teams use it?

PTU is a monthly capacity reservation for Azure AI Foundry that gives guaranteed throughput in exchange for a fixed cost. It is cost-effective when your usage exceeds roughly 50% of the PTU capacity consistently — typically at 50M+ tokens per month. Below that, pay-as-you-go is cheaper. Use the Azure AI Foundry PTU calculator to find your break-even point.

How can I use Ollama to reduce costs in development?

Register Ollama as your IChatClient for non-production environments. Your application code does not change — only the DI registration differs between development (Ollama endpoint with a local model such as phi4-mini) and production (Azure OpenAI endpoint). This eliminates API costs entirely during development and testing, which adds up quickly once you are actively building and debugging features.

What is the Azure OpenAI Batch API and how much does it save?

The Batch API processes asynchronous requests at 50% lower cost than the real-time API. Submit a JSONL file of requests, and Azure processes them within 24 hours. Ideal for document classification, bulk embedding, batch evaluation, and offline enrichment — any workload that doesn't require real-time responses.

How do I set per-user token budgets in a multi-tenant .NET app?

Maintain a per-user token counter in Redis or a database. Before each AI call, check if the user has remaining daily quota. After each call, deduct completion.Usage.TotalTokenCount from their remaining budget. Return HTTP 429 with a descriptive message when the budget is exhausted rather than forwarding to Azure AI Foundry.

Track your progress through this learning path.

You Might Also Enjoy

#Cost Optimization #Azure AI Foundry #Azure OpenAI #GPT-5.4 #Token Budget #Model Routing #Semantic Caching #.NET AI

Was this article useful?

Feedback is anonymous and helps us improve content quality.