Skip to main content

Token Counting and Context Management in C# for Azure OpenAI

Intermediate Original .NET 9 Microsoft.ML.Tokenizers 0.22.0 Azure.AI.OpenAI 2.1.0
By Rajesh Mishra · Mar 21, 2026 · 13 min read
Verified Mar 2026 .NET 9 Microsoft.ML.Tokenizers 0.22.0
In 30 Seconds

Token counting in C# uses Microsoft.ML.Tokenizers 0.22.x with TiktokenTokenizer.CreateForModel(). Chat messages add 4 tokens overhead per message plus 2 for reply priming. Implement a token budget as an IFunctionInvocationFilter in Semantic Kernel to prevent context overflow before API calls. Pre-call cost estimation multiplies counted tokens by per-model per-million-token pricing. Always add a 10% buffer to estimates.

Token counting is one of those unsexy fundamentals that separates production AI applications from prototypes. Every Azure OpenAI request costs money per token. Every model has a hard context window limit. Every RAG chunking strategy needs token-accurate boundaries, not character estimates. Get this wrong and you’ll see unexpected bill spikes, mysterious 400 errors when conversations grow long, and RAG retrieval that degrades because your chunks are the wrong size.

This guide covers the full stack: choosing the right tokenizer library, counting plain text and chat messages, building a token budget middleware for Semantic Kernel, estimating costs before API calls, and processing documents in token-bounded batches.

Why Token Counting Matters

There are three concrete reasons to instrument your application with token counting from the start.

Cost control. Azure OpenAI charges per token — both input and output. A 128K-token context filled with conversation history costs roughly 128 times more than a single short message. Without token counting, you have no visibility into what you’re actually spending per request, per user, or per feature. Production AI systems need per-request cost telemetry just as much as they need latency metrics.

Context window limits. GPT-4o supports a 128,000-token context window, which sounds large until you start accumulating conversation history, RAG chunks, system prompts, and tool schemas simultaneously. When you exceed the limit, the Azure OpenAI API returns a 400 error. Worse, some SDKs silently truncate history, causing the model to lose context without any error surfacing to your application. Proactive token counting lets you trim or summarize history before hitting that wall.

Chunking accuracy for RAG. The common shortcut of “1 token ≈ 4 characters” is a reasonable approximation for English, but it breaks down significantly for non-English text, code, and technical vocabulary. SQL keywords tokenize differently than prose. Chinese and Japanese characters are typically 1-2 tokens each, not 0.25. If you’re building a RAG pipeline and sizing chunks by character count, your chunk boundaries will be off — some chunks will be too large and get rejected, others too small and lose semantic coherence. Token-accurate chunking requires an actual tokenizer.

The Tokenizer Landscape in .NET

Three libraries have circulated in the .NET ecosystem for TikToken-compatible tokenization. They are not equal.

SharpToken is a community-maintained port of the Python TikToken library. It is accurate and was widely used before Microsoft released an official solution, but it has no Microsoft backing and its maintenance cadence depends entirely on volunteer contributors. For production systems, dependence on a community library with uncertain longevity is a risk.

Microsoft.DeepDev.TokenizerLib was Microsoft’s first attempt at a .NET tokenizer. It has been officially deprecated by Microsoft and should not be used in new projects. If you have existing code using this library, migrate away from it.

Microsoft.ML.Tokenizers is the recommended choice. It is maintained by the Microsoft ML.NET team, supports both cl100k_base (used by GPT-4, GPT-3.5-Turbo, and text-embedding-3 models) and o200k_base (used by GPT-4o), and ships on the same release cadence as the broader ML.NET ecosystem. This is what you should use.

Install it:

dotnet add package Microsoft.ML.Tokenizers --version 0.22.0

Setup and Basic Token Counting

The entry point is TiktokenTokenizer.CreateForModel(), which accepts a model name and returns a tokenizer configured with the correct encoding for that model.

using Microsoft.ML.Tokenizers;

// Create tokenizer for GPT-4o (uses o200k_base encoding)
TiktokenTokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");

// Count tokens in plain text
int tokenCount = tokenizer.CountTokens("Hello, how are you today?");
Console.WriteLine($"Token count: {tokenCount}"); // ~6 tokens

// Get the actual tokens (useful for debugging)
IReadOnlyList<int> tokens = tokenizer.EncodeToIds("Hello, how are you today?");
Console.WriteLine($"Token IDs: [{string.Join(", ", tokens)}]");

TiktokenTokenizer.CreateForModel("gpt-4o") downloads the tokenizer vocabulary file on the first call and caches it locally. This means the first call has network overhead and will fail in air-gapped environments. Cache the tokenizer instance — create it once at startup and reuse it throughout the application lifetime. Creating it per-request is wasteful and risks hitting the network on every call.

For dependency injection, register it as a singleton:

builder.Services.AddSingleton(_ => TiktokenTokenizer.CreateForModel("gpt-4o"));

Counting Tokens for Chat Messages

Plain text token counts are not the full picture for chat completions. The OpenAI message format wraps each message with role identifiers and formatting delimiters that consume additional tokens. The overhead per message is 4 tokens (for the role name and boundary markers), plus 2 tokens at the end of the conversation to prime the assistant reply.

using Microsoft.ML.Tokenizers;
using OpenAI.Chat;

public static class TokenCounter
{
    private static readonly TiktokenTokenizer _tokenizer =
        TiktokenTokenizer.CreateForModel("gpt-4o");

    /// <summary>
    /// Counts tokens for a list of chat messages following the OpenAI message format overhead.
    /// Each message adds 4 tokens of overhead; the reply is primed with 2 tokens.
    /// </summary>
    public static int CountChatTokens(IEnumerable<ChatMessage> messages)
    {
        int total = 2; // Reply priming: <|im_start|>assistant

        foreach (var message in messages)
        {
            total += 4; // Per-message overhead: role + formatting tokens
            total += _tokenizer.CountTokens(GetMessageText(message));
        }

        return total;
    }

    private static string GetMessageText(ChatMessage message) => message switch
    {
        UserChatMessage user => string.Join(" ", user.Content.Select(p => p.Text ?? string.Empty)),
        AssistantChatMessage assistant => string.Join(" ", assistant.Content?.Select(p => p.Text ?? string.Empty) ?? []),
        SystemChatMessage system => string.Join(" ", system.Content.Select(p => p.Text ?? string.Empty)),
        _ => string.Empty
    };
}

This formula — 4 tokens per message, 2 tokens for reply priming — is the same calculation used by OpenAI’s Python cookbook. It does not account for function/tool call schemas, which are serialized separately and add their own overhead (addressed in the production pitfalls section).

Call this before every chat completion to know your input token budget:

var messages = new List<ChatMessage>
{
    new SystemChatMessage("You are a helpful assistant."),
    new UserChatMessage("Explain token counting in three sentences.")
};

int inputTokens = TokenCounter.CountChatTokens(messages);
Console.WriteLine($"Input tokens: {inputTokens}");

Implementing a Token Budget as an IFunctionInvocationFilter

In Semantic Kernel, the cleanest place to enforce a token budget is an IFunctionInvocationFilter. This runs before every function call, giving you a pre-flight check that prevents oversized requests from ever reaching the Azure OpenAI API. This is more efficient than handling the 400 error after the fact — you save the network round-trip and get a clean application-layer error with a meaningful message.

If you are not familiar with how Semantic Kernel’s plugin and filter system is structured, the Semantic Kernel Architecture Deep Dive covers the filter pipeline in detail.

using Microsoft.SemanticKernel;

public class TokenBudgetFilter : IFunctionInvocationFilter
{
    private readonly TiktokenTokenizer _tokenizer;
    private readonly int _maxInputTokens;
    private readonly ILogger<TokenBudgetFilter> _logger;

    public TokenBudgetFilter(
        ILogger<TokenBudgetFilter> logger,
        int maxInputTokens = 100_000)
    {
        _tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
        _maxInputTokens = maxInputTokens;
        _logger = logger;
    }

    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        // Check if there's a prompt argument to count
        if (context.Arguments.TryGetValue("input", out var input) && input is string promptText)
        {
            int tokenCount = _tokenizer.CountTokens(promptText);

            if (tokenCount > _maxInputTokens)
            {
                _logger.LogWarning(
                    "Token budget exceeded: {TokenCount} tokens, limit is {MaxTokens}",
                    tokenCount, _maxInputTokens);

                throw new InvalidOperationException(
                    $"Input exceeds token budget of {_maxInputTokens} tokens. " +
                    $"Current input: {tokenCount} tokens.");
            }

            _logger.LogDebug("Token count: {TokenCount}/{MaxTokens}", tokenCount, _maxInputTokens);
        }

        await next(context);
    }
}

Register the filter in your ASP.NET Core application:

builder.Services.AddSingleton<TokenBudgetFilter>();
builder.Services.AddKernel()
    .AddAzureOpenAIChatCompletion(deployment, endpoint, apiKey);

// Register the filter after building
var kernel = app.Services.GetRequiredService<Kernel>();
kernel.FunctionInvocationFilters.Add(
    app.Services.GetRequiredService<TokenBudgetFilter>());

For multi-tenant SaaS applications, extend this pattern to accept a per-user budget retrieved from your tenancy configuration, rather than a single global limit. That way each tenant’s token consumption is independently capped.

Pre-flight Token Validation

For direct AzureChatClient usage outside of Semantic Kernel, implement the validation at the service layer before calling CompleteChatAsync:

public async Task<string> CompleteChatWithValidationAsync(
    List<ChatMessage> messages,
    int maxContextTokens = 120_000,
    CancellationToken ct = default)
{
    int inputTokens = TokenCounter.CountChatTokens(messages);

    if (inputTokens > maxContextTokens)
    {
        throw new InvalidOperationException(
            $"Request would exceed model context window. " +
            $"Input tokens: {inputTokens}, limit: {maxContextTokens}. " +
            $"Consider trimming chat history or reducing input size.");
    }

    var completion = await _chatClient.CompleteChatAsync(messages, cancellationToken: ct);

    // Log actual usage for monitoring
    _logger.LogInformation(
        "Token usage — Input: {Input}, Output: {Output}, Total: {Total}",
        completion.Usage.InputTokenCount,
        completion.Usage.OutputTokenCount,
        completion.Usage.TotalTokenCount);

    return completion.Content[0].Text;
}

Setting maxContextTokens to 120,000 rather than the full 128,000 leaves an 8,000-token buffer for output and tool overhead. The actual charged token count can differ slightly from your estimate — always leave headroom.

Cost Estimation

Knowing your token count before sending a request lets you estimate the cost and log it as a metric. This is the foundation of per-feature and per-user cost attribution.

public static class AzureOpenAICostEstimator
{
    // Prices per million tokens (as of early 2026 — verify current pricing)
    private static readonly Dictionary<string, (double Input, double Output)> _pricing = new()
    {
        ["gpt-4o"] = (5.00, 15.00),
        ["gpt-4o-mini"] = (0.15, 0.60),
        ["text-embedding-3-small"] = (0.02, 0.00),
        ["text-embedding-3-large"] = (0.13, 0.00),
    };

    public static double EstimateRequestCost(
        string modelName,
        int inputTokens,
        int estimatedOutputTokens)
    {
        if (!_pricing.TryGetValue(modelName, out var price))
            return 0;

        return (inputTokens / 1_000_000.0 * price.Input) +
               (estimatedOutputTokens / 1_000_000.0 * price.Output);
    }
}

Use it before every significant AI call:

int inputTokens = TokenCounter.CountChatTokens(messages);
double estimatedCost = AzureOpenAICostEstimator.EstimateRequestCost(
    "gpt-4o", inputTokens, estimatedOutputTokens: 1000);
_logger.LogInformation("Estimated request cost: ${Cost:F6}", estimatedCost);

The pricing values in the dictionary above are illustrative — Azure OpenAI pricing changes, varies by region, and may be subject to commitment discounts. Always verify against the Azure OpenAI pricing page before building billing or cost allocation features. For a broader treatment of cost optimization strategies including batch APIs, caching, and model routing, see AI Cost Optimization for .NET Developers.

Client-side TikToken counting is accurate for text content but does not account for all formatting overhead. The actual charged token count can differ by 1-3% due to tool call formatting, image tokens in vision models, and internal system overhead. For cost estimation, add a 10% buffer to your calculated estimate.

Batch Processing with Token Budgets

When processing large document sets for embeddings or summarization, you need to accumulate documents into token-bounded batches rather than sending one document at a time or trying to fit everything into a single call.

public async Task ProcessDocumentsAsync(
    IEnumerable<string> documents,
    int batchTokenLimit = 50_000)
{
    var batch = new List<string>();
    int batchTokens = 0;

    foreach (var doc in documents)
    {
        int docTokens = _tokenizer.CountTokens(doc);

        if (batchTokens + docTokens > batchTokenLimit && batch.Count > 0)
        {
            // Flush current batch
            await ProcessBatchAsync(batch);
            batch.Clear();
            batchTokens = 0;
        }

        batch.Add(doc);
        batchTokens += docTokens;
    }

    if (batch.Count > 0)
        await ProcessBatchAsync(batch);
}

This pattern ensures no batch exceeds your token limit while maximizing throughput by packing as many documents as possible into each API call. The flush-before-add logic handles the edge case where a single document is larger than the batch limit — in that case it gets added to an empty batch and processed alone. If individual documents can exceed the limit, add a pre-check and split oversized documents before entering the batch loop.

The batchTokenLimit of 50,000 is conservative for a 128K context window. Leave room for the model’s response, system prompts, and any per-document metadata you’re including. For embedding calls, which have no output tokens, you can push closer to the model’s input limit.

Further Reading

⚠ Production Considerations

  • TiktokenTokenizer.CreateForModel() downloads the tokenizer vocabulary on first call. In containerized deployments with no internet access, this will fail. Pre-warm the tokenizer at startup by calling CreateForModel() in a hosted service initialization method, or embed the vocabulary file in your container image.
  • Token count for function definitions (tool schemas) is NOT included in content token counts. When using Auto Function Calling with many plugins, the tool schema tokens can consume 1,000-3,000 tokens of your context budget invisibly. Account for this by reducing your content token limit by an estimated tool overhead.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

Build token counting as a first-class service in your AI middleware stack, not as an afterthought. A per-user token budget enforced at the application layer prevents individual users from monopolizing shared Azure OpenAI quota — which is especially important in multi-tenant SaaS applications where one power user can starve other tenants.

AI-Friendly Summary

Summary

Token counting in C# uses Microsoft.ML.Tokenizers 0.22.x with TiktokenTokenizer.CreateForModel(). Chat messages add 4 tokens overhead per message plus 2 for reply priming. Implement a token budget as an IFunctionInvocationFilter in Semantic Kernel to prevent context overflow before API calls. Pre-call cost estimation multiplies counted tokens by per-model per-million-token pricing. Always add a 10% buffer to estimates.

Key Takeaways

  • Use Microsoft.ML.Tokenizers 0.22.x — the Microsoft-maintained TikToken library for .NET
  • Chat messages add 4 tokens overhead per message + 2 for reply priming beyond text content
  • IFunctionInvocationFilter makes a clean token budget middleware in Semantic Kernel
  • Pre-call cost estimation: count tokens × (price / 1M) before every expensive AI call
  • Client-side counting is accurate to ~97-99% — always add a 10% safety buffer

Implementation Checklist

  • Add Microsoft.ML.Tokenizers 0.22.0 NuGet package
  • Create TiktokenTokenizer using TiktokenTokenizer.CreateForModel('gpt-4o')
  • Account for 4-token message overhead and 2-token reply priming in chat token counts
  • Implement IFunctionInvocationFilter as a token budget gate in Semantic Kernel
  • Add pre-call cost estimation logging for every AI request
  • Set a max_tokens limit on every request to cap output token consumption

Frequently Asked Questions

What is the best tokenizer library for .NET in 2026?

Microsoft.ML.Tokenizers 0.22.x is the recommended library. It is maintained by Microsoft, supports TikToken encoding (used by GPT-4o, GPT-4, and GPT-3.5), and is actively updated. Avoid SharpToken (community, less maintained) and the deprecated Microsoft.DeepDev.TokenizerLib.

How do I count tokens for a chat message in C#?

Each chat message adds overhead beyond its text content. The formula is: 4 tokens per message (for role and formatting), plus the token count of the content, plus 2 tokens priming the reply. Use TiktokenTokenizer.CreateForModel('gpt-4o').CountTokens(messageContent) for the content part, then add 4 per message and 2 for the reply.

What is a token budget middleware in Semantic Kernel?

A token budget middleware is an IFunctionInvocationFilter that counts tokens before every function call, compares against a configured limit, and rejects calls that would exceed the budget. This prevents context overflow before it reaches the Azure OpenAI API, which is more efficient than handling the 400 error after the fact.

How do I estimate the cost of an Azure OpenAI request before sending it?

Count input tokens with Microsoft.ML.Tokenizers, estimate output tokens based on your max_tokens setting, then multiply by per-model pricing. For GPT-4o: (inputTokens / 1_000_000 * 5.0) + (estimatedOutputTokens / 1_000_000 * 15.0). This gives a pre-call cost estimate in USD.

Does token count change between Azure OpenAI models?

GPT-4o, GPT-4o-mini, GPT-4, and GPT-3.5-Turbo all use the same cl100k_base TikToken encoding. Use TiktokenTokenizer.CreateForModel('gpt-4o') for all these models. Embedding models use different tokenizers — text-embedding-3-small uses cl100k_base as well, but always verify against the model's tokenizer documentation.

How accurate is client-side token counting compared to what Azure OpenAI charges?

Client-side TikToken counting is highly accurate for text content but does not account for all formatting overhead. The actual charged token count can differ by 1-3% due to tool call formatting, image tokens in vision models, and internal system overhead. Always add a 10% buffer to your estimated costs.

Can I count tokens without making an API call?

Yes. Microsoft.ML.Tokenizers works entirely offline — no API calls required. TiktokenTokenizer.CreateForModel(modelName) loads the tokenizer vocabulary locally. You can count tokens for thousands of documents per second without any network overhead or API quota consumption.

Track your progress through this learning path.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Azure OpenAI #Token Counting #Microsoft.ML.Tokenizers #Context Management #.NET AI