Token counting is one of those unsexy fundamentals that separates production AI applications from prototypes. Every Azure OpenAI request costs money per token. Every model has a hard context window limit. Every RAG chunking strategy needs token-accurate boundaries, not character estimates. Get this wrong and you’ll see unexpected bill spikes, mysterious 400 errors when conversations grow long, and RAG retrieval that degrades because your chunks are the wrong size.
This guide covers the full stack: choosing the right tokenizer library, counting plain text and chat messages, building a token budget middleware for Semantic Kernel, estimating costs before API calls, and processing documents in token-bounded batches.
Why Token Counting Matters
There are three concrete reasons to instrument your application with token counting from the start.
Cost control. Azure OpenAI charges per token — both input and output. A 128K-token context filled with conversation history costs roughly 128 times more than a single short message. Without token counting, you have no visibility into what you’re actually spending per request, per user, or per feature. Production AI systems need per-request cost telemetry just as much as they need latency metrics.
Context window limits. GPT-4o supports a 128,000-token context window, which sounds large until you start accumulating conversation history, RAG chunks, system prompts, and tool schemas simultaneously. When you exceed the limit, the Azure OpenAI API returns a 400 error. Worse, some SDKs silently truncate history, causing the model to lose context without any error surfacing to your application. Proactive token counting lets you trim or summarize history before hitting that wall.
Chunking accuracy for RAG. The common shortcut of “1 token ≈ 4 characters” is a reasonable approximation for English, but it breaks down significantly for non-English text, code, and technical vocabulary. SQL keywords tokenize differently than prose. Chinese and Japanese characters are typically 1-2 tokens each, not 0.25. If you’re building a RAG pipeline and sizing chunks by character count, your chunk boundaries will be off — some chunks will be too large and get rejected, others too small and lose semantic coherence. Token-accurate chunking requires an actual tokenizer.
The Tokenizer Landscape in .NET
Three libraries have circulated in the .NET ecosystem for TikToken-compatible tokenization. They are not equal.
SharpToken is a community-maintained port of the Python TikToken library. It is accurate and was widely used before Microsoft released an official solution, but it has no Microsoft backing and its maintenance cadence depends entirely on volunteer contributors. For production systems, dependence on a community library with uncertain longevity is a risk.
Microsoft.DeepDev.TokenizerLib was Microsoft’s first attempt at a .NET tokenizer. It has been officially deprecated by Microsoft and should not be used in new projects. If you have existing code using this library, migrate away from it.
Microsoft.ML.Tokenizers is the recommended choice. It is maintained by the Microsoft ML.NET team, supports both cl100k_base (used by GPT-4, GPT-3.5-Turbo, and text-embedding-3 models) and o200k_base (used by GPT-4o), and ships on the same release cadence as the broader ML.NET ecosystem. This is what you should use.
Install it:
dotnet add package Microsoft.ML.Tokenizers --version 0.22.0
Setup and Basic Token Counting
The entry point is TiktokenTokenizer.CreateForModel(), which accepts a model name and returns a tokenizer configured with the correct encoding for that model.
using Microsoft.ML.Tokenizers;
// Create tokenizer for GPT-4o (uses o200k_base encoding)
TiktokenTokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
// Count tokens in plain text
int tokenCount = tokenizer.CountTokens("Hello, how are you today?");
Console.WriteLine($"Token count: {tokenCount}"); // ~6 tokens
// Get the actual tokens (useful for debugging)
IReadOnlyList<int> tokens = tokenizer.EncodeToIds("Hello, how are you today?");
Console.WriteLine($"Token IDs: [{string.Join(", ", tokens)}]");
TiktokenTokenizer.CreateForModel("gpt-4o") downloads the tokenizer vocabulary file on the first call and caches it locally. This means the first call has network overhead and will fail in air-gapped environments. Cache the tokenizer instance — create it once at startup and reuse it throughout the application lifetime. Creating it per-request is wasteful and risks hitting the network on every call.
For dependency injection, register it as a singleton:
builder.Services.AddSingleton(_ => TiktokenTokenizer.CreateForModel("gpt-4o"));
Counting Tokens for Chat Messages
Plain text token counts are not the full picture for chat completions. The OpenAI message format wraps each message with role identifiers and formatting delimiters that consume additional tokens. The overhead per message is 4 tokens (for the role name and boundary markers), plus 2 tokens at the end of the conversation to prime the assistant reply.
using Microsoft.ML.Tokenizers;
using OpenAI.Chat;
public static class TokenCounter
{
private static readonly TiktokenTokenizer _tokenizer =
TiktokenTokenizer.CreateForModel("gpt-4o");
/// <summary>
/// Counts tokens for a list of chat messages following the OpenAI message format overhead.
/// Each message adds 4 tokens of overhead; the reply is primed with 2 tokens.
/// </summary>
public static int CountChatTokens(IEnumerable<ChatMessage> messages)
{
int total = 2; // Reply priming: <|im_start|>assistant
foreach (var message in messages)
{
total += 4; // Per-message overhead: role + formatting tokens
total += _tokenizer.CountTokens(GetMessageText(message));
}
return total;
}
private static string GetMessageText(ChatMessage message) => message switch
{
UserChatMessage user => string.Join(" ", user.Content.Select(p => p.Text ?? string.Empty)),
AssistantChatMessage assistant => string.Join(" ", assistant.Content?.Select(p => p.Text ?? string.Empty) ?? []),
SystemChatMessage system => string.Join(" ", system.Content.Select(p => p.Text ?? string.Empty)),
_ => string.Empty
};
}
This formula — 4 tokens per message, 2 tokens for reply priming — is the same calculation used by OpenAI’s Python cookbook. It does not account for function/tool call schemas, which are serialized separately and add their own overhead (addressed in the production pitfalls section).
Call this before every chat completion to know your input token budget:
var messages = new List<ChatMessage>
{
new SystemChatMessage("You are a helpful assistant."),
new UserChatMessage("Explain token counting in three sentences.")
};
int inputTokens = TokenCounter.CountChatTokens(messages);
Console.WriteLine($"Input tokens: {inputTokens}");
Implementing a Token Budget as an IFunctionInvocationFilter
In Semantic Kernel, the cleanest place to enforce a token budget is an IFunctionInvocationFilter. This runs before every function call, giving you a pre-flight check that prevents oversized requests from ever reaching the Azure OpenAI API. This is more efficient than handling the 400 error after the fact — you save the network round-trip and get a clean application-layer error with a meaningful message.
If you are not familiar with how Semantic Kernel’s plugin and filter system is structured, the Semantic Kernel Architecture Deep Dive covers the filter pipeline in detail.
using Microsoft.SemanticKernel;
public class TokenBudgetFilter : IFunctionInvocationFilter
{
private readonly TiktokenTokenizer _tokenizer;
private readonly int _maxInputTokens;
private readonly ILogger<TokenBudgetFilter> _logger;
public TokenBudgetFilter(
ILogger<TokenBudgetFilter> logger,
int maxInputTokens = 100_000)
{
_tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
_maxInputTokens = maxInputTokens;
_logger = logger;
}
public async Task OnFunctionInvocationAsync(
FunctionInvocationContext context,
Func<FunctionInvocationContext, Task> next)
{
// Check if there's a prompt argument to count
if (context.Arguments.TryGetValue("input", out var input) && input is string promptText)
{
int tokenCount = _tokenizer.CountTokens(promptText);
if (tokenCount > _maxInputTokens)
{
_logger.LogWarning(
"Token budget exceeded: {TokenCount} tokens, limit is {MaxTokens}",
tokenCount, _maxInputTokens);
throw new InvalidOperationException(
$"Input exceeds token budget of {_maxInputTokens} tokens. " +
$"Current input: {tokenCount} tokens.");
}
_logger.LogDebug("Token count: {TokenCount}/{MaxTokens}", tokenCount, _maxInputTokens);
}
await next(context);
}
}
Register the filter in your ASP.NET Core application:
builder.Services.AddSingleton<TokenBudgetFilter>();
builder.Services.AddKernel()
.AddAzureOpenAIChatCompletion(deployment, endpoint, apiKey);
// Register the filter after building
var kernel = app.Services.GetRequiredService<Kernel>();
kernel.FunctionInvocationFilters.Add(
app.Services.GetRequiredService<TokenBudgetFilter>());
For multi-tenant SaaS applications, extend this pattern to accept a per-user budget retrieved from your tenancy configuration, rather than a single global limit. That way each tenant’s token consumption is independently capped.
Pre-flight Token Validation
For direct AzureChatClient usage outside of Semantic Kernel, implement the validation at the service layer before calling CompleteChatAsync:
public async Task<string> CompleteChatWithValidationAsync(
List<ChatMessage> messages,
int maxContextTokens = 120_000,
CancellationToken ct = default)
{
int inputTokens = TokenCounter.CountChatTokens(messages);
if (inputTokens > maxContextTokens)
{
throw new InvalidOperationException(
$"Request would exceed model context window. " +
$"Input tokens: {inputTokens}, limit: {maxContextTokens}. " +
$"Consider trimming chat history or reducing input size.");
}
var completion = await _chatClient.CompleteChatAsync(messages, cancellationToken: ct);
// Log actual usage for monitoring
_logger.LogInformation(
"Token usage — Input: {Input}, Output: {Output}, Total: {Total}",
completion.Usage.InputTokenCount,
completion.Usage.OutputTokenCount,
completion.Usage.TotalTokenCount);
return completion.Content[0].Text;
}
Setting maxContextTokens to 120,000 rather than the full 128,000 leaves an 8,000-token buffer for output and tool overhead. The actual charged token count can differ slightly from your estimate — always leave headroom.
Cost Estimation
Knowing your token count before sending a request lets you estimate the cost and log it as a metric. This is the foundation of per-feature and per-user cost attribution.
public static class AzureOpenAICostEstimator
{
// Prices per million tokens (as of early 2026 — verify current pricing)
private static readonly Dictionary<string, (double Input, double Output)> _pricing = new()
{
["gpt-4o"] = (5.00, 15.00),
["gpt-4o-mini"] = (0.15, 0.60),
["text-embedding-3-small"] = (0.02, 0.00),
["text-embedding-3-large"] = (0.13, 0.00),
};
public static double EstimateRequestCost(
string modelName,
int inputTokens,
int estimatedOutputTokens)
{
if (!_pricing.TryGetValue(modelName, out var price))
return 0;
return (inputTokens / 1_000_000.0 * price.Input) +
(estimatedOutputTokens / 1_000_000.0 * price.Output);
}
}
Use it before every significant AI call:
int inputTokens = TokenCounter.CountChatTokens(messages);
double estimatedCost = AzureOpenAICostEstimator.EstimateRequestCost(
"gpt-4o", inputTokens, estimatedOutputTokens: 1000);
_logger.LogInformation("Estimated request cost: ${Cost:F6}", estimatedCost);
The pricing values in the dictionary above are illustrative — Azure OpenAI pricing changes, varies by region, and may be subject to commitment discounts. Always verify against the Azure OpenAI pricing page before building billing or cost allocation features. For a broader treatment of cost optimization strategies including batch APIs, caching, and model routing, see AI Cost Optimization for .NET Developers.
Client-side TikToken counting is accurate for text content but does not account for all formatting overhead. The actual charged token count can differ by 1-3% due to tool call formatting, image tokens in vision models, and internal system overhead. For cost estimation, add a 10% buffer to your calculated estimate.
Batch Processing with Token Budgets
When processing large document sets for embeddings or summarization, you need to accumulate documents into token-bounded batches rather than sending one document at a time or trying to fit everything into a single call.
public async Task ProcessDocumentsAsync(
IEnumerable<string> documents,
int batchTokenLimit = 50_000)
{
var batch = new List<string>();
int batchTokens = 0;
foreach (var doc in documents)
{
int docTokens = _tokenizer.CountTokens(doc);
if (batchTokens + docTokens > batchTokenLimit && batch.Count > 0)
{
// Flush current batch
await ProcessBatchAsync(batch);
batch.Clear();
batchTokens = 0;
}
batch.Add(doc);
batchTokens += docTokens;
}
if (batch.Count > 0)
await ProcessBatchAsync(batch);
}
This pattern ensures no batch exceeds your token limit while maximizing throughput by packing as many documents as possible into each API call. The flush-before-add logic handles the edge case where a single document is larger than the batch limit — in that case it gets added to an empty batch and processed alone. If individual documents can exceed the limit, add a pre-check and split oversized documents before entering the batch loop.
The batchTokenLimit of 50,000 is conservative for a 128K context window. Leave room for the model’s response, system prompts, and any per-document metadata you’re including. For embedding calls, which have no output tokens, you can push closer to the model’s input limit.