Your Azure OpenAI chat app has been working fine in development, but in production — after users have long conversations or your RAG pipeline retrieves several documents — requests start failing with a 400 error. The error message is unambiguous: you sent more tokens than the model can handle.
The Error
Azure.RequestFailedException: 400 (Bad Request)
{
"error": {
"code": "context_length_exceeded",
"message": "This model's maximum context length is 128000 tokens. However, your messages resulted in 132450 tokens. Please reduce the length of the messages.",
"type": "invalid_request_error"
}
}
The error is deterministic: send too many tokens, get a 400 back. There is no retry that will fix it. You must reduce the size of the request before resending.
Fixes at a Glance
- Count tokens before sending — use
Microsoft.ML.Tokenizersto validate the request size pre-flight - Sliding window chat history — trim oldest messages from
ChatHistorywhile preserving the system message - Summarization strategy — compress old conversation turns with GPT-4o-mini when history grows large
- Token budget filter — use a Semantic Kernel
IFunctionInvocationFilterto enforce budgets automatically
Root Cause: Context Windows
Every Azure OpenAI model has a fixed context window — the maximum number of tokens it can process in a single request. The window is shared across everything you send:
| Model | Context Window | Max Output |
|---|---|---|
| GPT-4o | 128,000 tokens | 16,384 tokens |
| GPT-4o-mini | 128,000 tokens | 16,384 tokens |
| GPT-4 Turbo | 128,000 tokens | 4,096 tokens |
Every token in the request counts toward this limit: the system prompt, every message in chat history, the current user message, and the schemas of any tool/function calls you have registered. The response tokens also come from this same pool — your max_tokens setting carves out space for the reply, reducing what is available for the input.
A conversation that starts well within limits can cross the threshold after 20-30 exchanges, especially if users send long messages or your RAG pipeline injects retrieved documents. Tool schemas are a particularly sneaky contributor: a Semantic Kernel plugin with five functions can consume 500-2,500 tokens of your budget before any conversation content is included.
Fix 1: Count Tokens Before Sending
The most direct fix is to reject oversized requests before they reach the API. Microsoft.ML.Tokenizers provides the same tokenizer that OpenAI models use, so your counts will be accurate.
using Microsoft.ML.Tokenizers;
using Azure.AI.OpenAI;
using OpenAI.Chat;
public class TokenAwareChatClient
{
private static readonly TiktokenTokenizer _tokenizer =
TiktokenTokenizer.CreateForModel("gpt-4o");
private const int ModelContextLimit = 128_000;
private const int MaxOutputTokens = 4_096;
private const int SafetyBuffer = 500; // Account for tool schemas and formatting
private readonly ChatClient _chatClient;
public TokenAwareChatClient(ChatClient chatClient)
{
_chatClient = chatClient;
}
public async Task<string> CompleteChatAsync(
IList<ChatMessage> messages,
CancellationToken ct = default)
{
int estimatedTokens = EstimateChatTokens(messages) + SafetyBuffer;
int availableForInput = ModelContextLimit - MaxOutputTokens;
if (estimatedTokens > availableForInput)
{
throw new InvalidOperationException(
$"Request would exceed context limit. Estimated: {estimatedTokens} tokens, " +
$"available: {availableForInput} tokens. Trim chat history before sending.");
}
var options = new ChatCompletionOptions { MaxOutputTokenCount = MaxOutputTokens };
var completion = await _chatClient.CompleteChatAsync(messages, options, ct);
return completion.Content[0].Text;
}
private static int EstimateChatTokens(IList<ChatMessage> messages)
{
int total = 2; // reply priming
foreach (var msg in messages)
{
total += 4; // per-message overhead
// Simple text extraction — adjust for your message types
total += _tokenizer.CountTokens(msg.ToString() ?? string.Empty);
}
return total;
}
}
The SafetyBuffer of 500 accounts for tool schemas and any formatting overhead the SDK adds that your estimate does not capture. Increase it if you are using many Semantic Kernel plugins. For a deeper treatment of token counting patterns and cost management, see Token Counting and Context Management in C# for Azure OpenAI.
Fix 2: Sliding Window Chat History
Pre-flight validation tells you when you are over the limit, but it does not fix the problem. For chat applications, the fix is to drop the oldest messages from history while preserving the system message at index 0.
using Microsoft.SemanticKernel.ChatCompletion;
public static class ChatHistoryExtensions
{
private static readonly TiktokenTokenizer _tokenizer =
TiktokenTokenizer.CreateForModel("gpt-4o");
/// <summary>
/// Trims the chat history to keep total tokens under the specified budget.
/// Always preserves the system message at index 0.
/// </summary>
public static void TrimToTokenBudget(this ChatHistory history, int maxTokens)
{
while (history.Count > 1 && EstimateTokens(history) > maxTokens)
{
// Remove oldest non-system message (index 1)
history.RemoveAt(1);
}
}
private static int EstimateTokens(ChatHistory history)
{
return history.Sum(msg => 4 + _tokenizer.CountTokens(msg.Content ?? string.Empty)) + 2;
}
}
Call it before every request:
var chatHistory = new ChatHistory("You are a helpful assistant.");
chatHistory.AddUserMessage(userMessage);
// Trim before every call
chatHistory.TrimToTokenBudget(maxTokens: 100_000);
var response = await chatService.GetChatMessageContentAsync(chatHistory, kernel: kernel);
chatHistory.AddAssistantMessage(response.Content ?? string.Empty);
The maxTokens of 100,000 leaves 28,000 tokens for the user message, tool schemas, and the model’s output. Adjust this based on your typical prompt structure and output size requirements.
A key correctness note: always remove from index 1, never index 0. Index 0 is the system message that defines the assistant’s behavior. Removing it produces unpredictable results and is almost never what you want. RemoveAt(1) removes the oldest user or assistant turn, which is the correct behavior for a sliding window.
Fix 3: Summarization Strategy
A sliding window has a limitation: it silently discards conversation history. If a user referenced something they said 30 messages ago, the model will have no memory of it. For long-running sessions where context continuity matters, summarization is more appropriate.
public async Task<ChatHistory> SummarizeAndResetHistoryAsync(
ChatHistory history,
ChatClient summarizerClient, // GPT-4o-mini — cheap
CancellationToken ct = default)
{
// Build a summarization prompt from old messages
var summaryRequest = new List<ChatMessage>
{
new SystemChatMessage(
"Summarize the following conversation in 2-3 sentences, " +
"preserving key facts, decisions, and context for future reference."),
new UserChatMessage(
string.Join("\n", history.Skip(1).Select(m =>
$"{m.Role}: {m.Content}")))
};
var summaryResult = await summarizerClient.CompleteChatAsync(summaryRequest, ct);
var summary = summaryResult.Content[0].Text;
// Rebuild history: system message + summary + (optionally) last 2 turns
var newHistory = new ChatHistory(history[0].Content ?? string.Empty);
newHistory.AddAssistantMessage($"[Conversation summary: {summary}]");
// Optionally keep the last user message for continuity
var lastUserMsg = history.LastOrDefault(m => m.Role == AuthorRole.User);
if (lastUserMsg != null)
newHistory.AddUserMessage(lastUserMsg.Content ?? string.Empty);
return newHistory;
}
Trigger this when history exceeds a threshold — for example, 20 messages or when the token count crosses 80,000:
if (chatHistory.Count > 20 || EstimateTokens(chatHistory) > 80_000)
{
chatHistory = await SummarizeAndResetHistoryAsync(chatHistory, miniClient, ct);
}
Using GPT-4o-mini for summarization keeps the cost low. A typical 20-message conversation compresses to a 2-3 sentence summary that costs a fraction of a cent. The resulting history drops from tens of thousands of tokens to a few hundred, restoring full headroom for the next segment of the conversation.
Fix 4: Token-Budgeting IFunctionInvocationFilter
For Semantic Kernel applications, a IFunctionInvocationFilter can enforce token budgets automatically across all plugin calls. The full filter implementation is covered in Token Counting and Context Management in C# for Azure OpenAI. The key addition for context overflow prevention is checking ChatHistory size specifically when it is passed as a kernel argument:
// In your token budget filter, also check ChatHistory if present in arguments
if (context.Arguments.TryGetValue("chatHistory", out var histObj) &&
histObj is ChatHistory history)
{
int historyTokens = history.Sum(m =>
4 + _tokenizer.CountTokens(m.Content ?? string.Empty));
if (historyTokens > _maxHistoryTokens)
{
history.TrimToTokenBudget(_maxHistoryTokens);
}
}
This approach is particularly useful when you have multiple entry points into your chat logic — the filter enforces the budget regardless of which code path initiated the call.
RAG Pipeline: Chunk Size Guidance
RAG pipelines introduce a second source of context overflow: the retrieved documents themselves. Each retrieved chunk is injected into the prompt, and if chunks are large or you retrieve many of them, the total easily exceeds the model window. For a complete walkthrough of building a production RAG pipeline, see Build a RAG Chatbot in .NET with Semantic Kernel and Cosmos DB.
The following settings provide a reliable baseline for most use cases:
| Setting | Value | Rationale |
|---|---|---|
| Chunk size | 512 tokens | Balances granularity and context |
| Overlap | 50 tokens | Prevents boundary information loss |
| Max retrieved chunks | 5 | ~2,560 tokens for document context |
| Reserved for prompt | 8,000 tokens | System + user message headroom |
Count chunk sizes with Microsoft.ML.Tokenizers at indexing time, not by character count. A 512-character chunk is not the same as a 512-token chunk — token density varies significantly between code, prose, and structured data.
At query time, enforce the maximum retrieved chunks in your search call. For Azure AI Search:
var searchOptions = new SearchOptions
{
Size = 5, // Limit to 5 chunks maximum
Select = { "content", "title", "source" }
};
If you are using Auto Function Calling with plugins alongside RAG, account for tool schema tokens. Five plugins with moderately complex schemas can consume 1,000-2,500 tokens before a single document chunk is included.
Monitoring Token Usage
After every successful call, log the actual token counts from the response:
// After every successful call
_logger.LogInformation(
"Token usage — Input: {Input}/{Limit} ({Pct:P0}), Output: {Output}",
completion.Usage.InputTokenCount,
128_000,
(double)completion.Usage.InputTokenCount / 128_000,
completion.Usage.OutputTokenCount);
Set an alert when InputTokenCount / 128_000 > 0.8 — at 80% of context, you have early warning before requests start failing. Tracking this metric over time also reveals trends: a gradual increase across sessions indicates chat history is accumulating and your trimming strategy needs adjustment.
Further Reading
- Token Counting and Context Management in C# for Azure OpenAI
- Build a RAG Chatbot in .NET with Semantic Kernel and Cosmos DB
- Azure OpenAI model documentation