Skip to main content

Fix context_length_exceeded in Azure OpenAI C# Apps

From StackOverflow .NET 9 Azure.AI.OpenAI 2.1.0 Microsoft.ML.Tokenizers 0.22.0 Microsoft.SemanticKernel 1.54.0
By Rajesh Mishra · Mar 21, 2026 · 10 min read
Verified Mar 2026 .NET 9 Azure.AI.OpenAI 2.1.0
In 30 Seconds

Context length exceeded errors occur when total input tokens (system prompt + history + user message + tools) exceed the model's context window. Fix by counting tokens with Microsoft.ML.Tokenizers before sending, implementing a sliding window to trim old chat history, summarizing old turns with GPT-4o-mini, or using a token budget IFunctionInvocationFilter. For RAG, keep chunks at 512 tokens with max 5 retrieved chunks.

⚠️
Error Fix Guide

Root cause analysis and verified fix. Code examples use Azure.AI.OpenAI 2.1.0.

✓ SOLVED

Your Azure OpenAI chat app has been working fine in development, but in production — after users have long conversations or your RAG pipeline retrieves several documents — requests start failing with a 400 error. The error message is unambiguous: you sent more tokens than the model can handle.

The Error

Azure.RequestFailedException: 400 (Bad Request)

{
  "error": {
    "code": "context_length_exceeded",
    "message": "This model's maximum context length is 128000 tokens. However, your messages resulted in 132450 tokens. Please reduce the length of the messages.",
    "type": "invalid_request_error"
  }
}

The error is deterministic: send too many tokens, get a 400 back. There is no retry that will fix it. You must reduce the size of the request before resending.

Fixes at a Glance

  1. Count tokens before sending — use Microsoft.ML.Tokenizers to validate the request size pre-flight
  2. Sliding window chat history — trim oldest messages from ChatHistory while preserving the system message
  3. Summarization strategy — compress old conversation turns with GPT-4o-mini when history grows large
  4. Token budget filter — use a Semantic Kernel IFunctionInvocationFilter to enforce budgets automatically

Root Cause: Context Windows

Every Azure OpenAI model has a fixed context window — the maximum number of tokens it can process in a single request. The window is shared across everything you send:

ModelContext WindowMax Output
GPT-4o128,000 tokens16,384 tokens
GPT-4o-mini128,000 tokens16,384 tokens
GPT-4 Turbo128,000 tokens4,096 tokens

Every token in the request counts toward this limit: the system prompt, every message in chat history, the current user message, and the schemas of any tool/function calls you have registered. The response tokens also come from this same pool — your max_tokens setting carves out space for the reply, reducing what is available for the input.

A conversation that starts well within limits can cross the threshold after 20-30 exchanges, especially if users send long messages or your RAG pipeline injects retrieved documents. Tool schemas are a particularly sneaky contributor: a Semantic Kernel plugin with five functions can consume 500-2,500 tokens of your budget before any conversation content is included.

Fix 1: Count Tokens Before Sending

The most direct fix is to reject oversized requests before they reach the API. Microsoft.ML.Tokenizers provides the same tokenizer that OpenAI models use, so your counts will be accurate.

using Microsoft.ML.Tokenizers;
using Azure.AI.OpenAI;
using OpenAI.Chat;

public class TokenAwareChatClient
{
    private static readonly TiktokenTokenizer _tokenizer =
        TiktokenTokenizer.CreateForModel("gpt-4o");

    private const int ModelContextLimit = 128_000;
    private const int MaxOutputTokens = 4_096;
    private const int SafetyBuffer = 500; // Account for tool schemas and formatting

    private readonly ChatClient _chatClient;

    public TokenAwareChatClient(ChatClient chatClient)
    {
        _chatClient = chatClient;
    }

    public async Task<string> CompleteChatAsync(
        IList<ChatMessage> messages,
        CancellationToken ct = default)
    {
        int estimatedTokens = EstimateChatTokens(messages) + SafetyBuffer;
        int availableForInput = ModelContextLimit - MaxOutputTokens;

        if (estimatedTokens > availableForInput)
        {
            throw new InvalidOperationException(
                $"Request would exceed context limit. Estimated: {estimatedTokens} tokens, " +
                $"available: {availableForInput} tokens. Trim chat history before sending.");
        }

        var options = new ChatCompletionOptions { MaxOutputTokenCount = MaxOutputTokens };
        var completion = await _chatClient.CompleteChatAsync(messages, options, ct);
        return completion.Content[0].Text;
    }

    private static int EstimateChatTokens(IList<ChatMessage> messages)
    {
        int total = 2; // reply priming
        foreach (var msg in messages)
        {
            total += 4; // per-message overhead
            // Simple text extraction — adjust for your message types
            total += _tokenizer.CountTokens(msg.ToString() ?? string.Empty);
        }
        return total;
    }
}

The SafetyBuffer of 500 accounts for tool schemas and any formatting overhead the SDK adds that your estimate does not capture. Increase it if you are using many Semantic Kernel plugins. For a deeper treatment of token counting patterns and cost management, see Token Counting and Context Management in C# for Azure OpenAI.

Fix 2: Sliding Window Chat History

Pre-flight validation tells you when you are over the limit, but it does not fix the problem. For chat applications, the fix is to drop the oldest messages from history while preserving the system message at index 0.

using Microsoft.SemanticKernel.ChatCompletion;

public static class ChatHistoryExtensions
{
    private static readonly TiktokenTokenizer _tokenizer =
        TiktokenTokenizer.CreateForModel("gpt-4o");

    /// <summary>
    /// Trims the chat history to keep total tokens under the specified budget.
    /// Always preserves the system message at index 0.
    /// </summary>
    public static void TrimToTokenBudget(this ChatHistory history, int maxTokens)
    {
        while (history.Count > 1 && EstimateTokens(history) > maxTokens)
        {
            // Remove oldest non-system message (index 1)
            history.RemoveAt(1);
        }
    }

    private static int EstimateTokens(ChatHistory history)
    {
        return history.Sum(msg => 4 + _tokenizer.CountTokens(msg.Content ?? string.Empty)) + 2;
    }
}

Call it before every request:

var chatHistory = new ChatHistory("You are a helpful assistant.");
chatHistory.AddUserMessage(userMessage);

// Trim before every call
chatHistory.TrimToTokenBudget(maxTokens: 100_000);

var response = await chatService.GetChatMessageContentAsync(chatHistory, kernel: kernel);
chatHistory.AddAssistantMessage(response.Content ?? string.Empty);

The maxTokens of 100,000 leaves 28,000 tokens for the user message, tool schemas, and the model’s output. Adjust this based on your typical prompt structure and output size requirements.

A key correctness note: always remove from index 1, never index 0. Index 0 is the system message that defines the assistant’s behavior. Removing it produces unpredictable results and is almost never what you want. RemoveAt(1) removes the oldest user or assistant turn, which is the correct behavior for a sliding window.

Fix 3: Summarization Strategy

A sliding window has a limitation: it silently discards conversation history. If a user referenced something they said 30 messages ago, the model will have no memory of it. For long-running sessions where context continuity matters, summarization is more appropriate.

public async Task<ChatHistory> SummarizeAndResetHistoryAsync(
    ChatHistory history,
    ChatClient summarizerClient, // GPT-4o-mini — cheap
    CancellationToken ct = default)
{
    // Build a summarization prompt from old messages
    var summaryRequest = new List<ChatMessage>
    {
        new SystemChatMessage(
            "Summarize the following conversation in 2-3 sentences, " +
            "preserving key facts, decisions, and context for future reference."),
        new UserChatMessage(
            string.Join("\n", history.Skip(1).Select(m =>
                $"{m.Role}: {m.Content}")))
    };

    var summaryResult = await summarizerClient.CompleteChatAsync(summaryRequest, ct);
    var summary = summaryResult.Content[0].Text;

    // Rebuild history: system message + summary + (optionally) last 2 turns
    var newHistory = new ChatHistory(history[0].Content ?? string.Empty);
    newHistory.AddAssistantMessage($"[Conversation summary: {summary}]");

    // Optionally keep the last user message for continuity
    var lastUserMsg = history.LastOrDefault(m => m.Role == AuthorRole.User);
    if (lastUserMsg != null)
        newHistory.AddUserMessage(lastUserMsg.Content ?? string.Empty);

    return newHistory;
}

Trigger this when history exceeds a threshold — for example, 20 messages or when the token count crosses 80,000:

if (chatHistory.Count > 20 || EstimateTokens(chatHistory) > 80_000)
{
    chatHistory = await SummarizeAndResetHistoryAsync(chatHistory, miniClient, ct);
}

Using GPT-4o-mini for summarization keeps the cost low. A typical 20-message conversation compresses to a 2-3 sentence summary that costs a fraction of a cent. The resulting history drops from tens of thousands of tokens to a few hundred, restoring full headroom for the next segment of the conversation.

Fix 4: Token-Budgeting IFunctionInvocationFilter

For Semantic Kernel applications, a IFunctionInvocationFilter can enforce token budgets automatically across all plugin calls. The full filter implementation is covered in Token Counting and Context Management in C# for Azure OpenAI. The key addition for context overflow prevention is checking ChatHistory size specifically when it is passed as a kernel argument:

// In your token budget filter, also check ChatHistory if present in arguments
if (context.Arguments.TryGetValue("chatHistory", out var histObj) &&
    histObj is ChatHistory history)
{
    int historyTokens = history.Sum(m =>
        4 + _tokenizer.CountTokens(m.Content ?? string.Empty));

    if (historyTokens > _maxHistoryTokens)
    {
        history.TrimToTokenBudget(_maxHistoryTokens);
    }
}

This approach is particularly useful when you have multiple entry points into your chat logic — the filter enforces the budget regardless of which code path initiated the call.

RAG Pipeline: Chunk Size Guidance

RAG pipelines introduce a second source of context overflow: the retrieved documents themselves. Each retrieved chunk is injected into the prompt, and if chunks are large or you retrieve many of them, the total easily exceeds the model window. For a complete walkthrough of building a production RAG pipeline, see Build a RAG Chatbot in .NET with Semantic Kernel and Cosmos DB.

The following settings provide a reliable baseline for most use cases:

SettingValueRationale
Chunk size512 tokensBalances granularity and context
Overlap50 tokensPrevents boundary information loss
Max retrieved chunks5~2,560 tokens for document context
Reserved for prompt8,000 tokensSystem + user message headroom

Count chunk sizes with Microsoft.ML.Tokenizers at indexing time, not by character count. A 512-character chunk is not the same as a 512-token chunk — token density varies significantly between code, prose, and structured data.

At query time, enforce the maximum retrieved chunks in your search call. For Azure AI Search:

var searchOptions = new SearchOptions
{
    Size = 5, // Limit to 5 chunks maximum
    Select = { "content", "title", "source" }
};

If you are using Auto Function Calling with plugins alongside RAG, account for tool schema tokens. Five plugins with moderately complex schemas can consume 1,000-2,500 tokens before a single document chunk is included.

Monitoring Token Usage

After every successful call, log the actual token counts from the response:

// After every successful call
_logger.LogInformation(
    "Token usage — Input: {Input}/{Limit} ({Pct:P0}), Output: {Output}",
    completion.Usage.InputTokenCount,
    128_000,
    (double)completion.Usage.InputTokenCount / 128_000,
    completion.Usage.OutputTokenCount);

Set an alert when InputTokenCount / 128_000 > 0.8 — at 80% of context, you have early warning before requests start failing. Tracking this metric over time also reveals trends: a gradual increase across sessions indicates chat history is accumulating and your trimming strategy needs adjustment.

Further Reading

⚠ Production Considerations

  • Sliding window by message count is imprecise — a single long message can consume as many tokens as 20 short ones. Use token-count-based trimming with Microsoft.ML.Tokenizers for reliable context management in production.
  • The summarization strategy requires a second API call. In high-traffic scenarios, triggering summarization synchronously adds latency. Consider summarizing in the background after the response is sent, and using the summary on the next turn.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

Design your chat session model around token budgets from the start, not as a retrofit. Allocate fixed budgets: 2,000 tokens for system prompt, 60,000 tokens for history, 8,000 tokens for the user message, 4,000 tokens for tool schemas, leaving 54,000 tokens for the response. This prevents surprises and makes context management deterministic.

AI-Friendly Summary

Summary

Context length exceeded errors occur when total input tokens (system prompt + history + user message + tools) exceed the model's context window. Fix by counting tokens with Microsoft.ML.Tokenizers before sending, implementing a sliding window to trim old chat history, summarizing old turns with GPT-4o-mini, or using a token budget IFunctionInvocationFilter. For RAG, keep chunks at 512 tokens with max 5 retrieved chunks.

Key Takeaways

  • HTTP 400 context_length_exceeded means total input tokens exceeded the model window
  • ChatHistory.RemoveRange(1, count) trims old messages — always keep index 0 (system)
  • Summarization with GPT-4o-mini preserves context while reducing token count by 80-90%
  • RAG chunk size: 512 tokens, max 5 chunks retrieved, 50-token overlap
  • Tool schemas consume tokens invisibly — account for 200-500 tokens per plugin

Implementation Checklist

  • Count tokens before sending with Microsoft.ML.Tokenizers
  • Implement sliding window ChatHistory.RemoveRange(1, removeCount)
  • Add summarization fallback when message count exceeds threshold
  • Log InputTokenCount after each call to track trend toward limit
  • Adjust RAG chunk size to 512 tokens and limit to 5 retrieved chunks
  • Account for tool schema tokens when using Auto Function Calling

Frequently Asked Questions

What HTTP status does context_length_exceeded return?

Azure OpenAI returns HTTP 400 Bad Request with error code 'context_length_exceeded'. The error message includes the number of tokens in your request and the model's maximum context window.

What are the context window limits for GPT-4o and GPT-4o-mini?

Both GPT-4o and GPT-4o-mini support 128,000 token context windows. The combined input (system prompt + chat history + user message + tool schemas) must stay under this limit. Output tokens also count against the limit — the maximum output for most deployments is 4,096-16,384 tokens.

How do I implement a sliding window chat history in Semantic Kernel?

Call chatHistory.RemoveRange(1, removeCount) to delete the oldest non-system messages. Always keep index 0 (the system message). Calculate removeCount as chatHistory.Count - 1 - maxMessages, where maxMessages is your target history length. Combine with Microsoft.ML.Tokenizers to trim by token count instead of message count for precision.

What is the summarization strategy for long chat histories?

When chat history grows large, call a cheap model (GPT-4o-mini) with all old messages and ask it to produce a concise summary. Then clear the chat history, re-add the system message, and add the summary as an assistant message. This preserves context semantics while dramatically reducing token count.

How should I chunk documents for RAG to avoid context overflow?

Use 512-token chunks with 50-token overlap. Count chunk sizes with Microsoft.ML.Tokenizers before indexing. At query time, limit retrieved chunks to 5 (2,560 tokens) to leave room for the system prompt, user message, and expected output. Adjust based on your model's context window and typical prompt size.

Why does my RAG pipeline fail with context_length_exceeded even with small documents?

Tool call schemas are invisible token consumers. When using Auto Function Calling with many plugins, each plugin's function schema adds 100-500 tokens. For a RAG pipeline with 5 plugins, you may be consuming 1,000-2,000 tokens in tool schemas alone before any document content is included.

Can I monitor token usage per request to catch overflow before it happens?

Yes. After each successful call, log completion.Usage.InputTokenCount and OutputTokenCount. Set an alert when InputTokenCount exceeds 80% of the model context window. This gives you early warning before requests start failing with context_length_exceeded.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Azure OpenAI #Context Length #Token Limit #Error Fix #.NET AI