Fix: 429 Rate Limit Exceeded with Azure OpenAI — Retry Strategies for .NET

From StackOverflow .NET 9 Azure.AI.OpenAI 2.1.0 Microsoft.Extensions.AI 10.3.0
By Rajesh Mishra · Feb 28, 2026 · Verified: Feb 28, 2026 · 8 min read

The Error

Your .NET application calls Azure OpenAI and receives this response:

Azure.RequestFailedException: HTTP 429 (Too Many Requests)

Content:
{
  "error": {
    "code": "429",
    "message": "Requests to the ChatCompletions_Create Operation under Azure OpenAI API
                version 2024-10-21 have exceeded token rate limit of your current
                OpenAI S0 pricing tier. Please retry after 6 seconds."
  }
}

Headers:
  Retry-After: 6
  x-ratelimit-remaining-tokens: 0
  x-ratelimit-remaining-requests: 12

The service is telling you that you have exhausted your quota for this time window. It is not a bug in your code — it is a capacity constraint.

Root Cause: TPM and RPM Quotas

Azure OpenAI enforces two rate limits per deployment:

QuotaWhat It LimitsTypical Default
TPM (Tokens Per Minute)Total input + output tokens10,000 - 120,000 depending on model and tier
RPM (Requests Per Minute)Number of API callsDerived from TPM (roughly TPM / 1,000 * 6)

When either limit is hit, all subsequent requests receive a 429 until the one-minute window resets. The Retry-After header tells you exactly how long to wait.

This happens most often when you have concurrent users, batch processing jobs, or prompts with large context windows. A single request with a 4,000-token prompt and a 2,000-token response consumes 6,000 tokens from your TPM budget in one call.

For a complete breakdown of quota limits by model and region, see the Azure OpenAI quotas and limits documentation.

Fix 1: Configure the SDK’s Built-in Retry

The Azure.AI.OpenAI SDK already retries on 429 responses with exponential backoff. By default, it retries 3 times. You can tune this:

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;

var options = new AzureOpenAIClientOptions();
options.RetryPolicy = new ClientRetryPolicy(maxRetries: 5);

var client = new AzureOpenAIClient(
    new Uri(config["AzureOpenAI:Endpoint"]!),
    new AzureKeyCredential(config["AzureOpenAI:ApiKey"]!),
    options);

ChatClient chatClient = client.GetChatClient("my-gpt4o-deployment");

// The SDK will automatically retry 429s up to 5 times with exponential backoff
ChatCompletion completion = await chatClient.CompleteChatAsync("Explain rate limiting.");

This is the simplest approach. The SDK reads the Retry-After header and waits the appropriate duration before retrying.

Fix 2: Custom Polly Resilience Pipeline

When you need more control — circuit breaking, custom backoff curves, or shared retry budgets across services — use Microsoft.Extensions.Resilience (which is built on Polly v8):

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Http.Resilience;
using Polly;

// In your DI setup
builder.Services.AddHttpClient("AzureOpenAI")
    .AddResilienceHandler("openai-retry", pipeline =>
    {
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 5,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            ShouldHandle = args => ValueTask.FromResult(
                args.Outcome.Result?.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
        });

        pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            SamplingDuration = TimeSpan.FromSeconds(30),
            FailureRatio = 0.7,
            MinimumThroughput = 10,
            BreakDuration = TimeSpan.FromSeconds(15)
        });
    });

The circuit breaker is important for sustained rate limiting. If 70% of your requests are getting 429s, it stops all requests for 15 seconds instead of continuing to hammer the endpoint — which only extends the throttling window.

Fix 3: Request Queue with Bounded Concurrency

For batch processing or high-throughput scenarios, throttle at the application level instead of relying on server-side rejection:

using System.Threading;
using System.Threading.Channels;

public class AzureOpenAIThrottledClient
{
    private readonly ChatClient _chatClient;
    private readonly SemaphoreSlim _semaphore;

    public AzureOpenAIThrottledClient(ChatClient chatClient, int maxConcurrency = 3)
    {
        _chatClient = chatClient;
        _semaphore = new SemaphoreSlim(maxConcurrency);
    }

    public async Task<ChatCompletion> CompleteChatAsync(
        string prompt,
        CancellationToken ct = default)
    {
        await _semaphore.WaitAsync(ct);
        try
        {
            return await _chatClient.CompleteChatAsync(prompt);
        }
        finally
        {
            // Add a small delay between requests to smooth out token consumption
            await Task.Delay(TimeSpan.FromMilliseconds(200), ct);
            _semaphore.Release();
        }
    }
}

This approach limits concurrency to 3 simultaneous requests with a 200ms gap between releases. Adjust both values based on your deployment’s TPM allocation.

Fix 4: Response Caching for Repeated Prompts

If your application sends the same or similar prompts repeatedly (think: classification tasks, template-based generation), caching avoids redundant quota consumption:

using Microsoft.Extensions.Caching.Memory;

public class CachedChatClient
{
    private readonly ChatClient _inner;
    private readonly IMemoryCache _cache;

    public CachedChatClient(ChatClient inner, IMemoryCache cache)
    {
        _inner = inner;
        _cache = cache;
    }

    public async Task<string> CompleteChatAsync(string prompt)
    {
        var cacheKey = $"chat:{prompt.GetHashCode()}";

        if (_cache.TryGetValue(cacheKey, out string? cached))
            return cached!;

        var completion = await _inner.CompleteChatAsync(prompt);
        var result = completion.Content[0].Text;

        _cache.Set(cacheKey, result, TimeSpan.FromMinutes(30));
        return result;
    }
}

For more sophisticated caching that accounts for semantic similarity, consider a vector similarity check before calling the API.

Monitoring Token Usage

Track token consumption per request to understand your quota utilization:

ChatCompletion completion = await chatClient.CompleteChatAsync(messages);

Console.WriteLine($"Input tokens:  {completion.Usage.InputTokenCount}");
Console.WriteLine($"Output tokens: {completion.Usage.OutputTokenCount}");
Console.WriteLine($"Total tokens:  {completion.Usage.TotalTokenCount}");

Log these values to your telemetry system. Over time, you will see consumption patterns that tell you whether you need a quota increase or better request management.

Requesting a Quota Increase

If your workload legitimately needs more throughput:

  1. Open the Azure portal and navigate to your Azure OpenAI resource.
  2. Select Quotas from the left menu.
  3. Find the model deployment and click Request Quota Increase.
  4. Set the desired TPM value and submit.

Standard quota increases are usually approved within minutes. Large increases may require manual review.

Prevention Patterns

  1. Estimate tokens before sending. Use a tokenizer like Microsoft.ML.Tokenizers to count tokens client-side and avoid sending prompts that will blow your remaining budget.
  2. Distribute across deployments. Create multiple deployments of the same model in different regions. Route requests round-robin or based on remaining quota.
  3. Set max_tokens on every request. Cap the response length to prevent runaway token consumption.
  4. Implement backpressure in your API. If your backend is getting 429s, return 503 to your frontend with a Retry-After header rather than queueing unbounded requests.
  5. Separate batch and interactive workloads. Use different deployments for user-facing chat (low latency, lower throughput) and batch processing (higher throughput, latency-tolerant).

Further Reading

⚠ Production Considerations

  • The Retry-After header value can be several seconds to minutes. Naive retry loops without backoff will hammer the endpoint and extend the throttling window.
  • Token counting is approximate at request time. The actual token count (including the response) may exceed your estimate, causing unexpected quota exhaustion mid-conversation.

🧠 Architect’s Note

For production AI workloads, treat Azure OpenAI quota as a shared resource. Implement request queuing with a bounded concurrency semaphore. Distribute across multiple deployments in different regions if your throughput requires it.

AI-Friendly Summary

Summary

The 429 Rate Limit Exceeded error from Azure OpenAI occurs when TPM or RPM quotas are exceeded. In .NET, handle this with the SDK's built-in retry logic (configurable via AzureOpenAIClientOptions), custom Polly resilience pipelines for advanced backoff, or by implementing request queuing and response caching. Monitor token usage per request, request quota increases when needed, and distribute load across multiple deployments for high-throughput scenarios.

Key Takeaways

  • Azure OpenAI enforces both TPM (tokens per minute) and RPM (requests per minute) quotas per deployment
  • The Azure SDK has built-in retry with exponential backoff — configure MaxRetries and Delay in AzureOpenAIClientOptions
  • Polly resilience pipelines give finer control over retry timing, circuit breaking, and timeout handling
  • Cache repeated prompts and batch requests where possible to reduce quota consumption
  • Monitor the Retry-After header in 429 responses for the exact wait time

Implementation Checklist

  • Check current TPM and RPM quota in Azure portal Quotas blade
  • Configure SDK retry options appropriate for your workload
  • Implement response caching for repeated or similar prompts
  • Add circuit breaker logic for sustained rate limiting
  • Request quota increase if current limits are insufficient
  • Consider multiple deployments or regions for load distribution

Frequently Asked Questions

What causes 429 errors with Azure OpenAI?

A 429 Too Many Requests error occurs when your application exceeds the Tokens Per Minute (TPM) or Requests Per Minute (RPM) quota assigned to your Azure OpenAI deployment. Azure throttles additional requests until the quota window resets.

How do I configure retry logic for Azure OpenAI in .NET?

The Azure.AI.OpenAI SDK includes built-in retry logic with exponential backoff. You can customize the retry count, delay, and max delay through AzureOpenAIClientOptions. For more control, use Polly resilience pipelines or Microsoft.Extensions.Resilience.

What is the difference between TPM and RPM quotas?

TPM (Tokens Per Minute) limits the total number of input and output tokens processed per minute across all requests. RPM (Requests Per Minute) limits the number of API calls regardless of token count. Either limit can trigger a 429 — whichever is hit first.

How do I increase my Azure OpenAI quota?

In the Azure portal, navigate to your Azure OpenAI resource, open the Quotas blade, and request an increase for the specific model deployment. Quota increases are subject to regional availability and may require approval.

Related Articles

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Azure OpenAI #429 Rate Limit #Retry Logic #Polly #.NET AI