What causes 429 errors with Azure OpenAI?

A 429 Too Many Requests error occurs when your application exceeds the Tokens Per Minute (TPM) or Requests Per Minute (RPM) quota assigned to your Azure OpenAI deployment. Azure throttles additional requests until the quota window resets.

How do I configure retry logic for Azure OpenAI in .NET?

The Azure.AI.OpenAI SDK includes built-in retry logic with exponential backoff. You can customize the retry count, delay, and max delay through AzureOpenAIClientOptions. For more control, use Polly resilience pipelines or Microsoft.Extensions.Resilience.

What is the difference between TPM and RPM quotas?

TPM (Tokens Per Minute) limits the total number of input and output tokens processed per minute across all requests. RPM (Requests Per Minute) limits the number of API calls regardless of token count. Either limit can trigger a 429 — whichever is hit first.

How do I increase my Azure OpenAI quota?

In the Azure portal, navigate to your Azure OpenAI resource, open the Quotas blade, and request an increase for the specific model deployment. Quota increases are subject to regional availability and may require approval.

Fix: 429 Rate Limit Exceeded with Azure OpenAI — Retry Strategies for .NET

The Error

Your .NET application calls Azure OpenAI and receives this response:

Azure.RequestFailedException: HTTP 429 (Too Many Requests)

Content:
{
  "error": {
    "code": "429",
    "message": "Requests to the ChatCompletions_Create Operation under Azure OpenAI API
                version 2024-10-21 have exceeded token rate limit of your current
                OpenAI S0 pricing tier. Please retry after 6 seconds."
  }
}

Headers:
  Retry-After: 6
  x-ratelimit-remaining-tokens: 0
  x-ratelimit-remaining-requests: 12

The service is telling you that you have exhausted your quota for this time window. It is not a bug in your code — it is a capacity constraint.

Root Cause: TPM and RPM Quotas

Azure OpenAI enforces two rate limits per deployment:

Quota	What It Limits	Typical Default
TPM (Tokens Per Minute)	Total input + output tokens	10,000 - 120,000 depending on model and tier
RPM (Requests Per Minute)	Number of API calls	Derived from TPM (roughly TPM / 1,000 * 6)

When either limit is hit, all subsequent requests receive a 429 until the one-minute window resets. The Retry-After header tells you exactly how long to wait.

This happens most often when you have concurrent users, batch processing jobs, or prompts with large context windows. A single request with a 4,000-token prompt and a 2,000-token response consumes 6,000 tokens from your TPM budget in one call.

For a complete breakdown of quota limits by model and region, see the Azure OpenAI quotas and limits documentation.

Fix 1: Configure the SDK’s Built-in Retry

The Azure.AI.OpenAI SDK already retries on 429 responses with exponential backoff. By default, it retries 3 times. You can tune this:

using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;

var options = new AzureOpenAIClientOptions();
options.RetryPolicy = new ClientRetryPolicy(maxRetries: 5);

var client = new AzureOpenAIClient(
    new Uri(config["AzureOpenAI:Endpoint"]!),
    new AzureKeyCredential(config["AzureOpenAI:ApiKey"]!),
    options);

ChatClient chatClient = client.GetChatClient("my-gpt4o-deployment");

// The SDK will automatically retry 429s up to 5 times with exponential backoff
ChatCompletion completion = await chatClient.CompleteChatAsync("Explain rate limiting.");

This is the simplest approach. The SDK reads the Retry-After header and waits the appropriate duration before retrying.

Fix 2: Custom Polly Resilience Pipeline

When you need more control — circuit breaking, custom backoff curves, or shared retry budgets across services — use Microsoft.Extensions.Resilience (which is built on Polly v8):

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Http.Resilience;
using Polly;

// In your DI setup
builder.Services.AddHttpClient("AzureOpenAI")
    .AddResilienceHandler("openai-retry", pipeline =>
    {
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 5,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            ShouldHandle = args => ValueTask.FromResult(
                args.Outcome.Result?.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
        });

        pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            SamplingDuration = TimeSpan.FromSeconds(30),
            FailureRatio = 0.7,
            MinimumThroughput = 10,
            BreakDuration = TimeSpan.FromSeconds(15)
        });
    });

The circuit breaker is important for sustained rate limiting. If 70% of your requests are getting 429s, it stops all requests for 15 seconds instead of continuing to hammer the endpoint — which only extends the throttling window.

Fix 3: Request Queue with Bounded Concurrency

For batch processing or high-throughput scenarios, throttle at the application level instead of relying on server-side rejection:

using System.Threading;
using System.Threading.Channels;

public class AzureOpenAIThrottledClient
{
    private readonly ChatClient _chatClient;
    private readonly SemaphoreSlim _semaphore;

    public AzureOpenAIThrottledClient(ChatClient chatClient, int maxConcurrency = 3)
    {
        _chatClient = chatClient;
        _semaphore = new SemaphoreSlim(maxConcurrency);
    }

    public async Task<ChatCompletion> CompleteChatAsync(
        string prompt,
        CancellationToken ct = default)
    {
        await _semaphore.WaitAsync(ct);
        try
        {
            return await _chatClient.CompleteChatAsync(prompt);
        }
        finally
        {
            // Add a small delay between requests to smooth out token consumption
            await Task.Delay(TimeSpan.FromMilliseconds(200), ct);
            _semaphore.Release();
        }
    }
}

This approach limits concurrency to 3 simultaneous requests with a 200ms gap between releases. Adjust both values based on your deployment’s TPM allocation.

Fix 4: Response Caching for Repeated Prompts

If your application sends the same or similar prompts repeatedly (think: classification tasks, template-based generation), caching avoids redundant quota consumption:

using Microsoft.Extensions.Caching.Memory;

public class CachedChatClient
{
    private readonly ChatClient _inner;
    private readonly IMemoryCache _cache;

    public CachedChatClient(ChatClient inner, IMemoryCache cache)
    {
        _inner = inner;
        _cache = cache;
    }

    public async Task<string> CompleteChatAsync(string prompt)
    {
        var cacheKey = $"chat:{prompt.GetHashCode()}";

        if (_cache.TryGetValue(cacheKey, out string? cached))
            return cached!;

        var completion = await _inner.CompleteChatAsync(prompt);
        var result = completion.Content[0].Text;

        _cache.Set(cacheKey, result, TimeSpan.FromMinutes(30));
        return result;
    }
}

For more sophisticated caching that accounts for semantic similarity, consider a vector similarity check before calling the API.

Monitoring Token Usage

Track token consumption per request to understand your quota utilization:

ChatCompletion completion = await chatClient.CompleteChatAsync(messages);

Console.WriteLine($"Input tokens:  {completion.Usage.InputTokenCount}");
Console.WriteLine($"Output tokens: {completion.Usage.OutputTokenCount}");
Console.WriteLine($"Total tokens:  {completion.Usage.TotalTokenCount}");

Log these values to your telemetry system. Over time, you will see consumption patterns that tell you whether you need a quota increase or better request management.

Requesting a Quota Increase

If your workload legitimately needs more throughput:

Open the Azure portal and navigate to your Azure OpenAI resource.
Select Quotas from the left menu.
Find the model deployment and click Request Quota Increase.
Set the desired TPM value and submit.

Standard quota increases are usually approved within minutes. Large increases may require manual review.

Prevention Patterns

Estimate tokens before sending. Use a tokenizer like Microsoft.ML.Tokenizers to count tokens client-side and avoid sending prompts that will blow your remaining budget.
Distribute across deployments. Create multiple deployments of the same model in different regions. Route requests round-robin or based on remaining quota.
Set max_tokens on every request. Cap the response length to prevent runaway token consumption.
Implement backpressure in your API. If your backend is getting 429s, return 503 to your frontend with a Retry-After header rather than queueing unbounded requests.
Separate batch and interactive workloads. Use different deployments for user-facing chat (low latency, lower throughput) and batch processing (higher throughput, latency-tolerant).

Fix: 429 Rate Limit Exceeded with Azure OpenAI — Retry Strategies for .NET

The Error

Root Cause: TPM and RPM Quotas

Fix 1: Configure the SDK’s Built-in Retry

Fix 2: Custom Polly Resilience Pipeline

Fix 3: Request Queue with Bounded Concurrency

Fix 4: Response Caching for Repeated Prompts

Monitoring Token Usage

Requesting a Quota Increase

Prevention Patterns

Further Reading

⚠ Production Considerations

🧠 Architect’s Note

AI-Friendly Summary

Summary

Key Takeaways

Implementation Checklist

Frequently Asked Questions

Related Articles

Comparing LLM Providers — OpenAI, Azure OpenAI, Anthropic, and Open-Source

Fix: Model Not Found and Deployment Errors in Azure OpenAI .NET SDK

Build a Document Summarizer with C# and Azure OpenAI

Was this article useful?

Discussion