The Error
Your .NET application calls Azure OpenAI and receives this response:
Azure.RequestFailedException: HTTP 429 (Too Many Requests)
Content:
{
"error": {
"code": "429",
"message": "Requests to the ChatCompletions_Create Operation under Azure OpenAI API
version 2024-10-21 have exceeded token rate limit of your current
OpenAI S0 pricing tier. Please retry after 6 seconds."
}
}
Headers:
Retry-After: 6
x-ratelimit-remaining-tokens: 0
x-ratelimit-remaining-requests: 12
The service is telling you that you have exhausted your quota for this time window. It is not a bug in your code — it is a capacity constraint.
Root Cause: TPM and RPM Quotas
Azure OpenAI enforces two rate limits per deployment:
| Quota | What It Limits | Typical Default |
|---|---|---|
| TPM (Tokens Per Minute) | Total input + output tokens | 10,000 - 120,000 depending on model and tier |
| RPM (Requests Per Minute) | Number of API calls | Derived from TPM (roughly TPM / 1,000 * 6) |
When either limit is hit, all subsequent requests receive a 429 until the one-minute window resets. The Retry-After header tells you exactly how long to wait.
This happens most often when you have concurrent users, batch processing jobs, or prompts with large context windows. A single request with a 4,000-token prompt and a 2,000-token response consumes 6,000 tokens from your TPM budget in one call.
For a complete breakdown of quota limits by model and region, see the Azure OpenAI quotas and limits documentation.
Fix 1: Configure the SDK’s Built-in Retry
The Azure.AI.OpenAI SDK already retries on 429 responses with exponential backoff. By default, it retries 3 times. You can tune this:
using Azure;
using Azure.AI.OpenAI;
using OpenAI.Chat;
var options = new AzureOpenAIClientOptions();
options.RetryPolicy = new ClientRetryPolicy(maxRetries: 5);
var client = new AzureOpenAIClient(
new Uri(config["AzureOpenAI:Endpoint"]!),
new AzureKeyCredential(config["AzureOpenAI:ApiKey"]!),
options);
ChatClient chatClient = client.GetChatClient("my-gpt4o-deployment");
// The SDK will automatically retry 429s up to 5 times with exponential backoff
ChatCompletion completion = await chatClient.CompleteChatAsync("Explain rate limiting.");
This is the simplest approach. The SDK reads the Retry-After header and waits the appropriate duration before retrying.
Fix 2: Custom Polly Resilience Pipeline
When you need more control — circuit breaking, custom backoff curves, or shared retry budgets across services — use Microsoft.Extensions.Resilience (which is built on Polly v8):
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Http.Resilience;
using Polly;
// In your DI setup
builder.Services.AddHttpClient("AzureOpenAI")
.AddResilienceHandler("openai-retry", pipeline =>
{
pipeline.AddRetry(new HttpRetryStrategyOptions
{
MaxRetryAttempts = 5,
Delay = TimeSpan.FromSeconds(2),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
ShouldHandle = args => ValueTask.FromResult(
args.Outcome.Result?.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
});
pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
SamplingDuration = TimeSpan.FromSeconds(30),
FailureRatio = 0.7,
MinimumThroughput = 10,
BreakDuration = TimeSpan.FromSeconds(15)
});
});
The circuit breaker is important for sustained rate limiting. If 70% of your requests are getting 429s, it stops all requests for 15 seconds instead of continuing to hammer the endpoint — which only extends the throttling window.
Fix 3: Request Queue with Bounded Concurrency
For batch processing or high-throughput scenarios, throttle at the application level instead of relying on server-side rejection:
using System.Threading;
using System.Threading.Channels;
public class AzureOpenAIThrottledClient
{
private readonly ChatClient _chatClient;
private readonly SemaphoreSlim _semaphore;
public AzureOpenAIThrottledClient(ChatClient chatClient, int maxConcurrency = 3)
{
_chatClient = chatClient;
_semaphore = new SemaphoreSlim(maxConcurrency);
}
public async Task<ChatCompletion> CompleteChatAsync(
string prompt,
CancellationToken ct = default)
{
await _semaphore.WaitAsync(ct);
try
{
return await _chatClient.CompleteChatAsync(prompt);
}
finally
{
// Add a small delay between requests to smooth out token consumption
await Task.Delay(TimeSpan.FromMilliseconds(200), ct);
_semaphore.Release();
}
}
}
This approach limits concurrency to 3 simultaneous requests with a 200ms gap between releases. Adjust both values based on your deployment’s TPM allocation.
Fix 4: Response Caching for Repeated Prompts
If your application sends the same or similar prompts repeatedly (think: classification tasks, template-based generation), caching avoids redundant quota consumption:
using Microsoft.Extensions.Caching.Memory;
public class CachedChatClient
{
private readonly ChatClient _inner;
private readonly IMemoryCache _cache;
public CachedChatClient(ChatClient inner, IMemoryCache cache)
{
_inner = inner;
_cache = cache;
}
public async Task<string> CompleteChatAsync(string prompt)
{
var cacheKey = $"chat:{prompt.GetHashCode()}";
if (_cache.TryGetValue(cacheKey, out string? cached))
return cached!;
var completion = await _inner.CompleteChatAsync(prompt);
var result = completion.Content[0].Text;
_cache.Set(cacheKey, result, TimeSpan.FromMinutes(30));
return result;
}
}
For more sophisticated caching that accounts for semantic similarity, consider a vector similarity check before calling the API.
Monitoring Token Usage
Track token consumption per request to understand your quota utilization:
ChatCompletion completion = await chatClient.CompleteChatAsync(messages);
Console.WriteLine($"Input tokens: {completion.Usage.InputTokenCount}");
Console.WriteLine($"Output tokens: {completion.Usage.OutputTokenCount}");
Console.WriteLine($"Total tokens: {completion.Usage.TotalTokenCount}");
Log these values to your telemetry system. Over time, you will see consumption patterns that tell you whether you need a quota increase or better request management.
Requesting a Quota Increase
If your workload legitimately needs more throughput:
- Open the Azure portal and navigate to your Azure OpenAI resource.
- Select Quotas from the left menu.
- Find the model deployment and click Request Quota Increase.
- Set the desired TPM value and submit.
Standard quota increases are usually approved within minutes. Large increases may require manual review.
Prevention Patterns
- Estimate tokens before sending. Use a tokenizer like
Microsoft.ML.Tokenizersto count tokens client-side and avoid sending prompts that will blow your remaining budget. - Distribute across deployments. Create multiple deployments of the same model in different regions. Route requests round-robin or based on remaining quota.
- Set
max_tokenson every request. Cap the response length to prevent runaway token consumption. - Implement backpressure in your API. If your backend is getting 429s, return 503 to your frontend with a
Retry-Afterheader rather than queueing unbounded requests. - Separate batch and interactive workloads. Use different deployments for user-facing chat (low latency, lower throughput) and batch processing (higher throughput, latency-tolerant).
Further Reading
- Azure OpenAI Quotas and Limits
- Azure.AI.OpenAI on NuGet
- Microsoft.Extensions.Resilience on NuGet
- Azure OpenAI Rate Limiting on StackOverflow