Skip to main content

Add Resilience to AI Calls in .NET with Polly v8

Intermediate Original .NET 9 Microsoft.Extensions.Resilience 9.3.0 Azure.AI.OpenAI 2.1.0
By Rajesh Mishra · Mar 21, 2026 · 15 min read
Verified Mar 2026 .NET 9 Microsoft.Extensions.Resilience 9.3.0
In 30 Seconds

Production resilience for Azure OpenAI in .NET requires three layers: retry with exponential backoff and jitter (handles transient 429s), circuit breaker (stops calls during sustained outages), and client-side rate limiting (prevents quota exhaustion). Use Microsoft.Extensions.Resilience for HttpClient-based scenarios and ResiliencePipelineBuilder for direct SDK wrapping. Apply strategies in order: rate limiter → timeout → circuit breaker → retry.

What You'll Build

Implement production-grade resilience for Azure OpenAI in C# using Polly v8. Retry, circuit breaker, and token bucket rate limiting with code examples.

Microsoft.Extensions.Resilience 9.3.0Azure.AI.OpenAI 2.1.0 .NET 9 · 15 min read to complete

Azure OpenAI calls fail in ways that ordinary HTTP calls do not. A database timeout is transient — wait a moment and retry. An Azure OpenAI 429 error may mean your quota window is exhausted for the next 60 seconds, and retrying immediately makes it worse. A circuit breaker that opens on 70% failure rate does nothing if every retry within that window counts as another failure driving you further into the open state.

This workshop builds production-grade resilience for Azure OpenAI calls in .NET using Polly v8 and Microsoft.Extensions.Resilience. You will implement three coordinated strategies — retry with exponential backoff, circuit breaker, and client-side token bucket rate limiting — in the correct order. Azure OpenAI 429 errors are the most common production incident for .NET AI teams — see our Fix 429 Rate Limit Exceeded guide for the root cause analysis before applying these patterns.

1. Why AI Calls Need Dedicated Resilience

Standard HTTP resilience advice — retry on 5xx, circuit break at 50% failure — does not translate directly to AI API calls. The differences matter.

Rate limits are per token, not per request. A single 4,000-token completion consumes as much quota as forty 100-token requests. Counting retries by request count without considering token weight means your rate limiter will underestimate consumption when large requests dominate traffic.

Failure modes are distinct. Azure OpenAI surfaces four categories of errors that require different handling:

StatusCauseCorrect Response
429Quota exhausted (TPM or RPM)Back off; respect Retry-After header
503Regional capacity issueRetry with backoff; consider failover
400 (content filter)Input policy violationDo not retry — fix the prompt
401 / 403Auth or RBAC misconfigurationDo not retry — fix credentials

Retrying on 400 content filter errors or 401 auth errors wastes quota and adds latency. Your retry predicate must be selective.

Retrying during quota exhaustion worsens the situation. If your 60-second TPM window is exhausted, five aggressive retries over 30 seconds consume quota from the next window before it resets. A circuit breaker that stops all calls for 15 seconds during sustained failures is worth more than five more retries.

No shared state across client instances by default. If you create multiple AzureOpenAIClient instances (one per request, for example), each has its own SDK retry policy with no knowledge of the others. Under load, twenty concurrent requests each retrying three times means sixty actual HTTP calls hitting an already-overloaded quota window.

Incoming RequestRate Limiter(token bucket)Circuit Breaker(70% failure)Retry Policy(exp backoff)Azure OpenAI

2. Built-in SDK Retry: The Starting Point

The Azure SDK includes a configurable retry policy. For low-traffic, single-user scenarios it is often sufficient:

using Azure;
using Azure.AI.OpenAI;

var options = new AzureOpenAIClientOptions();
options.RetryPolicy = new ClientRetryPolicy(maxRetries: 5);

var client = new AzureOpenAIClient(
    new Uri(configuration["AzureOpenAI:Endpoint"]!),
    new AzureKeyCredential(configuration["AzureOpenAI:ApiKey"]!),
    options);

ClientRetryPolicy applies exponential backoff on 429 and 5xx responses. It respects the Retry-After header when present. For a developer tool or low-concurrency internal application, this is a reasonable baseline.

The limitations become apparent at scale:

  • No circuit breaking. If Azure OpenAI is sustaining a 100% error rate for three minutes, the SDK keeps retrying every request individually with no coordinated stop.
  • No request-level timeout. The SDK respects socket timeouts but has no mechanism to fail fast on a slow response that is taking 45 seconds to complete.
  • No shared failure state. Each AzureOpenAIClient instance tracks its own retry count independently. Ten concurrent users, each with their own client, each retrying five times, send fifty requests to an already-failing endpoint.
  • No client-side rate limiting. The SDK cannot prevent your application from exceeding its quota window — it can only recover after the server rejects the excess requests.

For production multi-user applications, these limitations require Polly.

3. Step 1: Retry with Exponential Backoff and Jitter

Install the package:

dotnet add package Microsoft.Extensions.Http.Resilience

Microsoft.Extensions.Http.Resilience wraps Polly v8 and integrates with IHttpClientFactory. Register a resilient HTTP client in Program.cs:

using Microsoft.Extensions.Http.Resilience;
using Polly;

// In Program.cs
builder.Services.AddHttpClient("AzureOpenAI")
    .AddResilienceHandler("azure-openai", pipeline =>
    {
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 5,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            // Only retry on 429 and 5xx transient errors
            ShouldHandle = static args => ValueTask.FromResult(
                args.Outcome.Result?.StatusCode == System.Net.HttpStatusCode.TooManyRequests ||
                (args.Outcome.Result?.StatusCode >= System.Net.HttpStatusCode.InternalServerError))
        });
    });

The ShouldHandle predicate is the most important part of this configuration. By explicitly matching only 429 and 5xx status codes, you avoid retrying 400 content filter rejections and 401 authentication failures — both of which are not transient and should surface immediately to the caller.

UseJitter = true adds ±25% random variation to each delay interval. Without jitter, twenty concurrent requests that all hit a 429 at the same moment will all sleep for exactly two seconds and then arrive simultaneously — the thundering herd problem. Jitter spreads the retry wave across a two-second window, significantly reducing the probability of another coordinated collision.

Exponential backoff with five retries and a two-second base delay produces these maximum delay intervals: 2s, 4s, 8s, 16s, 30s (capped). Total maximum wait before final failure is approximately 60 seconds, which aligns with the one-minute Azure OpenAI quota reset window.

4. Step 2: Circuit Breaker

Add a circuit breaker to the same pipeline:

pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
    // Open circuit when 70% of last 10+ requests fail
    SamplingDuration = TimeSpan.FromSeconds(30),
    FailureRatio = 0.7,
    MinimumThroughput = 10,
    BreakDuration = TimeSpan.FromSeconds(15),
    // Notify on state changes for monitoring
    OnOpened = static args =>
    {
        // Log or alert that circuit is open
        return ValueTask.CompletedTask;
    }
});

The circuit breaker operates in three states:

Closed is the normal operating state. Every request passes through to Azure OpenAI. The circuit breaker tracks success and failure counts within the SamplingDuration window.

Open is the failure state. When failures exceed the FailureRatio threshold (with at least MinimumThroughput requests observed), the circuit opens. All requests are immediately rejected with a BrokenCircuitException — they never reach Azure OpenAI. This prevents quota exhaustion from cascading retries during a sustained outage.

Half-Open is the recovery probe state. After BreakDuration elapses, the circuit allows a single test request through. If it succeeds, the circuit closes. If it fails, the circuit re-opens for another BreakDuration.

The MinimumThroughput = 10 setting prevents the circuit from opening based on small samples. If your application sends two requests and both fail (perhaps during startup while credentials are being resolved), a 100% failure ratio should not immediately open the circuit. Requiring at least ten observations before evaluating the failure ratio gives a statistically meaningful sample.

The OnOpened callback is where you should log a warning or trigger an alert. A circuit opening means your application has detected a sustained Azure OpenAI failure — this is an operational event worth tracking.

OnOpened = static args =>
{
    Log.Warning("Azure OpenAI circuit breaker opened. Duration: {Duration}",
        args.BreakDuration);
    return ValueTask.CompletedTask;
}

5. Step 3: Client-Side Rate Limiting

Retry and circuit breaking respond to failures after they occur. A client-side rate limiter prevents quota exhaustion from occurring in the first place.

using System.Threading.RateLimiting;

// Create a token bucket sized to your Azure OpenAI TPM quota
// Example: 100,000 TPM deployment
var rateLimiter = new TokenBucketRateLimiter(new TokenBucketRateLimiterOptions
{
    TokenLimit = 100_000,          // Max tokens in bucket
    QueueProcessingOrder = QueueProcessingOrder.OldestFirst,
    QueueLimit = 0,                 // Reject immediately if at limit
    ReplenishmentPeriod = TimeSpan.FromMinutes(1),
    TokensPerPeriod = 100_000,     // Refill to full every minute
    AutoReplenishment = true
});

// Register as singleton
builder.Services.AddSingleton(rateLimiter);

A TokenBucketRateLimiter works by maintaining a bucket of permits. Each request acquires permits proportional to its estimated token consumption. When the bucket is empty, requests are rejected immediately (QueueLimit = 0) rather than waiting — you want callers to receive a fast failure they can handle, not a silent queue that eventually times out.

The QueueLimit = 0 setting is intentional. Queuing requests at the rate limiter introduces unpredictable latency that is difficult to surface to callers. A rejected RateLimitLease is a clean signal the caller can translate into a 429 response with a Retry-After header, matching what Azure OpenAI would return if the request had reached the server.

Use the rate limiter in a service that wraps chat calls:

public class RateLimitedChatService
{
    private readonly ChatClient _chatClient;
    private readonly TokenBucketRateLimiter _rateLimiter;

    public RateLimitedChatService(ChatClient chatClient, TokenBucketRateLimiter rateLimiter)
    {
        _chatClient = chatClient;
        _rateLimiter = rateLimiter;
    }

    public async Task<string> CompleteChatAsync(
        List<ChatMessage> messages,
        int estimatedTokens,
        CancellationToken ct = default)
    {
        // Acquire token bucket permits proportional to estimated request size
        using var lease = await _rateLimiter.AcquireAsync(
            permitCount: estimatedTokens, cancellationToken: ct);

        if (!lease.IsAcquired)
        {
            throw new InvalidOperationException(
                "Client-side rate limit exceeded. Reduce request frequency.");
        }

        var completion = await _chatClient.CompleteChatAsync(messages, cancellationToken: ct);
        return completion.Content[0].Text;
    }
}

Estimating token count before making the call is straightforward for fixed prompts. For dynamic prompts, a rough heuristic of four characters per token is accurate enough for rate limiting purposes — the goal is approximate quota tracking, not precise billing.

6. Step 4: Wrapping the SDK Directly (Non-HttpClient Pattern)

Microsoft.Extensions.Resilience attaches to IHttpClientFactory. When you use the Azure SDK’s ChatClient directly — as most code does — you need to build a Polly pipeline independently and wrap your SDK calls with it.

Install the Polly package:

dotnet add package Polly.Core

Build a standalone resilience pipeline:

using Polly;
using Polly.Retry;
using Polly.CircuitBreaker;
using Azure;

// Build a standalone resilience pipeline
ResiliencePipeline pipeline = new ResiliencePipelineBuilder()
    .AddRetry(new RetryStrategyOptions
    {
        MaxRetryAttempts = 5,
        Delay = TimeSpan.FromSeconds(2),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,
        ShouldHandle = new PredicateBuilder()
            .Handle<RequestFailedException>(ex => ex.Status == 429 || ex.Status >= 500)
    })
    .AddCircuitBreaker(new CircuitBreakerStrategyOptions
    {
        SamplingDuration = TimeSpan.FromSeconds(30),
        FailureRatio = 0.7,
        MinimumThroughput = 10,
        BreakDuration = TimeSpan.FromSeconds(15),
        ShouldHandle = new PredicateBuilder()
            .Handle<RequestFailedException>(ex => ex.Status == 429 || ex.Status >= 500)
    })
    .Build();

// Use it to wrap calls
string result = await pipeline.ExecuteAsync(async ct =>
{
    var completion = await chatClient.CompleteChatAsync(messages, cancellationToken: ct);
    return completion.Content[0].Text;
}, cancellationToken);

The PredicateBuilder().Handle<RequestFailedException>() syntax is Polly v8’s approach. The predicate receives the RequestFailedException thrown by the Azure SDK and filters by HTTP status code. This is equivalent to the ShouldHandle lambda used in the HttpClient-based approach, but targeting exceptions rather than HttpResponseMessage objects.

Register this pipeline as a singleton in DI so it is shared across all requests:

builder.Services.AddSingleton(_ =>
    new ResiliencePipelineBuilder()
        .AddRetry(new RetryStrategyOptions
        {
            MaxRetryAttempts = 5,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            ShouldHandle = new PredicateBuilder()
                .Handle<RequestFailedException>(ex => ex.Status == 429 || ex.Status >= 500)
        })
        .AddCircuitBreaker(new CircuitBreakerStrategyOptions
        {
            SamplingDuration = TimeSpan.FromSeconds(30),
            FailureRatio = 0.7,
            MinimumThroughput = 10,
            BreakDuration = TimeSpan.FromSeconds(15)
        })
        .Build());

A singleton pipeline is shared across all chat service instances, which means the circuit breaker’s failure counter is shared. This is the intended behavior — you want one circuit breaker per Azure OpenAI deployment, not one per user session.

7. Step 5: Multi-Deployment Round-Robin

When you have multiple Azure OpenAI deployments (for example, gpt-4o-eastus and gpt-4o-westus), distributing load across them multiplies your effective quota. Round-robin is the simplest distribution strategy:

public class MultiDeploymentChatService
{
    private readonly ChatClient[] _clients;
    private int _currentIndex = 0;

    public MultiDeploymentChatService(IConfiguration configuration)
    {
        var deployments = configuration.GetSection("AzureOpenAI:Deployments").Get<string[]>()!;
        var endpoint = new Uri(configuration["AzureOpenAI:Endpoint"]!);
        var credential = new AzureKeyCredential(configuration["AzureOpenAI:ApiKey"]!);

        var azureClient = new AzureOpenAIClient(endpoint, credential);
        _clients = deployments.Select(d => azureClient.GetChatClient(d)).ToArray();
    }

    public async Task<string> CompleteChatAsync(
        List<ChatMessage> messages,
        CancellationToken ct = default)
    {
        // Thread-safe round-robin
        int index = Interlocked.Increment(ref _currentIndex) % _clients.Length;
        var completion = await _clients[index].CompleteChatAsync(messages, cancellationToken: ct);
        return completion.Content[0].Text;
    }
}

Interlocked.Increment makes the round-robin counter thread-safe without a lock. The modulo operation wraps the counter back to zero when it reaches the array length. Concurrent requests from twenty users will be distributed evenly across deployments.

This pairs naturally with the resilience pipeline. If one deployment’s circuit breaker opens, requests routed to that deployment fail fast while requests routed to other deployments continue normally. The circuit breaker’s OnOpened callback becomes an opportunity to temporarily remove the failing deployment from the rotation — but for most scenarios, letting each deployment’s circuit breaker manage independently is sufficient.

For a full streaming implementation that integrates with this service pattern, see the Azure OpenAI Chat Completion Streaming in .NET workshop.

8. Full Composition Example

Combining all three strategies in Program.cs:

// Program.cs — complete resilience setup
using Microsoft.Extensions.Http.Resilience;
using System.Threading.RateLimiting;
using Polly;

var builder = WebApplication.CreateBuilder(args);

// Client-side rate limiter (sized to deployment quota)
var rateLimiter = new TokenBucketRateLimiter(new TokenBucketRateLimiterOptions
{
    TokenLimit = 100_000,
    ReplenishmentPeriod = TimeSpan.FromMinutes(1),
    TokensPerPeriod = 100_000,
    AutoReplenishment = true,
    QueueLimit = 0
});
builder.Services.AddSingleton(rateLimiter);

// Resilient HTTP pipeline for AI calls
builder.Services.AddHttpClient("AzureOpenAI")
    .AddResilienceHandler("azure-openai-full", pipeline =>
    {
        // Order: rate limiter → circuit breaker → retry
        pipeline.AddRateLimiter(rateLimiter);

        pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            SamplingDuration = TimeSpan.FromSeconds(30),
            FailureRatio = 0.7,
            MinimumThroughput = 10,
            BreakDuration = TimeSpan.FromSeconds(15)
        });

        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 5,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            ShouldHandle = static args => ValueTask.FromResult(
                args.Outcome.Result?.StatusCode == System.Net.HttpStatusCode.TooManyRequests)
        });
    });

The strategy order is not arbitrary. Each strategy in a Polly pipeline wraps the strategies that follow it, so the outermost strategy executes first:

  1. Rate limiter first — rejects requests locally before consuming any network budget. A rejected request never touches Azure OpenAI and never counts as a failure toward the circuit breaker.
  2. Circuit breaker second — when the backend is struggling, stops all outbound calls. Retrying into an open circuit is futile; the circuit breaker prevents the retry strategy from wasting its budget on guaranteed failures.
  3. Retry last — retries happen within the protection of both the rate limiter and circuit breaker. Each retry attempt is subject to the rate limiter (preventing retry bursts from exhausting quota) and is counted by the circuit breaker (contributing to the failure ratio if retries also fail).

Reversing the order produces counterproductive behavior. Retry wrapping circuit breaker means a retry might attempt a call while the circuit is open — the circuit breaker should short-circuit the retry, but the retry counts do not contribute to the circuit breaker’s failure window correctly. Rate limiter wrapping retry means the rate limiter would consume tokens for the original attempt only, not for subsequent retry attempts, underestimating actual quota consumption.

Testing the Circuit Breaker

The circuit breaker is only useful if it opens under the conditions you expect. Test it by simulating sustained 429 responses:

// Integration test pattern using WireMock.Net
using WireMock.RequestBuilders;
using WireMock.ResponseBuilders;
using WireMock.Server;

var server = WireMockServer.Start();
server.Given(Request.Create().UsingPost())
    .RespondWith(Response.Create()
        .WithStatusCode(429)
        .WithHeader("Retry-After", "60"));

// Send 12 requests (above MinimumThroughput = 10)
// Expect the circuit to open after 70% failure ratio is reached
// Subsequent requests should throw BrokenCircuitException immediately

Verify that:

  • Requests 1-10 each reach the mock server (circuit is evaluating)
  • After the failure ratio threshold, requests fail with BrokenCircuitException without reaching the server
  • After BreakDuration elapses, a probe request goes through
  • On probe success, subsequent requests proceed normally

Further Reading

⚠ Production Considerations

  • The circuit breaker counts failures globally across all requests. If you use a shared circuit breaker for multiple Azure OpenAI deployments, one failing deployment will open the circuit for all deployments. Maintain separate circuit breakers per deployment endpoint.
  • TokenBucketRateLimiter token counts are in your configured unit (e.g., tokens-per-minute). If you size the bucket by request count instead of token count, a single large prompt can deplete the entire bucket before other requests get a chance. Prefer token-based rate limiting over request-count-based limiting.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

Design resilience from the quota up. Start by documenting your Azure OpenAI deployment's TPM and RPM limits. Then configure your rate limiter to 80% of that limit (leaving headroom). Set the circuit breaker threshold to open before your application's retry budget causes the quota window to stack. Resilience patterns that fight each other — like aggressive retry against a tight circuit breaker — cause more downtime than no resilience at all.

AI-Friendly Summary

Summary

Production resilience for Azure OpenAI in .NET requires three layers: retry with exponential backoff and jitter (handles transient 429s), circuit breaker (stops calls during sustained outages), and client-side rate limiting (prevents quota exhaustion). Use Microsoft.Extensions.Resilience for HttpClient-based scenarios and ResiliencePipelineBuilder for direct SDK wrapping. Apply strategies in order: rate limiter → timeout → circuit breaker → retry.

Key Takeaways

  • Three strategies needed: retry (transient), circuit breaker (sustained), rate limiter (prevention)
  • Strategy order matters: rate limiter → timeout → circuit breaker → retry
  • Microsoft.Extensions.Resilience for HttpClient scenarios; Polly directly for SDK wrapping
  • Circuit breaker: 70% failure ratio, 30s window, 15s break — reasonable defaults for Azure OpenAI
  • Client-side token bucket rate limiter prevents 429s more effectively than retrying after them

Implementation Checklist

  • Add Microsoft.Extensions.Resilience NuGet package
  • Configure retry with exponential backoff, jitter, and Retry-After header handling
  • Add circuit breaker with 70% failure ratio, 30s window, 15s break duration
  • Implement TokenBucketRateLimiter sized to your Azure OpenAI deployment quota
  • Ensure strategy order: rate limiter → timeout → circuit breaker → retry
  • Test the circuit breaker by simulating sustained 429 responses

Frequently Asked Questions

Should I use Polly directly or Microsoft.Extensions.Resilience for Azure OpenAI?

Use Microsoft.Extensions.Resilience (which wraps Polly v8) when you are integrating with ASP.NET Core's HttpClient factory — it gives you AddResilienceHandler() and standardized pipeline configuration. Use Polly directly via ResiliencePipelineBuilder when you need resilience outside of HttpClient, such as wrapping the Azure OpenAI SDK's ChatClient directly.

What retry configuration is appropriate for Azure OpenAI 429 errors?

Use exponential backoff with jitter: start at 2 seconds, double each retry, add ±25% jitter, cap at 30 seconds, maximum 5 retries. The Azure OpenAI Retry-After header should take precedence over your calculated delay when present. The SDK's built-in retry (maxRetries: 3) handles simple cases; Polly adds circuit breaking and shared pipeline coordination.

When should the circuit breaker open for Azure OpenAI calls?

Open the circuit when 70% of requests in a 30-second window fail (minimum 10 requests to avoid triggering on small sample sizes). Keep it open for 15 seconds before trying a single probe request. This prevents extended quota exhaustion — continuously retrying during a sustained outage worsens the situation.

How do I implement client-side rate limiting in .NET to avoid Azure OpenAI 429 errors?

Use System.Threading.RateLimiting.TokenBucketRateLimiter. Configure tokens equal to your per-minute token quota, refill rate matching the quota replenishment period. Acquire tokens proportional to estimated request size before each call. This prevents outgoing request bursts that would trigger server-side 429 responses.

What is the correct order of resilience strategies in a pipeline?

For Azure OpenAI: rate limiter first (reject locally before using network), then timeout (fail fast before retry budget is exhausted), then circuit breaker (stop all calls when backend is struggling), then retry last (retry within the circuit breaker's open window would be futile). This order ensures each strategy sees the right behavior.

Can I share a resilience pipeline across multiple Azure OpenAI clients?

Yes. Build a ResiliencePipeline<HttpResponseMessage> once and share it via DI. Multiple ChatClient instances can share the same pipeline — they will collectively count toward the circuit breaker's failure threshold, which is the desired behavior for quota management across a shared deployment.

How does Microsoft.Extensions.Resilience differ from the Azure SDK's built-in retry?

The Azure SDK's built-in ClientRetryPolicy handles transient HTTP errors and 429s with simple exponential backoff. Microsoft.Extensions.Resilience adds circuit breaking (stops all requests during sustained failures), hedging, timeout policies, and composable pipelines. For single-user apps, the SDK retry is sufficient. For multi-user production apps, add Polly/Extensions.Resilience.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Polly #Resilience #Azure OpenAI #Circuit Breaker #Rate Limiting #.NET AI