Does Azure OpenAI support streaming in .NET?

Yes. The Azure.AI.OpenAI SDK exposes CompleteChatStreamingAsync, which returns an AsyncCollectionResult . Each update carries one or more content tokens you can push to the client immediately, reducing perceived latency to under a second for the first token.

How do I handle rate limiting with Azure OpenAI?

Register your AzureOpenAIClient with a retry pipeline. The SDK's AzureOpenAIClientOptions accepts a RetryPolicy that automatically retries on 429 Too Many Requests responses with exponential back-off. For additional control, read the Retry-After header and queue requests accordingly.

What is the difference between streaming and non-streaming chat completion?

Non-streaming waits for the entire response to be generated before returning it in a single payload. Streaming sends partial tokens as they are produced, dramatically lowering time-to-first-token and enabling real-time UX. The total token count is the same for both modes.

How do I send streaming responses to a browser?

Use Server-Sent Events (SSE). Set the response Content-Type to text/event-stream, write each token as a data: line followed by two newlines, and flush after each write. The browser consumes this natively with the EventSource API or the Fetch API's ReadableStream.

Build a Streaming Chat API with Azure OpenAI and .NET

Streaming changes the dynamics of a chat API. Instead of waiting several seconds for a complete response, users see tokens arrive in real time. The experience goes from sluggish to conversational. This workshop builds that experience from scratch, covering project setup, dependency injection, retry logic, three distinct endpoint patterns, error handling, and conversation history — all in a single runnable .NET 9 project.

By the end, you will have a production-grade API that serves non-streaming JSON, streaming IAsyncEnumerable, and Server-Sent Events to browser clients.

Prerequisites

Before you begin, make sure you have the following ready:

.NET 9 SDK installed
An Azure subscription with an Azure OpenAI resource provisioned
A deployed chat model (this workshop uses gpt-4o, but gpt-4o-mini works too)
Your Azure OpenAI endpoint and API key (or a configured managed identity)

If you are new to prompt design for chat models, the Prompt Engineering Fundamentals in C# guide is worth reading first.

Step 1 — Scaffold the Project

Create a new Minimal API project and move into the directory:

dotnet new webapi -n StreamingChatApi --use-minimal-apis
cd StreamingChatApi

Install the Azure OpenAI SDK from NuGet:

dotnet add package Azure.AI.OpenAI --version 2.1.0

This pulls in the OpenAI base package as a transitive dependency. You do not need to install it separately.

Step 2 — Configure Application Settings

Open appsettings.json and add your Azure OpenAI configuration:

{
  "AzureOpenAI": {
    "Endpoint": "https://<your-resource>.openai.azure.com/",
    "ApiKey": "<your-api-key>",
    "DeploymentName": "gpt-4o"
  },
  "Logging": {
    "LogLevel": {
      "Default": "Information"
    }
  }
}

For production, store the API key in Azure Key Vault or use DefaultAzureCredential with managed identity. Never commit secrets to source control.

Step 3 — Define the Configuration and Request Models

Create a Models folder and add the following files.

Models/AzureOpenAISettings.cs

namespace StreamingChatApi.Models;

public sealed class AzureOpenAISettings
{
    public const string SectionName = "AzureOpenAI";

    public required string Endpoint { get; init; }
    public required string ApiKey { get; init; }
    public required string DeploymentName { get; init; }
}

Models/ChatRequest.cs

namespace StreamingChatApi.Models;

public sealed class ChatRequest
{
    public required string Message { get; init; }
    public string? ConversationId { get; init; }
}

Models/ChatResponse.cs

namespace StreamingChatApi.Models;

public sealed class ChatResponse
{
    public required string Message { get; init; }
    public required string ConversationId { get; init; }
    public int PromptTokens { get; init; }
    public int CompletionTokens { get; init; }
}

Step 4 — Register the Azure OpenAI Client with Retry Logic

The AzureOpenAIClient is thread-safe and should be registered as a singleton. Wrapping it with a retry policy ensures transient failures — especially 429 rate-limit responses — are handled automatically.

Open Program.cs and configure dependency injection:

using System.ClientModel;
using System.ClientModel.Primitives;
using Azure;
using Azure.AI.OpenAI;
using StreamingChatApi.Models;
using StreamingChatApi.Services;

var builder = WebApplication.CreateBuilder(args);

// Bind configuration
builder.Services.Configure<AzureOpenAISettings>(
    builder.Configuration.GetSection(AzureOpenAISettings.SectionName));

// Register AzureOpenAIClient as singleton with retry
builder.Services.AddSingleton(sp =>
{
    var settings = builder.Configuration
        .GetSection(AzureOpenAISettings.SectionName)
        .Get<AzureOpenAISettings>()
        ?? throw new InvalidOperationException("AzureOpenAI settings are missing.");

    var clientOptions = new AzureOpenAIClientOptions
    {
        RetryPolicy = new ClientRetryPolicy(maxRetries: 3)
    };

    return new AzureOpenAIClient(
        new Uri(settings.Endpoint),
        new AzureKeyCredential(settings.ApiKey),
        clientOptions);
});

// Register application services
builder.Services.AddSingleton<ConversationStore>();
builder.Services.AddScoped<ChatService>();

var app = builder.Build();

The ClientRetryPolicy ships with the System.ClientModel library. It retries on 429 and 5xx status codes with exponential back-off, respecting Retry-After headers from Azure. Three retries covers most transient spikes.

Step 5 — Build the Conversation Store

Multi-turn chat requires history. A production system would persist conversations in a database. For this workshop, an in-memory concurrent dictionary does the job.

Services/ConversationStore.cs

using System.Collections.Concurrent;
using OpenAI.Chat;

namespace StreamingChatApi.Services;

public sealed class ConversationStore
{
    private readonly ConcurrentDictionary<string, List<ChatMessage>> _conversations = new();

    public string CreateConversation()
    {
        var id = Guid.NewGuid().ToString("N");
        _conversations[id] = new List<ChatMessage>
        {
            new SystemChatMessage(
                "You are a helpful assistant. Be concise and accurate.")
        };
        return id;
    }

    public List<ChatMessage> GetMessages(string conversationId)
    {
        if (!_conversations.TryGetValue(conversationId, out var messages))
            throw new KeyNotFoundException(
                $"Conversation '{conversationId}' not found.");
        return messages;
    }

    public void AddUserMessage(string conversationId, string content)
    {
        var messages = GetMessages(conversationId);
        lock (messages)
        {
            messages.Add(new UserChatMessage(content));
        }
    }

    public void AddAssistantMessage(string conversationId, string content)
    {
        var messages = GetMessages(conversationId);
        lock (messages)
        {
            messages.Add(new AssistantChatMessage(content));
        }
    }
}

Locking on the list prevents concurrent mutations during streaming. In a real application, you would scope conversation access to authenticated users and apply a maximum history length to stay within token limits.

Step 6 — Implement the Chat Service

The ChatService wraps the Azure OpenAI SDK and exposes both non-streaming and streaming methods.

Services/ChatService.cs

using System.Runtime.CompilerServices;
using Azure.AI.OpenAI;
using Microsoft.Extensions.Options;
using OpenAI.Chat;
using StreamingChatApi.Models;

namespace StreamingChatApi.Services;

public sealed class ChatService
{
    private readonly AzureOpenAIClient _aiClient;
    private readonly ConversationStore _store;
    private readonly AzureOpenAISettings _settings;

    public ChatService(
        AzureOpenAIClient aiClient,
        ConversationStore store,
        IOptions<AzureOpenAISettings> settings)
    {
        _aiClient = aiClient;
        _store = store;
        _settings = settings.Value;
    }

    public async Task<ChatResponse> CompleteAsync(ChatRequest request)
    {
        var conversationId = request.ConversationId
            ?? _store.CreateConversation();
        _store.AddUserMessage(conversationId, request.Message);

        var chatClient = _aiClient.GetChatClient(_settings.DeploymentName);
        var messages = _store.GetMessages(conversationId);

        ChatCompletion completion = await chatClient.CompleteChatAsync(messages);

        var reply = completion.Content[0].Text;
        _store.AddAssistantMessage(conversationId, reply);

        return new ChatResponse
        {
            Message = reply,
            ConversationId = conversationId,
            PromptTokens = completion.Usage.InputTokenCount,
            CompletionTokens = completion.Usage.OutputTokenCount
        };
    }

    public async IAsyncEnumerable<string> StreamAsync(
        ChatRequest request,
        [EnumeratorCancellation] CancellationToken cancellationToken = default)
    {
        var conversationId = request.ConversationId
            ?? _store.CreateConversation();
        _store.AddUserMessage(conversationId, request.Message);

        var chatClient = _aiClient.GetChatClient(_settings.DeploymentName);
        var messages = _store.GetMessages(conversationId);

        var fullResponse = new System.Text.StringBuilder();

        await foreach (StreamingChatCompletionUpdate update in
            chatClient.CompleteChatStreamingAsync(messages, cancellationToken: cancellationToken))
        {
            foreach (ChatMessageContentPart part in update.ContentUpdate)
            {
                fullResponse.Append(part.Text);
                yield return part.Text;
            }
        }

        _store.AddAssistantMessage(conversationId, fullResponse.ToString());
    }
}

Two critical design decisions live here. First, CompleteAsync returns the entire response in one shot, which is simpler when latency is not the primary concern. Second, StreamAsync returns IAsyncEnumerable<string>, yielding individual token fragments as they arrive. The StringBuilder accumulates the full response so conversation history stays complete.

Step 7 — Wire Up the Endpoints

Add three endpoints to Program.cs after the var app = builder.Build(); line:

// Non-streaming endpoint
app.MapPost("/api/chat", async (ChatRequest request, ChatService chat) =>
{
    try
    {
        var response = await chat.CompleteAsync(request);
        return Results.Ok(response);
    }
    catch (ClientResultException ex) when (ex.Status == 401)
    {
        return Results.Problem(
            "Authentication failed. Check your API key.",
            statusCode: 401);
    }
    catch (ClientResultException ex) when (ex.Status == 429)
    {
        return Results.Problem(
            "Rate limit exceeded. Try again later.",
            statusCode: 429);
    }
});

// Streaming endpoint returning IAsyncEnumerable
app.MapPost("/api/chat/stream", (ChatRequest request, ChatService chat) =>
{
    async IAsyncEnumerable<string> StreamTokens(
        [EnumeratorCancellation] CancellationToken ct)
    {
        await foreach (var token in chat.StreamAsync(request, ct))
        {
            yield return token;
        }
    }

    return Results.Ok(StreamTokens(CancellationToken.None));
});

// SSE endpoint for browser clients
app.MapPost("/api/chat/sse", async (
    ChatRequest request,
    ChatService chat,
    HttpContext context) =>
{
    context.Response.ContentType = "text/event-stream";
    context.Response.Headers.CacheControl = "no-cache";
    context.Response.Headers.Connection = "keep-alive";

    var writer = context.Response.BodyWriter;

    try
    {
        await foreach (var token in chat.StreamAsync(
            request, context.RequestAborted))
        {
            var escaped = token
                .Replace("\n", "\\n")
                .Replace("\r", "");
            var line = $"data: {escaped}\n\n";
            await context.Response.WriteAsync(line, context.RequestAborted);
            await context.Response.Body.FlushAsync(context.RequestAborted);
        }

        await context.Response.WriteAsync("data: [DONE]\n\n");
        await context.Response.Body.FlushAsync();
    }
    catch (ClientResultException ex) when (ex.Status == 429)
    {
        await context.Response.WriteAsync(
            "event: error\ndata: Rate limit exceeded\n\n");
        await context.Response.Body.FlushAsync();
    }
    catch (OperationCanceledException)
    {
        // Client disconnected -- nothing to do
    }
});

app.Run();

The /api/chat endpoint is straightforward request-response. The /api/chat/stream endpoint returns IAsyncEnumerable<string>, which ASP.NET Core serializes as a JSON array with chunked transfer encoding. The /api/chat/sse endpoint manually writes the SSE protocol, which browsers can consume using the EventSource API or fetch with a ReadableStream.

Step 8 — Handle Content Filter Responses

Azure OpenAI applies content filtering by default. When a request or response triggers a filter, the SDK throws a ClientResultException with status 400 and a specific error code. Add global error handling by inserting middleware before the endpoints:

app.Use(async (context, next) =>
{
    try
    {
        await next();
    }
    catch (ClientResultException ex) when (
        ex.Message.Contains("content_filter"))
    {
        context.Response.StatusCode = 400;
        await context.Response.WriteAsJsonAsync(new
        {
            error = "Content filter triggered",
            detail = "The request or response was flagged by Azure content safety."
        });
    }
});

Check the fix for 429 rate-limit errors and fix for 401 unauthorized errors if you run into those issues during development.

Step 9 — Test with curl

Start the application:

dotnet run

Non-streaming request:

curl -X POST http://localhost:5000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain dependency injection in three sentences."}'

You receive a single JSON payload with the full response, conversation ID, and token counts.

Streaming with SSE:

curl -X POST http://localhost:5000/api/chat/sse \
  -H "Content-Type: application/json" \
  -d '{"message": "What is async/await in C#?"}' \
  --no-buffer

Tokens appear incrementally in data: lines. The --no-buffer flag ensures curl writes output as it arrives.

Multi-turn conversation:

# First message -- capture the conversationId
curl -X POST http://localhost:5000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "What is LINQ?"}'

# Follow-up using the returned conversationId
curl -X POST http://localhost:5000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Show me an example.", "conversationId": "<id-from-above>"}'

The assistant remembers the first message and builds on it. This works because the conversation store accumulates the full message history and sends it with each request.

Step 10 — Consume SSE from a Browser

Here is a minimal JavaScript snippet to consume the SSE endpoint from a web page:

async function streamChat(message) {
    const response = await fetch('/api/chat/sse', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ message })
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    const output = document.getElementById('output');

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value, { stream: true });
        const lines = chunk.split('\n');

        for (const line of lines) {
            if (line.startsWith('data: ') && line !== 'data: [DONE]') {
                const token = line.slice(6).replace(/\\n/g, '\n');
                output.textContent += token;
            }
        }
    }
}

The fetch API with ReadableStream gives you full control over SSE parsing. For simpler use cases, EventSource works but only supports GET requests. Since our endpoint uses POST, the fetch approach is the better choice.

Complete Project Structure

StreamingChatApi/
  Program.cs
  appsettings.json
  Models/
    AzureOpenAISettings.cs
    ChatRequest.cs
    ChatResponse.cs
  Services/
    ChatService.cs
    ConversationStore.cs

Every file shown in this workshop is complete. Copy them into this structure, update appsettings.json with your Azure OpenAI credentials, and run dotnet run.

What You Learned

This workshop covered the full path from an empty directory to a production-ready streaming chat API. You configured AzureOpenAIClient as a singleton with automatic retry. You built three endpoint patterns — synchronous JSON, IAsyncEnumerable streaming, and SSE — each suited to different client types. You handled 429, 401, and content filter errors at both the endpoint and middleware levels. Finally, you implemented conversation history management for multi-turn interactions.

The Azure OpenAI SDK on NuGet and the Azure OpenAI documentation are the authoritative references for everything shown here. The release notes for Azure.AI.OpenAI 2.1.0 cover the latest SDK changes.

For deeper prompt engineering techniques to improve response quality, continue with the Prompt Engineering Fundamentals in C# guide.

Build a Streaming Chat API with Azure OpenAI and .NET

Prerequisites

Step 1 — Scaffold the Project

Step 2 — Configure Application Settings

Step 3 — Define the Configuration and Request Models

Step 4 — Register the Azure OpenAI Client with Retry Logic

Step 5 — Build the Conversation Store

Step 6 — Implement the Chat Service

Step 7 — Wire Up the Endpoints

Step 8 — Handle Content Filter Responses

Step 9 — Test with curl

Step 10 — Consume SSE from a Browser

Complete Project Structure

What You Learned

AI-Friendly Summary

Summary

Key Takeaways

Implementation Checklist

Frequently Asked Questions

Related Articles

Build a Document Summarizer with C# and Azure OpenAI

Build a Real-Time AI Chat App with SignalR and Azure OpenAI

Comparing LLM Providers — OpenAI, Azure OpenAI, Anthropic, and Open-Source

Was this article useful?

Discussion