Streaming changes the dynamics of a chat API. Instead of waiting several seconds for a complete response, users see tokens arrive in real time. The experience goes from sluggish to conversational. This workshop builds that experience from scratch, covering project setup, dependency injection, retry logic, three distinct endpoint patterns, error handling, and conversation history — all in a single runnable .NET 9 project.
By the end, you will have a production-grade API that serves non-streaming JSON, streaming IAsyncEnumerable, and Server-Sent Events to browser clients.
Prerequisites
Before you begin, make sure you have the following ready:
- .NET 9 SDK installed
- An Azure subscription with an Azure OpenAI resource provisioned
- A deployed chat model (this workshop uses
gpt-4o, butgpt-4o-miniworks too) - Your Azure OpenAI endpoint and API key (or a configured managed identity)
If you are new to prompt design for chat models, the Prompt Engineering Fundamentals in C# guide is worth reading first.
Step 1 — Scaffold the Project
Create a new Minimal API project and move into the directory:
dotnet new webapi -n StreamingChatApi --use-minimal-apis
cd StreamingChatApi
Install the Azure OpenAI SDK from NuGet:
dotnet add package Azure.AI.OpenAI --version 2.1.0
This pulls in the OpenAI base package as a transitive dependency. You do not need to install it separately.
Step 2 — Configure Application Settings
Open appsettings.json and add your Azure OpenAI configuration:
{
"AzureOpenAI": {
"Endpoint": "https://<your-resource>.openai.azure.com/",
"ApiKey": "<your-api-key>",
"DeploymentName": "gpt-4o"
},
"Logging": {
"LogLevel": {
"Default": "Information"
}
}
}
For production, store the API key in Azure Key Vault or use DefaultAzureCredential with managed identity. Never commit secrets to source control.
Step 3 — Define the Configuration and Request Models
Create a Models folder and add the following files.
Models/AzureOpenAISettings.cs
namespace StreamingChatApi.Models;
public sealed class AzureOpenAISettings
{
public const string SectionName = "AzureOpenAI";
public required string Endpoint { get; init; }
public required string ApiKey { get; init; }
public required string DeploymentName { get; init; }
}
Models/ChatRequest.cs
namespace StreamingChatApi.Models;
public sealed class ChatRequest
{
public required string Message { get; init; }
public string? ConversationId { get; init; }
}
Models/ChatResponse.cs
namespace StreamingChatApi.Models;
public sealed class ChatResponse
{
public required string Message { get; init; }
public required string ConversationId { get; init; }
public int PromptTokens { get; init; }
public int CompletionTokens { get; init; }
}
Step 4 — Register the Azure OpenAI Client with Retry Logic
The AzureOpenAIClient is thread-safe and should be registered as a singleton. Wrapping it with a retry policy ensures transient failures — especially 429 rate-limit responses — are handled automatically.
Open Program.cs and configure dependency injection:
using System.ClientModel;
using System.ClientModel.Primitives;
using Azure;
using Azure.AI.OpenAI;
using StreamingChatApi.Models;
using StreamingChatApi.Services;
var builder = WebApplication.CreateBuilder(args);
// Bind configuration
builder.Services.Configure<AzureOpenAISettings>(
builder.Configuration.GetSection(AzureOpenAISettings.SectionName));
// Register AzureOpenAIClient as singleton with retry
builder.Services.AddSingleton(sp =>
{
var settings = builder.Configuration
.GetSection(AzureOpenAISettings.SectionName)
.Get<AzureOpenAISettings>()
?? throw new InvalidOperationException("AzureOpenAI settings are missing.");
var clientOptions = new AzureOpenAIClientOptions
{
RetryPolicy = new ClientRetryPolicy(maxRetries: 3)
};
return new AzureOpenAIClient(
new Uri(settings.Endpoint),
new AzureKeyCredential(settings.ApiKey),
clientOptions);
});
// Register application services
builder.Services.AddSingleton<ConversationStore>();
builder.Services.AddScoped<ChatService>();
var app = builder.Build();
The ClientRetryPolicy ships with the System.ClientModel library. It retries on 429 and 5xx status codes with exponential back-off, respecting Retry-After headers from Azure. Three retries covers most transient spikes.
Step 5 — Build the Conversation Store
Multi-turn chat requires history. A production system would persist conversations in a database. For this workshop, an in-memory concurrent dictionary does the job.
Services/ConversationStore.cs
using System.Collections.Concurrent;
using OpenAI.Chat;
namespace StreamingChatApi.Services;
public sealed class ConversationStore
{
private readonly ConcurrentDictionary<string, List<ChatMessage>> _conversations = new();
public string CreateConversation()
{
var id = Guid.NewGuid().ToString("N");
_conversations[id] = new List<ChatMessage>
{
new SystemChatMessage(
"You are a helpful assistant. Be concise and accurate.")
};
return id;
}
public List<ChatMessage> GetMessages(string conversationId)
{
if (!_conversations.TryGetValue(conversationId, out var messages))
throw new KeyNotFoundException(
$"Conversation '{conversationId}' not found.");
return messages;
}
public void AddUserMessage(string conversationId, string content)
{
var messages = GetMessages(conversationId);
lock (messages)
{
messages.Add(new UserChatMessage(content));
}
}
public void AddAssistantMessage(string conversationId, string content)
{
var messages = GetMessages(conversationId);
lock (messages)
{
messages.Add(new AssistantChatMessage(content));
}
}
}
Locking on the list prevents concurrent mutations during streaming. In a real application, you would scope conversation access to authenticated users and apply a maximum history length to stay within token limits.
Step 6 — Implement the Chat Service
The ChatService wraps the Azure OpenAI SDK and exposes both non-streaming and streaming methods.
Services/ChatService.cs
using System.Runtime.CompilerServices;
using Azure.AI.OpenAI;
using Microsoft.Extensions.Options;
using OpenAI.Chat;
using StreamingChatApi.Models;
namespace StreamingChatApi.Services;
public sealed class ChatService
{
private readonly AzureOpenAIClient _aiClient;
private readonly ConversationStore _store;
private readonly AzureOpenAISettings _settings;
public ChatService(
AzureOpenAIClient aiClient,
ConversationStore store,
IOptions<AzureOpenAISettings> settings)
{
_aiClient = aiClient;
_store = store;
_settings = settings.Value;
}
public async Task<ChatResponse> CompleteAsync(ChatRequest request)
{
var conversationId = request.ConversationId
?? _store.CreateConversation();
_store.AddUserMessage(conversationId, request.Message);
var chatClient = _aiClient.GetChatClient(_settings.DeploymentName);
var messages = _store.GetMessages(conversationId);
ChatCompletion completion = await chatClient.CompleteChatAsync(messages);
var reply = completion.Content[0].Text;
_store.AddAssistantMessage(conversationId, reply);
return new ChatResponse
{
Message = reply,
ConversationId = conversationId,
PromptTokens = completion.Usage.InputTokenCount,
CompletionTokens = completion.Usage.OutputTokenCount
};
}
public async IAsyncEnumerable<string> StreamAsync(
ChatRequest request,
[EnumeratorCancellation] CancellationToken cancellationToken = default)
{
var conversationId = request.ConversationId
?? _store.CreateConversation();
_store.AddUserMessage(conversationId, request.Message);
var chatClient = _aiClient.GetChatClient(_settings.DeploymentName);
var messages = _store.GetMessages(conversationId);
var fullResponse = new System.Text.StringBuilder();
await foreach (StreamingChatCompletionUpdate update in
chatClient.CompleteChatStreamingAsync(messages, cancellationToken: cancellationToken))
{
foreach (ChatMessageContentPart part in update.ContentUpdate)
{
fullResponse.Append(part.Text);
yield return part.Text;
}
}
_store.AddAssistantMessage(conversationId, fullResponse.ToString());
}
}
Two critical design decisions live here. First, CompleteAsync returns the entire response in one shot, which is simpler when latency is not the primary concern. Second, StreamAsync returns IAsyncEnumerable<string>, yielding individual token fragments as they arrive. The StringBuilder accumulates the full response so conversation history stays complete.
Step 7 — Wire Up the Endpoints
Add three endpoints to Program.cs after the var app = builder.Build(); line:
// Non-streaming endpoint
app.MapPost("/api/chat", async (ChatRequest request, ChatService chat) =>
{
try
{
var response = await chat.CompleteAsync(request);
return Results.Ok(response);
}
catch (ClientResultException ex) when (ex.Status == 401)
{
return Results.Problem(
"Authentication failed. Check your API key.",
statusCode: 401);
}
catch (ClientResultException ex) when (ex.Status == 429)
{
return Results.Problem(
"Rate limit exceeded. Try again later.",
statusCode: 429);
}
});
// Streaming endpoint returning IAsyncEnumerable
app.MapPost("/api/chat/stream", (ChatRequest request, ChatService chat) =>
{
async IAsyncEnumerable<string> StreamTokens(
[EnumeratorCancellation] CancellationToken ct)
{
await foreach (var token in chat.StreamAsync(request, ct))
{
yield return token;
}
}
return Results.Ok(StreamTokens(CancellationToken.None));
});
// SSE endpoint for browser clients
app.MapPost("/api/chat/sse", async (
ChatRequest request,
ChatService chat,
HttpContext context) =>
{
context.Response.ContentType = "text/event-stream";
context.Response.Headers.CacheControl = "no-cache";
context.Response.Headers.Connection = "keep-alive";
var writer = context.Response.BodyWriter;
try
{
await foreach (var token in chat.StreamAsync(
request, context.RequestAborted))
{
var escaped = token
.Replace("\n", "\\n")
.Replace("\r", "");
var line = $"data: {escaped}\n\n";
await context.Response.WriteAsync(line, context.RequestAborted);
await context.Response.Body.FlushAsync(context.RequestAborted);
}
await context.Response.WriteAsync("data: [DONE]\n\n");
await context.Response.Body.FlushAsync();
}
catch (ClientResultException ex) when (ex.Status == 429)
{
await context.Response.WriteAsync(
"event: error\ndata: Rate limit exceeded\n\n");
await context.Response.Body.FlushAsync();
}
catch (OperationCanceledException)
{
// Client disconnected -- nothing to do
}
});
app.Run();
The /api/chat endpoint is straightforward request-response. The /api/chat/stream endpoint returns IAsyncEnumerable<string>, which ASP.NET Core serializes as a JSON array with chunked transfer encoding. The /api/chat/sse endpoint manually writes the SSE protocol, which browsers can consume using the EventSource API or fetch with a ReadableStream.
Step 8 — Handle Content Filter Responses
Azure OpenAI applies content filtering by default. When a request or response triggers a filter, the SDK throws a ClientResultException with status 400 and a specific error code. Add global error handling by inserting middleware before the endpoints:
app.Use(async (context, next) =>
{
try
{
await next();
}
catch (ClientResultException ex) when (
ex.Message.Contains("content_filter"))
{
context.Response.StatusCode = 400;
await context.Response.WriteAsJsonAsync(new
{
error = "Content filter triggered",
detail = "The request or response was flagged by Azure content safety."
});
}
});
Check the fix for 429 rate-limit errors and fix for 401 unauthorized errors if you run into those issues during development.
Step 9 — Test with curl
Start the application:
dotnet run
Non-streaming request:
curl -X POST http://localhost:5000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Explain dependency injection in three sentences."}'
You receive a single JSON payload with the full response, conversation ID, and token counts.
Streaming with SSE:
curl -X POST http://localhost:5000/api/chat/sse \
-H "Content-Type: application/json" \
-d '{"message": "What is async/await in C#?"}' \
--no-buffer
Tokens appear incrementally in data: lines. The --no-buffer flag ensures curl writes output as it arrives.
Multi-turn conversation:
# First message -- capture the conversationId
curl -X POST http://localhost:5000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is LINQ?"}'
# Follow-up using the returned conversationId
curl -X POST http://localhost:5000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Show me an example.", "conversationId": "<id-from-above>"}'
The assistant remembers the first message and builds on it. This works because the conversation store accumulates the full message history and sends it with each request.
Step 10 — Consume SSE from a Browser
Here is a minimal JavaScript snippet to consume the SSE endpoint from a web page:
async function streamChat(message) {
const response = await fetch('/api/chat/sse', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
const output = document.getElementById('output');
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value, { stream: true });
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ') && line !== 'data: [DONE]') {
const token = line.slice(6).replace(/\\n/g, '\n');
output.textContent += token;
}
}
}
}
The fetch API with ReadableStream gives you full control over SSE parsing. For simpler use cases, EventSource works but only supports GET requests. Since our endpoint uses POST, the fetch approach is the better choice.
Complete Project Structure
StreamingChatApi/
Program.cs
appsettings.json
Models/
AzureOpenAISettings.cs
ChatRequest.cs
ChatResponse.cs
Services/
ChatService.cs
ConversationStore.cs
Every file shown in this workshop is complete. Copy them into this structure, update appsettings.json with your Azure OpenAI credentials, and run dotnet run.
What You Learned
This workshop covered the full path from an empty directory to a production-ready streaming chat API. You configured AzureOpenAIClient as a singleton with automatic retry. You built three endpoint patterns — synchronous JSON, IAsyncEnumerable streaming, and SSE — each suited to different client types. You handled 429, 401, and content filter errors at both the endpoint and middleware levels. Finally, you implemented conversation history management for multi-turn interactions.
The Azure OpenAI SDK on NuGet and the Azure OpenAI documentation are the authoritative references for everything shown here. The release notes for Azure.AI.OpenAI 2.1.0 cover the latest SDK changes.
For deeper prompt engineering techniques to improve response quality, continue with the Prompt Engineering Fundamentals in C# guide.