Skip to main content

Semantic Kernel Streaming in C#: Chat, Tokens, Error Handling

Intermediate Original .NET 9 Microsoft.SemanticKernel 1.54.0 Azure.AI.OpenAI 2.1.0 Microsoft.ML.Tokenizers 0.22.0
By Rajesh Mishra · Mar 21, 2026 · 13 min read
Verified Mar 2026 .NET 9 Microsoft.SemanticKernel 1.54.0
In 30 Seconds

Semantic Kernel streaming in C# uses GetStreamingChatMessageContentsAsync on IChatCompletionService to return IAsyncEnumerable<StreamingChatMessageContent>, or InvokeStreamingAsync on the Kernel for prompt-template functions. Token usage metadata is null during streaming in SK 1.x — workaround by accumulating text and counting with TiktokenTokenizer post-stream. Content filter silent failures stop the stream abruptly without exceptions; detect via FinishReason on the last chunk. Auto function calling works transparently inside the streaming loop. Blazor Server streaming requires await InvokeAsync(StateHasChanged) for thread-safe UI updates. Wrap all streaming loops in try/catch for ClientResultException and OperationCanceledException.

Streaming transforms the feel of an AI application. Instead of a blank screen followed by a wall of text, users see the response build in real time — word by word. Semantic Kernel exposes two distinct APIs for streaming, each suited to different scenarios. Choosing the right one, handling the gaps in token tracking, detecting silent content filter failures, and writing thread-safe Blazor components all require patterns that are not obvious from the documentation alone.

This guide covers all of it — the two streaming APIs, SSE endpoints, token counting workarounds, content filter detection, automatic function calling mid-stream, and Blazor integration — with complete, compilable code examples throughout.

For a lower-level foundation of the Azure OpenAI streaming mechanics that Semantic Kernel wraps, see Build a Streaming Chat API with Azure OpenAI and .NET.

The Two Streaming APIs

Semantic Kernel exposes streaming through two different entry points. Understanding which to use saves significant confusion.

Direct Service Call: GetStreamingChatMessageContentsAsync

The first approach calls GetStreamingChatMessageContentsAsync directly on IChatCompletionService. This is the preferred path for ChatHistory-based conversations:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text;

// kernel is an injected Kernel instance
var chatCompletionService = kernel.Services
    .GetRequiredService<IChatCompletionService>();

var chatHistory = new ChatHistory();
chatHistory.AddSystemMessage("You are a helpful .NET assistant.");
chatHistory.AddUserMessage("Explain IAsyncEnumerable in two sentences.");

var settings = new OpenAIPromptExecutionSettings
{
    MaxTokens = 500
};

var responseBuffer = new StringBuilder();

await foreach (StreamingChatMessageContent chunk in
    chatCompletionService.GetStreamingChatMessageContentsAsync(
        chatHistory,
        settings,
        kernel))
{
    // chunk.Content can be null — always null-check
    if (chunk.Content is not null)
    {
        responseBuffer.Append(chunk.Content);
        Console.Write(chunk.Content);
    }
}

Console.WriteLine();
Console.WriteLine($"Full response: {responseBuffer}");

The method signature is:

IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(
    ChatHistory chatHistory,
    PromptExecutionSettings? executionSettings = null,
    Kernel? kernel = null,
    CancellationToken cancellationToken = default)

StreamingChatMessageContent.Content is string? — it can be null for metadata chunks (like tool-call initiation frames). Always null-check before appending.

StreamingChatMessageContent.ChoiceIndex identifies which model choice the chunk belongs to (almost always 0 for single-completion requests).

Kernel Function Call: InvokeStreamingAsync

The second approach uses kernel.InvokeStreamingAsync<StreamingChatMessageContent>, which requires a KernelFunction as input. This is the right choice when you have a prompt template stored as a semantic function:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;

// Create a streaming function from a prompt template
var promptTemplate = "Summarize the following in {{$style}} style:\n\n{{$input}}";

KernelFunction summarizeFunction = KernelFunctionFactory.CreateFromPrompt(
    promptTemplate,
    functionName: "Summarize");

var args = new KernelArguments
{
    ["input"] = "Semantic Kernel is an open-source SDK...",
    ["style"] = "executive"
};

var responseBuffer = new StringBuilder();

await foreach (StreamingChatMessageContent chunk in
    kernel.InvokeStreamingAsync<StreamingChatMessageContent>(
        summarizeFunction,
        args))
{
    if (chunk.Content is not null)
    {
        responseBuffer.Append(chunk.Content);
        Console.Write(chunk.Content);
    }
}

For streaming raw ChatHistory-based chat without a prompt template, prefer GetStreamingChatMessageContentsAsync — it is simpler and does not require wrapping the conversation in a KernelFunction.

Building an SSE Endpoint in ASP.NET Core

Server-Sent Events (SSE) is the browser-native protocol for pushing streaming text from server to client. Setting up an SSE endpoint in ASP.NET Core Minimal API with Semantic Kernel requires three things: the correct response headers, flushing after each chunk, and proper data: line formatting.

Here is a complete, production-ready SSE endpoint:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text;
using System.Text.Json;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddKernel()
    .AddAzureOpenAIChatCompletion(
        deploymentName: builder.Configuration["AzureOpenAI:DeploymentName"]!,
        endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
        apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!);

var app = builder.Build();

app.MapPost("/chat/stream", async (
    ChatStreamRequest req,
    Kernel kernel,
    HttpResponse resp,
    CancellationToken ct) =>
{
    // Required SSE headers
    resp.ContentType = "text/event-stream";
    resp.Headers["Cache-Control"] = "no-cache";
    resp.Headers["X-Accel-Buffering"] = "no"; // Disable nginx buffering

    var chatHistory = new ChatHistory();
    chatHistory.AddSystemMessage("You are a helpful .NET assistant.");
    chatHistory.AddUserMessage(req.Message);

    var settings = new OpenAIPromptExecutionSettings { MaxTokens = 1000 };
    var chatService = kernel.Services.GetRequiredService<IChatCompletionService>();
    var responseBuffer = new StringBuilder();

    // Use a timeout to prevent hanging streams
    using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
    timeoutCts.CancelAfter(TimeSpan.FromSeconds(60));

    try
    {
        await foreach (var chunk in chatService.GetStreamingChatMessageContentsAsync(
            chatHistory, settings, kernel, timeoutCts.Token))
        {
            if (chunk.Content is not null)
            {
                responseBuffer.Append(chunk.Content);

                // Serialize the chunk as JSON for structured SSE payloads
                var payload = JsonSerializer.Serialize(new { text = chunk.Content });
                await resp.WriteAsync($"data: {payload}\n\n", timeoutCts.Token);
                await resp.Body.FlushAsync(timeoutCts.Token);
            }

            // Check for content filter silent failure on the final chunk
            if (chunk.FinishReason == "content_filter")
            {
                await resp.WriteAsync("event: content_filter\ndata: {}\n\n", ct);
                await resp.Body.FlushAsync(ct);
                return;
            }
        }

        // Signal stream completion
        await resp.WriteAsync("data: [DONE]\n\n", ct);
        await resp.Body.FlushAsync(ct);
    }
    catch (OperationCanceledException)
    {
        // Client disconnected or timeout — nothing to send
    }
    catch (System.ClientModel.ClientResultException ex)
    {
        var errorPayload = JsonSerializer.Serialize(new
        {
            error = $"API error {ex.Status}: {ex.Message}"
        });
        await resp.WriteAsync($"event: error\ndata: {errorPayload}\n\n", ct);
        await resp.Body.FlushAsync(ct);
    }
});

app.Run();

record ChatStreamRequest(string Message);

Key implementation details:

  • resp.ContentType = "text/event-stream" tells the browser this is an SSE stream
  • Cache-Control: no-cache prevents proxy servers from buffering the response
  • X-Accel-Buffering: no disables nginx buffering for deployments behind a reverse proxy
  • await resp.Body.FlushAsync() after each write ensures chunks reach the client immediately rather than buffering
  • A CancellationTokenSource linked to the request CancellationToken with a 60-second timeout prevents hung streams

The Token Tracking Gap

When you call GetStreamingChatMessageContentsAsync, Semantic Kernel does not expose token usage in the streaming path. In non-streaming calls, FunctionResult.Metadata["Usage"] contains the token counts. In streaming calls in SK 1.x, this value is null.

This is a known limitation of the current SDK. The Azure OpenAI API does send a final usage chunk at stream completion, but Semantic Kernel’s streaming pipeline does not surface it as structured CompletionUsage metadata through the StreamingChatMessageContent API.

The impact: if you are building billing logic, quota enforcement, or token budget tracking that reads usage from streaming function results, you will silently record zero for every streaming call.

Token Counting Workaround

The workaround is to accumulate the full streamed text and count tokens client-side after the stream closes, using TiktokenTokenizer from Microsoft.ML.Tokenizers 0.22.0.

For a full reference on TikToken token counting in .NET, see Token Counting and Context Management in C# for Azure OpenAI.

using Microsoft.ML.Tokenizers;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text;

var chatService = kernel.Services.GetRequiredService<IChatCompletionService>();
var chatHistory = new ChatHistory();
chatHistory.AddSystemMessage("You are a helpful .NET assistant.");
chatHistory.AddUserMessage("What is async/await in C#?");

var settings = new OpenAIPromptExecutionSettings { MaxTokens = 500 };
var responseBuffer = new StringBuilder();

await foreach (var chunk in chatService.GetStreamingChatMessageContentsAsync(
    chatHistory, settings, kernel))
{
    if (chunk.Content is not null)
    {
        responseBuffer.Append(chunk.Content);
    }
}

// After stream completes, count tokens on the accumulated text
string completionText = responseBuffer.ToString();

// TiktokenTokenizer works offline — no API call required
var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
int completionTokens = tokenizer.CountTokens(completionText);

// Count input tokens for the full chat history
int promptTokens = 0;
foreach (var message in chatHistory)
{
    // 4 overhead tokens per message + content tokens
    promptTokens += 4;
    if (message.Content is not null)
        promptTokens += tokenizer.CountTokens(message.Content);
}
promptTokens += 2; // Reply priming

Console.WriteLine($"Prompt tokens (estimated): {promptTokens}");
Console.WriteLine($"Completion tokens (estimated): {completionTokens}");
Console.WriteLine($"Total (estimated): {promptTokens + completionTokens}");

This approach is accurate to approximately 97-99% of the actual billed tokens. The small gap comes from tool schema tokens and internal formatting overhead that the API includes but client-side counting does not see. Add a 3-5% buffer to your estimates for quota enforcement calculations.

TiktokenTokenizer.CreateForModel("gpt-4o") works for GPT-4o, GPT-4o-mini, GPT-4, and GPT-3.5-Turbo — all use cl100k_base encoding. Call this at startup and reuse the instance; construction is the expensive operation.

Content Filter Silent Failures

Azure OpenAI applies content safety filters during streaming. Unlike non-streaming calls — where a content filter violation throws a ClientResultException with status 400 — a filter hit during streaming causes the stream to stop abruptly, with no exception thrown.

This means your await foreach loop will exit normally (no exception) even though the response was cut short. The only indicator is the FinishReason property on the final StreamingChatMessageContent chunk.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;

var chatService = kernel.Services.GetRequiredService<IChatCompletionService>();
var chatHistory = new ChatHistory();
chatHistory.AddUserMessage(userInput);

var responseBuffer = new StringBuilder();
string? lastFinishReason = null;
bool streamCompletedNormally = false;

using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(TimeSpan.FromSeconds(45));

try
{
    await foreach (var chunk in chatService.GetStreamingChatMessageContentsAsync(
        chatHistory, null, kernel, cts.Token))
    {
        if (chunk.Content is not null)
        {
            responseBuffer.Append(chunk.Content);
        }

        // Capture finish reason from the last chunk
        if (chunk.FinishReason is not null)
        {
            lastFinishReason = chunk.FinishReason;
        }
    }

    streamCompletedNormally = true;
}
catch (OperationCanceledException)
{
    // Timeout or client disconnect
    Console.WriteLine("Stream timed out or was cancelled.");
}

// Post-stream analysis
if (streamCompletedNormally)
{
    if (lastFinishReason == "content_filter")
    {
        Console.WriteLine("Content filter triggered — response discarded.");
        // Do not store partial response to conversation history
    }
    else if (lastFinishReason == "stop" || lastFinishReason == "length")
    {
        Console.WriteLine($"Stream completed normally. Finish reason: {lastFinishReason}");
        Console.WriteLine($"Response: {responseBuffer}");
    }
    else if (responseBuffer.Length > 0 && lastFinishReason is null)
    {
        // Stream stopped with content but no finish reason — unexpected termination
        Console.WriteLine("Stream terminated unexpectedly (no finish reason).");
    }
}

The timeout CancellationTokenSource is a second line of defense: if the Azure OpenAI API stalls mid-stream without sending any more data or a finish reason, the cancellation fires and prevents the request from hanging indefinitely.

Streaming with Auto Function Calling

When FunctionChoiceBehavior.Auto() is set, Semantic Kernel intercepts tool-call chunks mid-stream, invokes the matching KernelFunction, feeds the result back to the model, and continues the response stream — all transparently inside the await foreach loop.

Here is the complete pattern with a registered plugin:

using System.ComponentModel;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using Microsoft.SemanticKernel.Connectors.OpenAI;
using System.Text;

// Define a plugin
public class WeatherPlugin
{
    [KernelFunction("get_weather")]
    [Description("Gets the current weather for a city. Returns temperature and conditions.")]
    public string GetWeather(
        [Description("The city name, e.g. 'London' or 'Seattle'")] string city)
    {
        // Replace with a real weather API call in production
        return city.ToLowerInvariant() switch
        {
            "london" => $"Current weather in {city}: 12°C, overcast",
            "seattle" => $"Current weather in {city}: 9°C, partly cloudy",
            _ => $"Current weather in {city}: 20°C, clear skies"
        };
    }
}

// In your service or component:
// Register the plugin with the kernel before streaming
kernel.Plugins.AddFromObject(new WeatherPlugin(), "WeatherPlugin");

var chatHistory = new ChatHistory();
chatHistory.AddUserMessage("What is the weather like in London right now?");

// Enable automatic function calling
var settings = new OpenAIPromptExecutionSettings
{
    FunctionChoiceBehavior = FunctionChoiceBehavior.Auto(),
    MaxTokens = 500
};

var chatService = kernel.Services.GetRequiredService<IChatCompletionService>();
var responseBuffer = new StringBuilder();

await foreach (var chunk in chatService.GetStreamingChatMessageContentsAsync(
    chatHistory, settings, kernel))
{
    if (chunk.Content is not null)
    {
        // During a tool call, chunks have no Content text
        // Only text chunks arrive here — tool call dispatch is invisible
        responseBuffer.Append(chunk.Content);
        Console.Write(chunk.Content);
    }
}

// Final response after function call round-trip
Console.WriteLine($"\nFull response: {responseBuffer}");

What happens under the hood during a tool call mid-stream:

  1. The model sends tool-call JSON chunks (no Content text — these are internal protocol frames)
  2. Semantic Kernel detects the tool-call structure and pauses iteration
  3. SK invokes WeatherPlugin.GetWeather("London") on your C# method
  4. SK appends the tool result to the conversation and makes a second API call
  5. The final response streams back normally with Content text chunks

Your await foreach loop sees only text chunks. The tool call round-trip is completely transparent. This means you do not need to write any dispatch logic — just iterate and accumulate.

If you need to observe tool calls as they happen (for logging or tracing), register an IFunctionInvocationFilter on the kernel rather than monitoring the streaming loop.

Blazor Server Streaming

For a full Blazor chatbot implementation, see Build a Streaming AI Chatbot with Blazor and Semantic Kernel. This section focuses specifically on the streaming mechanics that differ from non-streaming Blazor components.

The two critical requirements for streaming in Blazor Server are:

  1. [StreamRendering(true)] attribute on the component class — enables incremental HTML flushing during the initial render pass
  2. await InvokeAsync(StateHasChanged) inside the await foreach loop — marshals UI updates to the Blazor render thread
@page "/streaming-demo"
@rendermode InteractiveServer
@attribute [StreamRendering(true)]
@inject Kernel Kernel
@using Microsoft.SemanticKernel
@using Microsoft.SemanticKernel.ChatCompletion
@using Microsoft.SemanticKernel.Connectors.OpenAI
@using Microsoft.Extensions.DependencyInjection
@using System.Text

<h2>Streaming Demo</h2>

<div>
    <textarea @bind="_userInput" rows="2" disabled="@_isStreaming"
              placeholder="Ask something..."></textarea>
    <button @onclick="StreamResponseAsync" disabled="@_isStreaming">
        @(_isStreaming ? "Streaming..." : "Send")
    </button>
</div>

@if (_isStreaming || _responseText.Length > 0)
{
    <div class="response">
        <p>@_responseText</p>
        @if (_isStreaming)
        {
            <span class="cursor">|</span>
        }
    </div>
}

@if (!string.IsNullOrEmpty(_errorMessage))
{
    <div class="error">@_errorMessage</div>
}

@if (_estimatedTokens > 0)
{
    <div class="token-info">Estimated completion tokens: @_estimatedTokens</div>
}

@code {
    private string _userInput = "";
    private string _responseText = "";
    private string _errorMessage = "";
    private bool _isStreaming;
    private int _estimatedTokens;

    private async Task StreamResponseAsync()
    {
        if (string.IsNullOrWhiteSpace(_userInput)) return;

        _isStreaming = true;
        _responseText = "";
        _errorMessage = "";
        _estimatedTokens = 0;
        await InvokeAsync(StateHasChanged);

        var userText = _userInput;
        _userInput = "";

        var responseBuffer = new StringBuilder();

        try
        {
            var chatHistory = new ChatHistory();
            chatHistory.AddSystemMessage("You are a helpful .NET assistant.");
            chatHistory.AddUserMessage(userText);

            var settings = new OpenAIPromptExecutionSettings { MaxTokens = 500 };
            var chatService = Kernel.Services
                .GetRequiredService<IChatCompletionService>();

            using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(45));

            await foreach (var chunk in chatService.GetStreamingChatMessageContentsAsync(
                chatHistory, settings, Kernel, cts.Token))
            {
                if (chunk.Content is not null)
                {
                    responseBuffer.Append(chunk.Content);

                    // Update the bound string — Blazor re-renders on StateHasChanged
                    _responseText = responseBuffer.ToString();

                    // CRITICAL: marshal to render thread — never call StateHasChanged() directly
                    await InvokeAsync(StateHasChanged);
                }
            }

            // Post-stream: count tokens from accumulated text
            var tokenizer = Microsoft.ML.Tokenizers.TiktokenTokenizer.CreateForModel("gpt-4o");
            _estimatedTokens = tokenizer.CountTokens(responseBuffer.ToString());
        }
        catch (OperationCanceledException)
        {
            _errorMessage = "Request timed out. Partial response is shown above.";
        }
        catch (System.ClientModel.ClientResultException ex)
        {
            _errorMessage = $"API error ({ex.Status}): {ex.Message}";
            if (responseBuffer.Length > 0)
            {
                // Show whatever was accumulated before the error
                _responseText = responseBuffer.ToString() + " [truncated]";
            }
        }
        catch (Exception ex)
        {
            _errorMessage = $"Unexpected error: {ex.Message}";
        }
        finally
        {
            _isStreaming = false;
            // Final state update to hide the cursor and show token count
            await InvokeAsync(StateHasChanged);
        }
    }
}

The await InvokeAsync(StateHasChanged) call inside the await foreach loop is the engine of streaming UI updates. Each call pushes the current _responseText value through the SignalR connection to the browser, where Blazor patches only the changed DOM node. The result is smooth token-by-token rendering without a full page refresh.

Calling StateHasChanged() directly from the streaming loop will throw a threading exception because await foreach may resume on a thread pool thread outside the Blazor synchronization context. InvokeAsync ensures the call is correctly marshaled.

Error Handling

Streaming error handling requires a different mindset than non-streaming calls. You can receive a partial response before failure, and different failure modes arrive at different points in the loop.

The complete pattern:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;

var chatService = kernel.Services.GetRequiredService<IChatCompletionService>();
var chatHistory = new ChatHistory();
chatHistory.AddUserMessage(userInput);

var responseBuffer = new StringBuilder();
string? lastFinishReason = null;

using var cts = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
cts.CancelAfter(TimeSpan.FromSeconds(45));

try
{
    await foreach (var chunk in chatService.GetStreamingChatMessageContentsAsync(
        chatHistory, null, kernel, cts.Token))
    {
        if (chunk.Content is not null)
        {
            responseBuffer.Append(chunk.Content);
        }

        if (chunk.FinishReason is not null)
        {
            lastFinishReason = chunk.FinishReason;
        }
    }

    // Stream completed — inspect finish reason
    switch (lastFinishReason)
    {
        case "stop":
            Console.WriteLine($"Complete response:\n{responseBuffer}");
            break;

        case "length":
            Console.WriteLine($"Truncated at max_tokens:\n{responseBuffer}");
            // Consider increasing MaxTokens in settings
            break;

        case "content_filter":
            Console.WriteLine("Content filter triggered — do not display this response.");
            break;

        default:
            if (responseBuffer.Length > 0)
                Console.WriteLine($"Stream ended unexpectedly. Partial:\n{responseBuffer}");
            break;
    }
}
catch (OperationCanceledException)
{
    // Timeout or client disconnect
    // Display partial response if useful
    if (responseBuffer.Length > 50)
    {
        Console.WriteLine($"Request timed out. Partial response:\n{responseBuffer}");
    }
    else
    {
        Console.WriteLine("Request timed out with no usable response.");
    }
}
catch (System.ClientModel.ClientResultException ex)
{
    // API-level errors that arrive before or at stream start
    // ex.Status contains the HTTP status code (401, 429, 400, etc.)
    switch (ex.Status)
    {
        case 401:
            Console.WriteLine("Authentication failed. Check your Azure OpenAI API key.");
            break;
        case 429:
            Console.WriteLine("Rate limit exceeded. Retry after backoff.");
            break;
        default:
            Console.WriteLine($"Azure OpenAI API error {ex.Status}: {ex.Message}");
            break;
    }

    // Show partial response if stream had started before the error
    if (responseBuffer.Length > 0)
    {
        Console.WriteLine($"Partial response before error:\n{responseBuffer}");
    }
}

ClientResultException is in the System.ClientModel namespace (part of the Azure.Core package). Check ex.Status for the HTTP status code. A 401 indicates invalid credentials. A 429 indicates rate limiting — for retry policies targeting 429s in Semantic Kernel, see Fix Azure OpenAI 429 Too Many Requests in .NET.

OperationCanceledException fires for both timeout (from CancellationTokenSource.CancelAfter) and explicit cancellation (client disconnect in ASP.NET Core). When it fires, responseBuffer may contain useful partial content — display it if it is long enough to be meaningful.

The combination of FinishReason inspection and exception handling covers all failure modes: pre-stream API errors (exceptions), content filter mid-stream (FinishReason), timeout (OperationCanceledException), and normal completion (FinishReason “stop” or “length”).

What You Learned

Semantic Kernel provides two streaming paths: GetStreamingChatMessageContentsAsync on IChatCompletionService for ChatHistory-based conversations (the simpler and more common path), and kernel.InvokeStreamingAsync for prompt-template functions. Token usage is not available in streaming results in SK 1.x — the workaround is to accumulate streamed text and count with TiktokenTokenizer after the stream closes.

Content filter silent failures are the most dangerous edge case in streaming: the stream terminates without an exception, and only FinishReason == "content_filter" on the final chunk reveals what happened. Automatic function calling with FunctionChoiceBehavior.Auto() works transparently inside the streaming loop. Blazor Server streaming requires await InvokeAsync(StateHasChanged) for thread safety. Wrap all streaming loops with a CancellationTokenSource timeout and handle both OperationCanceledException and ClientResultException with access to the partial response accumulated so far.

⚠ Production Considerations

  • Do not assume the stream completed normally just because no exception was thrown. A content filter silent failure terminates the stream without an exception. Always check the FinishReason on the final StreamingChatMessageContent chunk. A 'content_filter' finish reason means the response was cut short and must be discarded or flagged.
  • Token usage is not reliably available during streaming in Semantic Kernel 1.x. Building billing or quota enforcement logic that reads FunctionResult.Metadata["Usage"] after a streaming call will silently record zero usage. Always accumulate streamed text and count tokens with TiktokenTokenizer as the authoritative source for streaming token consumption.
  • Calling GetStreamingChatMessageContentsAsync without a CancellationToken leaves the stream open indefinitely if the Azure OpenAI API stalls. Always pass a CancellationToken from CancellationTokenSource with a timeout (30-60 seconds for typical chat responses) to prevent goroutine/thread leaks and hung requests.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

Semantic Kernel streaming is powerful but requires defensive coding that non-streaming calls do not. The two key architectural decisions are: first, treat token counting as a post-stream operation rather than a real-time metric — accumulate and count after the stream closes; second, treat FinishReason checking as mandatory rather than optional. A content filter hit that silently terminates the stream will look identical to a normal completion unless you inspect the final chunk. Build your streaming service layer with both of these checks as non-negotiable invariants, not optional add-ons.

AI-Friendly Summary

Summary

Semantic Kernel streaming in C# uses GetStreamingChatMessageContentsAsync on IChatCompletionService to return IAsyncEnumerable<StreamingChatMessageContent>, or InvokeStreamingAsync on the Kernel for prompt-template functions. Token usage metadata is null during streaming in SK 1.x — workaround by accumulating text and counting with TiktokenTokenizer post-stream. Content filter silent failures stop the stream abruptly without exceptions; detect via FinishReason on the last chunk. Auto function calling works transparently inside the streaming loop. Blazor Server streaming requires await InvokeAsync(StateHasChanged) for thread-safe UI updates. Wrap all streaming loops in try/catch for ClientResultException and OperationCanceledException.

Key Takeaways

  • Use GetStreamingChatMessageContentsAsync on IChatCompletionService for direct chat streaming — simpler than InvokeStreamingAsync for ChatHistory-based conversations
  • Token usage metadata (FunctionResult.Metadata["Usage"]) is null during streaming in SK 1.x — count tokens client-side post-stream with TiktokenTokenizer
  • Content filter silent failures terminate the stream abruptly; detect via StreamingChatMessageContent.FinishReason on the final chunk and use a timeout CancellationTokenSource as backup
  • FunctionChoiceBehavior.Auto() enables transparent tool call interception mid-stream — no dispatch code required
  • Blazor Server streaming requires await InvokeAsync(StateHasChanged) — never call StateHasChanged() directly from async streaming loops
  • Catch ClientResultException for API-level errors and OperationCanceledException for timeouts; display partial accumulated response on mid-stream failure

Implementation Checklist

  • Resolve IChatCompletionService from kernel.Services.GetRequiredService<IChatCompletionService>()
  • Call GetStreamingChatMessageContentsAsync with ChatHistory, OpenAIPromptExecutionSettings, and the Kernel
  • Null-check chunk.Content before appending to StringBuilder accumulator
  • After stream completes, call TiktokenTokenizer.CreateForModel('gpt-4o').CountTokens(accumulated) for token estimate
  • Check final chunk FinishReason for 'content_filter' to detect silent stream termination
  • Set FunctionChoiceBehavior.Auto() in execution settings to enable automatic tool call handling mid-stream
  • In Blazor Server, use await InvokeAsync(StateHasChanged) inside the streaming loop
  • Wrap streaming loop in try/catch for ClientResultException and OperationCanceledException; display partial response on failure

Frequently Asked Questions

What is the difference between GetStreamingChatMessageContentsAsync and InvokeStreamingAsync in Semantic Kernel?

GetStreamingChatMessageContentsAsync is called directly on IChatCompletionService and is the simpler path — you pass a ChatHistory, execution settings, and optionally a Kernel. It returns IAsyncEnumerable<StreamingChatMessageContent> immediately. InvokeStreamingAsync<StreamingChatMessageContent> is called on the Kernel and requires a KernelFunction first, which makes it better suited for prompt templates stored as semantic functions. For streaming raw chat conversations, GetStreamingChatMessageContentsAsync is the preferred approach.

Why is token usage null during Semantic Kernel streaming?

In Semantic Kernel 1.x, FunctionResult.Metadata["Usage"] is null when a function is called via streaming. The underlying Azure OpenAI API does send a final usage chunk at stream end, but Semantic Kernel's streaming pipeline does not surface this as structured metadata the way the non-streaming path does. This is a known limitation. The workaround is to accumulate the streamed text and count tokens client-side with TiktokenTokenizer after the stream completes.

How do I detect a content filter silent failure during streaming?

When Azure OpenAI triggers a content filter mid-stream, the stream stops abruptly without throwing an exception. Detect this by checking StreamingChatMessageContent.FinishReason on the final chunk — a content-filtered response will have FinishReason set to 'content_filter' rather than 'stop'. Wrap the streaming loop in a CancellationTokenSource with a timeout as a secondary safeguard for cases where no final chunk arrives at all.

How does function calling work during Semantic Kernel streaming?

With FunctionChoiceBehavior.Auto() in OpenAIPromptExecutionSettings, Semantic Kernel intercepts tool-call chunks mid-stream, invokes your KernelFunction C# method, feeds the result back to the model, and resumes streaming the final response. This entire loop happens transparently inside GetStreamingChatMessageContentsAsync — you do not write any dispatch code. The streaming loop simply continues iterating; tool-call chunks emit no Content text.

What is the correct way to call StateHasChanged inside a streaming loop in Blazor Server?

Always use await InvokeAsync(StateHasChanged) inside async streaming loops in Blazor Server components. The await foreach loop may execute on a thread pool thread that is outside the component's SignalR dispatcher context. Calling StateHasChanged() directly from a background thread causes threading exceptions in production. InvokeAsync marshals the call to the correct synchronization context.

What is ClientResultException and when does it appear in streaming?

ClientResultException is in the Azure.Core namespace (System.ClientModel). It wraps HTTP errors returned by the Azure OpenAI API. In streaming scenarios, it can be thrown if the API rejects the initial request (before the stream starts) — for example on 401 authentication failures or 429 rate-limit rejections that occur before the first token is sent. Check ex.Status for the HTTP status code. Errors that occur after streaming begins (like content filter mid-stream) appear as stream termination rather than exceptions.

Can I use kernel.InvokeStreamingAsync to stream a raw prompt string?

Not directly. kernel.InvokeStreamingAsync<StreamingChatMessageContent> requires a KernelFunction. To create a streaming function from a prompt template, use KernelFunctionFactory.CreateFromPrompt(promptTemplate). For raw ChatHistory-based streaming without a prompt template, call GetStreamingChatMessageContentsAsync on the IChatCompletionService directly — that is simpler and does not require a KernelFunction.

Track your progress through this learning path.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Semantic Kernel #Streaming #InvokeStreamingAsync #Real-Time #.NET AI