How ChatHistory Works in Semantic Kernel
ChatHistory is the core conversation state object in Semantic Kernel. It lives in the Microsoft.SemanticKernel namespace and holds an ordered list of ChatMessageContent objects — one per turn in the conversation.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
// Initialize with a system message at index 0
var chatHistory = new ChatHistory("You are a helpful .NET assistant.");
// Add a user turn
chatHistory.AddUserMessage("How does Semantic Kernel handle retries?");
// Add the assistant's response back
chatHistory.AddAssistantMessage("Semantic Kernel delegates retry logic to the underlying HTTP client...");
// You can also add tool messages
chatHistory.Add(new ChatMessageContent(AuthorRole.Tool, "Tool result data"));
Because ChatHistory inherits from List<ChatMessageContent>, every standard list operation is available: Count, RemoveAt(index), RemoveRange(index, count), and LINQ queries. This makes truncation straightforward without any custom abstractions.
The four AuthorRole values you will use in production are:
AuthorRole.System— the system prompt (index 0)AuthorRole.User— user turnsAuthorRole.Assistant— AI responsesAuthorRole.Tool— function call results
The Unbounded Growth Problem
Every conversation turn appends two messages to ChatHistory — one for the user, one for the assistant. With a 200-token average per message, here is what that looks like over time:
| Turns | Messages | Approx. tokens |
|---|---|---|
| 10 | 21 | ~4,200 |
| 30 | 61 | ~12,200 |
| 50 | 101 | ~20,200 |
| 100 | 201 | ~40,200 |
| 200 | 401 | ~80,400 |
GPT-4o has a 128K context window. A long session with verbose replies can exhaust it well before 200 turns. When this happens, the API returns a context_length_exceeded error — see Fix Azure OpenAI Context Length Exceeded in C# for handling strategies.
The solution is a deliberate history management strategy chosen at design time. The four patterns below cover the full spectrum from simple to sophisticated.
Pattern 1: Sliding Window
The simplest strategy — keep a fixed number of the most recent messages and discard the oldest. The system message at index 0 is never removed.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
public static class ChatHistoryExtensions
{
/// <summary>
/// Trims history to at most maxMessages recent messages,
/// always preserving index 0 (system prompt).
/// </summary>
public static void ApplySlidingWindow(
this ChatHistory chatHistory,
int maxMessages = 20)
{
// chatHistory[0] is the system message — never remove it
// Count - 1 gives us the number of non-system messages
int nonSystemCount = chatHistory.Count - 1;
int excess = nonSystemCount - maxMessages;
if (excess > 0)
{
// Remove from index 1 (oldest non-system) to reduce excess
chatHistory.RemoveRange(1, excess);
}
}
}
Call it after every assistant reply:
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
public class SlidingWindowChatService(Kernel kernel)
{
private readonly ChatHistory _chatHistory =
new("You are a helpful .NET assistant.");
private readonly IChatCompletionService _chatCompletion =
kernel.Services.GetRequiredService<IChatCompletionService>();
public async Task<string> ChatAsync(string userMessage)
{
_chatHistory.AddUserMessage(userMessage);
var response = await _chatCompletion.GetChatMessageContentAsync(
_chatHistory,
kernel: kernel);
_chatHistory.AddAssistantMessage(response.Content ?? "");
// Keep only the last 20 messages + system prompt
_chatHistory.ApplySlidingWindow(maxMessages: 20);
return response.Content ?? "";
}
}
Trade-off: Simple and predictable token usage, but the AI loses context for anything older than the window. Conversations that require remembering facts from earlier turns will appear to forget them.
Pattern 2: Summarization with GPT-4o-mini
Instead of discarding old messages, compress them into a single summary. This preserves long-range context at the cost of one extra LLM call.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;
public class SummarizingChatService(Kernel kernel)
{
private readonly ChatHistory _chatHistory =
new("You are a helpful .NET assistant.");
private readonly IChatCompletionService _chatCompletion =
kernel.Services.GetRequiredService<IChatCompletionService>();
// Summarize when non-system message count exceeds this threshold
private const int SummarizationThreshold = 30;
public async Task<string> ChatAsync(string userMessage)
{
_chatHistory.AddUserMessage(userMessage);
var response = await _chatCompletion.GetChatMessageContentAsync(
_chatHistory,
kernel: kernel);
_chatHistory.AddAssistantMessage(response.Content ?? "");
// Trigger summarization if we've grown too large
int nonSystemCount = _chatHistory.Count - 1;
if (nonSystemCount >= SummarizationThreshold)
{
await SummarizeHistoryAsync();
}
return response.Content ?? "";
}
private async Task SummarizeHistoryAsync()
{
// Collect all non-system messages to summarize
var messagesToSummarize = _chatHistory
.Skip(1)
.ToList();
// Build a prompt asking the model to summarize the conversation so far
var summaryPrompt = new StringBuilder();
summaryPrompt.AppendLine("Summarize the following conversation concisely. ");
summaryPrompt.AppendLine("Capture key facts, decisions, and context that would help continue the conversation:");
summaryPrompt.AppendLine();
foreach (var message in messagesToSummarize)
{
summaryPrompt.AppendLine($"{message.Role}: {message.Content}");
}
// Use a fast, cheap model for summarization
// Configure a separate kernel or execution settings for GPT-4o-mini
var summaryHistory = new ChatHistory(
"You are a precise conversation summarizer. Produce concise factual summaries.");
summaryHistory.AddUserMessage(summaryPrompt.ToString());
var summaryResponse = await _chatCompletion.GetChatMessageContentAsync(
summaryHistory,
kernel: kernel);
var summary = summaryResponse.Content ?? "Previous conversation omitted.";
// Rebuild history: system message + summary as assistant context + empty
var systemMessage = _chatHistory[0];
_chatHistory.Clear();
_chatHistory.Add(systemMessage);
// Add the summary as an assistant message to preserve conversational flow
_chatHistory.Add(new ChatMessageContent(
AuthorRole.Assistant,
$"[Summary of previous conversation]: {summary}"));
}
}
Trade-off: Preserves long-range context. The summary costs one extra LLM call per threshold crossing. For production, configure GPT-4o-mini as a dedicated summarization deployment to keep costs low.
Pattern 3: Token-Aware Truncation
The most accurate strategy — count actual tokens and truncate until you are within budget. This requires Microsoft.ML.Tokenizers.
dotnet add package Microsoft.ML.Tokenizers --version 0.22.0
using Microsoft.ML.Tokenizers;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
public static class TokenAwareTruncation
{
// Each message has ~4 tokens of overhead (role, separators)
private const int TokensPerMessageOverhead = 4;
// Two priming tokens are added by the API at the start of the reply
private const int PrimingTokens = 2;
/// <summary>
/// Removes oldest non-system messages until the total token count
/// falls below maxTokens. Always preserves index 0 (system prompt).
/// </summary>
public static void TruncateToTokenBudget(
this ChatHistory chatHistory,
TiktokenTokenizer tokenizer,
int maxTokens = 8_000)
{
while (chatHistory.Count > 1 && CountTokens(chatHistory, tokenizer) > maxTokens)
{
// Remove the oldest non-system message (index 1)
chatHistory.RemoveAt(1);
}
}
public static int CountTokens(
ChatHistory chatHistory,
TiktokenTokenizer tokenizer)
{
int total = PrimingTokens;
foreach (var message in chatHistory)
{
total += TokensPerMessageOverhead;
total += tokenizer.CountTokens(message.Content ?? "");
}
return total;
}
}
public class TokenAwareChatService(Kernel kernel)
{
private readonly ChatHistory _chatHistory =
new("You are a helpful .NET assistant.");
private readonly IChatCompletionService _chatCompletion =
kernel.Services.GetRequiredService<IChatCompletionService>();
// Create the tokenizer once — it is thread-safe and expensive to construct
private static readonly TiktokenTokenizer _tokenizer =
TiktokenTokenizer.CreateForModel("gpt-4o");
// Leave headroom for the model's response tokens
private const int MaxInputTokens = 100_000;
public async Task<string> ChatAsync(string userMessage)
{
_chatHistory.AddUserMessage(userMessage);
// Truncate before sending to stay within context window
_chatHistory.TruncateToTokenBudget(_tokenizer, MaxInputTokens);
var response = await _chatCompletion.GetChatMessageContentAsync(
_chatHistory,
kernel: kernel);
_chatHistory.AddAssistantMessage(response.Content ?? "");
return response.Content ?? "";
}
}
For a deeper understanding of token counting mechanics and why the 4-token overhead per message exists, see Azure OpenAI Token Counting and Context Management in C#.
Trade-off: Most precise strategy — no surprise context-length errors. The tokenizer adds a small CPU overhead per call but TiktokenTokenizer is efficient. The downside is that abrupt removal of messages can create incoherent context if a user refers back to an earlier exchange.
Pattern 4: Hybrid — Summarize + Recency Window
Combines summarization and sliding window for the best of both worlds. Summarize every 20 turns to capture long-range context, then keep only the last 5 turns for recency.
using Microsoft.ML.Tokenizers;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;
public class HybridChatService(Kernel kernel)
{
private readonly ChatHistory _chatHistory =
new("You are a helpful .NET assistant. You are knowledgeable about .NET and Azure.");
private readonly IChatCompletionService _chatCompletion =
kernel.Services.GetRequiredService<IChatCompletionService>();
private static readonly TiktokenTokenizer _tokenizer =
TiktokenTokenizer.CreateForModel("gpt-4o");
// Summarize the oldest messages when non-system count exceeds this
private const int SummarizationTrigger = 20;
// After summarization, keep this many most-recent turns
private const int RecentTurnsToKeep = 5;
// Hard token ceiling before the summarization kicks in
private const int MaxTokensBeforeForce = 90_000;
public async Task<string> ChatAsync(string userMessage)
{
_chatHistory.AddUserMessage(userMessage);
var response = await _chatCompletion.GetChatMessageContentAsync(
_chatHistory,
kernel: kernel);
_chatHistory.AddAssistantMessage(response.Content ?? "");
await ApplyHybridStrategyAsync();
return response.Content ?? "";
}
private async Task ApplyHybridStrategyAsync()
{
int nonSystemCount = _chatHistory.Count - 1;
int tokenCount = TokenAwareTruncation.CountTokens(_chatHistory, _tokenizer);
bool shouldSummarize =
nonSystemCount >= SummarizationTrigger ||
tokenCount >= MaxTokensBeforeForce;
if (!shouldSummarize)
return;
// Identify the messages to summarize (everything except last K turns)
// non-system messages live at index 1..Count-1
// keep the last RecentTurnsToKeep * 2 messages (each turn = 2 messages)
int recentMessageCount = RecentTurnsToKeep * 2;
int messagesToSummarize = nonSystemCount - recentMessageCount;
if (messagesToSummarize <= 0)
return; // Not enough old messages to summarize yet
var oldMessages = _chatHistory
.Skip(1) // skip system
.Take(messagesToSummarize) // oldest messages only
.ToList();
var recentMessages = _chatHistory
.Skip(1 + messagesToSummarize) // skip system + old
.ToList();
// Build summary of old messages
var summaryPrompt = new StringBuilder();
summaryPrompt.AppendLine(
"Summarize this conversation excerpt concisely. Include key facts, " +
"code snippets discussed, decisions made, and any unresolved questions:");
summaryPrompt.AppendLine();
foreach (var msg in oldMessages)
{
summaryPrompt.AppendLine($"{msg.Role}: {msg.Content}");
}
var summaryHistory = new ChatHistory(
"You are a conversation summarizer. Be concise and factual.");
summaryHistory.AddUserMessage(summaryPrompt.ToString());
var summaryResp = await _chatCompletion.GetChatMessageContentAsync(
summaryHistory, kernel: kernel);
var summary = summaryResp.Content ?? "Earlier context omitted.";
// Rebuild: system + summary + recent turns
var systemMessage = _chatHistory[0];
_chatHistory.Clear();
_chatHistory.Add(systemMessage);
_chatHistory.Add(new ChatMessageContent(
AuthorRole.Assistant,
$"[Conversation summary — earlier context]: {summary}"));
foreach (var msg in recentMessages)
{
_chatHistory.Add(msg);
}
}
}
Trade-off: Best coherence across long sessions. Adds latency and cost every 20 turns. Use a dedicated GPT-4o-mini deployment for the summarization call to minimize impact on user-perceived response time.
Persisting ChatHistory Across Requests
For multi-turn chatbots in ASP.NET Core, you need to persist ChatHistory between HTTP requests. ChatMessageContent serializes cleanly with System.Text.Json.
Redis Persistence with IDistributedCache
using Microsoft.Extensions.Caching.Distributed;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text.Json;
public class RedisChatHistoryStore(IDistributedCache cache)
{
private static readonly JsonSerializerOptions _jsonOptions = new()
{
WriteIndented = false,
};
public async Task<ChatHistory> LoadAsync(
string sessionId,
string systemPrompt,
CancellationToken ct = default)
{
var bytes = await cache.GetAsync(sessionId, ct);
if (bytes is null || bytes.Length == 0)
{
// New session — start fresh with the system prompt
return new ChatHistory(systemPrompt);
}
var messages = JsonSerializer.Deserialize<List<SerializableChatMessage>>(bytes, _jsonOptions)
?? [];
var history = new ChatHistory();
foreach (var msg in messages)
{
history.Add(new ChatMessageContent(
new AuthorRole(msg.Role),
msg.Content));
}
return history;
}
public async Task SaveAsync(
string sessionId,
ChatHistory chatHistory,
CancellationToken ct = default)
{
var messages = chatHistory
.Select(m => new SerializableChatMessage
{
Role = m.Role.ToString(),
Content = m.Content ?? "",
})
.ToList();
var bytes = JsonSerializer.SerializeToUtf8Bytes(messages, _jsonOptions);
await cache.SetAsync(sessionId, bytes, new DistributedCacheEntryOptions
{
SlidingExpiration = TimeSpan.FromHours(24),
}, ct);
}
}
public record SerializableChatMessage
{
public string Role { get; init; } = "";
public string Content { get; init; } = "";
}
Register Redis in Program.cs:
builder.Services.AddStackExchangeRedisCache(options =>
{
options.Configuration = builder.Configuration["Redis:ConnectionString"];
options.InstanceName = "chatbot:";
});
builder.Services.AddScoped<RedisChatHistoryStore>();
Cosmos DB Persistence
For applications already using Azure Cosmos DB (as in the RAG chatbot tutorial), store chat sessions as documents alongside your operational data:
using Microsoft.Azure.Cosmos;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text.Json;
public class CosmosChatHistoryStore(CosmosClient cosmosClient)
{
private readonly Container _container =
cosmosClient.GetContainer("chatbot", "sessions");
public async Task<ChatHistory> LoadAsync(
string userId,
string systemPrompt,
CancellationToken ct = default)
{
try
{
var response = await _container.ReadItemAsync<ChatSession>(
id: userId,
partitionKey: new PartitionKey(userId),
cancellationToken: ct);
var history = new ChatHistory();
foreach (var msg in response.Value.Messages)
{
history.Add(new ChatMessageContent(
new AuthorRole(msg.Role),
msg.Content));
}
return history;
}
catch (CosmosException ex) when (ex.StatusCode == System.Net.HttpStatusCode.NotFound)
{
return new ChatHistory(systemPrompt);
}
}
public async Task SaveAsync(
string userId,
ChatHistory chatHistory,
CancellationToken ct = default)
{
var session = new ChatSession
{
Id = userId,
UserId = userId,
UpdatedAt = DateTime.UtcNow,
Messages = chatHistory
.Select(m => new SerializableChatMessage
{
Role = m.Role.ToString(),
Content = m.Content ?? "",
})
.ToList(),
};
await _container.UpsertItemAsync(
session,
new PartitionKey(userId),
cancellationToken: ct);
}
}
public class ChatSession
{
public string Id { get; set; } = "";
public string UserId { get; set; } = "";
public DateTime UpdatedAt { get; set; }
public List<SerializableChatMessage> Messages { get; set; } = [];
}
Multi-User ASP.NET Core Integration
In ASP.NET Core, register ChatHistory as a Scoped service so each HTTP request gets its own instance. For multi-turn sessions, load and save around each request.
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
// Program.cs
builder.Services.AddSemanticKernel()
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4o",
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!);
builder.Services.AddScoped<RedisChatHistoryStore>();
builder.Services.AddScoped<HybridChatService>();
Wire up a minimal API endpoint that loads, runs, and saves history per request:
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
app.MapPost("/api/chat/{sessionId}", async (
string sessionId,
ChatRequest request,
RedisChatHistoryStore historyStore,
Kernel kernel,
CancellationToken ct) =>
{
const string SystemPrompt =
"You are a helpful .NET assistant specialized in Semantic Kernel and Azure OpenAI.";
// 1. Load or create history for this session
var chatHistory = await historyStore.LoadAsync(sessionId, SystemPrompt, ct);
// 2. Add the new user message
chatHistory.AddUserMessage(request.Message);
// 3. Apply token-aware truncation before the API call
var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
chatHistory.TruncateToTokenBudget(tokenizer, maxTokens: 100_000);
// 4. Get the AI response
var chatCompletion = kernel.Services
.GetRequiredService<IChatCompletionService>();
var response = await chatCompletion.GetChatMessageContentAsync(
chatHistory,
kernel: kernel,
cancellationToken: ct);
chatHistory.AddAssistantMessage(response.Content ?? "");
// 5. Persist the updated history
await historyStore.SaveAsync(sessionId, chatHistory, ct);
return Results.Ok(new { reply = response.Content });
});
public record ChatRequest(string Message);
The key design rule: never store ChatHistory as a Singleton in a multi-user application. A Singleton shares state across all concurrent requests, causing users to see each other’s conversation history.
Choosing the Right Strategy
| Strategy | Token predictability | Context preservation | Extra latency | Best for |
|---|---|---|---|---|
| Sliding window | High | Low (drops old context) | None | Customer support bots with short sessions |
| Summarization | Medium | High | +1 LLM call per threshold | Long-running assistant sessions |
| Token-aware | Very high | Medium | Small CPU overhead | Applications with strict token budgets |
| Hybrid | High | Very high | +1 LLM call per 20 turns | Production AI assistants |
For most production .NET applications, start with the token-aware truncation pattern — it prevents context-length errors reliably and has minimal overhead. Upgrade to hybrid when users report the bot forgetting important context from earlier in long sessions.