Skip to main content

Semantic Kernel Chat History Management in C#

Intermediate Original .NET 9 Microsoft.SemanticKernel 1.54.0 Microsoft.ML.Tokenizers 0.22.0
By Rajesh Mishra · Mar 21, 2026 · 14 min read
Verified Mar 2026 .NET 9 Microsoft.SemanticKernel 1.54.0
In 30 Seconds

This article covers four production patterns for managing Semantic Kernel ChatHistory growth in C# .NET 9 applications: sliding window truncation, GPT-4o-mini summarization, token-aware truncation using Microsoft.ML.Tokenizers 0.22.0, and a hybrid approach combining summarization with recency windows. It also covers serializing ChatHistory to Cosmos DB or Redis for multi-user ASP.NET Core persistence.

How ChatHistory Works in Semantic Kernel

ChatHistory is the core conversation state object in Semantic Kernel. It lives in the Microsoft.SemanticKernel namespace and holds an ordered list of ChatMessageContent objects — one per turn in the conversation.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

// Initialize with a system message at index 0
var chatHistory = new ChatHistory("You are a helpful .NET assistant.");

// Add a user turn
chatHistory.AddUserMessage("How does Semantic Kernel handle retries?");

// Add the assistant's response back
chatHistory.AddAssistantMessage("Semantic Kernel delegates retry logic to the underlying HTTP client...");

// You can also add tool messages
chatHistory.Add(new ChatMessageContent(AuthorRole.Tool, "Tool result data"));

Because ChatHistory inherits from List<ChatMessageContent>, every standard list operation is available: Count, RemoveAt(index), RemoveRange(index, count), and LINQ queries. This makes truncation straightforward without any custom abstractions.

The four AuthorRole values you will use in production are:

  • AuthorRole.System — the system prompt (index 0)
  • AuthorRole.User — user turns
  • AuthorRole.Assistant — AI responses
  • AuthorRole.Tool — function call results

The Unbounded Growth Problem

Every conversation turn appends two messages to ChatHistory — one for the user, one for the assistant. With a 200-token average per message, here is what that looks like over time:

TurnsMessagesApprox. tokens
1021~4,200
3061~12,200
50101~20,200
100201~40,200
200401~80,400

GPT-4o has a 128K context window. A long session with verbose replies can exhaust it well before 200 turns. When this happens, the API returns a context_length_exceeded error — see Fix Azure OpenAI Context Length Exceeded in C# for handling strategies.

The solution is a deliberate history management strategy chosen at design time. The four patterns below cover the full spectrum from simple to sophisticated.

Pattern 1: Sliding Window

The simplest strategy — keep a fixed number of the most recent messages and discard the oldest. The system message at index 0 is never removed.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

public static class ChatHistoryExtensions
{
    /// <summary>
    /// Trims history to at most maxMessages recent messages,
    /// always preserving index 0 (system prompt).
    /// </summary>
    public static void ApplySlidingWindow(
        this ChatHistory chatHistory,
        int maxMessages = 20)
    {
        // chatHistory[0] is the system message — never remove it
        // Count - 1 gives us the number of non-system messages
        int nonSystemCount = chatHistory.Count - 1;
        int excess = nonSystemCount - maxMessages;

        if (excess > 0)
        {
            // Remove from index 1 (oldest non-system) to reduce excess
            chatHistory.RemoveRange(1, excess);
        }
    }
}

Call it after every assistant reply:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

public class SlidingWindowChatService(Kernel kernel)
{
    private readonly ChatHistory _chatHistory =
        new("You are a helpful .NET assistant.");

    private readonly IChatCompletionService _chatCompletion =
        kernel.Services.GetRequiredService<IChatCompletionService>();

    public async Task<string> ChatAsync(string userMessage)
    {
        _chatHistory.AddUserMessage(userMessage);

        var response = await _chatCompletion.GetChatMessageContentAsync(
            _chatHistory,
            kernel: kernel);

        _chatHistory.AddAssistantMessage(response.Content ?? "");

        // Keep only the last 20 messages + system prompt
        _chatHistory.ApplySlidingWindow(maxMessages: 20);

        return response.Content ?? "";
    }
}

Trade-off: Simple and predictable token usage, but the AI loses context for anything older than the window. Conversations that require remembering facts from earlier turns will appear to forget them.

Pattern 2: Summarization with GPT-4o-mini

Instead of discarding old messages, compress them into a single summary. This preserves long-range context at the cost of one extra LLM call.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;

public class SummarizingChatService(Kernel kernel)
{
    private readonly ChatHistory _chatHistory =
        new("You are a helpful .NET assistant.");

    private readonly IChatCompletionService _chatCompletion =
        kernel.Services.GetRequiredService<IChatCompletionService>();

    // Summarize when non-system message count exceeds this threshold
    private const int SummarizationThreshold = 30;

    public async Task<string> ChatAsync(string userMessage)
    {
        _chatHistory.AddUserMessage(userMessage);

        var response = await _chatCompletion.GetChatMessageContentAsync(
            _chatHistory,
            kernel: kernel);

        _chatHistory.AddAssistantMessage(response.Content ?? "");

        // Trigger summarization if we've grown too large
        int nonSystemCount = _chatHistory.Count - 1;
        if (nonSystemCount >= SummarizationThreshold)
        {
            await SummarizeHistoryAsync();
        }

        return response.Content ?? "";
    }

    private async Task SummarizeHistoryAsync()
    {
        // Collect all non-system messages to summarize
        var messagesToSummarize = _chatHistory
            .Skip(1)
            .ToList();

        // Build a prompt asking the model to summarize the conversation so far
        var summaryPrompt = new StringBuilder();
        summaryPrompt.AppendLine("Summarize the following conversation concisely. ");
        summaryPrompt.AppendLine("Capture key facts, decisions, and context that would help continue the conversation:");
        summaryPrompt.AppendLine();

        foreach (var message in messagesToSummarize)
        {
            summaryPrompt.AppendLine($"{message.Role}: {message.Content}");
        }

        // Use a fast, cheap model for summarization
        // Configure a separate kernel or execution settings for GPT-4o-mini
        var summaryHistory = new ChatHistory(
            "You are a precise conversation summarizer. Produce concise factual summaries.");
        summaryHistory.AddUserMessage(summaryPrompt.ToString());

        var summaryResponse = await _chatCompletion.GetChatMessageContentAsync(
            summaryHistory,
            kernel: kernel);

        var summary = summaryResponse.Content ?? "Previous conversation omitted.";

        // Rebuild history: system message + summary as assistant context + empty
        var systemMessage = _chatHistory[0];

        _chatHistory.Clear();
        _chatHistory.Add(systemMessage);

        // Add the summary as an assistant message to preserve conversational flow
        _chatHistory.Add(new ChatMessageContent(
            AuthorRole.Assistant,
            $"[Summary of previous conversation]: {summary}"));
    }
}

Trade-off: Preserves long-range context. The summary costs one extra LLM call per threshold crossing. For production, configure GPT-4o-mini as a dedicated summarization deployment to keep costs low.

Pattern 3: Token-Aware Truncation

The most accurate strategy — count actual tokens and truncate until you are within budget. This requires Microsoft.ML.Tokenizers.

dotnet add package Microsoft.ML.Tokenizers --version 0.22.0
using Microsoft.ML.Tokenizers;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

public static class TokenAwareTruncation
{
    // Each message has ~4 tokens of overhead (role, separators)
    private const int TokensPerMessageOverhead = 4;
    // Two priming tokens are added by the API at the start of the reply
    private const int PrimingTokens = 2;

    /// <summary>
    /// Removes oldest non-system messages until the total token count
    /// falls below maxTokens. Always preserves index 0 (system prompt).
    /// </summary>
    public static void TruncateToTokenBudget(
        this ChatHistory chatHistory,
        TiktokenTokenizer tokenizer,
        int maxTokens = 8_000)
    {
        while (chatHistory.Count > 1 && CountTokens(chatHistory, tokenizer) > maxTokens)
        {
            // Remove the oldest non-system message (index 1)
            chatHistory.RemoveAt(1);
        }
    }

    public static int CountTokens(
        ChatHistory chatHistory,
        TiktokenTokenizer tokenizer)
    {
        int total = PrimingTokens;

        foreach (var message in chatHistory)
        {
            total += TokensPerMessageOverhead;
            total += tokenizer.CountTokens(message.Content ?? "");
        }

        return total;
    }
}

public class TokenAwareChatService(Kernel kernel)
{
    private readonly ChatHistory _chatHistory =
        new("You are a helpful .NET assistant.");

    private readonly IChatCompletionService _chatCompletion =
        kernel.Services.GetRequiredService<IChatCompletionService>();

    // Create the tokenizer once — it is thread-safe and expensive to construct
    private static readonly TiktokenTokenizer _tokenizer =
        TiktokenTokenizer.CreateForModel("gpt-4o");

    // Leave headroom for the model's response tokens
    private const int MaxInputTokens = 100_000;

    public async Task<string> ChatAsync(string userMessage)
    {
        _chatHistory.AddUserMessage(userMessage);

        // Truncate before sending to stay within context window
        _chatHistory.TruncateToTokenBudget(_tokenizer, MaxInputTokens);

        var response = await _chatCompletion.GetChatMessageContentAsync(
            _chatHistory,
            kernel: kernel);

        _chatHistory.AddAssistantMessage(response.Content ?? "");

        return response.Content ?? "";
    }
}

For a deeper understanding of token counting mechanics and why the 4-token overhead per message exists, see Azure OpenAI Token Counting and Context Management in C#.

Trade-off: Most precise strategy — no surprise context-length errors. The tokenizer adds a small CPU overhead per call but TiktokenTokenizer is efficient. The downside is that abrupt removal of messages can create incoherent context if a user refers back to an earlier exchange.

Pattern 4: Hybrid — Summarize + Recency Window

Combines summarization and sliding window for the best of both worlds. Summarize every 20 turns to capture long-range context, then keep only the last 5 turns for recency.

using Microsoft.ML.Tokenizers;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text;

public class HybridChatService(Kernel kernel)
{
    private readonly ChatHistory _chatHistory =
        new("You are a helpful .NET assistant. You are knowledgeable about .NET and Azure.");

    private readonly IChatCompletionService _chatCompletion =
        kernel.Services.GetRequiredService<IChatCompletionService>();

    private static readonly TiktokenTokenizer _tokenizer =
        TiktokenTokenizer.CreateForModel("gpt-4o");

    // Summarize the oldest messages when non-system count exceeds this
    private const int SummarizationTrigger = 20;
    // After summarization, keep this many most-recent turns
    private const int RecentTurnsToKeep = 5;
    // Hard token ceiling before the summarization kicks in
    private const int MaxTokensBeforeForce = 90_000;

    public async Task<string> ChatAsync(string userMessage)
    {
        _chatHistory.AddUserMessage(userMessage);

        var response = await _chatCompletion.GetChatMessageContentAsync(
            _chatHistory,
            kernel: kernel);

        _chatHistory.AddAssistantMessage(response.Content ?? "");

        await ApplyHybridStrategyAsync();

        return response.Content ?? "";
    }

    private async Task ApplyHybridStrategyAsync()
    {
        int nonSystemCount = _chatHistory.Count - 1;
        int tokenCount = TokenAwareTruncation.CountTokens(_chatHistory, _tokenizer);

        bool shouldSummarize =
            nonSystemCount >= SummarizationTrigger ||
            tokenCount >= MaxTokensBeforeForce;

        if (!shouldSummarize)
            return;

        // Identify the messages to summarize (everything except last K turns)
        // non-system messages live at index 1..Count-1
        // keep the last RecentTurnsToKeep * 2 messages (each turn = 2 messages)
        int recentMessageCount = RecentTurnsToKeep * 2;
        int messagesToSummarize = nonSystemCount - recentMessageCount;

        if (messagesToSummarize <= 0)
            return; // Not enough old messages to summarize yet

        var oldMessages = _chatHistory
            .Skip(1)                        // skip system
            .Take(messagesToSummarize)      // oldest messages only
            .ToList();

        var recentMessages = _chatHistory
            .Skip(1 + messagesToSummarize)  // skip system + old
            .ToList();

        // Build summary of old messages
        var summaryPrompt = new StringBuilder();
        summaryPrompt.AppendLine(
            "Summarize this conversation excerpt concisely. Include key facts, " +
            "code snippets discussed, decisions made, and any unresolved questions:");
        summaryPrompt.AppendLine();

        foreach (var msg in oldMessages)
        {
            summaryPrompt.AppendLine($"{msg.Role}: {msg.Content}");
        }

        var summaryHistory = new ChatHistory(
            "You are a conversation summarizer. Be concise and factual.");
        summaryHistory.AddUserMessage(summaryPrompt.ToString());

        var summaryResp = await _chatCompletion.GetChatMessageContentAsync(
            summaryHistory, kernel: kernel);

        var summary = summaryResp.Content ?? "Earlier context omitted.";

        // Rebuild: system + summary + recent turns
        var systemMessage = _chatHistory[0];

        _chatHistory.Clear();
        _chatHistory.Add(systemMessage);
        _chatHistory.Add(new ChatMessageContent(
            AuthorRole.Assistant,
            $"[Conversation summary — earlier context]: {summary}"));

        foreach (var msg in recentMessages)
        {
            _chatHistory.Add(msg);
        }
    }
}

Trade-off: Best coherence across long sessions. Adds latency and cost every 20 turns. Use a dedicated GPT-4o-mini deployment for the summarization call to minimize impact on user-perceived response time.

Persisting ChatHistory Across Requests

For multi-turn chatbots in ASP.NET Core, you need to persist ChatHistory between HTTP requests. ChatMessageContent serializes cleanly with System.Text.Json.

Redis Persistence with IDistributedCache

using Microsoft.Extensions.Caching.Distributed;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text.Json;

public class RedisChatHistoryStore(IDistributedCache cache)
{
    private static readonly JsonSerializerOptions _jsonOptions = new()
    {
        WriteIndented = false,
    };

    public async Task<ChatHistory> LoadAsync(
        string sessionId,
        string systemPrompt,
        CancellationToken ct = default)
    {
        var bytes = await cache.GetAsync(sessionId, ct);

        if (bytes is null || bytes.Length == 0)
        {
            // New session — start fresh with the system prompt
            return new ChatHistory(systemPrompt);
        }

        var messages = JsonSerializer.Deserialize<List<SerializableChatMessage>>(bytes, _jsonOptions)
            ?? [];

        var history = new ChatHistory();

        foreach (var msg in messages)
        {
            history.Add(new ChatMessageContent(
                new AuthorRole(msg.Role),
                msg.Content));
        }

        return history;
    }

    public async Task SaveAsync(
        string sessionId,
        ChatHistory chatHistory,
        CancellationToken ct = default)
    {
        var messages = chatHistory
            .Select(m => new SerializableChatMessage
            {
                Role = m.Role.ToString(),
                Content = m.Content ?? "",
            })
            .ToList();

        var bytes = JsonSerializer.SerializeToUtf8Bytes(messages, _jsonOptions);

        await cache.SetAsync(sessionId, bytes, new DistributedCacheEntryOptions
        {
            SlidingExpiration = TimeSpan.FromHours(24),
        }, ct);
    }
}

public record SerializableChatMessage
{
    public string Role { get; init; } = "";
    public string Content { get; init; } = "";
}

Register Redis in Program.cs:

builder.Services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = builder.Configuration["Redis:ConnectionString"];
    options.InstanceName = "chatbot:";
});

builder.Services.AddScoped<RedisChatHistoryStore>();

Cosmos DB Persistence

For applications already using Azure Cosmos DB (as in the RAG chatbot tutorial), store chat sessions as documents alongside your operational data:

using Microsoft.Azure.Cosmos;
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
using System.Text.Json;

public class CosmosChatHistoryStore(CosmosClient cosmosClient)
{
    private readonly Container _container =
        cosmosClient.GetContainer("chatbot", "sessions");

    public async Task<ChatHistory> LoadAsync(
        string userId,
        string systemPrompt,
        CancellationToken ct = default)
    {
        try
        {
            var response = await _container.ReadItemAsync<ChatSession>(
                id: userId,
                partitionKey: new PartitionKey(userId),
                cancellationToken: ct);

            var history = new ChatHistory();

            foreach (var msg in response.Value.Messages)
            {
                history.Add(new ChatMessageContent(
                    new AuthorRole(msg.Role),
                    msg.Content));
            }

            return history;
        }
        catch (CosmosException ex) when (ex.StatusCode == System.Net.HttpStatusCode.NotFound)
        {
            return new ChatHistory(systemPrompt);
        }
    }

    public async Task SaveAsync(
        string userId,
        ChatHistory chatHistory,
        CancellationToken ct = default)
    {
        var session = new ChatSession
        {
            Id = userId,
            UserId = userId,
            UpdatedAt = DateTime.UtcNow,
            Messages = chatHistory
                .Select(m => new SerializableChatMessage
                {
                    Role = m.Role.ToString(),
                    Content = m.Content ?? "",
                })
                .ToList(),
        };

        await _container.UpsertItemAsync(
            session,
            new PartitionKey(userId),
            cancellationToken: ct);
    }
}

public class ChatSession
{
    public string Id { get; set; } = "";
    public string UserId { get; set; } = "";
    public DateTime UpdatedAt { get; set; }
    public List<SerializableChatMessage> Messages { get; set; } = [];
}

Multi-User ASP.NET Core Integration

In ASP.NET Core, register ChatHistory as a Scoped service so each HTTP request gets its own instance. For multi-turn sessions, load and save around each request.

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

// Program.cs
builder.Services.AddSemanticKernel()
    .AddAzureOpenAIChatCompletion(
        deploymentName: "gpt-4o",
        endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
        apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!);

builder.Services.AddScoped<RedisChatHistoryStore>();
builder.Services.AddScoped<HybridChatService>();

Wire up a minimal API endpoint that loads, runs, and saves history per request:

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;

app.MapPost("/api/chat/{sessionId}", async (
    string sessionId,
    ChatRequest request,
    RedisChatHistoryStore historyStore,
    Kernel kernel,
    CancellationToken ct) =>
{
    const string SystemPrompt =
        "You are a helpful .NET assistant specialized in Semantic Kernel and Azure OpenAI.";

    // 1. Load or create history for this session
    var chatHistory = await historyStore.LoadAsync(sessionId, SystemPrompt, ct);

    // 2. Add the new user message
    chatHistory.AddUserMessage(request.Message);

    // 3. Apply token-aware truncation before the API call
    var tokenizer = TiktokenTokenizer.CreateForModel("gpt-4o");
    chatHistory.TruncateToTokenBudget(tokenizer, maxTokens: 100_000);

    // 4. Get the AI response
    var chatCompletion = kernel.Services
        .GetRequiredService<IChatCompletionService>();

    var response = await chatCompletion.GetChatMessageContentAsync(
        chatHistory,
        kernel: kernel,
        cancellationToken: ct);

    chatHistory.AddAssistantMessage(response.Content ?? "");

    // 5. Persist the updated history
    await historyStore.SaveAsync(sessionId, chatHistory, ct);

    return Results.Ok(new { reply = response.Content });
});

public record ChatRequest(string Message);

The key design rule: never store ChatHistory as a Singleton in a multi-user application. A Singleton shares state across all concurrent requests, causing users to see each other’s conversation history.

Choosing the Right Strategy

StrategyToken predictabilityContext preservationExtra latencyBest for
Sliding windowHighLow (drops old context)NoneCustomer support bots with short sessions
SummarizationMediumHigh+1 LLM call per thresholdLong-running assistant sessions
Token-awareVery highMediumSmall CPU overheadApplications with strict token budgets
HybridHighVery high+1 LLM call per 20 turnsProduction AI assistants

For most production .NET applications, start with the token-aware truncation pattern — it prevents context-length errors reliably and has minimal overhead. Upgrade to hybrid when users report the bot forgetting important context from earlier in long sessions.

⚠ Production Considerations

  • Never remove the system message at index 0 when truncating — it resets the AI persona and breaks safety guardrails. Always start RemoveRange at index 1.
  • Summarization adds latency (one extra LLM round-trip) and cost. Use a fast, cheap model like GPT-4o-mini for summaries, not your primary GPT-4o deployment.
  • Deserializing ChatHistory from Redis without proper AuthorRole mapping produces messages with AuthorRole.User for all entries — always preserve and restore the Role field explicitly.

Enjoying this article?

Get weekly .NET + AI insights delivered to your inbox. No spam.

Subscribe Free →

🧠 Architect’s Note

For high-traffic multi-user chatbots, persist ChatHistory in Redis with a short TTL (24–48 hours) and lazy-load on first message. Avoid Cosmos DB for session storage if your sessions are short-lived — the per-document RU cost adds up. Summarize server-side, never client-side, to keep the summary consistent across clients. Profile your average session token growth rate weekly and adjust window sizes before you hit rate limit errors in production.

AI-Friendly Summary

Summary

This article covers four production patterns for managing Semantic Kernel ChatHistory growth in C# .NET 9 applications: sliding window truncation, GPT-4o-mini summarization, token-aware truncation using Microsoft.ML.Tokenizers 0.22.0, and a hybrid approach combining summarization with recency windows. It also covers serializing ChatHistory to Cosmos DB or Redis for multi-user ASP.NET Core persistence.

Key Takeaways

  • ChatHistory inherits from List<ChatMessageContent> — RemoveAt() and RemoveRange() work natively
  • Sliding window is the simplest strategy but loses long-range context; summarization preserves it at the cost of an extra LLM call
  • Token-aware truncation with TiktokenTokenizer is the most accurate way to stay within model limits
  • The hybrid pattern (summarize every 20 turns, keep last 5) gives the best coherence/cost tradeoff for production chatbots
  • Serialize ChatHistory with JsonSerializer.Serialize(chatHistory.ToList()) for persistence in Redis or Cosmos DB

Implementation Checklist

  • Install Microsoft.SemanticKernel 1.54.0 and Microsoft.ML.Tokenizers 0.22.0
  • Initialize ChatHistory with a system prompt: new ChatHistory("system prompt")
  • Choose a truncation strategy based on your coherence requirements and token budget
  • Implement token counting using TiktokenTokenizer.CreateForModel("gpt-4o")
  • Add a persistent store (Redis or Cosmos DB) for multi-request chat sessions
  • Register ChatHistory as Scoped in ASP.NET Core DI, keyed by session ID
  • Test that the system message is always preserved at index 0 after any truncation
  • Monitor average token count per session in production to tune window sizes

Frequently Asked Questions

What is ChatHistory in Semantic Kernel?

ChatHistory is a collection class in the Microsoft.SemanticKernel namespace that holds the conversation messages exchanged between the user, assistant, and system. It inherits from List<ChatMessageContent> and provides helper methods like AddUserMessage(), AddAssistantMessage(), and AddSystemMessage() to append messages to the conversation.

How do I prevent ChatHistory from growing unbounded?

Use one of three strategies: (1) Sliding window — remove the oldest messages (excluding system) once the count exceeds a threshold; (2) Summarization — periodically compress old messages into a summary using a fast model like GPT-4o-mini; (3) Token-aware truncation — count tokens with Microsoft.ML.Tokenizers and remove oldest non-system messages until you fall below your token budget.

Can I serialize ChatHistory to store it in a database?

Yes. ChatMessageContent is serializable with System.Text.Json. Use JsonSerializer.Serialize(chatHistory.ToList()) to produce a JSON array, then store the string in Cosmos DB, Redis, or any other store. Deserialize by recreating a new ChatHistory and adding the deserialized messages back.

How do I share ChatHistory across multiple requests in ASP.NET Core?

Register ChatHistory as a Scoped service keyed by session ID or user ID. Each HTTP request gets its own ChatHistory instance for the duration of that request. For multi-turn persistence across requests, load from and save to an external store (Redis or Cosmos DB) at the start and end of each request.

Which tokenizer should I use for counting Semantic Kernel messages?

Use TiktokenTokenizer.CreateForModel("gpt-4o") from the Microsoft.ML.Tokenizers 0.22.0 package. This is the same BPE tokenizer that Azure OpenAI GPT-4o models use, so the counts are accurate for planning token budgets.

What is the hybrid summarization pattern?

The hybrid pattern summarizes every N turns (e.g., every 20 messages) using a fast cheap model, then keeps only the last K turns (e.g., 5 messages) plus the summary as context, plus the system message at index 0. This balances coherence (recent turns) with long-range memory (summary) while keeping token usage predictable.

Is it safe to call chatHistory.RemoveRange() while preserving the system message?

Yes. Because ChatHistory inherits from List<ChatMessageContent>, you can call RemoveRange(startIndex, count). Always keep index 0 as your system message. The formula chatHistory.RemoveRange(1, Math.Max(0, chatHistory.Count - 1 - maxMessages)) removes the oldest non-system messages until only maxMessages remain after the system prompt.

Track your progress through this learning path.

You Might Also Enjoy

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#Semantic Kernel #Chat History #Token Management #ChatHistory #.NET AI