Retrieval-Augmented Generation Explained for .NET Architects

Intermediate Original .NET 9 Microsoft.SemanticKernel 1.34.0 Azure.Search.Documents 11.7.0
By Rajesh Mishra · Feb 28, 2026 · Verified: Feb 28, 2026 · 14 min read

Why LLMs Hallucinate

Large language models generate text by predicting the most probable next token based on patterns learned during training. They do not know facts — they have learned statistical associations between words. When asked about information that was not in their training data, or when the statistical pattern is ambiguous, they generate plausible-sounding text that may be completely fabricated.

This is called hallucination, and for enterprise applications it is a fundamental problem. A customer support bot that invents policy details. A legal assistant that cites nonexistent case law. A code assistant that recommends deprecated APIs. The model is not lying — it is doing exactly what it was trained to do: continue the text in the most probable way.

RAG solves this by changing the question. Instead of asking the model “What do you know about X?”, you first retrieve relevant documents about X from your own data store, then ask the model “Given these documents, answer the question about X.” The model generates a response grounded in actual data rather than parametric memory alone.

The RAG Pipeline

RAG is not a single technique — it is a pipeline with distinct stages. Each stage has its own design decisions, failure modes, and optimization opportunities.

┌─────────────────────────────────────────────────────────┐
│                    INGESTION PHASE                       │
│                                                         │
│   Documents → Chunk → Embed → Store in Vector DB        │
│                                                         │
├─────────────────────────────────────────────────────────┤
│                    RETRIEVAL PHASE                       │
│                                                         │
│   User Query → Embed Query → Vector Search → Top-K Docs │
│                                                         │
├─────────────────────────────────────────────────────────┤
│                   GENERATION PHASE                       │
│                                                         │
│   System Prompt + Retrieved Docs + Query → LLM → Answer │
│                                                         │
└─────────────────────────────────────────────────────────┘

Let’s walk through each stage.

Embeddings: Turning Text into Math

An embedding is a numerical representation of text — a vector of floating-point numbers (typically 256 to 3072 dimensions) that captures the meaning of the text, not just the words. Texts with similar meanings produce vectors that are close together in this high-dimensional space.

The sentence “How do I reset my password?” and “I forgot my login credentials” would have very similar embeddings, even though they share almost no words. This is what makes vector search fundamentally different from keyword search.

In .NET, generating embeddings is straightforward:

using Azure.AI.OpenAI;
using OpenAI.Embeddings;

var client = new AzureOpenAIClient(
    new Uri("https://your-resource.openai.azure.com/"),
    new DefaultAzureCredential());

EmbeddingClient embeddingClient = client.GetEmbeddingClient("text-embedding-3-small");

// Single text embedding
OpenAIEmbedding embedding = await embeddingClient.GenerateEmbeddingAsync(
    "How do I reset my password?");

ReadOnlyMemory<float> vector = embedding.ToFloats();
// vector is now a 1536-dimensional float array (for text-embedding-3-small)

Choosing an embedding model:

ModelDimensionsQualityCostUse Case
text-embedding-3-small1536GoodLowMost applications
text-embedding-3-large3072BestHigherHigh-precision retrieval
text-embedding-ada-0021536GoodLowLegacy — prefer 3-small

For the majority of RAG applications, text-embedding-3-small provides an excellent quality-to-cost ratio. Switch to text-embedding-3-large only if retrieval quality measurements show the smaller model is insufficient for your specific domain.

One critical constraint: you must use the same embedding model for both ingestion and retrieval. Vectors from different models are not compatible. If you change your embedding model, you must re-embed your entire corpus.

Chunking Strategies

Before embedding, documents must be split into chunks — segments small enough to be meaningful as retrieval units but large enough to contain useful context. Chunking strategy has an outsized impact on RAG quality.

Fixed-Size Chunking

The simplest approach. Split text into segments of a fixed token count with overlap between adjacent chunks.

public class FixedSizeChunker
{
    private readonly int _chunkSize;
    private readonly int _overlap;

    public FixedSizeChunker(int chunkSize = 512, int overlap = 50)
    {
        _chunkSize = chunkSize;
        _overlap = overlap;
    }

    public List<string> Chunk(string text)
    {
        var words = text.Split(' ', StringSplitOptions.RemoveEmptyEntries);
        var chunks = new List<string>();

        for (int i = 0; i < words.Length; i += _chunkSize - _overlap)
        {
            var chunk = string.Join(' ', words.Skip(i).Take(_chunkSize));
            if (!string.IsNullOrWhiteSpace(chunk))
                chunks.Add(chunk);
        }

        return chunks;
    }
}

The overlap ensures that information near chunk boundaries is not lost. Without overlap, a sentence split across two chunks may not be retrievable by either.

Semantic Chunking

Instead of splitting at arbitrary positions, semantic chunking identifies natural topic boundaries — paragraph breaks, section headers, topic shifts — and splits there.

public class SemanticChunker
{
    private readonly int _maxChunkSize;

    public SemanticChunker(int maxChunkSize = 1000)
    {
        _maxChunkSize = maxChunkSize;
    }

    public List<string> Chunk(string text)
    {
        // Split on paragraph boundaries first
        var paragraphs = text.Split(
            ["\n\n", "\r\n\r\n"],
            StringSplitOptions.RemoveEmptyEntries);

        var chunks = new List<string>();
        var currentChunk = new StringBuilder();

        foreach (var paragraph in paragraphs)
        {
            if (currentChunk.Length + paragraph.Length > _maxChunkSize
                && currentChunk.Length > 0)
            {
                chunks.Add(currentChunk.ToString().Trim());
                currentChunk.Clear();
            }
            currentChunk.AppendLine(paragraph);
        }

        if (currentChunk.Length > 0)
            chunks.Add(currentChunk.ToString().Trim());

        return chunks;
    }
}

Recursive Chunking

A hybrid approach that first tries to split on the largest structural boundary (sections), then paragraphs, then sentences, then words — recursing until each chunk fits within the size limit. This preserves document structure as much as possible.

Practical guidance: Start with fixed-size chunking at 512 tokens with 50-token overlap. Measure retrieval quality. If relevant documents are being missed, experiment with semantic or recursive chunking. The “best” strategy is entirely dependent on your document corpus and query patterns.

Vector Databases for .NET

Once chunks are embedded, you need somewhere to store the vectors and perform similarity search. The .NET ecosystem has several strong options.

Azure Cosmos DB added native vector search to its NoSQL API, making it possible to store operational data and embeddings in the same database. This is compelling if you already use Cosmos DB — no separate vector store means no synchronization problems.

using Microsoft.Azure.Cosmos;

// Store a document chunk with its embedding
var chunk = new
{
    id = Guid.NewGuid().ToString(),
    content = "Semantic Kernel supports multiple AI connectors...",
    embedding = vectorFromEmbeddingModel, // float[]
    source = "docs/semantic-kernel-overview.md",
    category = "documentation"
};

await container.CreateItemAsync(chunk, new PartitionKey(chunk.category));

// Vector similarity search
var query = new QueryDefinition(
    "SELECT TOP @k c.id, c.content, c.source, " +
    "VectorDistance(c.embedding, @queryVector) AS score " +
    "FROM c " +
    "ORDER BY VectorDistance(c.embedding, @queryVector)")
    .WithParameter("@k", 5)
    .WithParameter("@queryVector", queryEmbedding);

For a full end-to-end implementation, see Build a RAG Chatbot with .NET, Semantic Kernel, and Azure Cosmos DB.

Azure AI Search is the most feature-rich option for .NET RAG architectures. It supports pure vector search, keyword search (BM25), hybrid search (both combined), and semantic reranking — all in a single service.

For the latest SDK capabilities, see Azure.Search.Documents 11.7.0 Released.

using Azure;
using Azure.Search.Documents;
using Azure.Search.Documents.Models;

var searchClient = new SearchClient(
    new Uri("https://your-search.search.windows.net"),
    "documents-index",
    new AzureKeyCredential("your-key"));

// Hybrid search: keyword + vector + semantic reranking
var options = new SearchOptions
{
    Size = 5,
    Select = { "content", "title", "source" },
    QueryType = SearchQueryType.Semantic,
    SemanticSearch = new SemanticSearchOptions
    {
        SemanticConfigurationName = "default",
        QueryCaption = new QueryCaption(QueryCaptionType.Extractive)
    },
    VectorSearch = new VectorSearchOptions
    {
        Queries =
        {
            new VectorizedQuery(queryEmbedding)
            {
                KNearestNeighborsCount = 10,
                Fields = { "embedding" }
            }
        }
    }
};

SearchResults<DocumentChunk> results = await searchClient.SearchAsync<DocumentChunk>(
    "how to configure vector search", options);

await foreach (SearchResult<DocumentChunk> result in results.GetResultsAsync())
{
    Console.WriteLine($"[{result.Score:F4}] {result.Document.Title}");
    Console.WriteLine($"  {result.Document.Content[..100]}...");
}

Azure AI Search is the recommended choice when you need hybrid search — and for most production RAG systems, you do. Learn more at Vector search overview on Microsoft Learn.

Other Options

Qdrant — Open-source vector database with a .NET client (Qdrant.Client NuGet). Good for teams that want to self-host and avoid Azure lock-in.

Pinecone — Managed vector database with a .NET client. Serverless pricing model suits variable workloads.

PostgreSQL with pgvector — If you already use PostgreSQL, the pgvector extension adds vector similarity search. Use with Npgsql in .NET.

Retrieval Methods

How you search for relevant chunks matters as much as how you store them.

Find the K nearest vectors to the query embedding using cosine similarity or dot product distance. Fast and effective for semantic matching, but misses exact keyword matches.

Keyword Search (BM25)

Traditional full-text search using term frequency and inverse document frequency. Excellent for exact terms, product names, error codes — anything where the precise words matter.

Combine vector and keyword search, then merge the results. This is the approach used by Azure AI Search and increasingly recognized as the optimal default for RAG. Hybrid search catches both semantic matches (concepts the user is asking about) and lexical matches (specific terms and identifiers).

Semantic Reranking

After initial retrieval, pass the top results through a cross-encoder model that evaluates how well each document actually answers the query. This reranking step significantly improves precision, especially when the initial retrieval returns near-miss results.

The recommended stack for most .NET RAG systems: hybrid search (BM25 + vector) with semantic reranking. Start with pure vector search for simplicity, then add hybrid and reranking when you need better precision.

Prompt Augmentation

The retrieval phase gives you relevant documents. The augmentation phase injects those documents into the LLM prompt. This is where most RAG implementations look deceptively simple — but the prompt design significantly affects response quality.

public class RagPromptBuilder
{
    public string BuildPrompt(string userQuery, List<RetrievedDocument> documents)
    {
        var context = new StringBuilder();
        for (int i = 0; i < documents.Count; i++)
        {
            context.AppendLine($"[Source {i + 1}: {documents[i].Title}]");
            context.AppendLine(documents[i].Content);
            context.AppendLine();
        }

        return $"""
            You are a technical assistant. Answer the user's question based ONLY on
            the provided context documents. If the context does not contain enough
            information to answer the question, say so explicitly — do not guess.

            When you use information from a source, cite it as [Source N].

            ## Context Documents

            {context}

            ## User Question

            {userQuery}

            ## Instructions

            - Answer precisely based on the context
            - Cite sources using [Source N] notation
            - If the context is insufficient, state what information is missing
            """;
    }
}

Key principles for augmentation prompts:

  • Instruct the model to use only the provided context. This reduces hallucination.
  • Require citations. This enables users to verify the answer and builds trust.
  • Handle insufficient context explicitly. “I don’t know” is better than a fabricated answer.
  • Place context before the question. Models attend to context positioning — retrieved documents placed before the question tend to be used more effectively.

Evaluation Metrics

Building a RAG system without measuring its quality is flying blind. There are three dimensions to evaluate:

Faithfulness — Does the generated answer actually reflect what the retrieved documents say? A faithful answer does not add information beyond what the context provides.

Relevance — Are the retrieved documents relevant to the user’s question? Poor retrieval quality cascades into poor generation quality.

Groundedness — Can every claim in the generated answer be traced back to a specific retrieved document? This is the strongest measure of RAG effectiveness.

Measuring these in practice typically involves:

  1. Creating an evaluation dataset — a set of questions with known correct answers and the documents that should be retrieved
  2. Running the RAG pipeline against this dataset
  3. Scoring retrieval (precision@K, recall@K) and generation (faithfulness, relevance) metrics
  4. Iterating on chunking, retrieval, and prompt design based on the scores

Automated evaluation using an LLM-as-judge pattern (where a second LLM evaluates the faithfulness of responses against retrieved context) provides affordable, scalable quality measurement during development.

Architecture for .NET

Bringing the full pipeline together in a .NET architecture:

┌─────────────────────────────────────────────────────────────────┐
│                     .NET 9 Web API                              │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐    │
│  │ Ingestion    │  │ Retrieval    │  │ Generation         │    │
│  │ Service      │  │ Service      │  │ Service            │    │
│  │              │  │              │  │                    │    │
│  │ - Read docs  │  │ - Embed      │  │ - Build prompt     │    │
│  │ - Chunk      │  │   query      │  │ - Call LLM         │    │
│  │ - Embed      │  │ - Search     │  │ - Stream response  │    │
│  │ - Store      │  │   vectors    │  │ - Track citations   │    │
│  └──────┬───────┘  │ - Rerank     │  └────────┬───────────┘    │
│         │          └──────┬───────┘           │                │
│         ▼                 ▼                   ▼                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │            Semantic Kernel Orchestration                 │   │
│  └─────────────────────────┬───────────────────────────────┘   │
│                             │                                   │
└─────────────────────────────┼───────────────────────────────────┘

          ┌───────────────────┼───────────────────┐
          ▼                   ▼                   ▼
   ┌─────────────┐   ┌──────────────┐   ┌──────────────┐
   │ Azure       │   │ Azure AI     │   │ Azure OpenAI │
   │ Cosmos DB   │   │ Search       │   │ GPT-4o       │
   │ (vectors +  │   │ (hybrid      │   │ (generation) │
   │  documents) │   │  search)     │   │              │
   └─────────────┘   └──────────────┘   └──────────────┘

Each service is registered through dependency injection and orchestrated by Semantic Kernel. The ingestion service runs offline (or via background jobs), the retrieval and generation services handle real-time queries.

For a hands-on implementation of this architecture, see the Build a Semantic Search API with .NET and Azure AI Search workshop.

Common Pitfalls

Chunking too aggressively. Tiny chunks (100 tokens) lose context. The model receives fragments that are technically relevant but lack enough information to produce a useful answer. Start at 512 tokens and adjust.

Ignoring chunk overlap. Without overlap, information at chunk boundaries is effectively invisible to retrieval. A 50-100 token overlap solves this at minimal storage cost.

Skipping hybrid search. Pure vector search misses exact matches for product codes, error numbers, and proper nouns. Hybrid search catches both semantic and lexical matches.

Not measuring retrieval quality separately. When answers are wrong, the instinct is to tune the prompt. But most RAG failures are retrieval failures — the right documents were never found. Measure retrieval precision and recall independently.

Embedding model mismatch. Using one model to embed documents and a different model to embed queries produces garbage results. This seems obvious, but it happens during migrations and upgrades. Version your embedding model choices.

Summary

RAG is the architecture that makes LLMs useful for enterprise applications. It grounds responses in your actual data, provides citation traceability, and keeps knowledge current without retraining models.

The pipeline — ingest, chunk, embed, store, retrieve, augment, generate — has clear boundaries and measurable quality at each stage. For .NET architects, the ecosystem is mature: Azure Cosmos DB and Azure AI Search for storage, Azure OpenAI for embeddings and generation, and Semantic Kernel for orchestration.

Start simple. Fixed-size chunking, pure vector search, a straightforward augmentation prompt. Measure retrieval quality. Then add hybrid search, semantic reranking, and prompt refinements based on what the measurements tell you. RAG is an iterative system, not a one-shot deployment.

⚠ Production Considerations

  • Embedding model changes require re-embedding your entire corpus — version your embedding model choice and plan for re-indexing before switching models.
  • Chunk size significantly impacts retrieval quality. Chunks that are too large dilute relevance; chunks that are too small lose context. Benchmark different sizes against your actual queries before committing to a strategy.

🧠 Architect’s Note

RAG is not a single component — it is a pipeline with multiple failure points. Measure retrieval quality (are the right documents being found?) separately from generation quality (is the LLM using the context correctly?). Most RAG failures are retrieval failures, not generation failures. Invest in chunking and search quality before tuning prompts.

AI-Friendly Summary

Summary

This article explains the Retrieval-Augmented Generation (RAG) architecture for .NET architects. It covers why LLMs hallucinate and how RAG solves it, the complete RAG pipeline from ingestion through generation, embedding mechanics, vector database options for .NET (Azure Cosmos DB, Azure AI Search, Qdrant), chunking strategies, retrieval methods including hybrid search, and evaluation metrics for RAG quality.

Key Takeaways

  • RAG grounds LLM responses in retrieved documents, reducing hallucinations without fine-tuning
  • The pipeline is: ingest, chunk, embed, store, retrieve, augment, generate
  • Embeddings convert text into high-dimensional vectors that capture semantic meaning
  • Hybrid search (BM25 + vector) with semantic reranking outperforms pure vector search
  • Azure Cosmos DB and Azure AI Search are the primary .NET-native vector store options

Implementation Checklist

  • Choose a vector store: Azure Cosmos DB, Azure AI Search, or Qdrant
  • Select an embedding model: text-embedding-3-small for cost efficiency, text-embedding-3-large for quality
  • Implement a chunking strategy with overlap (start with 512 tokens, 50 token overlap)
  • Build the ingestion pipeline: read documents, chunk, embed, store
  • Implement retrieval with top-K vector search
  • Consider hybrid search (BM25 + vector) for better recall
  • Design the augmentation prompt template with retrieved context
  • Measure faithfulness and relevance with evaluation metrics
  • Add citation tracking to link responses back to source documents

Frequently Asked Questions

What is RAG in AI?

Retrieval-Augmented Generation (RAG) is an architecture pattern that improves LLM accuracy by retrieving relevant documents from a knowledge base and injecting them into the prompt before generation. Instead of relying solely on the model's training data, RAG grounds responses in your actual data — reducing hallucinations and enabling the model to answer questions about private, recent, or domain-specific information.

Why is RAG better than fine-tuning for most use cases?

Fine-tuning bakes knowledge into model weights, which is expensive, slow to update, and risks catastrophic forgetting. RAG keeps knowledge in an external store that you can update instantly without retraining. RAG also provides citation traceability — you know exactly which documents informed the answer. For most enterprise scenarios where data changes frequently, RAG is more practical and cost-effective.

What vector database should I use with .NET?

Azure Cosmos DB (with native vector search) is ideal if you already use Cosmos DB for operational data, as it eliminates a separate vector store. Azure AI Search is the strongest choice for hybrid search (BM25 + vector) with semantic reranking. For open-source options, Qdrant and Pinecone both have .NET client libraries. The choice depends on your existing infrastructure and whether you need hybrid search capabilities.

What chunking strategy works best for RAG?

There is no single best strategy — it depends on your content. Fixed-size chunking (500-1000 tokens with overlap) works well for uniform text. Semantic chunking (splitting on topic boundaries) produces higher-quality chunks for varied content but is more complex to implement. Start with fixed-size overlapping chunks and iterate based on retrieval quality metrics.

Related Articles

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#RAG #Embeddings #Vector Search #Azure AI Search #Azure Cosmos DB #.NET AI