How Large Language Models Work — A Mental Model for Engineers

Beginner Original .NET 9
By Rajesh Mishra · Feb 28, 2026 · Verified: Feb 28, 2026 · 12 min read

Building on the Foundation

In the previous article, we covered what generative AI is, where LLMs fit in the machine learning hierarchy, and what tools the .NET ecosystem provides. That was the map. Now we need to understand the territory.

This article explains how LLMs actually work — not at the mathematical level, but at the mechanical level that matters for engineering decisions. When you understand the generation process, you make better choices about prompting, temperature settings, context management, and model selection. These are not abstract concerns. They directly affect the reliability, cost, and quality of your AI features.

How LLMs Are Built: Pre-Training and Fine-Tuning

An LLM does not spring into existence fully formed. It goes through distinct phases, and understanding these phases explains both the model’s capabilities and its limitations.

Pre-Training: Learning Language Patterns

Pre-training is where the model learns how language works. The process is conceptually simple, even if the engineering is staggering in scale.

Take an enormous corpus of text — books, websites, code repositories, academic papers, conversation logs. Trillions of tokens worth. Then train a neural network on one task: given a sequence of tokens, predict the next token. Do this billions of times, adjusting the model’s weights to minimize prediction error.

That is it. The model is never told what grammar is. It is never taught facts. It is never given rules about reasoning. It learns all of these things implicitly by getting very good at predicting what token comes next.

The scale is what makes this work. GPT-4 was reportedly trained on over 13 trillion tokens. At that scale, the model develops internal representations that capture syntax, semantics, factual knowledge, reasoning patterns, and even some understanding of code structure. These capabilities were not explicitly programmed — they emerged from the training objective.

Pre-training requires massive computational resources. We are talking about thousands of GPUs running for months at a cost of tens to hundreds of millions of dollars. This is why only a handful of organizations can train frontier models from scratch.

Fine-Tuning: Shaping Behavior

A pre-trained model is a powerful but raw language predictor. It can continue any text, but it cannot reliably follow instructions, answer questions helpfully, or refuse harmful requests. Fine-tuning addresses this.

Supervised Fine-Tuning (SFT) trains the model on curated examples of good behavior — high-quality prompt-response pairs where the responses demonstrate the kind of output you want. This teaches the model to follow instructions rather than just predict text.

Reinforcement Learning from Human Feedback (RLHF) takes the fine-tuned model further. Human evaluators rank multiple model outputs for a given prompt. A reward model learns those preferences, and the LLM is trained to maximize the reward signal. This is how models learn to be helpful, harmless, and accurate — or at least, how they get significantly better at those qualities.

Anthropic’s Claude uses a variant called Constitutional AI (CAI), where the model critiques and revises its own outputs based on a set of principles, reducing the need for human evaluators in the feedback loop.

The result of this pipeline — pre-training, then SFT, then RLHF — is the model you interact with through an API. Every response it generates reflects all three phases of training.

Tokenization: How Text Becomes Numbers

Neural networks operate on numbers, not text. Tokenization is the process of converting text into numerical representations that the model can process.

Byte Pair Encoding (BPE)

Most modern LLMs use Byte Pair Encoding or a close variant. BPE works by iteratively merging the most frequent pairs of characters (or bytes) in the training corpus until it reaches a target vocabulary size.

The result is a vocabulary of subword units. Common words like “the” or “function” become single tokens. Less common words are split into fragments: “unbelievable” might become [“un”, “believ”, “able”]. Rare words or technical jargon might be split into even smaller pieces.

A typical vocabulary size is 50,000 to 100,000 tokens. GPT-4 uses a vocabulary of roughly 100,000 tokens. Claude uses its own tokenizer with a similar scale.

SentencePiece

SentencePiece is an alternative tokenization framework used by models like Llama and Gemini. It treats the input as a raw byte stream and learns subword units directly, without relying on language-specific pre-processing like whitespace splitting. This makes it more language-agnostic — important for multilingual models.

Why Tokenization Matters for Engineers

Understanding tokenization explains several practical behaviors:

Token counts differ from word counts. A 1,000-word English document is typically 1,200-1,500 tokens, depending on the tokenizer along with vocabulary and word complexity. Code tends to have a higher token-per-word ratio because of syntax characters and naming conventions.

Different models tokenize differently. The phrase “Microsoft.Extensions.AI” might be two tokens in one model and five in another. This affects both cost and how the model internally represents your input.

Tokenization affects reasoning. Models reason at the token level. A model that tokenizes “ChatGPT” as a single token treats it as an atomic unit. A model that splits it into “Chat” + “G” + “PT” must compose the meaning from parts. This can create subtle differences in behavior.

For most .NET development work, you will not interact with tokenizers directly. But when you are debugging unexpected model behavior, estimating costs, or managing context windows, knowing how tokenization works gives you the right mental model.

Next-Token Prediction: The Core Mechanism

Everything an LLM outputs — every answer, every code snippet, every creative story — is the result of repeated next-token prediction.

Here is the process:

  1. Your prompt is tokenized into a sequence of token IDs.
  2. The model processes the full sequence through its transformer layers.
  3. The final layer outputs a probability distribution over the entire vocabulary — a number for each of the ~100,000 tokens indicating how likely it is to come next.
  4. A token is selected from this distribution (more on selection strategy below).
  5. The selected token is appended to the sequence.
  6. Steps 2-5 repeat until the model outputs a special stop token or the maximum output length is reached.

This is called autoregressive generation — the model generates one token at a time, and each new token becomes part of the input for generating the next one. The model is literally talking to itself, one token at a time, building the response incrementally.

There is a critical implication here: the model has no plan for its complete response. When it starts generating, it does not have a structured outline of what it will say. Each token is chosen based only on everything that came before it. This is why long, complex outputs can sometimes drift or contradict earlier statements — the model is making locally optimal choices without guaranteed global coherence.

Understanding this mechanism is fundamentally important for prompt design. If you want structured output, you need to guide the model’s token-by-token prediction toward the structure you want. This is exactly what prompt engineering is about.

Temperature, Top-p, and Sampling Strategies

When the model produces a probability distribution over its vocabulary, how do you pick the next token? This is where sampling parameters come in, and getting them right for your use case is a genuine engineering decision.

Temperature

Temperature scales the probability distribution before sampling. Mathematically, it divides the raw logits (pre-softmax scores) by the temperature value.

  • Temperature 0 — The model always picks the single highest-probability token. Output is deterministic. Given the same input, you get the same output every time. Use this for factual retrieval, code generation, and structured data extraction.
  • Temperature 0.3-0.5 — Slightly more variety while still favoring high-probability tokens. Good for most production use cases where you want reliable but not completely rigid output.
  • Temperature 0.7-1.0 — Meaningful randomness. Lower-probability tokens have a real chance of being selected. Use this for creative tasks — brainstorming, creative writing, generating diverse alternatives.
  • Temperature > 1.0 — Flattens the distribution aggressively. Output becomes increasingly random and often incoherent. Rarely useful in production.

Top-p (Nucleus Sampling)

Top-p offers an alternative approach to controlling randomness. Instead of scaling probabilities, it truncates the distribution: sort tokens by probability, then only consider the smallest set of tokens whose cumulative probability exceeds the threshold p.

With top_p = 0.9, the model only considers the top tokens that together account for 90% of the probability mass, regardless of how many tokens that includes. This adapts dynamically — when the model is confident, it considers fewer tokens; when uncertain, it considers more.

Practical Guidance

Most API providers support both temperature and top-p. Use one or the other, not both simultaneously (they interact in unintuitive ways).

For most .NET application scenarios:

  • Code generation and JSON output: Temperature 0 to 0.2
  • Summarization and analysis: Temperature 0.3 to 0.5
  • Conversational responses: Temperature 0.5 to 0.7
  • Creative generation: Temperature 0.7 to 1.0

These are starting points. The right values depend on your specific use case, and you should experiment.

Context Windows and Their Engineering Implications

The context window — measured in tokens — defines the total amount of information the model can process in a single request. This includes your system prompt, conversation history, any retrieved documents, the user’s current message, and the model’s response.

Current context window sizes:

ModelContext Window
GPT-4o128,000 tokens
Claude 3.5 Sonnet200,000 tokens
Gemini 1.5 Pro2,000,000 tokens
Llama 3.1 405B128,000 tokens
DeepSeek-V3128,000 tokens

Large context windows solve some problems but create others. Processing 200K tokens is slower and more expensive than processing 2K tokens. The model’s attention over very long contexts is not uniform — information in the middle of a long context can receive less attention than information at the beginning or end (a phenomenon sometimes called “lost in the middle”).

Engineering implications for .NET applications:

  • Conversation management. Track token usage and implement sliding window or summarization strategies when approaching limits.
  • RAG architecture. Even with large context windows, retrieving only the most relevant chunks is more cost-effective and often more accurate than stuffing everything in.
  • Cost modeling. Token-based pricing means context window usage directly impacts your operational costs. A system prompt that uses 2,000 tokens is charged on every single request.

The Major Model Families

As a .NET developer integrating LLMs, you will encounter several model families. Here is a practical overview of the landscape as it stands.

GPT (OpenAI / Azure OpenAI)

The GPT family — GPT-4, GPT-4o, GPT-4o-mini — is the most widely deployed in enterprise settings, largely due to Azure OpenAI Service. Microsoft’s partnership with OpenAI means these models have first-class support in the Azure ecosystem, including private networking, managed identity, and content filtering.

GPT-4o is the current flagship — fast, multimodal (text + vision), strong at code, and available through Azure OpenAI Service.

Claude (Anthropic)

Claude models — Claude 3.5 Sonnet, Claude 3 Opus, Claude 3.5 Haiku — are known for strong instruction following, long-context processing, and careful safety alignment. Claude 3.5 Sonnet offers an especially good balance of speed, quality, and cost. Available through the Anthropic API and increasingly through cloud provider partnerships.

Gemini (Google)

Gemini models — Gemini 1.5 Pro, Gemini 1.5 Flash — are Google’s multimodal models with notably large context windows (up to 2 million tokens). Available through Google Cloud’s Vertex AI and the Gemini API.

Llama (Meta)

Llama is Meta’s open-source model family. Llama 3.1 is available in 8B, 70B, and 405B parameter sizes. Being open-source, you can host it yourself, fine-tune it on your data, and deploy it without per-token API costs. Available through Azure AI Model Catalog, Hugging Face, and any infrastructure that supports model hosting.

DeepSeek

DeepSeek models, including DeepSeek-V3 and DeepSeek-Coder, have shown surprisingly strong performance at lower cost points. DeepSeek-V3 uses a Mixture-of-Experts architecture that activates only a fraction of its parameters per token, making it efficient despite its large total parameter count.

Open-Source vs. Proprietary Models

This is one of the most consequential architectural decisions you will make when building AI features. The trade-offs are real.

Proprietary models (GPT-4o, Claude, Gemini):

  • Higher quality on complex reasoning tasks (generally)
  • Zero infrastructure burden — call an API, get a response
  • Per-token pricing that scales with usage
  • Data leaves your network (unless using Azure OpenAI with private endpoints)
  • You are dependent on the provider’s availability, pricing, and model changes

Open-source models (Llama, DeepSeek, Mistral, Phi):

  • Growing quality — top open-source models rival proprietary ones for many tasks
  • Fixed infrastructure cost (GPUs) regardless of token volume
  • Full data sovereignty — nothing leaves your network
  • Fine-tuning on your domain data is possible and often transformative
  • You own the operational burden: hosting, scaling, monitoring, updating

For most .NET teams starting with AI, proprietary models via Azure OpenAI offer the fastest path to production. As usage scales and requirements mature, evaluating open-source alternatives for specific workloads — especially high-volume, domain-specific tasks — becomes a meaningful cost optimization.

Why This Matters for Prompt Design

Understanding the generation mechanism directly improves your prompting. When you know the model is predicting one token at a time based on everything before it, several prompt engineering principles become obvious:

  • Clear instructions early in the prompt bias every subsequent token in the right direction.
  • Structured output formats (JSON, tables) work because the model’s token-level predictions follow the structural patterns in its training data.
  • Few-shot examples work because they shift the probability distribution toward the pattern demonstrated in the examples.
  • Temperature tuning matters because different tasks need different amounts of predictability.

These are not mystical prompt engineering tricks. They are direct consequences of how the model generates text.

What Comes Next

You now have a mechanical understanding of how LLMs work — from training through tokenization through generation. The next step is putting this knowledge to work.

In Prompt Engineering Fundamentals for C# Developers, we will apply these concepts practically. You will learn how to structure prompts, use system messages effectively, implement few-shot patterns, and extract structured output — all with C# examples you can use immediately.

⚠ Production Considerations

  • Setting temperature too high for structured tasks (code generation, JSON output) introduces errors that are expensive to detect and retry.
  • Ignoring context window limits leads to silent truncation of conversation history, causing the model to lose important context mid-conversation.

🧠 Architect’s Note

Treat LLM calls as unreliable external dependencies. Design for retries, timeouts, and fallback behavior. The model's output is probabilistic — validate structured outputs with schema validation, not trust.

AI-Friendly Summary

Summary

This article explains how large language models work at a mechanical level for software engineers. It covers pre-training and fine-tuning phases, tokenization with BPE and SentencePiece, the next-token prediction mechanism, sampling strategies including temperature and top-p, context window engineering, and the major model families (GPT, Claude, Gemini, Llama, DeepSeek). It concludes with practical guidance on choosing between open-source and proprietary models.

Key Takeaways

  • LLMs are next-token prediction machines — they calculate probability distributions over vocabulary at each step
  • Pre-training learns language patterns from massive corpora; fine-tuning and RLHF align the model to be helpful and safe
  • Temperature controls randomness: 0 is deterministic, 1.0 is proportional sampling, higher values increase chaos
  • Context windows are fixed-size working memory — everything the model needs must fit within this limit
  • Open-source models are production-viable but require infrastructure; proprietary models trade cost for operational simplicity

Implementation Checklist

  • Understand the pre-training → fine-tuning → RLHF pipeline
  • Learn how tokenization converts text to model inputs
  • Grasp next-token prediction as the core generation mechanism
  • Know when to use low vs. high temperature settings
  • Evaluate open-source vs. proprietary models for your use case
  • Account for context window limits in your application architecture

Frequently Asked Questions

How do LLMs generate text?

LLMs generate text through next-token prediction. Given a sequence of tokens, the model calculates a probability distribution over its entire vocabulary and selects the next token. This token is appended to the sequence, and the process repeats until the model produces a stop token or hits a length limit. The selection process is influenced by parameters like temperature and top-p that control randomness.

What is the difference between GPT and Claude?

GPT (by OpenAI) and Claude (by Anthropic) are both transformer-based large language models, but they differ in training methodology, safety approach, and capabilities. GPT-4o emphasizes multimodal breadth and tool use. Claude emphasizes long-context processing (up to 200K tokens), instruction following, and Constitutional AI safety training. For .NET developers, both are accessible through similar API patterns and Microsoft.Extensions.AI abstractions.

What does temperature mean in LLM settings?

Temperature is a parameter that controls randomness in token selection. A temperature of 0 makes the model always pick the highest-probability token (deterministic output). A temperature of 1.0 samples proportionally from the probability distribution (more creative/varied output). Values above 1.0 increase randomness further. For code generation, use low temperature (0-0.2). For creative writing, use higher temperature (0.7-1.0).

Are open-source LLMs good enough for production?

Yes, for many use cases. Models like Llama 3.1 405B and DeepSeek-V3 approach proprietary model quality on standard benchmarks. Open-source models offer advantages in cost control, data privacy, and customization through fine-tuning. The trade-off is operational complexity — you must host, scale, and maintain the infrastructure yourself, or use managed services like Azure AI Model Catalog.

Related Articles

Was this article useful?

Feedback is anonymous and helps us improve content quality.

Discussion

Engineering discussion powered by GitHub Discussions.

#LLMs #GPT #Transformers #AI Architecture #Token Prediction