Why This Article Exists
If you have spent any time in the .NET ecosystem recently, you have noticed the ground shifting. Microsoft is embedding AI capabilities into every layer of the stack — from Visual Studio to Azure to the core framework libraries. But most of the educational content out there assumes you are either a data scientist or willing to rewrite everything in Python.
That is not how production software gets built.
This article exists to give you — a working C# developer — a clear, engineering-grounded understanding of what generative AI actually is, how the core concepts work, and why the .NET ecosystem is now a first-class platform for building AI applications. No hype. No hand-waving. Just the mental models you need to make informed architectural decisions.
What AI Actually Means for Engineers
Strip away the marketing and AI comes down to one thing: systems that learn patterns from data and use those patterns to make predictions or generate outputs.
That definition covers everything from a spam filter to GPT-4o. The difference is scale, architecture, and what kind of output the model produces.
For engineers, the important distinction is between AI that classifies (is this email spam?) and AI that generates (write me an email about this topic). Generative AI falls into the second category. It produces new content — text, code, images, structured data — based on patterns it learned during training.
This is not magic. It is statistical pattern matching at extraordinary scale. Understanding that framing will save you from both over-trusting and under-utilizing these systems.
Machine Learning vs. Deep Learning vs. LLMs
These terms get thrown around interchangeably, but they represent a clear hierarchy. Getting this right matters because it determines which tools you reach for.
Machine Learning
Machine learning is the broadest category. It encompasses any system that improves at a task through exposure to data rather than explicit programming. Linear regression, decision trees, random forests, support vector machines — these are all classical ML techniques.
In the .NET world, ML.NET is the primary framework for classical machine learning. If you need to predict a number, classify a category, detect anomalies, or recommend items based on tabular data, ML.NET is the right tool. It handles the entire pipeline — data loading, feature engineering, training, evaluation, and deployment — without leaving C#.
Deep Learning
Deep learning is a subset of machine learning that uses neural networks with many layers (hence “deep”). These networks can learn complex, hierarchical representations from raw data — images, audio, text — without manual feature engineering.
The breakthrough of deep learning was that the model figures out what features matter. You feed it pixels, it learns to recognize edges, shapes, objects, and scenes. You feed it text, it learns grammar, meaning, and context.
ONNX Runtime lets you run deep learning models in .NET. Models trained in Python (using PyTorch or TensorFlow) can be exported to the ONNX format and executed with full performance in your C# applications.
Large Language Models
LLMs are a specific class of deep learning model trained on massive text datasets using a particular architecture called a transformer. GPT-4, Claude, Gemini, Llama, DeepSeek — these are all LLMs built on transformer architecture.
What makes LLMs distinct is their generality. Classical ML models are trained for one task. An LLM trained on enough text develops emergent capabilities across many tasks: writing, summarizing, translating, coding, reasoning, and more. This is what makes them both powerful and unpredictable.
The hierarchy is clear: Machine Learning > Deep Learning > Large Language Models. Each layer builds on the one below it.
How Transformer Models Work
You do not need to understand the mathematics of transformers to use LLMs effectively, but you do need an accurate mental model. Many architectural mistakes come from misunderstanding what the model is actually doing.
The Core Idea: Self-Attention
Before transformers, language models processed text sequentially — one word at a time, left to right. This made them slow and limited their ability to understand relationships between distant words.
Transformers introduced self-attention, a mechanism that lets the model look at every token in the input simultaneously and determine which tokens are most relevant to each other. When processing the word “bank” in a sentence, the model can attend to surrounding context — “river” or “account” — to determine the correct interpretation.
This parallel processing is what made LLMs feasible at scale. Instead of crawling through text token by token, transformers process the entire input in parallel, leveraging GPU hardware that excels at exactly this kind of computation.
Encoder-Decoder and Decoder-Only
The original transformer architecture (introduced in the 2017 paper “Attention Is All You Need”) had two parts: an encoder that reads input and a decoder that generates output. This design works well for translation — encode the source language, decode into the target language.
Most modern LLMs use a decoder-only architecture. GPT, Claude, and Llama are all decoder-only models. They take a sequence of tokens and predict what comes next. This simplification turned out to be remarkably powerful when scaled up.
Tokens: The Fundamental Unit
LLMs do not operate on words. They operate on tokens — fragments of text that the model has learned to recognize during training.
A token might be a whole word (“hello”), a word fragment (“un” + “believ” + “able”), a punctuation mark, or even a single character. The exact tokenization depends on the model’s tokenizer, which is trained alongside the model itself.
Why does this matter for engineers? Three reasons:
Cost. API providers charge per token. A 1,000-word document might be 1,300 tokens with one model and 1,500 with another. Understanding tokenization helps you estimate costs.
Context limits. Every model has a maximum context window measured in tokens. GPT-4o supports 128,000 tokens. Claude 3.5 supports 200,000 tokens. If your input exceeds the limit, it gets truncated or rejected.
Behavior. Models reason at the token level. A model might handle “JavaScript” as one token but “TypeScript” as two tokens (“Type” + “Script”). This can create subtle differences in how the model processes similar inputs.
Most tokenizers use Byte Pair Encoding (BPE) or SentencePiece algorithms. You rarely need to work with tokenizers directly, but knowing they exist explains pricing, context limits, and some model behaviors that would otherwise seem arbitrary.
Context Windows and Their Implications
The context window is the total number of tokens a model can process in a single request — including both your input and the model’s output.
Think of it as working memory. Everything the model needs to “know” for a given interaction must fit within this window. There is no persistent memory between API calls (unless you build it yourself).
This has direct engineering implications:
- Conversation history consumes context. A long chat session gradually fills the window, and older messages must be dropped or summarized.
- Document processing requires chunking strategies. A 50-page document will not fit in a single call, so you need to split, process, and reassemble.
- System prompts use context too. A detailed system message with instructions, few-shot examples, and constraints might consume several thousand tokens before the user even says anything.
Context window management is one of the most important architectural concerns in AI application development. It is where theoretical understanding meets real engineering trade-offs.
Embeddings: Meaning as Numbers
An embedding is a numerical representation of text — a vector of floating-point numbers that captures semantic meaning. Similar texts produce similar vectors. This property makes embeddings the foundation of semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG).
When you embed the phrases “How do I reset my password?” and “I forgot my login credentials,” they will produce vectors that are close together in the embedding space, even though they share almost no words.
Embedding models are different from generative models. They transform text into vectors but do not generate new text. Azure OpenAI offers dedicated embedding models like text-embedding-3-small and text-embedding-3-large that are optimized for this purpose.
In .NET, you generate embeddings through the same API clients you use for chat completion. The vectors are then stored in a vector database — Azure AI Search, Azure Cosmos DB, Qdrant, or others — for similarity search at query time.
Why .NET Has First-Class AI Support Now
For years, Python dominated the AI landscape. That made sense — the training ecosystem (PyTorch, TensorFlow, Hugging Face) is centered on Python. But training models and consuming models are very different activities.
Most .NET developers are not training models from scratch. They are consuming pre-trained models through APIs — sending prompts, receiving completions, generating embeddings, orchestrating multi-step workflows. For this consumption layer, .NET is now fully equipped.
Microsoft has shipped a comprehensive set of libraries that make .NET a genuine first-choice platform for AI application development. This is not a bolt-on afterthought. These are well-designed, production-grade libraries with strong typing, dependency injection support, and the kind of API design .NET developers expect.
The .NET AI Ecosystem Map
Here is how the major pieces fit together and when you should reach for each one.
Microsoft.Extensions.AI
Microsoft.Extensions.AI is the unified abstraction layer for AI services in .NET. It defines standard interfaces — IChatClient, IEmbeddingGenerator — that any AI provider can implement.
When to use it: You want to call an LLM (chat completion, embeddings) without coupling to a specific provider. It is the ILogger of AI — a thin abstraction that lets you swap providers without changing application code.
Semantic Kernel
Semantic Kernel is Microsoft’s orchestration SDK for building AI agents. It provides plugins, planners, memory connectors, and a pipeline for composing complex AI workflows.
When to use it: You need more than simple prompt-response. If your application involves tool calling, multi-step reasoning, RAG pipelines, or agent-like behavior, Semantic Kernel provides the structure. For a deep look at its internals, see our Semantic Kernel Architecture Deep Dive.
ML.NET
ML.NET is the classical machine learning framework for .NET. It handles supervised learning, unsupervised learning, and common tasks like classification, regression, anomaly detection, and recommendation.
When to use it: Your problem is tabular data, and you need a trained model for prediction or classification. ML.NET is not for generative AI — it is for the bread-and-butter ML tasks that many applications need.
ONNX Runtime
ONNX Runtime executes pre-trained neural network models in .NET. Models trained in Python can be exported to ONNX format and run with near-native performance in C#.
When to use it: You need to run a specific model locally — image classification, object detection, custom NLP. ONNX Runtime bridges the gap between Python training and .NET inference.
Azure.AI.OpenAI
The Azure.AI.OpenAI client library provides direct access to Azure OpenAI Service. It supports chat completion, embeddings, image generation, and all Azure-specific features like content filtering and managed identity authentication.
When to use it: You are building on Azure and need direct access to OpenAI models with enterprise features — private networking, managed keys, regional deployment, content safety filters.
How These Pieces Connect
The ecosystem is layered by design. At the bottom, Microsoft.Extensions.AI defines the abstractions. Provider-specific libraries like Azure.AI.OpenAI implement those abstractions. Semantic Kernel sits on top, using the abstraction layer to orchestrate multi-step AI workflows.
For a simple chatbot, Microsoft.Extensions.AI with an Azure OpenAI backend might be all you need. For a RAG system that retrieves documents, reasons about them, and calls external APIs, Semantic Kernel provides the structure you need. For a demand forecasting model, ML.NET is the right tool.
The key architectural principle: choose the thinnest abstraction layer that meets your requirements. Do not pull in Semantic Kernel if you just need to send a prompt and get a response. Do not write raw HTTP calls if Microsoft.Extensions.AI already has an interface for what you need.
What Comes Next
This article gave you the conceptual foundation — what generative AI is, how transformers and LLMs work at a high level, and what tools are available in the .NET ecosystem.
The next step is understanding how LLMs actually generate text. That mechanical understanding — tokenization, next-token prediction, temperature, sampling — is what separates developers who use AI tools effectively from those who treat them as black boxes.
Continue to How Large Language Models Work — A Mental Model for Engineers to build that understanding.