How Large Language Models Actually Work?
A Practical, Engineering-Free Explainer for Product Leaders and AI Builders
If you’re working in or around AI product management, you keep hearing words like LLMs, tokens, embeddings, transformers, pre-training, fine-tuning, context windows, hallucinations. You don’t need to derive the math—but you do need to understand what’s really going on so you can:
Scope feasible products
Talk credibly with engineers and data scientists
Make smart trade-offs between accuracy, cost, and latency
Avoid promising things models simply can’t do safely
We will explore,
What an LLM actually is
Tokens, vocabulary, and context windows
How training really works (pre-training → post-training)
Why transformers and “attention” changed everything
Why GPUs (and now specialised chips) matter
Large vs small language models (and when to use which)
Hallucinations, stochasticity, and why they happen
What all this means for your product decisions
What is a Large Language Model?
At the core, a large language model is:
A very big neural network trained on massive amounts of text (and sometimes other data) to predict the next token in a sequence, in a way that appears like understanding and reasoning.
Key bits in that sentence:
Neural network – a big function with millions or billions of parameters (weights).
Massive data – web pages, books, code, documentation, forums, etc.
Next token prediction – given
input tokens, it outputs probabilities for what token comes next.
The magic is that when you train a model to be extremely good at “next token prediction” across a huge variety of texts, it starts to exhibit behaviour that looks like:
Translation
Summarisation
Question-answering
Code generation
Step-by-step reasoning
All by the same mechanism: predicting the next token, over and over.
Recent surveys and reports from industry and academia agree on this basic framing: LLMs are essentially large sequence models trained on text to predict tokens, based on the Transformer architecture.
Tokens, vocabulary, and context windows
You can’t feed raw characters or words directly to a neural network; you need numbers. That’s where tokenisation comes in.
1. What is a token?
A token is just a small unit of text that the model operates on. It might be:
A word (
hello)A sub-word (
ing,tion)Even a piece of punctuation
Models typically use a scheme like Byte Pair Encoding (BPE) or similar algorithms to split text into tokens. Different models have different vocabularies and tokenisation rules.
Rough thumb rule (English):
~3–4 characters ≈ 1 token, or ~75% of a word per token (very rough).
So a 1,000-word article might be ~1,300–1,500 tokens, depending on language and style.
2. Vocabulary size
The vocabulary is the set of tokens the model knows. For modern LLMs, this is often on the order of tens of thousands to hundreds of thousands of tokens (e.g., ~50k–200k).
The model essentially learns:
“Given this sequence of tokens, what distribution over the vocabulary should the next token follow?”
3. Context window
The context window is how many tokens the model can “see” at once for a single request.
For example:
Early GPT-3 models: ~2,048 tokens
Later models: 8k, 16k, 32k, even 100k+ tokens of context
Other vendors similarly expanded context over time
Larger context windows allow:
Longer conversations
“Chat with your docs” over bigger documents
More instruction + examples in a single prompt
But there’s a catch:
More context = more compute = higher cost and latency
And beyond a point, adding more context doesn’t always increase effective understanding; performance can degrade on very long prompts if not handled well.
As a PM, you should think of the context window as the maximum size of the “working memory” per request. Anything outside that has to be brought in smartly—this is where retrieval and context engineering come in.
How training actually works?
Training has three conceptual stages:
Pre-training
Supervised fine-tuning (SFT)
Alignment & post-training (often with human feedback)
1. Pre-training: learning to speak and think in text
In pre-training, the model is fed enormous amounts of raw text (and sometimes code, images, etc.) from:
Web crawls
Books and academic papers
Code repositories
Wikipedia and reference works
Other curated sources
The objective is simple:
Given a sequence of tokens, predict the next one.
Here’s the intuition:
Take a piece of text, convert it to tokens.
For each position in the sequence, ask the model to predict the next token.
Compare its prediction with the true next token.
Compute the difference (the loss), and adjust the model’s parameters to reduce that loss.
Repeat billions and billions of times across all your data.
Mathematically, this is done via stochastic gradient descent and backpropagation, but conceptually you can think of it as:
The model starts off “random” and terrible.
Each time it makes a mistake, you nudge its parameters to be a little less wrong.
After enough iterations on enough data, it becomes very good at modelling patterns in language.
This process can take weeks or months on huge clusters of GPUs or TPUs and costs millions of dollars for the largest models.
2. Supervised fine-tuning: teaching it tasks & formats
Pre-training gives you a “base” model that’s good at language, but not necessarily good at:
Following instructions (“Explain X in simple terms”)
Conversational behaviour
Specific tasks (e.g., writing safe code, summarising legal docs)
Producing output in structured formats (JSON, XML, etc.)
So you collect labelled datasets of the form:
Input: prompt / question / instruction
Output: desired answer / completion
Then you run another training phase (often shorter and cheaper) where the model is tuned to produce the desired outputs. This is supervised fine-tuning (SFT).
After SFT, the model becomes more “assistant-like”: it’s better at doing what it’s told in the examples.
3. Alignment and RLHF: giving the model a “personality”
Even after SFT, the model might:
Say unsafe or biased things
Answer confidently when it should say “I don’t know”
Ignore user instructions
Behave in ways that aren’t aligned with your brand or safety requirements
To address that, many vendors use:
RLHF (Reinforcement Learning from Human Feedback)
Other forms of preference optimisation and safety training
Conceptually:
Humans rate model outputs (“good”, “bad”, “unsafe”, “unhelpful”).
Another model learns a reward function from these ratings (“what does a ‘good’ answer look like?”).
The base model is then optimised to maximise that reward, i.e., behave more like the kind of assistant humans prefer and regulation allows.
For you as a PM, post-training / alignment is what makes the difference between:
A raw language model that will say almost anything
A product-ready assistant that tries to be helpful, harmless, and honest
Transformers & Attention: Why this all became possible
Before around 2017, sequence models were dominated by RNNs/LSTMs, which processed tokens sequentially and struggled with long-range dependencies.
The Transformer architecture (introduced in the paper Attention Is All You Need) changed that.
The key innovation: self-attention.
1. Self-attention in plain language
Self-attention lets the model, for each token, look at all other tokens in the sequence and decide:
“How much should I pay attention to this word?”
“Which previous words are most relevant for predicting the next one?”
For example, in the sentence:
“He lives near the river bank.”
To understand “bank” correctly (river bank vs financial bank), the model needs to look at “river”, not just “bank” in isolation.
Self-attention provides a mechanism for “every token to look at every other token” and compute a weighted combination of their representations. It captures relationships nuanced enough to handle context, ambiguity, and long-range dependencies.
The transformer stacks many layers of attention and feed-forward networks, plus multi-head variants (multiple attention patterns per layer), resulting in a very expressive sequence model.
2. Why transformers enabled scaling
Transformers have several properties that made large-scale training practical:
They parallelise much better on GPUs than recurrent models.
Self-attention can be computed efficiently with matrix operations.
This allows training on giant datasets and huge models (billions of parameters) within reasonable time.
This scaling ability is the main reason LLMs went from research to industrial reality. Without transformers and efficient attention, we wouldn’t have GPT-class models, Claude-class models, or their open-source counterparts.
Why GPUs matter
You’ll often hear about:
GPU shortages
Cloud GPU pricing
Chip vendor market caps skyrocketing
Why? Because both training and inference of LLMs are extremely compute-intensive and map very well to GPU architectures:
Neural networks = lots of linear algebra (matrix multiplications)
GPUs are built for massively parallel operations, originally for graphics, now perfect for ML workloads
Training vs inference
Training: adjusting parameters based on loss; requires forward and backward passes, huge batches, and large clusters. Very expensive, done by model providers.
Inference: using the trained model to generate outputs; typically just forward passes token by token. Still expensive at scale, but cheaper than training.
As a PM, you don’t usually decide the chip vendor, but you do care about:
Cost per 1M tokens (input + output)
Latency and throughput your product can afford
Whether you run models via a third-party API, your cloud provider, or on your own infra
These are economic and experience levers, not just technical nerdiness.


