“I built a tiny language model from scratch — and suddenly, the magic of ChatGPT didn’t feel like magic anymore.”
Large Language Models (LLMs) are AI systems that predict the next word (or token) in a sequence of text.
That’s the core idea — they learn patterns in language by reading billions of examples.
They don’t “understand” the world like humans do.
They approximate patterns of how humans express ideas.
Think of them as supercharged autocomplete engines that have learned:
So when you type:
“The capital of France is…”
The model has seen enough examples to confidently predict → “Paris”.
You don’t need a PhD to understand LLMs.
Here’s the essential math intuition:
| Concept | Why it matters |
|---|---|
| Vectors | Represent tokens as lists of numbers (embeddings). |
| Dot Product | Measures how related two tokens are. |
| Softmax | Converts scores into probabilities that sum to 1. |
| Gradients | Tell the model how to adjust weights to improve. |
| Cross-Entropy Loss | Quantifies how wrong the model’s prediction was. |
Everything else is built on top of these ideas.
| Term | Meaning |
|---|---|
| Token | A chunk of text (word, subword, or even a character). |
| Embedding | The numeric vector that represents a token. |
| Context Window | How much text the model can “see” at once. |
| Parameter | A trainable number in the model (LLMs have billions). |
| Training | Teaching the model to predict the next token. |
| Inference | Using the trained model to generate text. |
| Prompt | The input text you feed into the model. |
Transformers are the architecture behind every major model — GPT, Claude, Gemini, LLaMA, etc.
They replaced older sequential models (RNNs, LSTMs) with a design that can look at all tokens simultaneously — through self-attention.
Input Text → Token Embeddings → [Transformer Blocks] → Output Predictions
Each Transformer Block has:
Stack dozens (or hundreds) of these, and you get a model that understands deep relationships between words and ideas.
Imagine reading this sentence:
“The cat sat on the mat because it was tired.”
To understand what “it” refers to, you instinctively connect “it” back to “cat”.
That’s attention.
Attention helps the model decide which previous tokens are relevant when predicting the next one.
Each token computes three vectors:
Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I hold?
Attention = softmax(Q · Kᵀ / √d) × V
So each token “looks” at all others, scores their relevance, and forms a weighted summary of what matters most.
LLMs can’t understand raw text, only numbers.
Tokenization converts text into IDs.
Example:
Text: "hello"
Tokens: ['h', 'e', 'l', 'l', 'o']
IDs: [4, 5, 8, 8, 11]
Larger models use Byte Pair Encoding (BPE) so that common chunks like “trans” or “ation” are single tokens.
Training is simply teaching the model to guess the next token.
Example:
Input: "I like deep lear"
Target: "I like deep learn"
At every step, the model tries to predict the next token correctly.
If it’s wrong, it calculates loss, measures how wrong it was, and adjusts weights via backpropagation.
for batch in data:
logits, loss = model(input, target)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Do this millions of times, and the model learns the structure of language.
Once trained, the model can generate text by sampling one token at a time.
prompt → predict next token → append → predict again → ...
Prompt:
"The universe began"
Model:
→ " with a bang."
→ " billions of years ago."
→ " as a cloud of energy."
Different sampling strategies:
Here’s the core idea of your TinyCharTransformer model:
class TinyCharTransformer(nn.Module):
def __init__(self, vocab_size, cfg):
super().__init__()
self.token_emb = nn.Embedding(vocab_size, cfg.n_embd)
self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)])
self.ln_f = nn.LayerNorm(cfg.n_embd)
self.lm_head = nn.Linear(cfg.n_embd, vocab_size)
def forward(self, idx, targets=None):
x = self.token_emb(idx) + self.pos_emb(torch.arange(idx.size(1)))
for blk in self.blocks:
x = blk(x)
logits = self.lm_head(self.ln_f(x))
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None
return logits, loss
Even this simple model can learn to mimic patterns from text like:
“To be or not to be…” → “that is the question.”
Once you’ve trained your model, you can see how it pays attention using visualization tools:
| Tool | What it shows |
|---|---|
animate_attention.py |
Animated heatmap of attention weights |
visualize_inference.py |
Step-by-step next-token prediction |
--logit-lens |
Which layers predict what |
--saliency |
Which tokens influence the output |
(Imagine each token looking back at previous ones)
token: t h e c a t
↑ ↑ ↑ ↑↑↑
The arrows show where the model is “looking” when generating each token.
Raw Text → Tokenization → Embeddings
→ Transformer Layers → Softmax → Predicted Token
→ Compare to True Token → Update Weights (Training)
→ Use for Generation (Inference)
That’s the entire lifecycle — from raw data to an AI that writes.
Once you grasp the fundamentals, explore:
You don’t need a supercomputer to understand how ChatGPT works.
You can build a tiny transformer on your laptop — and see it think.
The beauty of modern AI isn’t in the size of the model, but in the simplicity of the math behind it.
Once you’ve trained one yourself, the “mystery” of LLMs turns into pure curiosity. ✨
Written by Nilesh Salpe, October 2025
MIT License · Educational & Open Source