llm-from-scratch

🧠 Building an LLM from Scratch — How Transformers Learn, Think, and Generate

“I built a tiny language model from scratch — and suddenly, the magic of ChatGPT didn’t feel like magic anymore.”

🚀 What Are Large Language Models (LLMs)?

Large Language Models (LLMs) are AI systems that predict the next word (or token) in a sequence of text.

That’s the core idea — they learn patterns in language by reading billions of examples.

They don’t “understand” the world like humans do.
They approximate patterns of how humans express ideas.

Think of them as supercharged autocomplete engines that have learned:

Grammar
Facts
Context
Reasoning structures

So when you type:

“The capital of France is…”

The model has seen enough examples to confidently predict → “Paris”.

🧮 The Math You Actually Need to Know

You don’t need a PhD to understand LLMs.
Here’s the essential math intuition:

Concept	Why it matters
Vectors	Represent tokens as lists of numbers (embeddings).
Dot Product	Measures how related two tokens are.
Softmax	Converts scores into probabilities that sum to 1.
Gradients	Tell the model how to adjust weights to improve.
Cross-Entropy Loss	Quantifies how wrong the model’s prediction was.

Everything else is built on top of these ideas.

🧩 Common LLM Terminology

Term	Meaning
Token	A chunk of text (word, subword, or even a character).
Embedding	The numeric vector that represents a token.
Context Window	How much text the model can “see” at once.
Parameter	A trainable number in the model (LLMs have billions).
Training	Teaching the model to predict the next token.
Inference	Using the trained model to generate text.
Prompt	The input text you feed into the model.

⚙️ The Transformer: Heart of Modern LLMs

Transformers are the architecture behind every major model — GPT, Claude, Gemini, LLaMA, etc.

They replaced older sequential models (RNNs, LSTMs) with a design that can look at all tokens simultaneously — through self-attention.

🧠 Transformer Architecture (Conceptually)

Input Text  →  Token Embeddings  →  [Transformer Blocks]  →  Output Predictions

Each Transformer Block has:

Multi-Head Self-Attention
Feedforward Neural Network
Residual Connections
Layer Normalization

Stack dozens (or hundreds) of these, and you get a model that understands deep relationships between words and ideas.

🔍 What Is Attention?

Imagine reading this sentence:

“The cat sat on the mat because it was tired.”

To understand what “it” refers to, you instinctively connect “it” back to “cat”.
That’s attention.

Attention helps the model decide which previous tokens are relevant when predicting the next one.

🧭 Attention Mechanism (Simplified)

Each token computes three vectors:
  Query (Q): What am I looking for?
  Key   (K): What do I contain?
  Value (V): What information do I hold?

Attention = softmax(Q · Kᵀ / √d) × V

So each token “looks” at all others, scores their relevance, and forms a weighted summary of what matters most.

🪄 Tokenization — Turning Words into Numbers

LLMs can’t understand raw text, only numbers.

Tokenization converts text into IDs.

Example:

Text: "hello"
Tokens: ['h', 'e', 'l', 'l', 'o']
IDs: [4, 5, 8, 8, 11]

Larger models use Byte Pair Encoding (BPE) so that common chunks like “trans” or “ation” are single tokens.

🏗️ Training: How a Model Learns to Write

Training is simply teaching the model to guess the next token.

Example:

Input:  "I like deep lear"
Target: "I like deep learn"

At every step, the model tries to predict the next token correctly.

If it’s wrong, it calculates loss, measures how wrong it was, and adjusts weights via backpropagation.

Simplified Training Loop

for batch in data:
    logits, loss = model(input, target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Do this millions of times, and the model learns the structure of language.

🔮 Inference: How Text Generation Works

Once trained, the model can generate text by sampling one token at a time.

prompt → predict next token → append → predict again → ...

Example

Prompt:

"The universe began"

Model:

→ " with a bang."
→ " billions of years ago."
→ " as a cloud of energy."

Different sampling strategies:

Greedy – Always pick the most likely token
Top-k / Top-p – Add controlled randomness
Temperature – Adjust creativity (higher = wilder)

🧰 The Tiny LLM — Minimal PyTorch Implementation

Here’s the core idea of your TinyCharTransformer model:

class TinyCharTransformer(nn.Module):
    def __init__(self, vocab_size, cfg):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, cfg.n_embd)
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
        self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)])
        self.ln_f = nn.LayerNorm(cfg.n_embd)
        self.lm_head = nn.Linear(cfg.n_embd, vocab_size)

    def forward(self, idx, targets=None):
        x = self.token_emb(idx) + self.pos_emb(torch.arange(idx.size(1)))
        for blk in self.blocks:
            x = blk(x)
        logits = self.lm_head(self.ln_f(x))
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None
        return logits, loss

Even this simple model can learn to mimic patterns from text like:

“To be or not to be…” → “that is the question.”

🎨 Visualizing What Happens Inside

Once you’ve trained your model, you can see how it pays attention using visualization tools:

Tool	What it shows
`animate_attention.py`	Animated heatmap of attention weights
`visualize_inference.py`	Step-by-step next-token prediction
`--logit-lens`	Which layers predict what
`--saliency`	Which tokens influence the output

Example: Attention Map

(Imagine each token looking back at previous ones)

token:  t  h  e     c  a  t
         ↑  ↑  ↑        ↑↑↑

The arrows show where the model is “looking” when generating each token.

🧭 Understanding the Full Pipeline

Raw Text → Tokenization → Embeddings
→ Transformer Layers → Softmax → Predicted Token
→ Compare to True Token → Update Weights (Training)
→ Use for Generation (Inference)

That’s the entire lifecycle — from raw data to an AI that writes.

💡 Beyond the Basics

Once you grasp the fundamentals, explore:

Fine-tuning – Specialize models for tasks like summarization
RLHF – Align model behavior with human preferences
Prompt Engineering – Guide outputs with clever phrasing
Scaling Laws – Bigger models learn better, but with diminishing returns
LoRA / Quantization – Run large models efficiently on smaller GPUs

📚 Recommended Resources

🪐 Final Thoughts

You don’t need a supercomputer to understand how ChatGPT works.
You can build a tiny transformer on your laptop — and see it think.

The beauty of modern AI isn’t in the size of the model, but in the simplicity of the math behind it.

Once you’ve trained one yourself, the “mystery” of LLMs turns into pure curiosity. ✨

Written by Nilesh Salpe, October 2025

MIT License · Educational & Open Source

Source Code

This site is open source. Improve this page.