llm-from-scratch

🧠 Building an LLM from Scratch — How Transformers Learn, Think, and Generate

“I built a tiny language model from scratch — and suddenly, the magic of ChatGPT didn’t feel like magic anymore.”


🚀 What Are Large Language Models (LLMs)?

Large Language Models (LLMs) are AI systems that predict the next word (or token) in a sequence of text.

That’s the core idea — they learn patterns in language by reading billions of examples.

They don’t “understand” the world like humans do.
They approximate patterns of how humans express ideas.

Think of them as supercharged autocomplete engines that have learned:

So when you type:

“The capital of France is…”

The model has seen enough examples to confidently predict → “Paris”.


🧮 The Math You Actually Need to Know

You don’t need a PhD to understand LLMs.
Here’s the essential math intuition:

Concept Why it matters
Vectors Represent tokens as lists of numbers (embeddings).
Dot Product Measures how related two tokens are.
Softmax Converts scores into probabilities that sum to 1.
Gradients Tell the model how to adjust weights to improve.
Cross-Entropy Loss Quantifies how wrong the model’s prediction was.

Everything else is built on top of these ideas.


🧩 Common LLM Terminology

Term Meaning
Token A chunk of text (word, subword, or even a character).
Embedding The numeric vector that represents a token.
Context Window How much text the model can “see” at once.
Parameter A trainable number in the model (LLMs have billions).
Training Teaching the model to predict the next token.
Inference Using the trained model to generate text.
Prompt The input text you feed into the model.

⚙️ The Transformer: Heart of Modern LLMs

Transformers are the architecture behind every major model — GPT, Claude, Gemini, LLaMA, etc.

They replaced older sequential models (RNNs, LSTMs) with a design that can look at all tokens simultaneously — through self-attention.


🧠 Transformer Architecture (Conceptually)

Input Text  →  Token Embeddings  →  [Transformer Blocks]  →  Output Predictions

Each Transformer Block has:

Stack dozens (or hundreds) of these, and you get a model that understands deep relationships between words and ideas.


🔍 What Is Attention?

Imagine reading this sentence:

“The cat sat on the mat because it was tired.”

To understand what “it” refers to, you instinctively connect “it” back to “cat”.
That’s attention.

Attention helps the model decide which previous tokens are relevant when predicting the next one.

🧭 Attention Mechanism (Simplified)

Each token computes three vectors:
  Query (Q): What am I looking for?
  Key   (K): What do I contain?
  Value (V): What information do I hold?

Attention = softmax(Q · Kᵀ / √d) × V

So each token “looks” at all others, scores their relevance, and forms a weighted summary of what matters most.


🪄 Tokenization — Turning Words into Numbers

LLMs can’t understand raw text, only numbers.

Tokenization converts text into IDs.

Example:

Text: "hello"
Tokens: ['h', 'e', 'l', 'l', 'o']
IDs: [4, 5, 8, 8, 11]

Larger models use Byte Pair Encoding (BPE) so that common chunks like “trans” or “ation” are single tokens.


🏗️ Training: How a Model Learns to Write

Training is simply teaching the model to guess the next token.

Example:

Input:  "I like deep lear"
Target: "I like deep learn"

At every step, the model tries to predict the next token correctly.

If it’s wrong, it calculates loss, measures how wrong it was, and adjusts weights via backpropagation.

Simplified Training Loop

for batch in data:
    logits, loss = model(input, target)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Do this millions of times, and the model learns the structure of language.


🔮 Inference: How Text Generation Works

Once trained, the model can generate text by sampling one token at a time.

prompt → predict next token → append → predict again → ...

Example

Prompt:

"The universe began"

Model:

→ " with a bang."
→ " billions of years ago."
→ " as a cloud of energy."

Different sampling strategies:


🧰 The Tiny LLM — Minimal PyTorch Implementation

Here’s the core idea of your TinyCharTransformer model:

class TinyCharTransformer(nn.Module):
    def __init__(self, vocab_size, cfg):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, cfg.n_embd)
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)
        self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layers)])
        self.ln_f = nn.LayerNorm(cfg.n_embd)
        self.lm_head = nn.Linear(cfg.n_embd, vocab_size)

    def forward(self, idx, targets=None):
        x = self.token_emb(idx) + self.pos_emb(torch.arange(idx.size(1)))
        for blk in self.blocks:
            x = blk(x)
        logits = self.lm_head(self.ln_f(x))
        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1)) if targets is not None else None
        return logits, loss

Even this simple model can learn to mimic patterns from text like:

“To be or not to be…” → “that is the question.”


🎨 Visualizing What Happens Inside

Once you’ve trained your model, you can see how it pays attention using visualization tools:

Tool What it shows
animate_attention.py Animated heatmap of attention weights
visualize_inference.py Step-by-step next-token prediction
--logit-lens Which layers predict what
--saliency Which tokens influence the output

Example: Attention Map

(Imagine each token looking back at previous ones)

token:  t  h  e     c  a  t
         ↑  ↑  ↑        ↑↑↑

The arrows show where the model is “looking” when generating each token.


🧭 Understanding the Full Pipeline

Raw Text → Tokenization → Embeddings
→ Transformer Layers → Softmax → Predicted Token
→ Compare to True Token → Update Weights (Training)
→ Use for Generation (Inference)

That’s the entire lifecycle — from raw data to an AI that writes.


💡 Beyond the Basics

Once you grasp the fundamentals, explore:



🪐 Final Thoughts

You don’t need a supercomputer to understand how ChatGPT works.
You can build a tiny transformer on your laptop — and see it think.

The beauty of modern AI isn’t in the size of the model, but in the simplicity of the math behind it.

Once you’ve trained one yourself, the “mystery” of LLMs turns into pure curiosity. ✨


Written by Nilesh Salpe, October 2025

MIT License · Educational & Open Source

Source Code