Introduction: Disassembling the Magic of LLMs
Large Language Models seem magical: you type a question and get a coherent, structured, often surprisingly intelligent response. But under the hood there's no magic, there's mathematics. An LLM is fundamentally a system that predicts the next token in a sequence, based on statistical patterns learned from billions of words during training.
Understanding how LLMs work is not an academic exercise: it's an essential practical skill. Knowing what happens between your prompt and the generated response allows you to write better prompts, debug unexpected behaviors, choose the right model for your use case, and understand why LLMs hallucinate.
What You'll Learn in This Article
- How text is transformed into numbers through tokenization
- The role of embeddings in representing semantic meaning
- How the attention mechanism works in Transformers
- The text generation process: from logits to tokens
- Sampling strategies: temperature, top-k, and top-p
- Why LLMs hallucinate and the role of the context window
Phase 1: Tokenization - From Text to Numbers
The first step in LLM processing is tokenization: converting text into a sequence of integers. Neural models don't understand letters or words; they operate on numerical vectors. Tokenization is the bridge between human language and model mathematics.
Byte-Pair Encoding (BPE)
The most common algorithm is Byte-Pair Encoding (BPE), used by GPT, Claude, and most modern models. BPE works iteratively: it starts from individual characters and progressively merges the most frequent pairs into longer tokens.
The result is a vocabulary of 50,000-100,000 tokens representing an optimal compromise between granularity and efficiency. Common words like "the" become a single token, while rare words are split into sub-tokens.
# Example of tokenization with tiktoken (OpenAI's tokenizer)
import tiktoken
# Load GPT-4's tokenizer
enc = tiktoken.encoding_for_model("gpt-4")
# Tokenize a sentence
text = "Generative artificial intelligence is revolutionary"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
# Decode each token to see the breakdown
for token_id in tokens:
print(f" ID {token_id} -> '{enc.decode([token_id])}'")
# Typical output:
# Text: Generative artificial intelligence is revolutionary
# Token IDs: [Gen, erative, artificial, intelligence, is, revolutionary]
# Token count: 6
Practical Impact of Tokenization
Tokenization has important practical consequences every developer should know:
- Cost: APIs charge per token, not per word. One word can be 1-4 tokens
- Context window: the limit is in tokens, not words. 4,000 tokens is roughly 3,000 English words
- Different languages: non-English languages often require more tokens to express the same concept (~1.3x for Italian)
- Code: source code is often less token-efficient than natural text
Phase 2: Embeddings - From Token to Meaning
After tokenization, each token ID is converted into an embedding: a dense vector of real numbers (typically 768-12,288 dimensions) that captures the semantic meaning of the token.
The power of embeddings lies in their geometry: words with similar meanings have nearby vectors in the space. "King" and "Queen" are close, as are "Paris" and "France". These relationships are automatically learned during training.
Embeddings: Numbers with Meaning
An embedding is not a simple numeric ID: it's a high-dimensional vector where each dimension
captures an aspect of meaning. Arithmetic operations on embeddings produce semantically
sensible results: vec("king") - vec("man") + vec("woman") ≈ vec("queen").
# Visualizing semantic similarity of embeddings
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text: str) -> list:
"""Get the embedding of a text using OpenAI."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list, b: list) -> float:
"""Calculate cosine similarity between two vectors."""
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Compare semantic similarities
words = ["cat", "dog", "automobile", "feline"]
embeddings = {w: get_embedding(w) for w in words}
print("Semantic similarities:")
print(f" cat-feline: {cosine_similarity(embeddings['cat'], embeddings['feline']):.4f}")
print(f" cat-dog: {cosine_similarity(embeddings['cat'], embeddings['dog']):.4f}")
print(f" cat-automobile: {cosine_similarity(embeddings['cat'], embeddings['automobile']):.4f}")
# cat-feline will have the highest similarity!
Positional Encoding
Beyond semantic embeddings, Transformers add a positional encoding: a signal indicating the position of each token in the sequence. Without this mechanism, the model wouldn't distinguish "the cat chases the mouse" from "the mouse chases the cat", since the Transformer architecture processes all tokens in parallel, not sequentially.
Phase 3: The Transformer - Attention Is All You Need
The heart of every modern LLM is the Transformer architecture, composed of repeated blocks of Self-Attention and Feed-Forward Networks. Models like GPT-4 have hundreds of these stacked blocks, each refining the text representation.
Self-Attention: The Key Mechanism
Self-attention allows each token to "look at" all other tokens in the sequence and decide how relevant they are to its meaning in the current context. In the sentence "The cat sat on the mat because it was tired", the attention mechanism connects "was tired" to "cat" (not "mat"), resolving the coreference.
Mathematically, for each token, three vectors are computed: Query (what am I looking for), Key (what I offer as context), and Value (my informational content). The dot product between Query and Key determines the attention weight, which is used to weigh the Values.
Multi-Head Attention
A single attention mechanism captures one type of relationship. Multi-head attention runs multiple attention operations in parallel (typically 32-128 "heads"), each specialized in a different aspect: syntactic, semantic, proximity, coreference relationships, and so on.
Anatomy of a Transformer Block
Each Transformer block follows this structure: Layer Norm to stabilize input, Multi-Head Self-Attention to capture relationships between tokens, Residual Connection to preserve original information, a second Layer Norm, and a Feed-Forward Network (2 dense layers) to transform the representation, followed by another Residual Connection. GPT-4 stacks approximately 120 of these blocks.
Phase 4: Text Generation
After the input text has passed through all Transformer blocks, the last layer produces an output vector for each position. To generate the next token, this vector is projected onto the entire vocabulary producing logits: a numerical score for every possible token in the vocabulary.
From Logits to Probabilities: Softmax
Logits are transformed into probabilities through the softmax function, which normalizes the scores so they sum to 1. The token with the highest probability is the model's "best prediction", but it's not always the one selected.
Sampling Strategies
The choice of the next token is not deterministic. Different sampling strategies produce outputs with different characteristics:
- Greedy decoding: always picks the most probable token. Deterministic but often repetitive and boring
- Random sampling: samples from the full distribution. Creative but potentially incoherent
- Temperature: controls "randomness". T=0 is greedy, T=1 is the original distribution, T>1 increases creativity
- Top-k sampling: samples only from the k most probable tokens (e.g., k=40)
- Top-p (nucleus) sampling: samples from the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9)
# Example: effect of temperature on generation
from anthropic import Anthropic
client = Anthropic()
prompt = "Write the beginning of a fantasy story in one line:"
for temp in [0.0, 0.5, 1.0, 1.5]:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
temperature=temp,
messages=[{"role": "user", "content": prompt}]
)
print(f"\nTemperature {temp}:")
print(f" {response.content[0].text}")
# Temperature 0.0: deterministic output, always the same
# Temperature 0.5: slight variation, still coherent
# Temperature 1.0: creative, good balance
# Temperature 1.5: very creative, possible incoherence
Context Window: The Memory of LLMs
The context window is the maximum number of tokens an LLM can process in a single request (input + output). This is effectively the model's "working memory" during a conversation.
Context Window by Model
| Model | Context Window | Approximate Equivalent |
|---|---|---|
| GPT-3.5 | 4,096 / 16,384 tokens | ~3,000 / 12,000 words |
| GPT-4 | 8,192 / 128,000 tokens | ~6,000 / 96,000 words |
| Claude 3.5 Sonnet | 200,000 tokens | ~150,000 words |
| Gemini 1.5 Pro | 1,000,000 tokens | ~750,000 words |
| Llama 3.1 | 128,000 tokens | ~96,000 words |
Hallucinations: Why LLMs Make Things Up
Hallucinations are one of the most critical problems with LLMs: the model generates false information with the same confidence it generates true information. This happens because an LLM doesn't "know" facts: it predicts the most probable next token given the context.
If the statistical pattern suggests that after "The capital of Australia is" the most likely token is "Sydney", the model will generate "Sydney" even though the correct answer is "Canberra". The model has no internal mechanism to verify the truth of its outputs.
Mitigation: Retrieval-Augmented Generation (RAG)
The most effective strategy to reduce hallucinations is RAG: providing the model with factual information retrieved from reliable sources as part of the context. Instead of asking the model to "remember", we give it updated data and ask it to reason about it.
# Simplified RAG: providing factual context to the model
from anthropic import Anthropic
client = Anthropic()
# Context retrieved from a database or search engine
factual_context = """
Company data updated to Q3 2025:
- Revenue: EUR 12.5M (+23% YoY)
- Employees: 85
- Active clients: 342
- NPS score: 72
"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Based EXCLUSIVELY on the following data,
answer the question. If the data doesn't contain the answer, say "I don't have this information".
DATA:
{factual_context}
QUESTION: What is the current revenue and how many employees do we have?"""
}]
)
print(response.content[0].text)
Conclusions
Understanding the inner workings of LLMs - from tokenization to generation, through embeddings, attention, and sampling - is not just theoretical knowledge. It's the foundation for using these tools effectively and consciously.
Tokenization influences costs and context limits. Temperature and sampling strategies determine output creativity. The attention mechanism explains why the model understands (or doesn't understand) context. Hallucinations are a direct consequence of the next-token-prediction architecture.
In the next article, we'll put this knowledge into practice with Advanced Prompt Engineering: systematic techniques to get the most from LLMs, from zero-shot to chain-of-thought, from system prompts to the ReAct pattern.







