EECS 224: Large Language Models
This is a new course of UC Merced starting at Spring 2025. In this course, students will learn the fundamentals about the modeling, theory, ethics, and systems aspects of large language models, as well as gain hands-on experience working with them.
The Goal of This Course
Offering useful, fundemental, detailed LLM knowledge to students.
Coursework
- In-Course Question Answering
- Final Projects
In-Course Question Answering
In each class, students will be asked several questions. Each student can only answer each question one time. The student who correctly answers a question for the first time and explain the answer clearly to all the classmates will be granted 1 credit. The final scores will be calculated based on the accumulated credits throughout the whole semester. This score does not contribute to the final grading. In the final course, the students who get high scores in question answering will be highlighted as a reputation.
Final Projects
Every student should complete a final project related to LLMs and present it in the final classes. A project report should be completed before the presentation and submitted to Final Project Google Form The file should be named as “Your Name.pdf” The format should follow the 2025 ACL long paper template. Good examples are at Project Report Examples
The final project scores will be calculated through 70% Instructor’s rating + 30% Classmates’ rating. After all the presentations, every student will be asked to choose 3 best projects. The classmates’ ratings will be based on the ratings of all the classmates.
Useful Links
Lecture 1: Overview of LLMs
What is Language?
Language is a systematic means of communicating ideas or feelings using conventionalized signs, sounds, gestures, or marks.

More than 7,000 languages are spoken around the world today, shaping how we describe and perceive the world around us. Source: https://www.snexplores.org/article/lets-learn-about-the-science-of-language
Text in Language
Text represents the written form of language, converting speech and meaning into visual symbols. Key aspects include:
Basic Units of Text
Text can be broken down into hierarchical units:
- Characters: The smallest meaningful units in writing systems
- Words: Combinations of characters that carry meaning
- Sentences: Groups of words expressing complete thoughts
- Paragraphs: Collections of related sentences
- Documents: Complete texts serving a specific purpose
Text Properties
Text demonstrates several key properties:
- Linearity: Written symbols appear in sequence
- Discreteness: Clear boundaries between units
- Conventionality: Agreed-upon meanings within a language community
- Structure: Follows grammatical and syntactic rules
- Context: Meaning often depends on surrounding text
Question 1: Could you give some examples in English that a word has two different meanings across two sentences?
Based on the above properties shared by different langauges, the NLP researchers develop a unified Machine Learning technique to model language data – Large Language Models. Let’s start to learn this unfied language modeling technique.
What is a Language Model?
Mathematical Definition
A language model is fundamentally a probability distribution over sequences of words or tokens. Mathematically, it can be expressed as:
\[P(w_1, w_2, ..., w_n) = \prod_i P(w_i|w_1, ..., w_{i-1})\]where:
- \(w_1, w_2, ..., w_n\) represents a sequence of words or tokens
-
The conditional probability of word \(w_i\) given all previous words is:
\[P(w_i|w_1, ..., w_{i-1})\]
For practical implementation, this often takes the form:
\[P(w_t|context) = \text{softmax}(h(context) \cdot W)\]where:
- Target word: \(w_t\)
- Context encoding function: \(h(context)\)
- Weight matrix: \(W\)
- softmax normalizes the output into probabilities
Example 1: Sentence Probability Calculation
Consider the sentence: “I love chocolate.”
The language model predicts the following probabilities:
- \[P(\text{'I'}) = 0.2\]
- \[P(\text{'love'}|\text{'I'}) = 0.4\]
- \[P(\text{'chocolate'}|\text{'I love'}) = 0.5\]
The total probability of the sentence is calculated as:
\(P(\text{'I love chocolate'}) = P(\text{'I'}) \cdot P(\text{'love'}|\text{'I'}) \cdot P(\text{'chocolate'}|\text{'I love'})\)
\(P(\text{'I love chocolate'}) = 0.2 \cdot 0.4 \cdot 0.5 = 0.04\)
Thus, the probability of the sentence “I love chocolate” is 0.04.
Example 2: Dialogue Probability Calculation
For the dialogue:
A: “Hello, how are you?”
B: “I’m fine, thank you.”
The model provides the following probabilities:
- Speaker A’s Sentence:
- \[P(\text{'Hello'}) = 0.3\]
- \[P(\text{','}|\text{'Hello'}) = 0.8\]
- \[P(\text{'how'}|\text{'Hello ,'}) = 0.5\]
- \[P(\text{'are'}|\text{'Hello , how'}) = 0.6\]
- \[P(\text{'you'}|\text{'Hello , how are'}) = 0.7\]
- Speaker B’s Sentence:
- \[P(\text{'I'}) = 0.4\]
- \[P(\text{'m'}|\text{'I'}) = 0.5\]
- \[P(\text{'fine'}|\text{'I m'}) = 0.6\]
- \[P(\text{','}|\text{'I m fine'}) = 0.7\]
- \[P(\text{'thank'}|\text{'I m fine ,'}) = 0.8\]
- \[P(\text{'you'}|\text{'I m fine , thank'}) = 0.9\]
- Total Probability for the Dialogue:
Combine the probabilities for both sentences:
\(P(\text{'Hello, how are you? I\'m fine, thank you.'}) = P(\text{'Hello, how are you?'}) \cdot P(\text{'I\'m fine, thank you.'})\)
\(P(\text{'Hello, how are you? I\'m fine, thank you.'}) = 0.0504 \cdot 0.06048 = 0.003048192\)
Thus, the total probability of the dialogue is approximately 0.00305.
Example 3: Partial Sentence Generation
Consider the sentence: “The dog barked loudly.”
The probabilities assigned by the language model are:
- \[P(\text{'The'}) = 0.25\]
- \[P(\text{'dog'}|\text{'The'}) = 0.4\]
- \[P(\text{'barked'}|\text{'The dog'}) = 0.5\]
- \[P(\text{'loudly'}|\text{'The dog barked'}) = 0.6\]
Question 2: Calculate the total probability of the sentence \(P(\text{'The dog barked loudly'})\) using the given probabilities.
The Transformer Model: Revolutionizing Language Models
The emergence of the Transformer architecture marked a paradigm shift in how machines process and understand human language. Unlike its predecessors, which struggled with long-range patterns in text, this groundbreaking architecture introduced mechanisms that revolutionized natural language processing (NLP).
The Building Blocks of Language Understanding
From Text to Machine-Readable Format
Before any sophisticated processing can occur, raw text must be converted into a format that machines can process. This happens in two crucial stages:
- Text Segmentation
The first challenge is breaking down text into meaningful units. Imagine building with LEGO blocks - just as you need individual blocks to create complex structures, language models need discrete pieces of text to work with. These pieces, called tokens, might be:
- Complete words
- Parts of words
- Individual characters
- Special symbols
For instance, the phrase “artificial intelligence” might become [“art”, “ificial”, “intel”, “ligence”], allowing the model to recognize patterns even in unfamiliar words.
- Numerical Representation Once we have our text pieces, each token gets transformed into a numerical vector - essentially a long list of numbers. Think of this as giving each word or piece its own unique mathematical “fingerprint” that captures its meaning and relationships with other words.
Adding Sequential Understanding
One of the most innovative aspects of Transformers is how they handle word order. Rather than treating text like a bag of unrelated words, the architecture adds precise positional information to each token’s representation.
Consider how the meaning changes in these sentences:
- “The cat chased the mouse”
- “The mouse chased the cat”
The words are identical, but their positions completely change the meaning. The Transformer’s positional encoding system ensures this crucial information isn’t lost.
The Heart of the System: Information Processing
Context Through Self-Attention
The true magic of Transformers lies in their attention mechanism. Unlike humans who must read text sequentially, Transformers can simultaneously analyze relationships between all words in a text. This is similar to how you might solve a complex puzzle:
- First, you look at all the pieces simultaneously
- Then, you identify which pieces are most likely to connect
- Finally, you use these relationships to build the complete picture
In language, this means the model can:
- Resolve pronouns (“She picked up her book” - who is “her” referring to?)
- Understand idiomatic expressions (“kicked the bucket” means something very different from “kicked the ball”)
- Grasp long-distance dependencies (“The keys, which I thought I had left on the kitchen counter yesterday morning, were actually in my coat pocket”)
Real-World Applications and Impact
The Transformer architecture has enabled breakthrough applications in:
- Cross-Language Communication
- Real-time translation systems
- Multilingual document processing
- Content Creation and Analysis
- Automated report generation
- Text summarization
- Content recommendations
- Specialized Industry Applications
- Legal document analysis
- Medical record processing
- Scientific literature review
The Road Ahead
As this architecture continues to evolve, we’re seeing:
- More efficient processing methods
- Better handling of specialized domains
- Improved understanding of contextual nuances
- Enhanced ability to work with multimodal inputs
The Transformer architecture represents more than just a technical advancement - it’s a fundamental shift in how machines can understand and process human language. Its impact continues to grow as researchers and developers find new ways to apply and improve upon its core principles.
The true power of Transformers lies not just in their technical capabilities, but in how they’ve opened new possibilities for human-machine interaction and understanding. As we continue to refine and build upon this architecture, we’re moving closer to systems that can truly understand and engage with human language in all its complexity and nuance.
What are large language models?
Large language models are transformers with billions to trillions of parameters, trained on massive amounts of text data. These models have several distinguishing characteristics:
- Scale: Models contain billions of parameters and are trained on hundreds of billions of tokens
- Architecture: Based on the Transformer architecture with self-attention mechanisms
- Emergent abilities: Complex capabilities that emerge with scale
- Few-shot learning: Ability to adapt to new tasks with few examples
-
Definition: Large Language Models are artificial intelligence systems trained on vast amounts of text data, containing hundreds of billions of parameters. Unlike traditional AI models, they can understand and generate human-like text across a wide range of tasks and domains.
- Scale and Architecture:
- Typically contain >1B parameters (Some exceed 500B)
- Built on Transformer architecture with attention mechanisms
- Require massive computational resources for training
- Examples: GPT-3 (175B), PaLM (540B), LLaMA (65B)
- Key Capabilities:
- Natural language understanding and generation
- Task adaptation without fine-tuning
- Complex reasoning and problem solving
- Knowledge storage and retrieval
- Multi-turn conversation
Historical Evolution
1. Statistical Language Models (SLM) - 1990s
- Core Technology: Used statistical methods to predict next words based on previous context
- Key Features:
- N-gram models (bigram, trigram)
- Markov assumption for word prediction
- Used in early IR and NLP applications
- Limitations:
- Curse of dimensionality
- Data sparsity issues
- Limited context window
- Required smoothing techniques
2. Neural Language Models (NLM) - 2013
- Core Technology: Neural networks for language modeling
- Key Advances:
- Distributed word representations
- Multi-layer perceptron and RNN architectures
- End-to-end learning
- Better feature extraction
- Impact:
- Word2vec and similar embedding models
- Improved generalization
- Reduced need for feature engineering
3. Pre-trained Language Models (PLM) - 2018
- Core Technology: Transformer-based models with pre-training
- Key Innovations:
- BERT and bidirectional context modeling
- GPT and autoregressive modeling
- Transfer learning approach
- Fine-tuning paradigm
- Benefits:
- Context-aware representations
- Better task performance
- Reduced need for task-specific data
- More efficient training
4. Large Language Models (LLM) - 2020+
- Core Technology: Scaled-up Transformer models
- Major Breakthroughs:
- Emergence of new abilities with scale
- Few-shot and zero-shot learning
- General-purpose problem solving
- Human-like interaction capabilities
- Key Examples:
- GPT-3: First demonstration of powerful in-context learning
- ChatGPT: Advanced conversational abilities
- GPT-4: Multimodal capabilities and improved reasoning
- PaLM: Enhanced multilingual and reasoning capabilities
Key Features of LLMs
Scaling Laws
- KM Scaling Law (OpenAI):
- Describes relationship between model performance (measured by cross entropy loss $L$) and three factors:
- Model size ($N$)
- Dataset size ($D$)
- Computing power ($C$)
- Mathematical formulations:
- $L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$, where $\alpha_N \sim 0.076$, $N_c \sim 8.8 \times 10^{13}$
- $L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}$, where $\alpha_D \sim 0.095$, $D_c \sim 5.4 \times 10^{13}$
- $L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}$, where $\alpha_C \sim 0.050$, $C_c \sim 3.1 \times 10^8$
- Predicts diminishing returns as model/data/compute scale increases
- Helps optimize resource allocation for training
- Describes relationship between model performance (measured by cross entropy loss $L$) and three factors:
- Chinchilla Scaling Law (DeepMind):
- Mathematical formulation:
- $L(N,D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$
- where $E = 1.69$, $A = 406.4$, $B = 410.7$, $\alpha = 0.34$, $\beta = 0.28$
- Optimal compute allocation:
- $N_{opt}(C) = G\left(\frac{C}{6}\right)^a$
- $D_{opt}(C) = G^{-1}\left(\frac{C}{6}\right)^b$
- where $a = \frac{\alpha}{\alpha+\beta}$, $b = \frac{\beta}{\alpha+\beta}$
- Suggests equal scaling of model and data size
- More efficient compute utilization than KM scaling law
- Demonstrated superior performance with smaller models trained on more data
- Mathematical formulation:
Emergent Abilities
- In-context Learning
- Definition: Ability to learn from examples in the prompt
- Characteristics:
- No parameter updates required
- Few-shot and zero-shot capabilities
- Task adaptation through demonstrations
- Emergence Point:
- GPT-3 showed first strong results
Question 3: Design a few-shot prompt that can classify the film topic by the film name. It must be able to correctly classify more than 5 films proposed by other students. Using ChatGPT as the test LLM.
- Instruction Following
- Definition: Ability to understand and execute natural language instructions
- Requirements:
- Instruction tuning
- Multi-task training
- Natural language task descriptions
- Step-by-step Reasoning
- Definition: Ability to break down complex problems
- Techniques:
- Chain-of-thought prompting
- Self-consistency methods
- Intermediate step generation
- Benefits:
- Better problem solving
- More reliable answers
- Transparent reasoning process
Technical Elements
Architecture
- Transformer Base
- Components:
- Multi-head attention mechanism
- Feed-forward neural networks
- Layer normalization
- Positional encoding
- Variations:
- Decoder-only (GPT-style)
- Encoder-decoder (T5-style)
- Modifications for efficiency
- Components:
- Scaling Considerations
- Hardware Requirements:
- Distributed training systems
- Memory optimization
- Parallel processing
- Architecture Choices:
- Layer count
- Hidden dimension size
- Attention head configuration
- Hardware Requirements:
Training Process
- Pre-training
- Data Preparation:
- Web text
- Books
- Code
- Scientific papers
- Objectives:
- Next token prediction
- Masked language modeling
- Multiple auxiliary tasks
- Data Preparation:
- Adaptation Methods
- Instruction Tuning:
- Natural language task descriptions
- Multi-task learning
- Task generalization
- RLHF:
- Human preference learning
- Safety alignment
- Behavior optimization
- Instruction Tuning:
Utilization Techniques
- Prompting Strategies
- Basic Prompting:
- Direct instructions
- Few-shot examples
- Zero-shot prompts
- Advanced Methods:
- Chain-of-thought
- Self-consistency
- Tool augmentation
- Basic Prompting:
- Application Patterns
- Task Types:
- Generation
- Classification
- Question answering
- Coding
- Integration Methods:
- API endpoints
- Model serving
- Application backends
- Task Types:
Major Milestones
ChatGPT (2022)
- Technical Achievements
- Advanced dialogue capabilities
- Robust safety measures
- Consistent persona
- Tool integration
- Impact
- Widespread adoption
- New application paradigms
- Industry transformation
- Public AI awareness
GPT-4 (2023)
- Key Advances
- Multimodal understanding
- Enhanced reliability
- Better reasoning
- Improved safety
- Technical Features
- Predictable scaling
- Vision capabilities
- Longer context window
- Advanced system prompting
Challenges and Future Directions
Current Challenges
- Computational Resources
- Training Costs:
- Massive energy requirements
- Expensive hardware needs
- Limited accessibility
- Infrastructure Needs:
- Specialized facilities
- Cooling systems
- Power management
- Training Costs:
- Data Requirements
- Quality Issues:
- Data cleaning
- Content filtering
- Bias mitigation
- Privacy Concerns:
- Personal information
- Copyright issues
- Regulatory compliance
- Quality Issues:
- Safety and Alignment
- Technical Challenges:
- Hallucination prevention
- Truthfulness
- Bias detection
- Ethical Considerations:
- Harm prevention
- Fairness
- Transparency
- Technical Challenges:
Future Directions
- Improved Efficiency
- Architecture Innovation:
- Sparse attention
- Parameter efficiency
- Memory optimization
- Training Methods:
- Better scaling laws
- Efficient fine-tuning
- Reduced compute needs
- Architecture Innovation:
- Enhanced Capabilities
- Multimodal Understanding:
- Vision-language integration
- Audio processing
- Sensor data interpretation
- Reasoning Abilities:
- Logical deduction
- Mathematical problem solving
- Scientific reasoning
- Multimodal Understanding:
- Safety Development
- Alignment Techniques:
- Value learning
- Preference optimization
- Safety bounds
- Evaluation Methods:
- Robustness testing
- Safety metrics
- Bias assessment
- Alignment Techniques:
Summary
- LLMs represent a fundamental shift in AI capabilities
- Scale and architecture drive emergent abilities
- Continuing rapid development in capabilities
- Balance between advancement and safety
- Growing impact on society and technology
- Need for responsible development and deployment
References and Further Reading
- Scaling Laws Papers
- Emergent Abilities Research
- Safety and Alignment Studies
- Technical Documentation
- Industry Reports
Paper Reading: A Survey of Large Language Models
Lecture 2: Understanding Tokenization in Language Models
Tokenization is a fundamental concept in Natural Language Processing (NLP) that involves breaking down text into smaller units called tokens. These tokens can be words, subwords, or characters, depending on the tokenization strategy used. The choice of tokenization method can significantly impact a model’s performance and its ability to handle various languages and vocabularies.
Common Tokenization Approaches
- Word-based Tokenization
- Splits text at word boundaries (usually spaces and punctuation)
- Simple and intuitive but struggles with out-of-vocabulary words
- Requires a large vocabulary to cover most words
- Examples: Early versions of BERT used WordPiece tokenization
- Character-based Tokenization
- Splits text into individual characters
- Very small vocabulary size
- Can handle any word but loses word-level meaning
- Typically results in longer sequences
- Subword Tokenization
- Breaks words into meaningful subunits
- Balances vocabulary size and semantic meaning
- Better handles rare words
- Popular methods include:
- Byte-Pair Encoding (BPE)
- WordPiece
- Unigram
- SentencePiece
Let’s dive deep into one of the most widely used subword tokenization methods: Byte-Pair Encoding (BPE).
Byte-Pair Encoding (BPE) Tokenization
Reference Tutorial: Byte-Pair Encoding tokenization
Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by many Transformer models, including GPT, GPT-2, Llama1, Llama2, Llama3, RoBERTa, BART, and DeBERTa.
Training Algorithm
BPE training starts by computing the unique set of words used in the corpus (after the normalization and pre-tokenization steps are completed), then building the vocabulary by taking all the symbols used to write those words. As a very simple example, let’s say our corpus uses these five words:
"hug", "pug", "pun", "bun", "hugs"
The base vocabulary will then be ["b", "g", "h", "n", "p", "s", "u"]
. For real-world cases, that base vocabulary will contain all the ASCII characters, at the very least, and probably some Unicode characters as well. If an example you are tokenizing uses a character that is not in the training corpus, that character will be converted to the unknown token. That’s one reason why lots of NLP models are very bad at analyzing content with emojis.
The GPT-2 and RoBERTa tokenizers (which are pretty similar) have a clever way to deal with this: they don’t look at words as being written with Unicode characters, but with bytes. This way the base vocabulary has a small size (256), but every character you can think of will still be included and not end up being converted to the unknown token. This trick is called byte-level BPE.
After getting this base vocabulary, we add new tokens until the desired vocabulary size is reached by learning merges, which are rules to merge two elements of the existing vocabulary together into a new one. So, at the beginning these merges will create tokens with two characters, and then, as training progresses, longer subwords.
At any step during the tokenizer training, the BPE algorithm will search for the most frequent pair of existing tokens (by “pair,” here we mean two consecutive tokens in a word). That most frequent pair is the one that will be merged, and we rinse and repeat for the next step.
Going back to our previous example, let’s assume the words had the following frequencies:
("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5)
meaning “hug” was present 10 times in the corpus, “pug” 5 times, “pun” 12 times, “bun” 4 times, and “hugs” 5 times. We start the training by splitting each word into characters (the ones that form our initial vocabulary) so we can see each word as a list of tokens:
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
Then we look at pairs. The pair (“h”, “u”) is present in the words “hug” and “hugs”, so 15 times total in the corpus. It’s not the most frequent pair, though: that honor belongs to (“u”, “g”), which is present in “hug”, “pug”, and “hugs”, for a grand total of 20 times in the vocabulary.
Thus, the first merge rule learned by the tokenizer is (“u”, “g”) -> “ug”, which means that “ug” will be added to the vocabulary, and the pair should be merged in all the words of the corpus. At the end of this stage, the vocabulary and corpus look like this:
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug"]
Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
Now we have some pairs that result in a token longer than two characters: the pair (“h”, “ug”), for instance (present 15 times in the corpus). The most frequent pair at this stage is (“u”, “n”), however, present 16 times in the corpus, so the second merge rule learned is (“u”, “n”) -> “un”. Adding that to the vocabulary and merging all existing occurrences leads us to:
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un"]
Corpus: ("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)
Now the most frequent pair is (“h”, “ug”), so we learn the merge rule (“h”, “ug”) -> “hug”, which gives us our first three-letter token. After the merge, the corpus looks like this:
Vocabulary: ["b", "g", "h", "n", "p", "s", "u", "ug", "un", "hug"]
Corpus: ("hug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("hug" "s", 5)
And we continue like this until we reach the desired vocabulary size.
Tokenization Inference
Tokenization inference follows the training process closely, in the sense that new inputs are tokenized by applying the following steps:
- Splitting the words into individual characters
- Applying the merge rules learned in order on those splits
Let’s take the example we used during training, with the three merge rules learned:
("u", "g") -> "ug"
("u", "n") -> "un"
("h", "ug") -> "hug"
The word “bug” will be tokenized as ["b", "ug"]
. “mug”, however, will be tokenized as ["[UNK]", "ug"]
since the letter “m” was not in the base vocabulary. Likewise, the word “thug” will be tokenized as ["[UNK]", "hug"]
: the letter “t” is not in the base vocabulary, and applying the merge rules results first in “u” and “g” being merged and then “h” and “ug” being merged.
Implementing BPE
Now let’s take a look at an implementation of the BPE algorithm. This won’t be an optimized version you can actually use on a big corpus; we just want to show you the code so you can understand the algorithm a little bit better.
I present the colab link for you to reproduce this part’s experiments easily: Colab BPE
Training BPE
First we need a corpus, so let’s create a simple one with a few sentences:
corpus = [
"This is the Hugging Face Course.",
"This chapter is about tokenization.",
"This section shows several tokenizer algorithms.",
"Hopefully, you will be able to understand how they are trained and generate tokens."
]
Next, we need to pre-tokenize that corpus into words. Since we are replicating a BPE tokenizer (like GPT-2), we will use the gpt2 tokenizer for the pre-tokenization:
from transformers import AutoTokenizer
# init pre tokenize function
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
pre_tokenize_function = gpt2_tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str
# pre tokenize
pre_tokenized_corpus = [pre_tokenize_function(text) for text in corpus]
We have the output
[
[('This', (0, 4)), ('Ġis', (4, 7)), ('Ġthe', (7, 11)), ('ĠHugging', (11, 19)), ('ĠFace', (19, 24)), ('ĠCourse', (24, 31)), ('.', (31, 32))],
[('This', (0, 4)), ('Ġchapter', (4, 12)), ('Ġis', (12, 15)), ('Ġabout', (15, 21)), ('Ġtokenization', (21, 34)), ('.', (34, 35))],
[('This', (0, 4)), ('Ġsection', (4, 12)), ('Ġshows', (12, 18)), ('Ġseveral', (18, 26)), ('Ġtokenizer', (26, 36)), ('Ġalgorithms', (36, 47)), ('.', (47, 48))],
[('Hopefully', (0, 9)), (',', (9, 10)), ('Ġyou', (10, 14)), ('Ġwill', (14, 19)), ('Ġbe', (19, 22)), ('Ġable', (22, 27)), ('Ġto', (27, 30)), ('Ġunderstand', (30, 41)), ('Ġhow', (41, 45)), ('Ġthey', (45, 50)), ('Ġare', (50, 54)), ('Ġtrained', (54, 62)), ('Ġand', (62, 66)), ('Ġgenerate', (66, 75)), ('Ġtokens', (75, 82)), ('.', (82, 83))]
]
Then we compute the frequencies of each word in the corpus as we do the pre-tokenization:
from collections import defaultdict
word2count = defaultdict(int)
for split_text in pre_tokenized_corpus:
for word, _ in split_text:
word2count[word] += 1
The obtained word2count is as follows:
defaultdict(<class 'int'>, {'This': 3, 'Ġis': 2, 'Ġthe': 1, 'ĠHugging': 1, 'ĠFace': 1, 'ĠCourse': 1, '.': 4, 'Ġchapter': 1, 'Ġabout': 1, 'Ġtokenization': 1, 'Ġsection': 1, 'Ġshows': 1, 'Ġseveral': 1, 'Ġtokenizer': 1, 'Ġalgorithms': 1, 'Hopefully': 1, ',': 1, 'Ġyou': 1, 'Ġwill': 1, 'Ġbe': 1, 'Ġable': 1, 'Ġto': 1, 'Ġunderstand': 1, 'Ġhow': 1, 'Ġthey': 1, 'Ġare': 1, 'Ġtrained': 1, 'Ġand': 1, 'Ġgenerate': 1, 'Ġtokens': 1})
The next step is to compute the base vocabulary, formed by all the characters used in the corpus:
vocab_set = set()
for word in word2count:
vocab_set.update(list(word))
vocabs = list(vocab_set)
The obtained base vocabulary is as follows:
['i', 't', 'p', 'o', 'r', 'm', 'e', ',', 'y', 'v', 'Ġ', 'F', 'a', 'C', 'H', '.', 'f', 'l', 'u', 'c', 'T', 'k', 'h', 'z', 'd', 'g', 'w', 'n', 's', 'b']
We now need to split each word into individual characters, to be able to start training:
word2splits = {word: [c for c in word] for word in word2count}
The output is:
'This': ['T', 'h', 'i', 's'],
'Ġis': ['Ġ', 'i', 's'],
'Ġthe': ['Ġ', 't', 'h', 'e'],
...
'Ġand': ['Ġ', 'a', 'n', 'd'],
'Ġgenerate': ['Ġ', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'e'],
'Ġtokens': ['Ġ', 't', 'o', 'k', 'e', 'n', 's']
Now that we are ready for training, let’s write a function that computes the frequency of each pair. We’ll need to use this at each step of the training:
def _compute_pair2score(word2splits, word2count):
pair2count = defaultdict(int)
for word, word_count in word2count.items():
split = word2splits[word]
if len(split) == 1:
continue
for i in range(len(split) - 1):
pair = (split[i], split[i + 1])
pair2count[pair] += word_count
return pair2count
The output is
defaultdict(<class 'int'>, {('T', 'h'): 3, ('h', 'i'): 3, ('i', 's'): 5, ('Ġ', 'i'): 2, ('Ġ', 't'): 7, ('t', 'h'): 3, ..., ('n', 's'): 1})
Now, finding the most frequent pair only takes a quick loop:
def _compute_most_score_pair(pair2count):
best_pair = None
max_freq = None
for pair, freq in pair2count.items():
if max_freq is None or max_freq < freq:
best_pair = pair
max_freq = freq
return best_pair
After counting, the current pair with the highest frequency is: (‘Ġ’, ‘t’), occurring 7 times. We merge (‘Ġ’, ‘t’) into a single token and add it to the vocabulary. Simultaneously, we add the merge rule (‘Ġ’, ‘t’) to our list of merge rules.
merge_rules = []
best_pair = compute_most_score_pair(pair2score)
vocabs.append(best_pair[0] + best_pair[1])
merge_rules.append(best_pair)
Now the vocabulary is
['i', 't', 'p', 'o', 'r', 'm', 'e', ',', 'y', 'v', 'Ġ', 'F', 'a', 'C', 'H', '.', 'f', 'l', 'u', 'c', 'T', 'k', 'h', 'z', 'd', 'g', 'w', 'n', 's', 'b',
'Ġt']
Based on the updated vocabulary, we re-split word2count. For implementation, we can directly apply the new merge rule (‘Ġ’, ‘t’) to the existing word2split. This is more efficient than performing a complete re-split, as we only need to apply the latest merge rule to the existing splits.
def _merge_pair(a, b, word2splits):
new_word2splits = dict()
for word, split in word2splits.items():
if len(split) == 1:
new_word2splits[word] = split
continue
i = 0
while i < len(split) - 1:
if split[i] == a and split[i + 1] == b:
split = split[:i] + [a + b] + split[i + 2:]
else:
i += 1
new_word2splits[word] = split
return new_word2splits
The new word2split is
{'This': ['T', 'h', 'i', 's'],
'Ġis': ['Ġ', 'i', 's'],
'Ġthe': ['Ġt', 'h', 'e'],
'ĠHugging': ['Ġ', 'H', 'u', 'g', 'g', 'i', 'n', 'g'],
...
'Ġtokens': ['Ġt', 'o', 'k', 'e', 'n', 's']}
As we can see, the new word2split now contains the newly merged token “Ġt”. We repeat this iterative process until the vocabulary size reaches our predefined target size.
while len(vocabs) < vocab_size:
pair2score = compute_pair2score(word2splits, word2count)
best_pair = compute_most_score_pair(pair2score)
vocabs.append(best_pair[0] + best_pair[1])
merge_rules.append(best_pair)
word2splits = merge_pair(best_pair[0], best_pair[1], word2splits)
Let’s say our target vocabulary size is 50. After the above iterations, we obtain the following vocabulary and merge rules:
vocabs = ['i', 't', 'p', 'o', 'r', 'm', 'e', ',', 'y', 'v', 'Ġ', 'F', 'a', 'C', 'H', '.', 'f', 'l', 'u', 'c', 'T', 'k', 'h', 'z', 'd', 'g', 'w', 'n', 's', 'b', 'Ġt', 'is', 'er', 'Ġa', 'Ġto', 'en', 'Th', 'This', 'ou', 'se', 'Ġtok', 'Ġtoken', 'nd', 'Ġis', 'Ġth', 'Ġthe', 'in', 'Ġab', 'Ġtokeni', 'Ġtokeniz']
merge_rules = [('Ġ', 't'), ('i', 's'), ('e', 'r'), ('Ġ', 'a'), ('Ġt', 'o'), ('e', 'n'), ('T', 'h'), ('Th', 'is'), ('o', 'u'), ('s', 'e'), ('Ġto', 'k'), ('Ġtok', 'en'), ('n', 'd'), ('Ġ', 'is'), ('Ġt', 'h'), ('Ġth', 'e'), ('i', 'n'), ('Ġa', 'b'), ('Ġtoken', 'i'), ('Ġtokeni', 'z')]
Thus, we have completed the training of our BPE tokenizer based on the given corpus. This trained tokenizer, consisting of the vocabulary and merge rules, can now be used to tokenize new input text using the learned subword patterns.
BPE’s Inference
During the inference phase, given a sentence, we need to split it into a sequence of tokens. The implementation involves two main steps:
First, we pre-tokenize the sentence and split it into character-level sequences
Then, we apply the merge rules sequentially to form larger tokens
def tokenize(text):
# pre tokenize
words = [word for word, _ in pre_tokenize_str(text)]
# split into char level
splits = [[c for c in word] for word in words]
# apply merge rules
for merge_rule in merge_rules:
for index, split in enumerate(splits):
i = 0
while i < len(split) - 1:
if split[i] == merge_rule[0] and split[i + 1] == merge_rule[1]:
split = split[:i] + ["".join(merge_rule)] + split[i + 2:]
else:
i += 1
splits[index] = split
return sum(splits, [])
For example:
>>> tokenize("This is not a token.")
>>> ['This', 'Ġis', 'Ġ', 'n', 'o', 't', 'Ġa', 'Ġtoken', '.']
Question 1: Given the tokenizer introduced in Lecture 2, what is the tokenization result of the string “This is a token.”
Lecture 3: Transformer Architecture
Table of Contents
- Introduction
- Background
- General Transformer Architecture
- Attention Mechanism
- Position-Wise Feed-Forward Networks
- Training and Optimization
- Conclusion
Introduction
The Transformer model is a powerful deep learning architecture that has achieved groundbreaking results in various fields—most notably in Natural Language Processing (NLP), computer vision, and speech recognition—since it was introduced in Attention Is All You Need (Vaswani et al., 2017). Its core component is the self-attention mechanism, which efficiently handles long-range dependencies in sequences while allowing for extensive parallelization. Many subsequent models, such as BERT, GPT, Vision Transformer (ViT), and multimodal Transformers, are built upon this foundational structure.
Background
Before the Transformer, sequential modeling primarily relied on Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs). These networks often struggled with capturing long-distance dependencies, parallelization, and computational efficiency. In contrast, the self-attention mechanism of Transformers captures global dependencies across input and output sequences simultaneously and offers excellent parallelization capabilities.
General Transformer Architecture
Modern Transformer architectures typically fall into one of three categories: encoder-decoder, encoder-only, or decoder-only, depending on the application scenario.
Encoder-Decoder Transformers
An encoder-decoder Transformer first encodes the input sequence into a contextual representation, then the decoder uses this encoded information to generate the target sequence. Typical applications include machine translation and text summarization. Models like T5 and MarianMT are representative of this structure.
Encoder-Only Transformers
Encoder-only models focus on learning bidirectional contextual representations of input sequences for classification, retrieval, and language understanding tasks. BERT and its variants (RoBERTa, ALBERT, etc.) belong to this category.
Decoder-Only Transformers
Decoder-only models generate outputs in an autoregressive manner, making them well-suited for text generation, dialogue systems, code generation, and more. GPT series, LLaMA, and PaLM are examples of this type.
Attention Mechanism
The core of the Transformer lies in its attention mechanism, which allows the model to focus on the most relevant parts of the input sequence given a query. Below, we detail the Scaled Dot-Product Attention and the Multi-Head Attention mechanisms.
What is Attention?
The attention mechanism describes a recent new group of layers in neural networks that has attracted a lot of interest in the past few years, especially in sequence tasks. There are a lot of different possible definitions of “attention” in the literature, but the one we will use here is the following: the attention mechanism describes a weighted average of (sequence) elements with the weights dynamically computed based on an input query and elements’ keys. So what does this exactly mean? The goal is to take an average over the features of multiple elements. However, instead of weighting each element equally, we want to weight them depending on their actual values. In other words, we want to dynamically decide on which inputs we want to “attend” more than others. In particular, an attention mechanism has usually four parts we need to specify:
- Query: The query is a feature vector that describes what we are looking for in the sequence, i.e. what would we maybe want to pay attention to.
- Keys: For each input element, we have a key which is again a feature vector. This feature vector roughly describes what the element is “offering”, or when it might be important. The keys should be designed such that we can identify the elements we want to pay attention to based on the query.
- Values: For each input element, we also have a value vector. This feature vector is the one we want to average over.
- Score function: To rate which elements we want to pay attention to, we need to specify a score function $f_{attn}$. The score function takes the query and a key as input, and output the score/attention weight of the query-key pair. It is usually implemented by simple similarity metrics like a dot product, or a small MLP.
The weights of the average are calculated by a softmax over all score function outputs. Hence, we assign those value vectors a higher weight whose corresponding key is most similar to the query. If we try to describe it with pseudo-math, we can write:
\[\alpha_i = \frac{\exp\left(f_{attn}\left(\text{key}_i, \text{query}\right)\right)}{\sum_j \exp\left(f_{attn}\left(\text{key}_j, \text{query}\right)\right)}, \hspace{5mm} \text{out} = \sum_i \alpha_i \cdot \text{value}_i\]Visually, we can show the attention over a sequence of words as follows:
Attention Example
For every word, we have one key and one value vector. The query is compared to all keys with a score function (in this case the dot product) to determine the weights. The softmax is not visualized for simplicity. Finally, the value vectors of all words are averaged using the attention weights.
Most attention mechanisms differ in terms of what queries they use, how the key and value vectors are defined, and what score function is used. The attention applied inside the Transformer architecture is called self-attention. In self-attention, each sequence element provides a key, value, and query. For each element, we perform an attention layer where based on its query, we check the similarity of the all sequence elements’ keys, and returned a different, averaged value vector for each element. We will now go into a bit more detail by first looking at the specific implementation of the attention mechanism which is in the Transformer case the scaled dot product attention.
Scaled Dot-Product Attention
Given a query matrix $Q$, key matrix $K$, and value matrix $V$, the attention formula is:
\[\text{Attention}(Q, K, V) = \text{softmax}\Bigl( \frac{QK^T}{\sqrt{d_k}} \Bigr)V\]where $d_k$ is the dimensionality of the key vectors (often the same as the query dimensionality). Every row of $Q$ corresponds a token’s embedding.
Example 1: Detailed Numerical Computation
Suppose we have the following matrices (small dimensions chosen for illustrative purposes):
\[Q = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 1 \\ 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad V = \begin{bmatrix} 0 & 2 \\ 1 & 1 \\ 2 & 0 \end{bmatrix}\]-
Compute $QK^T$
\[QK^T = \begin{bmatrix} 1 & 0 & 1 \\ 1 & 1 & 0 \\ 2 & 1 & 1 \end{bmatrix}\]
According to the example setup: -
Scale by $\sqrt{d_k}$
\[\frac{QK^T}{\sqrt{2}} \approx \begin{bmatrix} 0.71 & 0 & 0.71 \\ 0.71 & 0.71 & 0 \\ 1.41 & 0.71 & 0.71 \end{bmatrix}\]
Here, $d_k = 2$. Thus, $\sqrt{2} \approx 1.41$. So, -
Apply softmax row-wise
The softmax of a vector $x$ is given by \(\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}.\) Let’s calculate this row by row:- Row 1: $[0.71, 0, 0.71]$
- Calculate exponentials:
- $e^{0.71} \approx 2.034$ (for the 1st and 3rd elements)
- $e^{0} = 1$ (for the 2nd element)
- Sum of exponentials: $2.034 + 1 + 2.034 \approx 5.068$
- Softmax values:
- $\frac{2.034}{5.068} \approx 0.401$
- $\frac{1}{5.068} \approx 0.197$
- $\frac{2.034}{5.068} \approx 0.401$
- Final result: $[0.401, 0.197, 0.401]$ ≈ $[0.40, 0.20, 0.40]$
- Calculate exponentials:
- Row 2: $[0.71, 0.71, 0]$
- Calculate exponentials:
- $e^{0.71} \approx 2.034$ (for the 1st and 2nd elements)
- $e^{0} = 1$ (for the 3rd element)
- Sum of exponentials: $2.034 + 2.034 + 1 \approx 5.068$
- Softmax values:
- $\frac{2.034}{5.068} \approx 0.401$
- $\frac{2.034}{5.068} \approx 0.401$
- $\frac{1}{5.068} \approx 0.197$
- Final result: $[0.401, 0.401, 0.197]$ ≈ $[0.40, 0.40, 0.20]$
- Calculate exponentials:
- Row 3: $[1.41, 0.71, 0.71]$
- Calculate exponentials:
- $e^{1.41} \approx 4.096$
- $e^{0.71} \approx 2.034$ (for the 2nd and 3rd elements)
- Sum of exponentials: $4.096 + 2.034 + 2.034 \approx 8.164$
- Softmax values:
- $\frac{4.096}{8.164} \approx 0.501$
- $\frac{2.034}{8.164} \approx 0.249$
- $\frac{2.034}{8.164} \approx 0.249$
- Final result: $[0.501, 0.249, 0.249]$ ≈ $[0.50, 0.25, 0.25]$
- Calculate exponentials:
The final softmax matrix $\alpha$ is: \(\alpha = \begin{bmatrix} 0.40 & 0.20 & 0.40 \\ 0.40 & 0.40 & 0.20 \\ 0.50 & 0.25 & 0.25 \end{bmatrix}\)
Key observations about the softmax results:
- All output values are between 0 and 1
- Each row sums to 1
- Equal input values (Row 1) result in equal output probabilities
- Larger input values receive larger output probabilities (middle values in Rows 2 and 3)
(slight rounding applied).
- Row 1: $[0.71, 0, 0.71]$
-
Multiply by (V)
\(\text{Attention}(Q, K, V) = \alpha V.\)
-
Row 1 weights ([0.40, 0.20, 0.40]) on (V):
\[0.40 \times [0,2] + 0.20 \times [1,1] + 0.40 \times [2,0] = [0 + 0.20 + 0.80,\; 0.80 + 0.20 + 0] = [1.00,\; 1.00].\] -
Row 2 weights ([0.40, 0.40, 0.20]):
\[0.40 \times [0,2] + 0.40 \times [1,1] + 0.20 \times [2,0] = [0,\;0.80] + [0.40,\;0.40] + [0.40,\;0] = [0.80,\;1.20].\] -
Row 3 weights ([0.50, 0.25, 0.25]):
\[0.50 \times [0,2] + 0.25 \times [1,1] + 0.25 \times [2,0] = [0,\;1.0] + [0.25,\;0.25] + [0.50,\;0] = [0.75,\;1.25].\]
Final Output:
\[\begin{bmatrix} 1.00 & 1.00 \\ 0.80 & 1.20 \\ 0.75 & 1.25 \end{bmatrix}\](rounded values).
-
Example 2: Another Small-Dimension Example
Let us consider an even smaller example:
\[Q = \begin{bmatrix} 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad V = \begin{bmatrix} 2 & 3 \\ 4 & 1 \end{bmatrix}.\]Here, $Q$ is $1 \times 2$, $K$ is $2 \times 2$, and $V$ is $2 \times 2$.
-
Compute $QK^T$
\[QK^T = QK = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 1 \end{bmatrix}.\]
Since $K$ is a square matrix, $K^T = K$: -
Scale by $\sqrt{d_k}$
\[\frac{[1,\;1]}{1.41} \approx [0.71,\;0.71].\]
$d_k = 2$. Thus, $\frac{1}{\sqrt{2}} \approx \frac{1}{1.41} \approx 0.71$. So -
Softmax
$[0.71, 0.71]$ has equal values, so the softmax is $[0.5, 0.5]$. -
Multiply by $V$
\[[0.5,\;0.5] \begin{bmatrix} 2 & 3 \\ 4 & 1 \end{bmatrix} = 0.5 \times [2,3] + 0.5 \times [4,1] = [1,1.5] + [2,0.5] = [3,2].\]
Final Output: $[3,\;2]$.
Example 3: Larger Q and K with V as a Column Vector
Let us consider an example where $Q$ and $K$ have a larger dimension, but $V$ has only one column:
\[Q = \begin{bmatrix} 1 & 1 & 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, \quad V = \begin{bmatrix} 2 \\ 4 \\ 6 \\ 8 \end{bmatrix}.\]In-Course Question: Attention computation result of the above Q, K, V.
Multi-Head Attention
Multi-head attention projects $Q, K, V$ into multiple subspaces and performs several parallel scaled dot-product attentions (referred to as “heads”). These are concatenated, then transformed via a final linear projection:
\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O,\]where each head is computed as:
\[\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V).\]Below are multiple examples illustrating how multi-head attention calculations are performed, with increasingly detailed numeric demonstrations.
Example 1: Two-Head Attention Computation (Conceptual Illustration)
Let us assume we have a 2-head setup ($h = 2$), each head operating on half the dimension of $Q, K, V$. For instance, if the original dimension is 4, each head dimension could be 2.
-
Step 1: Linear transformations and splitting
\[Q W^Q \rightarrow [Q_1,\ Q_2], \quad K W^K \rightarrow [K_1,\ K_2], \quad V W^V \rightarrow [V_1,\ V_2].\]Here, $[Q_1,\ Q_2]$ means we split the transformed $Q$ along its last dimension into two sub-matrices (head 1 and head 2).
-
Step 2: Compute scaled dot-product attention for each head
\[\text{head}_1 = \text{Attention}(Q_1, K_1, V_1), \quad \text{head}_2 = \text{Attention}(Q_2, K_2, V_2).\]Suppose after computation:
\[\text{head}_1 = \begin{bmatrix} h_{11} & h_{12} \\ h_{21} & h_{22} \\ h_{31} & h_{32} \end{bmatrix}, \quad \text{head}_2 = \begin{bmatrix} g_{11} & g_{12} \\ g_{21} & g_{22} \\ g_{31} & g_{32} \end{bmatrix}.\] -
Step 3: Concatenate and apply final linear transform
\[\text{Concat}(\text{head}_1, \text{head}_2) = \begin{bmatrix} h_{11} & h_{12} & g_{11} & g_{12} \\ h_{21} & h_{22} & g_{21} & g_{22} \\ h_{31} & h_{32} & g_{31} & g_{32} \end{bmatrix}.\]
Concatenating the heads yields a $3 \times 4$ matrix (if each head is $3 \times 2$):We then multiply by $W^O$ (e.g., a $4 \times 4$ matrix) to get the final multi-head attention output.
Note: Actual numeric computation requires specifying all projection matrices $W_i^Q, W_i^K, W_i^V, W^O$ and the input $Q, K, V$. Below, we provide more concrete numeric examples.
Example 2: Two-Head Attention with Full Numerical Details
In this example, we will provide explicit numbers for a 2-head setup. We will assume each of $Q, K, V$ has shape $(3,4)$: there are 3 “tokens” (or time steps), each with a hidden size of 4. We split that hidden size into 2 heads, each with size 2.
Step 0: Define inputs and parameters
Let
We also define the projection matrices for the two heads. For simplicity, we assume each projection matrix has shape $(4,2)$ (since we project dimension 4 down to dimension 2), and $W^O$ will have shape $(4,4)$ to map the concatenated result $(3,4)$ back to $(3,4)$.
Let’s define:
\[W^Q_1 = \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix}, \quad W^K_1 = \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix}, \quad W^V_1 = \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix},\] \[W^Q_2 = \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 1\\ 0 & 0 \end{bmatrix}, \quad W^K_2 = \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 0\\ 1 & 1 \end{bmatrix}, \quad W^V_2 = \begin{bmatrix} 0 & 1\\ 1 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix}.\]And let:
\[W^O = \begin{bmatrix} 1 & 0 & 0 & 1\\ 0 & 1 & 1 & 0\\ 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1 \end{bmatrix}.\]We will go step by step.
Step 1: Compute $Q_1, K_1, V_1$ for Head 1
\[Q_1 = Q \times W^Q_1,\quad K_1 = K \times W^K_1,\quad V_1 = V \times W^V_1.\]-
$Q_1 = Q W^Q_1$.
\[Q = \begin{bmatrix} 1 & 2 & 1 & 0\\ 0 & 1 & 1 & 1\\ 1 & 0 & 2 & 1 \end{bmatrix}, \quad W^Q_1 = \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix}.\]
Each row of $Q$ is multiplied by $W^Q_1$:-
Row 1 of $Q$: $[1,2,1,0]$
\[[1,2,1,0] \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix} = [1*1 + 2*0 + 1*1 + 0*0,\; 1*0 + 2*1 + 1*0 + 0*1] = [2,\;2].\] -
Row 2: $[0,1,1,1]$
\[[0,1,1,1] \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix} = [1,\;2].\] -
Row 3: $[1,0,2,1]$
\[[1,0,2,1] \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix} = [3,\;1].\]
Thus,
\[Q_1 = \begin{bmatrix} 2 & 2\\ 1 & 2\\ 3 & 1 \end{bmatrix}.\] -
-
$K_1 = K W^K_1$.
\[K = \begin{bmatrix} 1 & 1 & 0 & 2\\ 2 & 1 & 1 & 0\\ 0 & 1 & 1 & 1 \end{bmatrix},\quad W^K_1 = \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix}.\]-
Row 1: $[1,1,0,2]$
\[[1,1,0,2] \times \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix} = [3,\;1].\] -
Row 2: $[2,1,1,0]$
\[[2,1,1,0] \times \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix} = [2,\;2].\] -
Row 3: $[0,1,1,1]$
\[[0,1,1,1] \times \begin{bmatrix} 1 & 0\\ 0 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix} = [1,\;2].\]
So,
\[K_1 = \begin{bmatrix} 3 & 1\\ 2 & 2\\ 1 & 2 \end{bmatrix}.\] -
-
$V_1 = V W^V_1$.
\[V = \begin{bmatrix} 1 & 1 & 0 & 0\\ 0 & 2 & 1 & 1\\ 1 & 1 & 2 & 2 \end{bmatrix},\quad W^V_1 = \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix}.\]-
Row 1: $[1,1,0,0]$
\[[1,1,0,0] \times \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix} = [1,\;1].\] -
Row 2: $[0,2,1,1]$
\[[0,2,1,1] \times \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix} = [1,\;3].\] -
Row 3: $[1,1,2,2]$
\[[1,1,2,2] \times \begin{bmatrix} 1 & 0\\ 0 & 1\\ 1 & 0\\ 0 & 1 \end{bmatrix} = [3,\;3].\]
Therefore,
\[V_1 = \begin{bmatrix} 1 & 1\\ 1 & 3\\ 3 & 3 \end{bmatrix}.\] -
Step 2: Compute $Q_2, K_2, V_2$ for Head 2
\[Q_2 = Q \times W^Q_2,\quad K_2 = K \times W^K_2,\quad V_2 = V \times W^V_2.\]-
$Q_2 = Q W^Q_2$:
\[W^Q_2 = \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 1\\ 0 & 0 \end{bmatrix}.\]-
Row 1 $[1,2,1,0]$:
\[[1,2,1,0] \times \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 1\\ 0 & 0 \end{bmatrix} = [3,\;2].\] -
Row 2 $[0,1,1,1]$:
\[[0,1,1,1] \times \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 1\\ 0 & 0 \end{bmatrix} = [2,\;1].\] -
Row 3 $[1,0,2,1]$:
\[[1,0,2,1] \times \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 1\\ 0 & 0 \end{bmatrix} = [2,\;3].\]
Hence,
\[Q_2 = \begin{bmatrix} 3 & 2\\ 2 & 1\\ 2 & 3 \end{bmatrix}.\] -
-
$K_2 = K W^K_2$:
\[W^K_2 = \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 0\\ 1 & 1 \end{bmatrix}.\]-
Row 1 $[1,1,0,2]$:
\[[1,1,0,2] \times \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 0\\ 1 & 1 \end{bmatrix} = [3,\;3].\] -
Row 2 $[2,1,1,0]$:
\[[2,1,1,0] \times \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 0\\ 1 & 1 \end{bmatrix} = [2,\;2].\] -
Row 3 $[0,1,1,1]$:
\[[0,1,1,1] \times \begin{bmatrix} 0 & 1\\ 1 & 0\\ 1 & 0\\ 1 & 1 \end{bmatrix} = [3,\;1].\]
So,
\[K_2 = \begin{bmatrix} 3 & 3\\ 2 & 2\\ 3 & 1 \end{bmatrix}.\] -
-
$V_2 = V W^V_2$:
\[W^V_2 = \begin{bmatrix} 0 & 1\\ 1 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix}.\]-
Row 1 $[1,1,0,0]$:
\[[1,1,0,0] \times \begin{bmatrix} 0 & 1\\ 1 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix} = [1,\;2].\] -
Row 2 $[0,2,1,1]$:
\[[0,2,1,1] \times \begin{bmatrix} 0 & 1\\ 1 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix} = [3,\;3].\] -
Row 3 $[1,1,2,2]$:
\[[1,1,2,2] \times \begin{bmatrix} 0 & 1\\ 1 & 1\\ 0 & 1\\ 1 & 0 \end{bmatrix} = [3,\;4].\]
Thus,
\[V_2 = \begin{bmatrix} 1 & 2\\ 3 & 3\\ 3 & 4 \end{bmatrix}.\] -
Step 3: Compute each head’s Scaled Dot-Product Attention
We now have for head 1:
\[Q_1 = \begin{bmatrix}2 & 2\\1 & 2\\3 & 1\end{bmatrix},\; K_1 = \begin{bmatrix}3 & 1\\2 & 2\\1 & 2\end{bmatrix},\; V_1 = \begin{bmatrix}1 & 1\\1 & 3\\3 & 3\end{bmatrix}.\]Similarly for head 2:
\[Q_2 = \begin{bmatrix}3 & 2\\2 & 1\\2 & 3\end{bmatrix},\; K_2 = \begin{bmatrix}3 & 3\\2 & 2\\3 & 1\end{bmatrix},\; V_2 = \begin{bmatrix}1 & 2\\3 & 3\\3 & 4\end{bmatrix}.\]Assume each key vector dimension is $d_k = 2$. Hence the scale is $\frac{1}{\sqrt{2}} \approx 0.707$.
- Head 1:
-
$Q_1 K_1^T$.
$K_1^T$ is
\[\begin{bmatrix} 3 & 2 & 1\\ 1 & 2 & 2 \end{bmatrix}.\] \[Q_1 K_1^T = \begin{bmatrix} 2 & 2\\ 1 & 2\\ 3 & 1 \end{bmatrix} \times \begin{bmatrix} 3 & 2 & 1\\ 1 & 2 & 2 \end{bmatrix} = \begin{bmatrix} 8 & 8 & 6\\ 5 & 6 & 5\\ 10 & 8 & 5 \end{bmatrix}.\] -
Scale: $\frac{Q_1 K_1^T}{\sqrt{2}}$:
\[\approx \begin{bmatrix} 5.66 & 5.66 & 4.24\\ 3.54 & 4.24 & 3.54\\ 7.07 & 5.66 & 3.54 \end{bmatrix}.\] -
Apply softmax row-wise (approx results after exponentiation and normalization):
\[\alpha_1 \approx \begin{bmatrix} 0.45 & 0.45 & 0.11\\ 0.25 & 0.50 & 0.25\\ 0.79 & 0.19 & 0.02 \end{bmatrix}.\] -
Multiply by $V_1$:
\[\text{head}_1 = \alpha_1 \times V_1.\]Approximating:
\[\text{head}_1 \approx \begin{bmatrix} 1.23 & 2.13\\ 1.50 & 2.50\\ 1.04 & 1.42 \end{bmatrix}.\]
-
- Head 2:
-
$Q_2 K_2^T$.
\[Q_2 = \begin{bmatrix} 3 & 2\\ 2 & 1\\ 2 & 3 \end{bmatrix},\quad K_2 = \begin{bmatrix} 3 & 3\\ 2 & 2\\ 3 & 1 \end{bmatrix}.\]Then
\[K_2^T = \begin{bmatrix} 3 & 2 & 3\\ 3 & 2 & 1 \end{bmatrix}.\] \[Q_2 K_2^T = \begin{bmatrix} 15 & 10 & 11\\ 9 & 6 & 7\\ 15 & 10 & 9 \end{bmatrix}.\] -
Scale: multiply by $1/\sqrt{2} \approx 0.707$:
\[\approx \begin{bmatrix} 10.61 & 7.07 & 7.78\\ 6.36 & 4.24 & 4.95\\ 10.61 & 7.07 & 6.36 \end{bmatrix}.\] -
Softmax row-wise (approx):
\[\alpha_2 \approx \begin{bmatrix} 0.92 & 0.03 & 0.05\\ 0.73 & 0.09 & 0.18\\ 0.96 & 0.03 & 0.01 \end{bmatrix}.\] -
Multiply by $V_2$:
\[V_2 = \begin{bmatrix} 1 & 2\\ 3 & 3\\ 3 & 4 \end{bmatrix}.\]Approximating:
\[\text{head}_2 \approx \begin{bmatrix} 1.16 & 2.13\\ 1.53 & 2.45\\ 1.09 & 2.06 \end{bmatrix}.\]
-
Step 4: Concatenate and apply $W^O$
We now concatenate $\text{head}_1$ and $\text{head}_2$ horizontally to form a $(3 \times 4)$ matrix:
Finally, multiply by $W^O$ $(4 \times 4)$:
\[\text{Output} = (\text{Concat}(\text{head}_1, \text{head}_2)) \times W^O.\]Where
\[W^O = \begin{bmatrix} 1 & 0 & 0 & 1\\ 0 & 1 & 1 & 0\\ 1 & 0 & 1 & 0\\ 0 & 1 & 0 & 1 \end{bmatrix}.\]We can do a row-by-row multiplication to get the final multi-head attention output (details omitted for brevity).
Example 3: Three-Head Attention with Another Set of Numbers (Short Demonstration)
For completeness, suppose we wanted $h=3$ heads, each of dimension $\frac{d_{\text{model}}}{3}$. The steps are exactly the same:
- Project $Q, K, V$ into three subspaces via $W^Q_i, W^K_i, W^V_i$.
- Perform scaled dot-product attention for each head:
$\text{head}_i = \text{Attention}(Q_i, K_i, V_i)$. - Concatenate all heads: $\text{Concat}(\text{head}_1, \text{head}_2, \text{head}_3)$.
- Multiply by $W^O$.
Each numeric calculation is analogous to the 2-head case—just with different shapes (e.g., each head might have dimension 4/3 if the original dimension is 4, which typically would be handled with rounding or a slightly different total dimension). The procedure remains identical in principle.
Position-Wise Feed-Forward Networks
Each layer in a Transformer includes a position-wise feed-forward network (FFN) that applies a linear transformation and activation to each position independently:
\[\text{FFN}(x) = \max(0,\; xW_1 + b_1)\, W_2 + b_2,\]where $\max(0, \cdot)$ is the ReLU activation function.
Example: Numerical Computation of the Feed-Forward Network
Let
\[x = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{bmatrix},\quad W_1 = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix},\quad b_1 = \begin{bmatrix} 0 & 1 \end{bmatrix},\quad W_2 = \begin{bmatrix} 1 & 0 \\ 2 & 1 \end{bmatrix},\quad b_2 = \begin{bmatrix} 1 & -1 \end{bmatrix}.\]- Compute $xW_1 + b_1$
-
Row 1: $[1, 0]$
\[[1, 0] \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} = [1, 1],\]then add $[0, 1]$ to get $[1, 2]$.
-
Row 2: $[0, 1]$
\[[0,1]\times \begin{bmatrix}1 & 1\\0 & 1\end{bmatrix} = [0, 1],\]plus $[0, 1]$ = $[0, 2]$.
-
Row 3: $[1,1]$
\[[1,1]\times \begin{bmatrix}1 & 1\\0 & 1\end{bmatrix} = [1, 2],\]plus $[0, 1]$ = $[1, 3]$.
So
\[X_1 = \begin{bmatrix} 1 & 2\\ 0 & 2\\ 1 & 3 \end{bmatrix}.\] -
-
ReLU activation
\[\text{ReLU}(X_1) = X_1.\]
$\max(0, X_1)$ leaves nonnegative elements unchanged. All entries are already $\ge0$, so -
Multiply by $W_2$ and add $b_2$
\[W_2 = \begin{bmatrix} 1 & 0\\ 2 & 1 \end{bmatrix},\quad b_2 = [1, -1].\] \[X_2 = X_1 W_2.\]-
Row 1 of $X_1$: $[1,2]$
\([1,2] \begin{bmatrix} 1\\2 \end{bmatrix} = 1*1 +2*2=5, \quad [1,2] \begin{bmatrix} 0\\1 \end{bmatrix} = 0 +2=2.\) So $[5,2]$.
-
Row 2: $[0,2]$
\[[0,2] \begin{bmatrix}1\\2\end{bmatrix}=4,\quad [0,2] \begin{bmatrix}0\\1\end{bmatrix}=2.\] -
Row 3: $[1,3]$
\[[1,3]\begin{bmatrix}1\\2\end{bmatrix}=1+6=7,\quad [1,3]\begin{bmatrix}0\\1\end{bmatrix}=0+3=3.\]
Thus
\[X_2 = \begin{bmatrix} 5 & 2\\ 4 & 2\\ 7 & 3 \end{bmatrix}.\]Add $b_2=[1,-1]$:
\[X_2 + b_2 = \begin{bmatrix} 6 & 1\\ 5 & 1\\ 8 & 2 \end{bmatrix}.\] -
Final Output:
\[\begin{bmatrix} 6 & 1\\ 5 & 1\\ 8 & 2 \end{bmatrix}.\]Training and Optimization
Optimizer and Learning Rate Scheduling
Transformers commonly use Adam or AdamW, combined with a piecewise learning rate scheduling strategy:
\[l_{\text{rate}} = d_{\text{model}}^{-0.5} \cdot \min\bigl(\text{step}_\text{num}^{-0.5},\; \text{step}_\text{num}\times \text{warmup}_\text{steps}^{-1.5}\bigr),\]where:
- $d_{\text{model}}$ is the hidden dimension.
- $\text{step}_\text{num}$ is the current training step.
- $\text{warmup}_\text{steps}$ is the number of warmup steps.
Conclusion
The Transformer architecture has become a foundational model in modern deep learning, showing remarkable performance in NLP, computer vision, and multimodal applications. Its ability to capture long-range dependencies, combined with high parallelizability and scalability, has inspired a diverse range of research directions and practical systems. Ongoing work continues to explore ways to improve Transformer efficiency, adapt it to new scenarios, and enhance model interpretability.
Paper Reading: Attention Is All You Need
Below is a paragraph-by-paragraph (or subsection-by-subsection) markdown file that first re-states (“recaps”) each portion of the paper Attention Is All You Need and then comments on or explains that portion in more detail. Each header corresponds to a main section or subsection from the original text. The original content has been paraphrased and condensed to be more concise, but the overall structure and meaning are preserved.
Note: The original paper, “Attention Is All You Need,” was published by Ashish Vaswani et al. This markdown document is for educational purposes, offering an English re-statement of each section followed by commentary.
Paper Reading: Attention Is All You Need
Authors and Affiliations
Original (Condensed)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin.
Affiliations: Google Brain, Google Research, University of Toronto.
Recap
A group of researchers from Google Brain, Google Research, and the University of Toronto propose a new network architecture that relies solely on attention mechanisms for sequence transduction tasks such as machine translation.
Commentary
This highlights that multiple authors, each potentially focusing on different aspects—model design, optimization, and experiments—came together to create what is now often referred to as the “Transformer” architecture.
Abstract
Original (Condensed)
The dominant sequence transduction models use recurrent or convolutional neural networks (often with attention). This paper proposes the Transformer, which is based entirely on attention mechanisms. It does away with recurrence and convolutions entirely. Experiments on two machine translation tasks show the model is both high-performing in terms of BLEU score and more parallelizable. The paper reports a new state-of-the-art BLEU on WMT 2014 English-German (28.4) and a strong single-model result on English-French (41.8), trained much faster than previous approaches. The Transformer also generalizes well to other tasks, e.g., English constituency parsing.*
Recap
The paper’s abstract introduces a novel approach called the Transformer. It uses only attention (no RNNs or CNNs) for tasks like machine translation and shows exceptional speed and accuracy results.
Commentary
This is a seminal innovation in deep learning for language processing. Removing recurrence (like LSTM layers) and convolutions makes training highly parallelizable, dramatically reducing training time. At the same time, it achieves superior or comparable performance on well-known benchmarks. The abstract also hints that the Transformer concept could generalize to other sequential or structured tasks.
1 Introduction
Original (Condensed)
Recurrent neural networks (RNNs), particularly LSTM or GRU models, have set the standard in sequence modeling and transduction tasks. However, they process input sequentially, limiting parallelization. Attention mechanisms have improved performance in tasks like translation, but they have traditionally been used on top of recurrent networks. This paper proposes a model that relies entirely on attention—called the Transformer—removing the need for recurrence or convolutional architectures. The result is a model that learns global dependencies and can be trained more efficiently.*
Recap
The introduction situates the proposed Transformer within the history of neural sequence modeling: first purely recurrent approaches, then RNN+attention, and finally a pure-attention approach. The authors observe that while recurrent models handle sequences effectively, they rely on step-by-step processing. This strongly limits parallel computation. The Transformer’s innovation is to dispense with recurrences altogether.
Commentary
The introduction highlights a major bottleneck in typical RNN-based models: the inability to parallelize across time steps in a straightforward way. Traditional attention over RNN outputs is still useful, but the authors propose a more radical approach, removing recurrences and using attention everywhere. This sets the stage for a highly parallelizable model that can scale better to longer sequences, given sufficient memory and computational resources.
In-Course Question 1: What is the number of dimensionality of the transformer’s query embeddings designed in this paper.
2 Background
Original (Condensed)
Efforts to reduce the sequential computation have led to alternatives like the Extended Neural GPU, ByteNet, and ConvS2S, which use convolutional networks for sequence transduction. However, even with convolution, the distance between two positions can be large in deep stacks, potentially making it harder to learn long-range dependencies. Attention mechanisms have been used for focusing on specific positions in a sequence, but typically in conjunction with RNNs. The Transformer is the first purely attention-based model for transduction.*
Recap
The background section covers attempts to speed up sequence modeling, including convolution-based architectures. While they improve speed and are more parallelizable than RNNs, they still can have challenges with long-range dependencies. Attention can address such dependencies, but before this paper, it was usually combined with recurrent models.
Commentary
This background motivates why researchers might try to eliminate recurrence and convolution entirely. If attention alone can handle dependency modeling, then the path length between any two positions in a sequence is effectively shorter. This suggests simpler, faster training and potentially better performance.
3 Model Architecture
The Transformer follows an encoder-decoder structure, but with self-attention replacing recurrences or convolutions.
3.1 Encoder and Decoder Stacks
Original (Condensed)
The encoder is composed of N identical layers; each layer has (1) a multi-head self-attention sub-layer, and (2) a position-wise feed-forward network. A residual connection is employed around each of these, followed by layer normalization. The decoder also has N identical layers with an additional sub-layer for attention over the encoder output. A masking scheme ensures each position in the decoder can only attend to positions before it (causal masking).*
Recap
- Encoder: Stack of N layers. Each layer has:
- Self-attention
- Feed-forward
Plus skip (residual) connections and layer normalization.
- Decoder: Similar stack but also attends to the encoder output. Additionally, the decoder masks future positions to preserve the autoregressive property.
Commentary
This design is highly modular: each layer is built around multi-head attention and a feed-forward block. The skip connections help with training stability, and layer normalization is known to speed up convergence. The causal masking in the decoder is crucial for generation tasks such as translation, ensuring that the model cannot “peek” at future tokens.
3.2 Attention
Original (Condensed)
An attention function maps a query and a set of key-value pairs to an output. We use a “Scaled Dot-Product Attention,” where the dot products between query and key vectors are scaled by the square root of the dimension. A softmax yields weights for each value. We also introduce multi-head attention: queries, keys, and values are linearly projected h times, each head performing attention in parallel, then combined.*
Recap
- Scaled Dot-Product Attention: Computes attention weights via
softmax((QK^T) / sqrt(d_k)) * V
. - Multi-Head Attention: Instead of a single attention, we project Q, K, V into multiple sub-spaces (heads), do attention in parallel, then concatenate.
Commentary
Dot-product attention is computationally efficient and can be parallelized easily. The scaling factor 1/√(d_k) helps mitigate large magnitude dot products when the dimensionality of keys/queries is big. Multiple heads allow the model to look at different positions/relationships simultaneously, which helps capture various types of information (e.g., syntax, semantics).
3.3 Position-wise Feed-Forward Networks
Original (Condensed)
Each layer in the encoder and decoder has a feed-forward network that is applied to each position separately and identically, consisting of two linear transformations with a ReLU in between.*
Recap
After multi-head attention, each token’s representation goes through a small “fully connected” or “feed-forward” sub-network. This is done independently per position.
Commentary
This structure ensures that after attention-based mixing, each position is then transformed in a non-linear way. It is reminiscent of using small per-position multi-layer perceptrons to refine each embedding.
3.4 Embeddings and Softmax
Original (Condensed)
Token embeddings and the final output linear transformation share the same weight matrix (with a scaling factor). The model uses learned embeddings to convert input and output tokens to vectors of dimension d_model.*
Recap
The model uses standard embedding layers for tokens and ties the same weights in both the embedding and the pre-softmax projection. This helps with parameter efficiency and sometimes improves performance.
Commentary
Weight tying is a known trick that can save on parameters and can help the embedding space align with the output space in generative tasks.
3.5 Positional Encoding
Original (Condensed)
Because there is no recurrence or convolution, the Transformer needs positional information. The paper adds a sinusoidal positional encoding to the input embeddings, allowing the model to attend to relative positions. Learned positional embeddings perform similarly, but sinusoidal encodings might let the model generalize to sequence lengths not seen during training.*
Recap
The Transformer adds sine/cosine signals of varying frequencies to the embeddings so that each position has a unique pattern. This is essential to preserve ordering information.
Commentary
Without positional encodings, the self-attention mechanism would treat input tokens as an unstructured set. Positional information ensures that the model knows how tokens relate to one another in a sequence.
4 Why Self-Attention
Original (Condensed)
The authors compare self-attention to recurrent and convolutional layers in terms of computation cost and how quickly signals can travel between distant positions in a sequence. Self-attention is more parallelizable and has O(1) maximum path length (all tokens can attend to all others in one step). Convolutions and recurrences require multiple steps to connect distant positions. This can help with learning long-range dependencies.*
Recap
Self-attention:
- Parallelizable across sequence positions.
- Constant number of sequential operations per layer.
- Short paths between positions -> easier to learn long-range dependencies.
Commentary
The authors argue that self-attention layers are efficient (especially when sequence length is not extremely large) and effective at modeling dependencies. This is a key motivation for the entire design.
In-class question: What is the probability assigned to the ground-truth class in the ground-truth distribution after label smoothing when training the Transformer in the default setting of this paper?
5 Training
5.1 Training Data and Batching
Original (Condensed)
The authors use WMT 2014 English-German (about 4.5M sentence pairs) and English-French (36M pairs). They use subword tokenization (byte-pair encoding or word-piece) to handle large vocabularies. Training batches contain roughly 25k source and 25k target tokens.*
Recap
They describe the datasets and how the text is batched using subword units. This avoids issues with out-of-vocabulary tokens.
Commentary
Subword tokenization was pivotal in neural MT systems because it handles rare words well. Batching by approximate length helps the model train more efficiently and speeds up training on GPUs.
5.2 Hardware and Schedule
Original (Condensed)
They trained on a single machine with 8 NVIDIA P100 GPUs. The base model was trained for 100k steps (about 12 hours), while the bigger model took around 3.5 days. Each training step for the base model took ~0.4 seconds on this setup.*
Recap
Base models train surprisingly quickly—only about half a day for high-quality results. The big model uses more parameters and trains longer.
Commentary
This training time is significantly shorter than earlier neural MT models, demonstrating one practical advantage of a highly parallelizable architecture.
5.3 Optimizer
Original (Condensed)
The paper uses the Adam optimizer with specific hyperparameters (β1=0.9, β2=0.98, ε=1e-9). The learning rate increases linearly for the first 4k steps, then decreases proportionally to step^-0.5.*
Recap
A custom learning-rate schedule is used, with a “warm-up” phase followed by a decay. This is crucial to stabilize training early on and then adapt to a more standard rate.
Commentary
This “Noam” learning rate schedule (as often called) is well-known in the community. It boosts the learning rate once the model is more confident, yet prevents divergence early on.
5.4 Regularization
Original (Condensed)
Three types of regularization: (1) Dropout after sub-layers and on embeddings, (2) label smoothing of 0.1, (3) early stopping / checkpoint averaging (not explicitly described here but implied). Label smoothing slightly hurts perplexity but improves translation BLEU.*
Recap
- Dropout helps avoid overfitting.
- Label smoothing makes the model less certain about each token prediction, improving generalization.
Commentary
By forcing the model to distribute probability mass across different tokens, label smoothing can prevent the network from becoming overly confident in a small set of predictions, thus improving real-world performance metrics like BLEU.
6 Results
6.1 Machine Translation
Original (Condensed)
On WMT 2014 English-German, the big Transformer achieved 28.4 BLEU, surpassing all previously reported results (including ensembles). On English-French, it got 41.8 BLEU with much less training cost compared to other models. The base model also outperforms previous single-model baselines.*
Recap
Transformer sets a new SOTA on English-German and matches/exceeds on English-French with vastly reduced training time.
Commentary
This was a landmark result, as both speed and quality improved. The authors highlight not just the performance, but the “cost” in terms of floating-point operations, showing how the Transformer is more efficient.
6.2 Model Variations
Original (Condensed)
They explore different hyperparameters, e.g., number of attention heads, dimension of queries/keys, feed-forward layer size, and dropout. They find that more heads can help but too many heads can degrade performance. Bigger dimensions improve results at the expense of more computation.*
Recap
Experiments confirm that the Transformer’s performance scales with model capacity. Properly tuned dropout is vital. Both sinusoidal and learned positional embeddings perform comparably.
Commentary
This section is valuable for practitioners, as it provides insight into how to adjust model size and regularization. It also confirms that the approach is flexible.
6.3 English Constituency Parsing
Original (Condensed)
They show that the Transformer can also tackle English constituency parsing, performing competitively with top models. On the WSJ dataset, it achieves strong results, and in a semi-supervised setting, it is even more impressive.*
Recap
It isn’t just about machine translation: the model generalizes to other tasks with structural dependencies, illustrating self-attention’s adaptability.
Commentary
Constituency parsing requires modeling hierarchical relationships in sentences. Transformer’s ability to attend to any part of the input helps capture these structures without specialized RNNs or grammar-based methods.
7 Conclusion
Original (Condensed)
The Transformer architecture relies entirely on self-attention, providing improved parallelization and, experimentally, new state-of-the-art results in machine translation. The paper suggests applying this approach to other tasks and modalities, possibly restricting attention to local neighborhoods for efficiency with large sequences. The code is made available in an open-source repository.*
Recap
The authors close by reiterating how self-attention replaces recurrence and convolution, giving strong speed advantages. They encourage investigating how to adapt the architecture to other domains and tasks.
Commentary
This conclusion underscores the paper’s broad impact. After publication, the Transformer rapidly became the foundation of many subsequent breakthroughs, including large-scale language models. Future directions—like local attention for very long sequences—have since seen extensive research.
References
(Original references are long and primarily list papers on neural networks, attention, convolutional models, etc. Below is a very brief, high-level mention.)
Recap
The references include prior works on RNN-based machine translation, convolutional approaches, attention mechanisms, and optimization techniques.
Commentary
They form a comprehensive backdrop for the evolution of neural sequence modeling, highlighting both the developments that led to the Transformer and the new directions it subsequently inspired.
Overall Commentary
The paper Attention Is All You Need revolutionized natural language processing by introducing a purely attention-based model (the Transformer). Its core contributions can be summarized as:
- Eliminating Recurrence and Convolution: Replacing them with multi-head self-attention to model dependencies in a single step.
- Superior Performance and Efficiency: Achieving state-of-the-art results on crucial MT tasks faster than prior methods.
- Generalization: Showing that the model concept extends beyond MT to other tasks, e.g., parsing.
This architecture laid the groundwork for many subsequent techniques, including BERT, GPT, and other large language models. The key takeaway is that attention mechanisms alone—when used in a multi-layer, multi-head framework—suffice to capture both local and global information in sequences, drastically improving efficiency and performance in a wide range of NLP tasks.
Lecture 4: Analysis of Transformer Models: Parameter Count, Computation, Activations
In-Class Question 1: Given layer number $N$ as 6, model dimension $d_{model}$ as 512, feed-forward dimension $d_{ff}$ = 2048, number of attention heads $h$ = 8, what is the total number of learnable parameters in a vanilla Transformer model?
In-Class Question 2: Given layer number $N$ as 6, model dimension $d_{model}$ as 1024, feed-forward dimension $d_{ff}$ = 4096, number of attention heads $h$ = 16, what is the total number of learnable parameters in a vanilla Transformer model?
Reference Tutorial: Parameter size of vanilla transformer Reference Tutorial: Analysis of Transformer Models
1. Introduction
Welcome to this expanded class on analyzing the memory and computational efficiency of training large language models (LLMs). With the rise of models like OpenAI’s ChatGPT, researchers and engineers have become increasingly interested in the mechanics behind Large Language Models. The “large” aspect of these models refers both to the number of model parameters and the scale of training data. For example, GPT-3 has 175 billion parameters and was trained on 570 GB of data. Consequently, training such models presents two key challenges: memory efficiency and computational efficiency.
Most large models in industry today utilize the transformer architecture. Their structures can be broadly divided into encoder-decoder (exemplified by T5) and decoder-only. The decoder-only structure can be split into Causal LM (represented by the GPT series) and Prefix LM (represented by GLM). Causal language models like GPT have achieved significant success, so many mainstream LLMs employ the Causal LM paradigm. In this class, we will focus on the decoder-only transformer framework, analyzing its parameter count, computational requirements, and intermediate activations to better understand the memory and computational efficiency of training and inference.
To make the analysis clearer, let us define the following notation:
- $l$: Number of transformer layers
- $h$: Hidden dimension
- $a$: Number of attention heads
- $V$: Vocabulary size
- $b$: Training batch size
- $s$: Sequence length
2. Model Parameter Count
A transformer model commonly consists of $l$ identical layers, each containing a self-attention block and an MLP block. The decoder-only structure also includes an embedding layer and a final output layer (often weight-tied with the embedding).
2.1 Parameter Breakdown per Layer
- Self-Attention Block
The trainable parameters here include:- Projection matrices for queries, keys, and values: $W_Q, W_K, W_V \in \mathbb{R}^{h \times h}$
- Output projection matrix: $W_O \in \mathbb{R}^{h \times h}$
- Their corresponding bias vectors (each in $\mathbb{R}^{h}$)
Hence, the parameter count in self-attention is: \(3(h \times h) + (h \times h) + \text{(4 biases)} = 4h^2 + 4h.\) However, in multi-head attention, we often split $h$ into $a$ heads, each of dimension $h/a$. Internally, $W_Q, W_K, W_V$ can be viewed as $[h, a\times (h/a)] = [h, h]$, so the total dimension is still $h\times h$. This is why the simpler $h^2$ counting still holds.
- MLP Block
Usually, the MLP block has two linear layers:- First layer: $W_1 \in \mathbb{R}^{h \times (4h)}$ and bias in $\mathbb{R}^{4h}$
- Second layer: $W_2 \in \mathbb{R}^{(4h) \times h}$ and bias in $\mathbb{R}^{h}$
Therefore, the MLP block has: \(h \times (4h) + (4h) \;+\; (4h)\times h + h \;=\; 8h^2 + 5h\) parameters in total.
- Layer Normalization
Both the self-attention and MLP blocks have a layer normalization containing a scaling parameter $\gamma$ and a shifting parameter $\beta$ in $\mathbb{R}^{h}$. So two layer norms contribute $4h$ parameters: \(2 \times (h + h) = 4h.\)
Summing these, each transformer layer has: \((4h^2 + 4h) + (8h^2 + 5h) + 4h = 12h^2 + 13h\) trainable parameters.
- Embedding Layer
There is a word embedding matrix in $\mathbb{R}^{V \times h}$, which contributes $Vh$ parameters. In many LLM implementations (such as GPT variants), this same matrix is shared with the final output projection for logits (output embedding). Hence the total parameters for input and output embeddings are typically counted as $Vh$ rather than $2Vh$.
If the position encoding is trainable, it might add a few more parameters, but often relative position encodings (e.g., RoPE, ALiBi) contain no trainable parameters. We will ignore any small parameter additions from positional encodings.
Thus, an $l$-layer transformer model has a total trainable parameter count of: \(l \times (12h^2 + 13h) + Vh.\)
When $h$ is large, we can approximate $13h$ by a smaller term compared to $12h^2$, so the parameter count is roughly: \(12\,l\,h^2.\)
2.2 Estimating LLaMA Parameter Counts
Below is a table comparing the approximate $12\,l\,h^2$ calculation for various LLaMA models to their actual parameter counts:
Actual Parameter Count | Hidden Dimension h | Layer Count l | 12lh^2 |
---|---|---|---|
6.7B | 4096 | 32 | 6,442,450,944 |
13.0B | 5120 | 40 | 12,582,912,000 |
32.5B | 6656 | 60 | 31,897,681,920 |
65.2B | 8192 | 80 | 64,424,509,440 |
We see that the approximation $12\,l\,h^2$ is quite close to actual parameter counts.
2.3 Memory Usage Analysis During Training
The main memory consumers during training are:
- Model Parameters
- Intermediate Activations (from the forward pass)
- Gradients
- Optimizer States (e.g., AdamW’s first and second moments)
We first analyze parameters, gradients, and optimizer states. The topic of intermediate activations will be discussed later in detail.
Large models often use the AdamW optimizer with mixed precision (float16 for forward/backward passes and float32 for optimizer updates). Let the total number of trainable parameters be $\Phi$. During a single training iteration:
- There is one gradient element per parameter ($\Phi$ elements total).
- AdamW maintains two optimizer states (first-order and second-order moments), so that is $2\Phi$ elements in total.
A float16 element is 2 bytes, and a float32 element is 4 bytes. In mixed precision training:
- Model parameters (for the forward and backward pass) are stored in float16.
- Gradients are computed in float16.
- For parameter updates, the optimizer internally uses float32 copies of parameters and gradients, as well as float32 for its two moment states.
Hence, each trainable parameter uses (approximately) the following:
- Forward/backward parameter: float16 $\to$ 2 bytes
- Gradient: float16 $\to$ 2 bytes
- Optimizer parameter copy: float32 $\to$ 4 bytes
- Optimizer gradient copy: float32 $\to$ 4 bytes
- First-order moment: float32 $\to$ 4 bytes
- Second-order moment: float32 $\to$ 4 bytes
Summing: \(2 + 2 + 4 + 4 + 4 + 4 = 20\ \text{bytes per parameter}.\)
Therefore, training a large model with $\Phi$ parameters under mixed precision with AdamW requires approximately: \(20\,\Phi \quad \text{bytes}\) to store parameters, gradients, and optimizer states.
Practical Note on Distributed Training
In practice, distributed training techniques like ZeRO (Zero Redundancy Optimizer) can partition optimizer states across multiple GPUs, reducing per-GPU memory usage. However, the total memory across the entire cluster remains on the same order as the above calculation (though effectively shared among GPUs).
2.4 Memory Usage Analysis During Inference
During inference, there are no gradients or optimizer states, nor do we need to store all intermediate activations for backpropagation. Thus, the main memory usage is from the model parameters themselves. If float16 is used for inference, this is roughly: \(2\,\Phi \quad \text{bytes}.\)
When using a key-value (KV) cache for faster autoregressive inference, some additional memory is used (analyzed later). There is also small overhead for the input data and temporary buffers, but this is typically negligible compared to parameter storage and KV cache.
3. Computational Requirements (FLOPs) Estimation
FLOPs (floating point operations) measure computational cost. For two matrices $A \in \mathbb{R}^{n \times m}$ and $B \in \mathbb{R}^{m \times l}$, computing $AB$ takes roughly $2nml$ FLOPs (one multiplication and one addition per element pair).
In one training iteration with input shape $[b, s]$, let’s break down the self-attention and MLP costs in a single transformer layer.
3.1 Self-Attention Block
A simplified representation of the self-attention operations is:
\[Q = xW_Q,\quad K = xW_K,\quad V = xW_V\] \[\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{Q K^\mathsf{T}}{\sqrt{h}}\right) \cdot V,\] \[x_{\text{out}} = \text{Attention}(Q,K,V)\,W_O + x.\]Let $x\in \mathbb{R}^{b\times s\times h}$. The major FLOP contributors are:
- Computing $Q, K, V$
Each matrix multiplication has shape $[b, s, h]\times[h, h]\to[b, s, h]$.- Cost: $3 \times 2 \,b\,s\,h^2 = 6\,b\,s\,h^2$ (the factor 2 arises from multiply + add).
- $Q K^\mathsf{T}$
- $Q, K \in \mathbb{R}^{b \times s \times h}$, often reinterpreted as $[b, a, s, \frac{h}{a}]$.
- The multiplication result has shape $[b, a, s, s]$.
- Cost: $2\,b\,s^2\,h$.
- Weighted $V$
- We multiply the attention matrix $[b, a, s, s]$ by $V \in [b, a, s, \frac{h}{a}]$.
- Cost: $2\,b\,s^2\,h$.
- Output linear projection
- $[b, s, h]\times[h, h]\to[b, s, h]$.
- Cost: $2\,b\,s\,h^2$.
Hence, the self-attention block requires about: \(6\,b\,s\,h^2 + 2\,b\,s\,h^2 + 2\,b\,s^2\,h + 2\,b\,s^2\,h\) which simplifies to \(8\,b\,s\,h^2 + 4\,b\,s^2\,h.\) (We will combine final terms more precisely in the overall layer cost.)
3.2 MLP Block
The MLP block typically is: \(x_{\text{MLP}} = \mathrm{GELU}\bigl(x_{\text{out}} W_1\bigr)\,W_2 + x_{\text{out}},\) where $W_1 \in [h, 4h]$ and $W_2 \in [4h, h]$. The major FLOP contributors are:
- First linear layer:
- $[b, s, h]\times [h, 4h]\to[b, s, 4h]$.
- Cost: $2\,b\,s\,h\,(4h) = 8\,b\,s\,h^2$.
- Second linear layer:
- $[b, s, 4h]\times [4h, h]\to[b, s, h]$.
- Cost: $2\,b\,s\,(4h)\,h = 8\,b\,s\,h^2$.
Nonlinear activations like GELU also incur some cost, but often it is modest compared to large matrix multiplications.
3.3 Summing Over One Transformer Layer
Combining self-attention and MLP:
- Self-Attention: ~$8\,b\,s\,h^2 + 4\,b\,s^2\,h$
- MLP: ~$16\,b\,s\,h^2$ (sum of the two 8’s)
Thus, each transformer layer requires about: \((8 + 16)\,b\,s\,h^2 + 4\,b\,s^2\,h \;=\; 24\,b\,s\,h^2 + 4\,b\,s^2\,h\) FLOPs.
Additionally, computing logits in the final output layer has cost: \(2\,b\,s\,h\,V.\)
For an $l$-layer transformer, one forward pass with input $[b, s]$ thus has a total cost:
\[l \times \Bigl(24\,b\,s\,h^2 + 4\,b\,s^2\,h\Bigr) \;+\; 2\,b\,s\,h\,V.\]In many large-scale settings, $h\gg s$, so $4\,b\,s^2\,h$ can be smaller relative to $24\,b\,s\,h^2$, and $2\,b\,s\,h\,V$ can also be relatively smaller if $V$ is not extremely large. Hence a common approximation is:
\[\approx 24\,l\,b\,s\,h^2.\]3.4 Relationship Between Computation and Parameter Count
Recall the parameter count is roughly $12\,l\,h^2$. Comparing:
\[\frac{24\,b\,s\,h^2\,l}{12\,l\,h^2} = 2\,b\,s.\]Hence, for each token, each parameter performs about 2 FLOPs in one forward pass (one multiplication + one addition). In a training iteration (forward + backward), the cost is typically 3 times the forward pass. Thus per token-parameter we have \(2 \times 3 = 6\) FLOPs in total.
However, activation recomputation (discussed in Section 4.3) can add another forward-like pass during backpropagation, making the factor 4 instead of 3. Then per token-parameter we get $2 \times 4 = 8$ FLOPs.
3.5 Estimating Training Costs
Consider GPT-3 (175B parameters), which has about $1.75\times 10^{11}$ parameters trained on $3\times 10^{11}$ tokens. Each parameter-token pair does about 6 FLOPs in forward+backward:
\[6 \times 1.746\times 10^{11} \times 3\times 10^{11} \;=\; 3.1428\times 10^{23}\,\text{FLOPs}.\]
Large Language Model's Costs (https://arxiv.org/pdf/2005.14165v4)
3.6 Training Time Estimation
Given the total FLOPs and the GPU hardware specs, we can estimate training time. The raw GPU FLOP rate alone does not reflect real-world utilization, and typical utilization might be between 0.3 and 0.55 due to factors like data loading, communication, and logging overheads.
Also note that activation recomputation adds an extra forward pass, giving a factor of 4 (forward + backward + recomputation) instead of 3. Thus, per token-parameter we get $2 \times 4 = 8$ FLOPs.
Hence, training time can be roughly estimated by: \(\text{Training Time} \approx \frac{8 \times (\text{tokens count}) \times (\text{model parameter count})} {\text{GPU count} \times \text{GPU peak performance (FLOPs)} \times \text{GPU utilization}}.\)
Example: GPT-3 (175B)
Using 1024 A100 (40GB) GPUs to train GPT-3 on 300B tokens:
- Peak performance per A100 (40GB) is about 312 TFLOPS.
- Assume GPU utilization at 0.45.
- Parameter count $\approx 175\text{B}$.
- Training tokens = 300B.
Estimated training time:
\[\text{Time} \approx \frac{8 \times 300\times 10^9 \times 175\times 10^9} {1024 \times 312\times 10^{12} \times 0.45} \;\approx\; 34\,\text{days}.\]This is consistent with reported real-world results in [7].
Example: LLaMA-65B
Using 2048 A100 (80GB) GPUs to train LLaMA-65B on 1.4T tokens:
- Peak performance per A100 (80GB) is about 624 TFLOPS.
- Assume GPU utilization at 0.3.
- Parameter count $\approx 65\text{B}$.
- Training tokens = 1.4T.
Estimated training time:
\[\text{Time} \approx \frac{8 \times 1.4\times 10^{12} \times 65\times 10^9} {2048 \times 624\times 10^{12} \times 0.3} \;\approx\; 21\,\text{days}.\]This also aligns with [4].
In-Class Question 1: What is the training time of using 4096 H100 GPUs to train LLaMA-70B on 300B tokens?
In-Class Question 2: What is the training time of using 1024 H100 GPUs to train LLaMA-70B on 1.4T tokens?
4. Intermediate Activation Analysis
During training, intermediate activations (values generated in the forward pass that are needed for the backward pass) can consume a large portion of memory. These include layer inputs, dropout masks, etc., but exclude model parameters and optimizer states. Although there are small buffers for means and variances in layer normalization, their total size is generally negligible compared to the main tensor dimensions.
Typically, float16 or bfloat16 is used to store activations. We assume 2 bytes per element for these. Dropout masks often use 1 byte per element (or sometimes bit-packing is used in advanced implementations).
Let us analyze the main contributors for each layer.
4.1 Self-Attention Block
Using: \(Q = x\,W_Q,\quad K = x\,W_K,\quad V = x\,W_V,\) \(\text{Attention}(Q,K,V)= \text{softmax}\Bigl(\frac{QK^\mathsf{T}}{\sqrt{h}}\Bigr)\cdot V,\) \(x_{\text{out}} = \text{Attention}(Q,K,V)\,W_O + x,\) we consider:
- Input $x$
- Shape $[b, s, h]$, stored as float16 $\to 2\,b\,s\,h$ bytes.
- Q and K
- Each is $[b, s, h]$ in float16, so $2\,b\,s\,h$ bytes each. Together: $4\,b\,s\,h$ bytes.
- $QK^\mathsf{T}$ (softmax input)
- Shape is $[b, a, s, s]$. Since $a \times \frac{h}{a}=h$, memory cost is $2\,b\,a\,s^2$ bytes.
- Dropout mask for the attention matrix
- Typically uses 1 byte per element, shape $[b, a, s, s]\to b\,a\,s^2$ bytes.
- Softmax output (scores) and $V$
- Score has $2\,b\,a\,s^2$ bytes, $V$ has $2\,b\,s\,h$ bytes.
- Output projection input
- $[b, s, h]$ in float16 $\to 2\,b\,s\,h$ bytes.
- Another dropout mask for the output: $[b, s, h]$ at 1 byte each $\to b\,s\,h$ bytes.
Summing these (grouping terms carefully), the self-attention block activations total around: \(11\,b\,s\,h + 5\,b\,s^2\,a \quad \text{(bytes, counting float16 and dropout masks)}.\)
4.2 MLP Block
For the MLP: \(x = \mathrm{GELU}(x_{\text{out}}\,W_1)\,W_2 + x_{\text{out}},\) the main stored activations are:
- Input to first linear layer: $[b,s,h]$ at float16 $\to 2\,b\,s\,h$ bytes.
- Hidden activation ($[b,s,4h]$) before or after GELU: $2\times 4\,b\,s\,h = 8\,b\,s\,h$ bytes. (One copy typically for the linear output and one for the activation function input/output; actual usage can vary by implementation.)
- Output of second linear layer: $[b,s,h]$ in float16 $\to 2\,b\,s\,h$ bytes.
- Dropout mask: $[b,s,h]$ at 1 byte per element $\to b\,s\,h$ bytes.
Hence, the MLP block’s stored activations sum to about: \(19\,b\,s\,h \quad \text{bytes}.\)
4.3 Layer Normalization
Each layer has two layer norms (one for self-attention, one for MLP), each storing its input in float16. That is: \(2\times (2\,b\,s\,h) = 4\,b\,s\,h \quad \text{bytes}.\)
Thus, per layer, the activation memory is roughly: \((11\,b\,s\,h + 5\,b\,s^2\,a) + 19\,b\,s\,h + 4\,b\,s\,h \;=\; 34\,b\,s\,h + 5\,b\,s^2\,a.\)
An $l$-layer transformer has approximately: \(l \times \bigl(34\,b\,s\,h + 5\,b\,s^2\,a\bigr)\) bytes of intermediate activation memory.
4.4 Comparison with Parameter Memory
Unlike model parameter memory, which is essentially constant with respect to $b$ and $s$, activation memory grows with $b$ and $s$. Reducing batch size $b$ or sequence length $s$ is a common way to mitigate OOM (Out Of Memory) issues. For example:
- GPT-3 (175B parameters, $96$ layers, $h = 12288, a=96$) at sequence length $s=2048$:
- Model parameters: $175\text{B} \times 2\text{ bytes}\approx 350\text{ GB}$ in float16.
- Intermediate activations:
- If $b=1$, about $275$ GB (close to $0.79\times$ parameter memory).
- If $b=64$, about $17.6$ TB ($\approx 50\times$ parameter memory).
- If $b=128$, about $35.3$ TB ($\approx 100\times$ parameter memory).
Thus, activation memory can easily exceed parameter memory, especially at large batch sizes.
4.5 Activation Recomputation
To reduce peak activation memory, activation recomputation (or checkpointing) is often used. The idea is:
- In the forward pass, we do not store all intermediate activations.
- In the backward pass, we recompute them from stored checkpoints (e.g., re-run part of the forward pass) before proceeding with gradient computations.
This trades extra computation for less memory usage and can cut activation memory from $O(l)$ to something smaller like $O(\sqrt{l})$, depending on the strategy. In practice, a common approach is to only store the activations at certain checkpoints (e.g., after each transformer block) and recompute the missing parts in the backward pass.
5. Conclusion
In this class, we explored how to estimate and analyze key aspects of training for large language models:
- Parameter Count
- For a transformer-based LLM, each layer has approximately $12h^2 + 13h$ parameters, plus $Vh$ for the embeddings, leading to a total of
\(l(12h^2+13h)+Vh.\) - When $h$ is large, we often approximate it as $12\,l\,h^2$.
- For a transformer-based LLM, each layer has approximately $12h^2 + 13h$ parameters, plus $Vh$ for the embeddings, leading to a total of
- Memory Usage
- During training, parameters, gradients, and optimizer states typically use about $20\,\Phi$ bytes under mixed precision with AdamW (where $\Phi$ is the total parameter count).
- Intermediate activations can exceed parameter storage, especially with large batch size $b$ and long sequence length $s$. Techniques like activation recomputation help reduce this memory footprint.
- During inference, only parameters (2 bytes each in float16) and the KV cache are major memory consumers.
- FLOP Estimation
- Roughly 2 FLOPs per token-parameter during a forward pass (one multiplication + one addition).
- Training (forward + backward) yields about 6 FLOPs per token-parameter if no recomputation is used, or 8 FLOPs per token-parameter if activation recomputation is used.
By dissecting these components, we gain a clearer picture of why training large language models requires extensive memory and computation, and how various strategies (e.g., activation recomputation, KV cache) are applied to optimize hardware resources. Such understanding is crucial for practitioners to make informed decisions about scaling laws, distributed training setups, and memory-saving techniques.
6. References
- Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21(1): 5485-5551.
- Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Advances in neural information processing systems, 2017, 30.
- Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Advances in neural information processing systems, 2020, 33: 1877-1901.
- Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Sheng Y, Zheng L, Yuan B, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
- Korthikanti V, Casper J, Lym S, et al. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198, 2022.
- Narayanan D, Shoeybi M, Casper J, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021: 1-15.
- Smith S, Patwary M, Norick B, et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
Lecture 5: Decoder-only Transformer (LLM) vs Vanilla Transformer: A Detailed Comparison
Introduction
Modern Large Language Models (LLMs) are primarily based on decoder-only transformer architectures, while the original transformer model (“vanilla transformer”) uses an encoder-decoder structure. This tutorial will explore the differences between these two architectures in detail, including their respective advantages, disadvantages, and application scenarios.
Vanilla Transformer Architecture
In 2017, Vaswani et al. introduced the original transformer architecture in their paper “Attention is All You Need.”
Key Features
- Dual-module Design: Consists of both encoder and decoder components
- Encoder:
- Processes the input sequence
- Composed of multiple layers of self-attention and feed-forward networks
- Each token can attend to all other tokens in the sequence
- Decoder:
- Generates the output sequence
- Contains self-attention layers, encoder-decoder attention layers, and feed-forward networks
- Uses masked attention to ensure predictions only depend on already generated tokens
Workflow
- Encoder receives and processes the complete input sequence
- Decoder generates output tokens one by one
- When generating each token, the decoder accesses the complete representation from the encoder through cross-attention
Application Scenarios
Mainly used for sequence-to-sequence (seq2seq) tasks, such as:
- Machine translation
- Text summarization
- Dialogue systems
Decoder-only Transformer (LLM) Architecture
Modern LLMs like the GPT (Generative Pre-trained Transformer) series adopt a simplified decoder-only architecture.
Key Features
- Single-module Design: Only retains the decoder part of the transformer (but removes the cross-attention layer)
- Autoregressive Generation: Predicts the next token based on previous tokens
- Masked Self-attention: Ensures each position can only attend to positions before it
- Scale Expansion: Parameter count is typically much larger than vanilla transformers
Workflow
- The model receives a partial sequence as input (prompt)
- Using an autoregressive approach, it predicts and generates subsequent tokens one by one
- Each newly generated token is added to the input for predicting the next token
Advantages
- Simplified Architecture: Removing the encoder simplifies the design
- Unified Framework: Views all NLP tasks as text completion problems
- Long-text Generation: Particularly suitable for open-ended generation tasks
- Scalability: Proven to scale to hundreds of billions of parameters
Key Differences Comparison
Feature | Vanilla Transformer | Decoder-only Transformer |
---|---|---|
Architecture | Encoder-Decoder | Decoder only |
Attention Mechanism | Encoder: Bidirectional attention Decoder: Unidirectional masked attention + cross-attention |
Only unidirectional masked self-attention |
Information Processing | Encoder encodes the entire input Decoder can access complete encoded information |
Can only access previously generated tokens |
Task Adaptability | Better for explicit transformation tasks | Better for open-ended generation tasks |
Inference Process | Input processed at once, then output generated step by step | Autoregressive generation, each step depends on previously generated content |
Parameter Efficiency | Higher for specific tasks | Requires more parameters to achieve similar performance |
Main Representatives | BERT (encoder-only), T5, BART | GPT series, LLaMA, Claude |
Technical Details
Positional Encoding
Both architectures use positional encoding, but implementation differs:
- Vanilla: Uses fixed sine and cosine functions
- Modern LLMs: Typically use learnable positional encodings or Rotary Position Embedding (RoPE)
Pre-training Methods
- Vanilla (BERT): Masked Language Modeling (MLM) and Next Sentence Prediction (NSP)
- Decoder-only: Autoregressive language modeling, predicting the next token
Attention Mechanism
// Bidirectional self-attention calculation in Vanilla transformer (simplified)
Q = X * Wq
K = X * Wk
V = X * Wv
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
// Masked self-attention in Decoder-only transformer
// The main difference is using a mask matrix to ensure position i can only attend to positions 0 to i
mask = generateCausalMask(seq_length) // lower triangular matrix
Attention(Q, K, V) = softmax((QK^T / sqrt(d_k)) + mask) * V
Why Decoder-only Models Became Mainstream?
- Simplicity: Removes complex encoder-decoder interactions
- Unified Interface: Can transform various NLP tasks into the same format
- Scalability: Proven to scale effectively to massive sizes
- Generalization Ability: Achieves remarkable generalization through large-scale pre-training
Conclusion
While the vanilla transformer architecture excels in specific tasks, the decoder-only architecture has become the preferred choice for modern LLMs due to its simplicity, scalability, and flexibility. Understanding the differences between these architectures is crucial for comprehending current developments in the NLP field.
Each has its advantages, and the choice of architecture should be based on specific task requirements:
- For tasks requiring bidirectional understanding and explicit transformation: Consider vanilla transformers or encoder-only models
- For open-ended generation and general AI capabilities: Decoder-only LLMs are more suitable
Artificial intelligence is developing rapidly, and these architectures continue to evolve, but understanding the fundamental differences will help grasp future development directions.
Lecture 6: Efficient Text Generation of Decoder-Only Transformers: KV-Cache
1. KV Cache
For faster generative inference, transformers often use a KV cache, which stores keys and values from previous tokens so that each new token only attends to the previously computed K and V rather than recomputing them from scratch.
Inference Without KV Cache
During the generation process without KV Cache, each new token is produced as follows:
Initial Token:
Start with the initial token (e.g., a start-of-sequence token). Compute its Q, K, and V vectors, and apply the attention mechanism to generate the first output token.
Subsequent Tokens:
For each new token, recompute the Q, K, and V vectors for the entire sequence (including all previous tokens), and apply the attention mechanism to generate the next token. This approach leads to redundant computations, as the K and V vectors for previously processed tokens are recalculated at each step, resulting in increased computational load and latency.
Inference With KV Cache
The KV Cache technique addresses the inefficiencies of the above method by storing the Key and Value vectors of previously processed tokens:
Initial Token:
Compute the Q, K, and V vectors for the initial token and store the K and V vectors in the cache.
Subsequent Tokens:
For each new token, compute its Q, K, and V vectors. Instead of recomputing K and V for all previous tokens, retrieve them from the cache. Apply the attention mechanism using the current Q vector and the cached K and V vectors to generate the next token. By caching the K and V vectors, the model avoids redundant computations, leading to faster inference times.
With KV Cache, A typical inference process has two phases:
- Prefill Phase: The full prompt sequence ($s$ tokens) is fed into the model, generating the key and value cache for each layer.
- Decoding Phase: Tokens are generated one by one (or in small batches), each time updating and using the cached keys and values.
1.1 Memory Usage of the KV Cache
Suppose the input sequence length is $s$, and we want to generate $n$ tokens. Let $b$ be the inference batch size (number of parallel sequences). We store $K, V \in \mathbb{R}^{b \times (s+n) \times h}$ in float16. Each element is 2 bytes, and we have both $K$ and $V$, so the memory cost per layer is:
\[2 \;\text{(for K and V)} \;\times\; b(s+n)h \;\times\; 2\,\text{bytes} = 4\,b\,(s+n)\,h.\]For $l$ layers, the total KV cache memory is: \(4\,l\,b\,h\,(s+n).\)
GPT-3 Example
Recall GPT-3 has around 350 GB of parameters (in float16). Suppose we do inference with batch size $b=64$, prompt length $s=512$, and we generate $n=32$ tokens:
- Model parameters: 350 GB
- KV cache:
\(4\,l\,b\,h\,(s+n)\approx 164\,\text{GB}\) which is nearly half the parameter size under these settings.
2. Conclusion
In this class, we explored how to estimate and analyze key aspects of inference for large language models:
- A powerful mechanism for fast autoregressive decoding, storing keys and values for each token in float16 to avoid recomputing them.
- The total KV cache scales with $b(s+n)h$ per layer.
By dissecting these components, we gain a clearer picture of why training large language models requires extensive memory and computation, and how various strategies (e.g., activation recomputation, KV cache) are applied to optimize hardware resources. Such understanding is crucial for practitioners to make informed decisions about scaling laws, distributed training setups, and memory-saving techniques.
Lecture 7: Decoding Algorithms in Large Language Models (LLMs)
Decoding algorithms are pivotal in determining how Large Language Models (LLMs) generate text sequences. These methods influence the coherence, diversity, and overall quality of the output. This tutorial delves into various decoding strategies, elucidating their mechanisms and applications.
1. Introduction to Decoding in LLMs
Decoding in LLMs refers to the process of generating text based on the model’s learned probabilities. Given a context or prompt, the model predicts subsequent tokens to construct coherent and contextually relevant text. The choice of decoding strategy significantly impacts the nature of the generated content.
2. Common Decoding Strategies
2.1 Greedy Search
Greedy Search selects the token with the highest probability at each step, aiming for immediate optimality.
Mechanism:
- Step 1: Start with an initial prompt.
- Step 2: At each position $t$, choose the token $x_t$ that maximizes the conditional probability $P(x_t \mid x_{1:t-1})$.
- Step 3: Append $x_t$ to the sequence.
- Step 4: Repeat until a stopping criterion is met (e.g., end-of-sequence token).
Example:
Given the prompt “The capital of France is”, the model might generate “Paris” by selecting the highest-probability token at each step.
Advantages:
- Simple and computationally efficient.
Disadvantages:
- May produce repetitive or generic text.
- Lacks diversity and can miss alternative plausible continuations.
2.2 Beam Search
Beam Search maintains multiple candidate sequences (beams) simultaneously, balancing exploration and exploitation.
Mechanism:
- Step 1: Initialize with the prompt, creating the initial beam.
- Step 2: At each step $t$, expand each beam by all possible next tokens.
- Step 3: Score each expanded sequence using a scoring function, often the sum of log probabilities.
- Step 4: Retain the top $B$ beams based on scores, where $B$ is the beam width.
- Step 5: Repeat until beams reach a stopping criterion.
Example:
For a beam width of 3, the model explores three parallel sequences, selecting the most probable completions among them.
Advantages:
- Explores multiple hypotheses, reducing the risk of suboptimal sequences.
Disadvantages:
- Computationally more intensive than Greedy Search.
- Can still produce repetitive outputs if not combined with other techniques.
2.3 Sampling-Based Methods
Sampling introduces randomness into the generation process, allowing for more diverse outputs.
2.3.1 Random Sampling
Tokens are selected randomly based on their conditional probabilities.
Mechanism:
- Step 1: Compute the probability distribution over the vocabulary for the next token.
- Step 2: Sample a token from this distribution.
- Step 3: Append the sampled token to the sequence.
- Step 4: Repeat until a stopping criterion is met.
Example:
Given the prompt “Once upon a time”, the model might generate various continuations like “a princess lived” or “a dragon roamed”, depending on the sampling.
Advantages:
- Produces varied and creative outputs.
Disadvantages:
- Can lead to incoherent or less relevant text.
- Quality depends heavily on the underlying probability distribution.
2.3.2 Top-k Sampling
Limits the sampling pool to the top $k$ tokens with the highest probabilities.
Mechanism:
- Step 1: Compute the probability distribution for the next token.
- Step 2: Select the top $k$ tokens with the highest probabilities.
- Step 3: Normalize the probabilities of these $k$ tokens.
- Step 4: Sample a token from this restricted distribution.
- Step 5: Append the sampled token to the sequence.
- Step 6: Repeat until a stopping criterion is met.
Example:
With ( k = 50 ), the model considers only the top 50 probable tokens at each step, introducing controlled randomness.
Advantages:
- Balances diversity and coherence.
- Reduces the chance of selecting low-probability, irrelevant tokens.
Disadvantages:
- The choice of $k$ is crucial; too high or too low can affect output quality.
2.3.3 Top-p (Nucleus) Sampling
Considers the smallest set of top tokens whose cumulative probability exceeds a threshold ( p ).
Mechanism:
- Step 1: Compute the probability distribution for the next token.
- Step 2: Sort tokens by probability in descending order.
- Step 3: Select the smallest set of tokens whose cumulative probability is at least ( p ).
- Step 4: Normalize the probabilities of these tokens.
- Step 5: Sample a token from this distribution.
- Step 6: Append the sampled token to the sequence.
- Step 7: Repeat until a stopping criterion is met.
Example:
With $p = 0.9$, the model dynamically adjusts the number of tokens considered at each step, ensuring that 90% of the probability mass is covered.
Advantages:
- Adapts the sampling pool size based on the distribution, providing flexibility.
- Often results in more natural and coherent text.
Disadvantages:
- Requires careful tuning of $p$ to balance diversity and coherence.
2.4 Temperature Scaling
Temperature scaling adjusts the sharpness of the probability distribution before sampling.
Mechanism:
- Step 1: Compute the logits (unnormalized probabilities) for the next token.
- Step 2: Divide the logits by the temperature $T$ (a positive scalar).
- Step 3: Apply the softmax function to obtain the adjusted probabilities.
- Step 4: Sample a token from this adjusted distribution.
- Step 5: Append the sampled token to the sequence.
- Step 6: Repeat until a stopping criterion is met.
Example:
- With $T = 1$, the distribution remains unchanged.
- With $T < 1$, the distribution becomes sharper, making high-probability tokens more likely.
- With $T > 1$, the distribution flattens, allowing for more diverse token selection.
Advantages:
- Provides control over the randomness of the output.
- Can be combined with other decoding strategies to fine-tune generation behavior.
Disadvantages:
- Setting $T$ too high can lead to incoherent text; too low can make the output deterministic.