\[\newcommand{\nl}[1]{\textsf{#1}} \newcommand{\attention}{\text{Attention}}\]

Welcome to EECS 224! This is a graduate course on understanding and developing large language models (LLMs).

What is a language model?
What are large language models?
Grading of This Course
In-Course Questions

What is Language?

Language is a systematic means of communicating ideas or feelings using conventionalized signs, sounds, gestures, or marks.

Text in Language

Text represents the written form of language, converting speech and meaning into visual symbols. Key aspects include:

Basic Units of Text

Text can be broken down into hierarchical units:

Characters: The smallest meaningful units in writing systems
Words: Combinations of characters that carry meaning
Sentences: Groups of words expressing complete thoughts
Paragraphs: Collections of related sentences
Documents: Complete texts serving a specific purpose

Text Properties

Text demonstrates several key properties:

Linearity: Written symbols appear in sequence
Discreteness: Clear boundaries between units
Conventionality: Agreed-upon meanings within a language community
Structure: Follows grammatical and syntactic rules
Context: Meaning often depends on surrounding text

Based on the above properties shared by different langauges, the NLP researchers develop a unified Machine Learning technique to model language data – Large Language Models. Let’s start to learn this unfied language modeling technique.

Words in documents that get filtered out of C4

What is a Language Model?

Mathematical Definition

A language model is fundamentally a probability distribution over sequences of words or tokens. Mathematically, it can be expressed as:

\[P(w_1, w_2, ..., w_n) = \prod_i P(w_i|w_1, ..., w_{i-1})\]

where:

\(w_1, w_2, ..., w_n\) represents a sequence of words or tokens
The conditional probability of word \(w_i\) given all previous words is:
\[P(w_i|w_1, ..., w_{i-1})\]

For practical implementation, this often takes the form:

\[P(w_t|context) = \text{softmax}(h(context) \cdot W)\]

where:

Target word: \(w_t\)
Context encoding function: \(h(context)\)
Weight matrix: \(W\)
softmax normalizes the output into probabilities

Why Use Conditional Probability in Language Models?

Core Insight

From a classification perspective, the number of categories directly impacts the learning difficulty - more categories require exponentially more training data to achieve adequate coverage.

Comparing Two Approaches

Joint Probability Approach

When modeling \(P(w_1,...,w_n)\) directly:

Needs to predict \(V^n\) categories
Requires seeing enough samples of each possible sentence
Most long sequences may never appear in training data
Makes learning practically impossible

Conditional Probability Approach

When modeling \(P(w_i|w_1,...,w_{i-1})\):

Only predicts \(V\) categories at each step
Each word position provides a training sample
Same words in different contexts contribute learning signals
Dramatically improves data efficiency

Numerical Example

Consider a language model with:

Vocabulary size \(V = 10,000\)
Sequence length \(n = 5\)

Then:

Joint probability: Must learn \(10,000^5\) categories
Conditional probability: Must learn \(10,000\) categories at each step

Why This Matters

Training Data Requirements
- More categories require more training examples
- Each category needs sufficient representation
- Data requirements grow exponentially with category count
Learning Efficiency
- Smaller category spaces are easier to model
- More efficient use of training data
- Each word occurrence contributes to learning
Statistical Coverage
- Impossible to see all possible sequences
- But possible to see all words in various contexts
- Makes learning feasible with finite training data

Conclusion

The conditional probability formulation cleverly transforms an intractable large-scale classification problem into a series of manageable smaller classification problems. This is the fundamental reason why language models can learn effectively from finite training data.

Real-world Application: Text Completion

The Prefix-based Generation Task

In practical applications, we often:

Have a fixed prefix of text
Need to predict/generate the continuation
Don’t need to generate text from scratch

Examples

Auto-completion
- Code completion in IDEs
- Search query suggestions
- Email text completion
Text Generation
- Story continuation
- Dialogue response generation
- Document completion

Why Conditional Probability Helps

The formulation \(P(w_i|w_1,...,w_{i-1})\) naturally fits this scenario because:

We can directly condition on the given prefix
No need to model the probability of the prefix itself
Can focus computational resources on predicting what comes next

Comparison with Joint Probability

The joint probability \(P(w_1,...,w_n)\) would be less suitable because:

Would need to model probability of the fixed prefix
Wastes computation on already-known parts
Doesn’t directly give us what we want (continuation probability)

This alignment between the mathematical formulation and practical use cases is another key advantage of the conditional probability approach in language modeling.

The Transformer Model: Revolutionizing Language Models

The emergence of the Transformer architecture marked a paradigm shift in how machines process and understand human language. Unlike its predecessors, which struggled with long-range patterns in text, this groundbreaking architecture introduced mechanisms that revolutionized natural language processing (NLP).

The Building Blocks of Language Understanding

From Text to Machine-Readable Format

Before any sophisticated processing can occur, raw text must be converted into a format that machines can process. This happens in two crucial stages:

Text Segmentation The first challenge is breaking down text into meaningful units. Imagine building with LEGO blocks - just as you need individual blocks to create complex structures, language models need discrete pieces of text to work with. These pieces, called tokens, might be:
- Complete words
- Parts of words
- Individual characters
- Special symbols

For instance, the phrase “artificial intelligence” might become [“art”, “ificial”, “intel”, “ligence”], allowing the model to recognize patterns even in unfamiliar words.

Numerical Representation Once we have our text pieces, each token gets transformed into a numerical vector - essentially a long list of numbers. Think of this as giving each word or piece its own unique mathematical “fingerprint” that captures its meaning and relationships with other words.

Adding Sequential Understanding

One of the most innovative aspects of Transformers is how they handle word order. Rather than treating text like a bag of unrelated words, the architecture adds precise positional information to each token’s representation.

Consider how the meaning changes in these sentences:

“The cat chased the mouse”
“The mouse chased the cat”

The words are identical, but their positions completely change the meaning. The Transformer’s positional encoding system ensures this crucial information isn’t lost.

The Heart of the System: Information Processing

Context Through Self-Attention

The true magic of Transformers lies in their attention mechanism. Unlike humans who must read text sequentially, Transformers can simultaneously analyze relationships between all words in a text. This is similar to how you might solve a complex puzzle:

First, you look at all the pieces simultaneously
Then, you identify which pieces are most likely to connect
Finally, you use these relationships to build the complete picture

In language, this means the model can:

Resolve pronouns (“She picked up her book” - who is “her” referring to?)
Understand idiomatic expressions (“kicked the bucket” means something very different from “kicked the ball”)
Grasp long-distance dependencies (“The keys, which I thought I had left on the kitchen counter yesterday morning, were actually in my coat pocket”)

After the attention mechanism identifies relevant connections, the information passes through a series of specialized neural networks. These networks:

Combine and transform the gathered context
Extract higher-level patterns
Refine the understanding of each piece of text

Generation and Decision Making

The final stage involves converting all this processed information into useful output. Whether the task is:

Completing a sentence
Translating text
Answering a question
Summarizing a document

The model uses a probability distribution system to select the most appropriate output. This is similar to a skilled writer choosing the perfect word from their vocabulary, considering both meaning and context.

Real-World Applications and Impact

The Transformer architecture has enabled breakthrough applications in:

Cross-Language Communication
- Real-time translation systems
- Multilingual document processing
- Cultural context adaptation
Content Creation and Analysis
- Automated report generation
- Text summarization
- Content recommendations
Specialized Industry Applications
- Legal document analysis
- Medical record processing
- Scientific literature review

The Road Ahead

As this architecture continues to evolve, we’re seeing:

More efficient processing methods
Better handling of specialized domains
Improved understanding of contextual nuances
Enhanced ability to work with multimodal inputs

The Transformer architecture represents more than just a technical advancement - it’s a fundamental shift in how machines can understand and process human language. Its impact continues to grow as researchers and developers find new ways to apply and improve upon its core principles.

The true power of Transformers lies not just in their technical capabilities, but in how they’ve opened new possibilities for human-machine interaction and understanding. As we continue to refine and build upon this architecture, we’re moving closer to systems that can truly understand and engage with human language in all its complexity and nuance.

What are large language models?

Large language models are transformers with billions to trillions of parameters, trained on massive amounts of text data. These models have several distinguishing characteristics:

Scale: Models contain billions of parameters and are trained on hundreds of billions of tokens
Architecture: Based on the Transformer architecture with self-attention mechanisms
Emergent abilities: Complex capabilities that emerge with scale
Few-shot learning: Ability to adapt to new tasks with few examples

Large Language Models (LLMs): A Comprehensive Introduction

What are Large Language Models?

Definition: Large Language Models are artificial intelligence systems trained on vast amounts of text data, containing hundreds of billions of parameters. Unlike traditional AI models, they can understand and generate human-like text across a wide range of tasks and domains.
Scale and Architecture:
- Typically contain >10B parameters (Some exceed 500B)
- Built on Transformer architecture with attention mechanisms
- Require massive computational resources for training
- Examples: GPT-3 (175B), PaLM (540B), LLaMA (65B)
Key Capabilities:
- Natural language understanding and generation
- Task adaptation without fine-tuning
- Complex reasoning and problem solving
- Knowledge storage and retrieval
- Multi-turn conversation

Historical Evolution

1. Statistical Language Models (SLM) - 1990s

Core Technology: Used statistical methods to predict next words based on previous context
Key Features:
- N-gram models (bigram, trigram)
- Markov assumption for word prediction
- Used in early IR and NLP applications
Limitations:
- Curse of dimensionality
- Data sparsity issues
- Limited context window
- Required smoothing techniques

2. Neural Language Models (NLM) - 2013

Core Technology: Neural networks for language modeling
Key Advances:
- Distributed word representations
- Multi-layer perceptron and RNN architectures
- End-to-end learning
- Better feature extraction
Impact:
- Word2vec and similar embedding models
- Improved generalization
- Reduced need for feature engineering

3. Pre-trained Language Models (PLM) - 2018

Core Technology: Transformer-based models with pre-training
Key Innovations:
- BERT and bidirectional context modeling
- GPT and autoregressive modeling
- Transfer learning approach
- Fine-tuning paradigm
Benefits:
- Context-aware representations
- Better task performance
- Reduced need for task-specific data
- More efficient training

4. Large Language Models (LLM) - 2020+

Core Technology: Scaled-up Transformer models
Major Breakthroughs:
- Emergence of new abilities with scale
- Few-shot and zero-shot learning
- General-purpose problem solving
- Human-like interaction capabilities
Key Examples:
- GPT-3: First demonstration of powerful in-context learning
- ChatGPT: Advanced conversational abilities
- GPT-4: Multimodal capabilities and improved reasoning
- PaLM: Enhanced multilingual and reasoning capabilities

Key Features of LLMs

Scaling Laws

KM Scaling Law (OpenAI):
- Describes relationship between model performance and three factors:
  - Model size (N)
  - Dataset size (D)
  - Computing power (C)
- Mathematical formulation: L(N) ∝ (Nc/N)^αN
- Predicts diminishing returns
- Helps optimize resource allocation
Chinchilla Scaling Law (DeepMind):
- Focuses on compute-optimal training
- Suggests equal scaling of model and data size
- More efficient resource utilization
- Demonstrated with Chinchilla vs Gopher comparison

Emergent Abilities

In-context Learning
- Definition: Ability to learn from examples in the prompt
- Characteristics:
  - No parameter updates required
  - Few-shot and zero-shot capabilities
  - Task adaptation through demonstrations
- Emergence Point:
  - Becomes effective at ~100B parameters
  - GPT-3 showed first strong results
Instruction Following
- Definition: Ability to understand and execute natural language instructions
- Requirements:
  - Instruction tuning
  - Multi-task training
  - Natural language task descriptions
- Emergence Point:
  - Requires >60B parameters
  - Improves significantly with scale
Step-by-step Reasoning
- Definition: Ability to break down complex problems
- Techniques:
  - Chain-of-thought prompting
  - Self-consistency methods
  - Intermediate step generation
- Benefits:
  - Better problem solving
  - More reliable answers
  - Transparent reasoning process

Technical Elements

Architecture

Transformer Base
- Components:
  - Multi-head attention mechanism
  - Feed-forward neural networks
  - Layer normalization
  - Positional encoding
- Variations:
  - Decoder-only (GPT-style)
  - Encoder-decoder (T5-style)
  - Modifications for efficiency
Scaling Considerations
- Hardware Requirements:
  - Distributed training systems
  - Memory optimization
  - Parallel processing
- Architecture Choices:
  - Layer count
  - Hidden dimension size
  - Attention head configuration

Training Process

Pre-training
- Data Preparation:
  - Web text
  - Books
  - Code
  - Scientific papers
- Objectives:
  - Next token prediction
  - Masked language modeling
  - Multiple auxiliary tasks
Adaptation Methods
- Instruction Tuning:
  - Natural language task descriptions
  - Multi-task learning
  - Task generalization
- RLHF:
  - Human preference learning
  - Safety alignment
  - Behavior optimization

Utilization Techniques

Prompting Strategies
- Basic Prompting:
  - Direct instructions
  - Few-shot examples
  - Zero-shot prompts
- Advanced Methods:
  - Chain-of-thought
  - Self-consistency
  - Tool augmentation
Application Patterns
- Task Types:
  - Generation
  - Classification
  - Question answering
  - Coding
- Integration Methods:
  - API endpoints
  - Model serving
  - Application backends

Major Milestones

ChatGPT (2022)

Technical Achievements
- Advanced dialogue capabilities
- Robust safety measures
- Consistent persona
- Tool integration
Impact
- Widespread adoption
- New application paradigms
- Industry transformation
- Public AI awareness

GPT-4 (2023)

Key Advances
- Multimodal understanding
- Enhanced reliability
- Better reasoning
- Improved safety
Technical Features
- Predictable scaling
- Vision capabilities
- Longer context window
- Advanced system prompting

Challenges and Future Directions

Current Challenges

Computational Resources
- Training Costs:
  - Massive energy requirements
  - Expensive hardware needs
  - Limited accessibility
- Infrastructure Needs:
  - Specialized facilities
  - Cooling systems
  - Power management
Data Requirements
- Quality Issues:
  - Data cleaning
  - Content filtering
  - Bias mitigation
- Privacy Concerns:
  - Personal information
  - Copyright issues
  - Regulatory compliance
Safety and Alignment
- Technical Challenges:
  - Hallucination prevention
  - Truthfulness
  - Bias detection
- Ethical Considerations:
  - Harm prevention
  - Fairness
  - Transparency

Future Directions

Improved Efficiency
- Architecture Innovation:
  - Sparse attention
  - Parameter efficiency
  - Memory optimization
- Training Methods:
  - Better scaling laws
  - Efficient fine-tuning
  - Reduced compute needs
Enhanced Capabilities
- Multimodal Understanding:
  - Vision-language integration
  - Audio processing
  - Sensor data interpretation
- Reasoning Abilities:
  - Logical deduction
  - Mathematical problem solving
  - Scientific reasoning
Safety Development
- Alignment Techniques:
  - Value learning
  - Preference optimization
  - Safety bounds
- Evaluation Methods:
  - Robustness testing
  - Safety metrics
  - Bias assessment

Summary

LLMs represent a fundamental shift in AI capabilities
Scale and architecture drive emergent abilities
Continuing rapid development in capabilities
Balance between advancement and safety
Growing impact on society and technology
Need for responsible development and deployment

References and Further Reading

Scaling Laws Papers
Emergent Abilities Research
Safety and Alignment Studies
Technical Documentation
Industry Reports