EECS 224: Large Language Models

This is a new course of UC Merced starting at Spring 2025. In this course, students will learn the fundamentals about the modeling, theory, ethics, and systems aspects of large language models, as well as gain hands-on experience working with them.

The Goal of This Course

Offering useful, fundemental, detailed LLM knowledge to students.

Coursework

Your grade is based on two activities:

  1. In-Course Question Answering (50%)
  2. Final Projects (50%)

In-Course Question Answering

In each class, students will be asked several questions. Each student can only answer each question one time. The student who correctly answers a question for the first time and explain the answer clearly to all the classmates will be granted 1 credit. The final scores will be calculated based on the accumulated credits throughout the whole semester.

Final Projects

Every student should complete a final project related to LLMs and present it in the final classes. A project report should be completed before the presentation and submitted to Final Project Google Folder The file should be named as “Your Name.pdf” The format should follow the 2025 ACL long paper template. Good examples are at Project Report Examples

The final project scores will be calculated through 50% Instructor’s rating + 50% Classmates’ rating. After all the presentations, every student will be asked to choose 3 best projects. The classmates’ ratings will be based on the ratings of all the classmates.

Final Project Google Folder

2025 ACL long paper template

Project Report Examples

Lecture 1: Overview of LLMs

  1. What is language?
  2. What is a language model?
  3. What are large language models?

What is Language?

Language is a systematic means of communicating ideas or feelings using conventionalized signs, sounds, gestures, or marks.

Text in Language

Text represents the written form of language, converting speech and meaning into visual symbols. Key aspects include:

Basic Units of Text

Text can be broken down into hierarchical units:

Text Properties

Text demonstrates several key properties:

Based on the above properties shared by different langauges, the NLP researchers develop a unified Machine Learning technique to model language data – Large Language Models. Let’s start to learn this unfied language modeling technique.

What is a Language Model?

Mathematical Definition

A language model is fundamentally a probability distribution over sequences of words or tokens. Mathematically, it can be expressed as:

\[P(w_1, w_2, ..., w_n) = \prod_i P(w_i|w_1, ..., w_{i-1})\]

where:

For practical implementation, this often takes the form:

\[P(w_t|context) = \text{softmax}(h(context) \cdot W)\]

where:

Why Use Conditional Probability in Language Models?

Core Insight

From a classification perspective, the number of categories directly impacts the learning difficulty - more categories require exponentially more training data to achieve adequate coverage.

Comparing Two Approaches

Joint Probability Approach

When modeling \(P(w_1,...,w_n)\) directly:

Conditional Probability Approach

When modeling \(P(w_i|w_1,...,w_{i-1})\):

Numerical Example

Consider a language model with:

Then:

Why This Matters

Training Data Requirements
Learning Efficiency
Statistical Coverage

Conclusion

The conditional probability formulation cleverly transforms an intractable large-scale classification problem into a series of manageable smaller classification problems. This is the fundamental reason why language models can learn effectively from finite training data.

Real-world Application: Text Completion

The Prefix-based Generation Task

In practical applications, we often:

Examples

  1. Auto-completion
    • Code completion in IDEs
    • Search query suggestions
    • Email text completion
  2. Text Generation
    • Story continuation
    • Dialogue response generation
    • Document completion

Why Conditional Probability Helps

The formulation \(P(w_i|w_1,...,w_{i-1})\) naturally fits this scenario because:

Comparison with Joint Probability

The joint probability \(P(w_1,...,w_n)\) would be less suitable because:

This alignment between the mathematical formulation and practical use cases is another key advantage of the conditional probability approach in language modeling.

The Transformer Model: Revolutionizing Language Models

The emergence of the Transformer architecture marked a paradigm shift in how machines process and understand human language. Unlike its predecessors, which struggled with long-range patterns in text, this groundbreaking architecture introduced mechanisms that revolutionized natural language processing (NLP).

The Building Blocks of Language Understanding

From Text to Machine-Readable Format

Before any sophisticated processing can occur, raw text must be converted into a format that machines can process. This happens in two crucial stages:

  1. Text Segmentation The first challenge is breaking down text into meaningful units. Imagine building with LEGO blocks - just as you need individual blocks to create complex structures, language models need discrete pieces of text to work with. These pieces, called tokens, might be:
    • Complete words
    • Parts of words
    • Individual characters
    • Special symbols

For instance, the phrase “artificial intelligence” might become [“art”, “ificial”, “intel”, “ligence”], allowing the model to recognize patterns even in unfamiliar words.

  1. Numerical Representation Once we have our text pieces, each token gets transformed into a numerical vector - essentially a long list of numbers. Think of this as giving each word or piece its own unique mathematical “fingerprint” that captures its meaning and relationships with other words.
Adding Sequential Understanding

One of the most innovative aspects of Transformers is how they handle word order. Rather than treating text like a bag of unrelated words, the architecture adds precise positional information to each token’s representation.

Consider how the meaning changes in these sentences:

The words are identical, but their positions completely change the meaning. The Transformer’s positional encoding system ensures this crucial information isn’t lost.

The Heart of the System: Information Processing

Context Through Self-Attention

The true magic of Transformers lies in their attention mechanism. Unlike humans who must read text sequentially, Transformers can simultaneously analyze relationships between all words in a text. This is similar to how you might solve a complex puzzle:

  1. First, you look at all the pieces simultaneously
  2. Then, you identify which pieces are most likely to connect
  3. Finally, you use these relationships to build the complete picture

In language, this means the model can:

Information Refinement

After the attention mechanism identifies relevant connections, the information passes through a series of specialized neural networks. These networks:

Generation and Decision Making

The final stage involves converting all this processed information into useful output. Whether the task is:

The model uses a probability distribution system to select the most appropriate output. This is similar to a skilled writer choosing the perfect word from their vocabulary, considering both meaning and context.

Real-World Applications and Impact

The Transformer architecture has enabled breakthrough applications in:

  1. Cross-Language Communication
    • Real-time translation systems
    • Multilingual document processing
    • Cultural context adaptation
  2. Content Creation and Analysis
    • Automated report generation
    • Text summarization
    • Content recommendations
  3. Specialized Industry Applications
    • Legal document analysis
    • Medical record processing
    • Scientific literature review

The Road Ahead

As this architecture continues to evolve, we’re seeing:

The Transformer architecture represents more than just a technical advancement - it’s a fundamental shift in how machines can understand and process human language. Its impact continues to grow as researchers and developers find new ways to apply and improve upon its core principles.

The true power of Transformers lies not just in their technical capabilities, but in how they’ve opened new possibilities for human-machine interaction and understanding. As we continue to refine and build upon this architecture, we’re moving closer to systems that can truly understand and engage with human language in all its complexity and nuance.

What are large language models?

Large language models are transformers with billions to trillions of parameters, trained on massive amounts of text data. These models have several distinguishing characteristics:

  1. Scale: Models contain billions of parameters and are trained on hundreds of billions of tokens
  2. Architecture: Based on the Transformer architecture with self-attention mechanisms
  3. Emergent abilities: Complex capabilities that emerge with scale
  4. Few-shot learning: Ability to adapt to new tasks with few examples