Thu 08/31/23 |
Lecture #1:
- Course introduction
- Logistics
- Transformers - high level overview
[
slides
]
|
Main readings:
- Attention is all you need (2017) [link]
|
|
Tues 09/05/23 |
Lecture #2:
- Optimization, backpropagation, and training
[
slides
]
|
Main readings:
- Deep Feedforward Networks. Ian Goodfellow, Yoshua Bengio, & Aaron Courville (2016). Deep Learning, Chapter 6.5. [link]
- An overview of gradient descent optimization algorithms [link]
- A Gentle Introduction to Torch Autograd [link]
- Autograd Mechanics [link]
|
|
Thu 09/07/23 |
Lecture #3:
- Word embeddings
- Tokenization
[
slides
]
|
Main readings:
- Distributed Representations of Words and Phrases and their Compositionality (2013) [link]
- GloVe- Global Vectors for Word Representation (2014) [link]
- BPE - Neural Machine Translation of Rare Words with Subword Units (2016) [link]
- SentencePiece- A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (2018) [link]
|
|
Tue 09/12/23 |
Lecture #4:
- Transformers
- Implementation details
[
slides
]
[ notebook transformer.ipynb ]
|
Main readings:
- Attention is all you need (2017) [link]
- The Annotated Transformer [link]
|
|
Thu 09/14/23 |
Lecture #5:
- Positional Information - Absolute - Relative - ROPE - ALiBi
- Multi-Query Attention
- Grouped Multi-Query Attention
- Inference
- KV caching
- Encoder/decoder-only vs encoder-decoder
[
slides
]
|
Main readings:
- Train Short, Test Long- Attention with Linear Biases Enables Input Length Extrapolation [link]
- RoFormer- Enhanced Transformer with Rotary Position Embedding [link]
- Fast Transformer Decoding- One Write-Head is All You Need [link]
|
Tips on choosing a project [slides]
HW1 out [link]
|
Th 09/19/23 |
Lecture #6:
[
slides
]
|
Main readings:
- ELMo, Deep Contextualized Representations [link]
- ULMFit, Universal Language Model Fine-tuning for Text Classification [link]
- BERT, Pre-training of Deep Bidirectional Transformers for Language Understanding [link]
- ELECTRA, Pre-training Text Encoders as Discriminators Rather Than Generators [link]
|
Project teams due. [Team submission form]
|
Th 09/21/23 |
Lecture #7:
- Model architecture and training objectives - Encoder-decoder, decoder-only - UL2 / FIM
[
slides
]
|
Main readings:
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [link]
- UL2- Unifying Language Learning Paradigms (2022) [link]
- What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (2022) [link]
- BART, Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension [link]
|
|
Tue 09/26/23 |
Lecture #8:
- Scale
- Compute analysis in transformers
[
slides
]
|
Main readings:
- Scaling Laws for Neural Language Models [link]
- Training Compute-Optimal Large Language Models [link]
|
|
Thu 09/28/23 |
Lecture #9:
- Scaling laws and GPT-3
- Few-shot Learning
- Prompting
- In-context learning
[
slides
]
|
Main readings:
- Language Models are Few-shot Learners (2020) [link]
- Rethinking the Role of Demonstrations- What Makes In-Context Learning Work? (2022) [link]
- Data Distributional Properties Drive Emergent In-Context Learning in Transformers (2022) [link]
|
|
Tue 10/03/23 |
Lecture #10:
- Prompting
- Emergence
- Reasoning
- Instruction tuning
[
slides
]
|
Main readings:
- Chain of Thought Prompting Elicits Reasoning in Large Language Models [link]
- Tree of Thoughts- Deliberate Problem Solving with Large Language Models [link]
- The curious case of neural text degeneration (2020) [link]
- Training language models to follow instructions with human feedback (2022) [link]
- Multitask Prompted Training Enables Zero-Shot Task Generalization (2021) [link]
- Finetuned Language Models Are Zero-Shot Learners (2021) [link]
- Scaling Instruction-Finetuned Language Models (2022) [link]
- Tree of Thoughts- Deliberate Problem Solving with Large Language Models (2023) [link]
|
|
Thu 10/05/23 |
Lecture #11:
- Adaptation
- Reinforcement Learning for language models fine-tuning
[
slides
]
|
Main readings:
- Fine-Tuning Language Models from Human Preferences (2019) [link]
- Learning to Summarize with Human Feedback (2020) [link]
- InstructGPT - Fine-Tuning Language Models from Human Instructions (2021) [link]
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link]
- Direct Preference Optimization- Your Language Model is Secretly a Reward Model (2023) [link]
Optional readings:
- Parameter-Efficient Transfer Learning for NLP (2019) [link]
- Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link]
|
HW 1 due (10/8)
|
Tues 10/10/23 |
Lecture #12:
- Challenges and Opportunities of Building Open LLMs
Guest lecturer: Iz Beltagy, Allen Institute for AI
[
slides
]
|
Main readings:
- What Language Model to Train if You Have One Million GPU Hours (2022) [link]
- Dolma- Trillion Token Open Corpus for Language Model Pretraining (2023) [link]
- Llama 2- Open Foundation and Fine-Tuned Chat Models (2023) [link]
- Pythia- A Suite for Analyzing Large Language Models Across Training and Scaling (2023) [link]
- BLOOM- A 176B-Parameter Open-Access Multilingual Language Model (2022) [link]
- Scaling Language Models- Methods, Analysis & Insights from Training Gopher (2022) [link]
|
|
Thu 10/12/23 |
Lecture #13:
- Parameter efficient fine tuning
[
slides
]
|
Main readings:
- Parameter-Efficient Transfer Learning for NLP (2019) [link]
- Prefix-Tuning- Optimizing Continuous Prompts for Generation (2021) [link]
- Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link]
- LoRA- Low-Rank Adaptation of Large Language Models [link]
- Scaling Down to Scale Up- A Guide to Parameter-Efficient Fine-Tuning [link]
- Efficient Transformers- A Survey [link]
|
|
Tue 10/17/23 |
Lecture #14:
- Guest lecture by Tushar Khot
Reasoning with (De)Composition
Guest lecturer: Tushar Khot, Allen Institute for AI
[
slides
]
|
Main readings:
- Hey AI, Can You Solve Complex Tasks by Talking to Agents? [link]
- Toolformer - A Tool-Augmented Language Model (2022) [link]
- Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback (2023) [link]
- ReAct- Synergizing Reasoning and Acting in Language Models [link]
|
10/20 Project proposal due
|
10/17/23 - 10/23/23 |
October recess - No classes
|
Tue 10/24/23 |
Lecture #15:
- Modular deep learning
- Mixture of experts
[
slides
]
|
Main readings:
- Modular Deep Learning (2022) [link]
- A Review of Sparse Expert Models in Deep Learning (2022) [link]
- Outrageously Large Neural Networks- The Sparsely-Gated Mixture-of-Experts Layer (2017) [link]
- Switch Transformers- Scaling to Trillion Parameter Models (2021) [link]
|
|
Thu 10/26/23 |
Midterm
|
Tue 10/31/23 |
Lecture #16:
- Retrieval augmented language models
Guest lecturer: Sewon Min, University of Washington
[
slides
]
|
Main readings:
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020) [link]
- Improving language models by retrieving from trillions of tokens (2021) [link]
- REPLUG, Retrieval-Augmented Black-Box Language Models [link]
|
|
Tue 11/02/23 |
Lecture #17:
- Build an Ecosystem, Not a Monolith
Guest lecturer: Colin Raffel, University of Toronto
[
slides
]
|
Main readings:
- Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning (2022) [link]
- Exploring and Predicting Transferability across NLP Tasks (2020) [link]
- Editing Models with Task Arithmetic (2022) [link]
|
|
Tue 11/07/23 |
Lecture #18:
- Modeling long sequences
- Hierarchical and graph-based methods
- Recurrence and memory
[
slides
]
|
Main readings:
- Higher-order Coreference Resolution with Coarse-to-fine Inference (2018) [link]
- Entity, Relation, and Event Extraction with Contextualized Span Representations (2020) [link]
- Memorizing transfomers (2022) [link]
- Hierarchical Graph Network for Multi-hop Question Answering [link]
- Compressive Transformers for Long-Range Sequence Modelling (2020) [link]
- Efficient transformers - A survey (2022) [link]
|
HW 2 due
|
Thu 11/09/23 |
Lecture #19:
- Modeling long sequences
- Sparse attention patterns
- Approximating attention
- Hardware aware efficiency
[
slides
]
|
Main readings:
- Longformer- The Long-Document Transformer (2020) [link]
- BigBird - Transformers for Longer Sequences (2020) [link]
- Performer - Rethinking Attention with Performers (2021) [link]
- Reformer - The Efficient Transformer (2020) [link]
- Long T5 - Efficient Text-To-Text Transformer for Long Sequences (2022) [link]
|
|
Tue 11/14/23 |
Lecture #20:
- Training approaches for long sequences
- Hardware aware efficiency
- Societal considerations and impacts of foundation models
[
slides
]
|
Main readings:
- FlashAttention - Fast and Memory-Efficient Exact Attention with IO-Awareness (2022) [link]
- PRIMERA- Pyramid-based Masked Sentence Pre-training for Multi-document Summarization (2022) [link]
- Peek Across- Improving Multi-Document Modeling via Cross-Document Question-Answering (2023) [link]
- What's in my big data? (2023) [link]
- Red Teaming Language Models with Language Models (2022) [link]
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (2022) [link]
|
|
Tue 11/16/23 |
Lecture #21:
- Vision transformers
- Diffusion models
[
slides
]
|
Main readings:
- An Image is Worth 16x16 Words- Transformers for Image Recognition at Scale (2020) [link]
- Training data-efficient image transformers & distillation through attention (2021) [link]
- Denoising Diffusion Probabilistic Models (2020) [link]
|
|
11/17/23 - 11/26/23 |
Thanksgiving recess - No classes
|
Tue 11/28/23 |
Lecture #22:
- Final project presentations -- session 1
|
|
|
Thu 11/30/23 |
Lecture #23:
- Towards Large Foundation Vision Models
Guest lecturer: Neil Houlsby, Google Deepmind
[
slides
]
|
Main readings:
- Scaling Vision Transformers to 22 Billion Parameters (2023) [link]
- From Sparse to Soft Mixtures of Experts (2023) [link]
- Scaling Vision Transformers (2021) [link]
Optional readings:
- PaLI-X- On Scaling up a Multilingual Vision and Language Model [link]
- PaLI- A Jointly-Scaled Multilingual Language-Image Model [link]
|
|
Friday 12/1/23 |
Lecture #24:
- Final project presentations -- session 2
|
|
12/9 HW 3 due
|
Tue 12/05/23 |
Lecture #25:
- Foundation Models for Code and Math
Guest lecturer: Ansong Ni, Yale University
[
slides
]
|
Main readings:
- Evaluating Large Language Models Trained on Code (2021) [link]
- Solving Quantitative Reasoning Problems with Language Models (2022) [link]
- StarCoder- May the source be with you! (2023) [link]
Optional readings:
- Program Synthesis with Large Language Models (2021) [link]
- Show Your Work- Scratchpads for Intermediate Computation with Language Models (2021) [link]
|
|
Thu 12/07/23 |
Lecture #26:
- Moved to 12/1 (see above)
|
|
12/18 Final project report due
|