Assignments

Schedule

Make sure to reload this page to ensure you're seeing the latest version.

- Language model basics // [slides]
- [reading] Jurafsky & Martin, 3.1-3.5 (language modeling)
- [reading] Jurafsky & Martin, 7 (neural language models)

- Neural language models // [notes]
- [reading] Neural language models (Bengio et al., 2003)
- [optional reading] Andrej Karpathy's coding-based backpropagation post

- Recurrent neural networks and attention // [notes]
- [reading] Vaswani et al., NeurIPS 2017 (paper that introduced Transformers)
- [optional reading] An easy-to-read blog post on attention
- [optional reading] Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)

- LLM pretraining and post-training // [slides]
- [reading] Instruction tuning (Wei et al., 2022, FLAN)
- [reading] Reinforcement learning from human feedback (Ouyang et al., 2022, RLHF)

- LLM usage and applications
- [reading] Lilian Weng's blogpost on prompt engineering
- [optional reading] Judging LLM as a Judge (MTBench, Zheng et al., NeuRIPS 2023)

- Scaling laws of LLMs // [slides] // [notes]
- [reading] Scaling Laws for Neural Language Models (Kaplan et al., 2020)
- [reading] Training Compute-Optimal Large Language Models (Hoffmann et al., 2022)

- Tokenization // [slides]
- [reading] Neural Machine Translation... with Subword Units (Sennrich et al., ACL 2016)
- [reading] ByT5: Towards a token-free future... (Xue et al., 2021)

- Position embeddings // [notes] // [slides]
- [reading] Rotary position embeddings (RoPE, Su et al., 2021)
- [reading] NoPE (no position embeddings, Kazemnejad et al., 2023)

- Analysis (3/4)
- [slides] // [reading] Lost in the Middle: How Language Models Use Long Contexts (Liu et al., TACL 2023, analysis)
- [slides] // [reading] Massive Activations in Large Language Models (Sun et al., COLM 2024, analysis)

- Attention variants (3/6)
- [slides] // [reading] Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention (Munkhdalai et al., arXiv 2024, attention variant)
- [slides] // [reading] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., EMNLP 2023, attention variant)

- Data (3/11)
- [slides] // [reading] Data Engineering for Scaling Language Models to 128K Context (Fu et al., ICML 2024, data)
- [slides] // [reading] What is Wrong with Perplexity for Long-context Language Modeling? (Fang et al., ICLR 2025, data)

- Efficient implementations (3/13)
- [slides] // [reading] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., NeurIPS 2022, efficient implementation)
- [slides] // [reading] Ring Attention with Blockwise Transformers for Near-Infinite Context (Liu et al., ICLR 2024, efficient implementation)

- Efficient inference (3/25)
- [slides] // [reading] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (Zhang et al., NeurIPS 2023, efficient inference)
- [slides] // [reading] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding (Sun et al., COLM 2024, efficient inference)

- Guest Lecture: Yang Zhou (Ph.D. student, Carnegie Mellon University) // [slides]
- [reading] GSM-∞: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? (Zhou et al., 2025)
- [optional reading] RULER: What's the Real Context Size of Your Long-Context Language Models? (Hsieh et al., COLM 2024)
- [optional reading] Physics of Language Models Part 2.1: Grade-School Math and the Hidden Reasoning Process (Ye et al., ICLR 2024)

- Evaluation (4/1)
- [slides] // [reading] Retrieval Augmented Generation or Long-Context LLMs? (Li et al., EMNLP 2024, evaluation)
- [slides] // [reading] One Thousand and One Pairs: A "novel" challenge for long-context language models (Karpinska et al., EMNLP 2024, evaluation)

- Training (4/3)
- [slides] // [reading] How to Train Long-Context Language Models (Effectively) (Gao et al., arXiv 2024, training)
- [slides] // [reading] Qwen2.5-1M Technical Report (Qwen Team et al., arXiv 2025, training)

- Novel architecture (4/15)
- [reading] Compressive Transformers for Long-Range Sequence Modelling (Rae et al., ICLR 2020, novel architecture)
- [reading] Improving language models by retrieving from trillions of tokens (Borgeaud et al., ICML 2022, novel architecture)

- Novel architecture (4/17)
- [reading] Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu et al., COLM 2024, novel architecture)
- [reading] Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models (De et al., arXiv 2024, novel architecture)

- Test-time scaling (4/22)
- [reading] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-AI Team et al., arXiv 2025, test-time scaling)
- [reading] s1: Simple test-time scaling (Muennighoff et al., arXiv 2025, test-time scaling)

- Applications (4/24)
- [reading] Agents' Room: Narrative Generation through Multi-step Collaboration (Huot et al., ICLR 2025, applications)
- [reading] Commit0: Library Generation from Scratch (Zhao et al., ICLR 2025, applications)

- Guest Lecture (4/29): Simeng Sun (Senior Research Scientist, NVIDIA)
- [reading] L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution (Sun et al., arXiv 2025)
- [reading] LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation (Wang et al., arXiv 2025)

- Long-context reasoning (5/1)
- [reading] Learning to Reason for Long-Form Story Generation (Qiu et al., arXiv 2025)
- [reading] DeepSeek-R1 Thoughtology: Let's <think> about LLM reasoning (Yan et al., arXiv 2025)

- Multilingual (5/6)
- [reading] A Benchmark for Learning to Translate a New Language from One Grammar Book (Tanzer et al., ICLR 2024, multilingual)
- [reading] Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book? (Aycock et al., ICLR 2025, multilingual)

- Multimodal (5/8)
- [reading] LongVILA: Scaling Long-Context Visual Language Models for Long Videos (Chen et al., arXiv 2024, multimodal)
- [reading] Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution (Liu et al., ICLR 2025, multimodal)