Transformers and Beyond
Notes from the summary of COMP3010 Lecture 8, labs, and online resources.
Key Concepts
Transformers are the dominant architecture in natural language processing and are increasingly used in vision and multi-modal tasks.
They outperform MLPs and CNNs in many contexts, especially where sequence modeling and long-range dependencies are important.
Motivation from NLP
Early models for tasks like machine translation followed the sequence-to-sequence paradigm with an encoder-decoder structure based on RNNs:
- The encoder processes the input sequence and converts it into a fixed-length context vector.
- The decoder uses this context vector to generate the output sequence step by step.
Transformers replace RNNs in this structure, allowing for parallelization and better handling of long-range dependencies.
Attention Mechanism
- Attention allows models to focus on relevant parts of the input sequence when generating each output.
- To compute attention for input token $i_2$, the query $q_2$ interacts with all keys $k$.
- Scaled Dot-Product Attention:
$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$
- Scaling improves numerical stability when computing softmax over large dot products.
Multi-Headed Attention lets the model attend to different types of relationships simultaneously.
Transformer Architecture
- Encoder-Only: e.g., BERT, RoBERTa – good for representation tasks.
- Decoder-Only: e.g., GPT – optimized for generation.
- Full Transformer: encoder-decoder architecture, as in translation models like T5.
GPT (2018) introduced the idea that decoders alone (with causal masking) can scale efficiently for large language models.
Vision Transformers (ViT)
- Adapt Transformer architectures for image tasks by treating image patches as tokens.
- Unlike CNNs, which leverage locality and translation invariance, ViTs learn all patterns from data, needing large-scale training datasets.
- Despite needing more data, ViTs can outperform ConvNets on many vision benchmarks.
Modern Architectures and Challenges
- Many state-of-the-art models blend ideas across architectures.
- A common challenge is the quadratic time and memory complexity of standard self-attention, especially for long sequences.
- Solutions: FlashAttention, Performer, Linformer, Longformer, etc.
Lab 08 Discussion Highlights
1. Self-Attention vs. Cross-Attention
| Feature | Self-Attention | Cross-Attention |
|---|---|---|
| What attends to what? | Tokens attend to themselves (Q, K, V from same sequence) | Queries attend to external K/V (e.g., decoder attends to encoder) |
| Use Case | Intra-sequence modeling (within a sentence or image) | Inter-sequence modeling (e.g., translation, multi-modal) |
| Masking | Causal (in decoder) or none (in encoder) | Usually unmasked; decoder may be causally masked |
| Parameters | Shared projections for Q, K, V | Separate projections for Q (decoder) and K/V (encoder) |
| Intuition | “How relevant is token i to j in the same sequence?” | “How relevant is encoder token j to decoder position i?” |
2. Efficiency and Scalability
| Property | Transformer | RNN | CNN |
|---|---|---|---|
| Parallelism | Fully parallel across tokens | Sequential steps | Parallel in spatial dimension |
| Path Length | 1 (global attention) | $O(n)$ (step-by-step dependency) | $O(\log n)$ (dilated/stacked) or $O(n)$ |
| Memory/Compute per Layer | $O(n^2 \cdot d)$ | $O(n \cdot d)$ | $O(n \cdot k \cdot d)$ |
| Scalability | Scales to 100B+ params & long contexts | Struggles with long sequences | Great for 2D (images); limited on 1D |
| Inductive Bias | None; must learn from data | Temporal bias | Locality & translation invariance |
| Best Use Case | Language, code, protein folding | Streaming, time series | Vision, audio, local patterns |
3. Self-Attention vs. 1D Convolution (Kernel Size = Sequence Length)
| Aspect | Self-Attention | 1D Convolution (Kernel = $n$) |
|---|---|---|
| Receptive Field | Global – all tokens attend to each other | Global – same kernel across all sequences |
| Parameter Count | $O(d^2)$ per head | $O(n \cdot d_{in} \cdot d_{out})$ |
| Computation | $O(n^2 \cdot d)$ | $O(n \cdot d_{in} \cdot d_{out})$ |
| Content Dependence | Attention weights vary with input content | Fixed weights, same for all inputs |
| Positional Info | Needs explicit encodings | Encoded by filter position |
| Inductive Bias | None – must be learned | Strong locality bias |
| Interpretability | High – via attention maps | Harder to interpret |
4. Encoder vs. Decoder Roles
Encoder
- Input: Full source sequence (text, image patches, etc.)
- Operation: Stacks of self-attention + feedforward layers (no masking)
- Output: Contextualized hidden states
Decoder
- Input: Past outputs (during training: ground truth), and encoder’s output
- Operation:
- Masked self-attention (autoregressive)
- Cross-attention over encoder output
- Feedforward and normalization
| Feature | Encoder-Only (e.g., BERT) | Decoder-Only (e.g., GPT) |
|---|---|---|
| Architecture | Self-attention blocks only | Masked self-attention blocks |
| Training Task | Masked language modeling, contrastive learning | Autoregressive next-token prediction |
| Usage | Embedding generation, classification, QA | Text generation, code completion, planning |
| Context Window | Typically shorter | Long-range generation (streaming, large context) |
| Fine-Tuning | Add task-specific heads, adapters | Prompting, LoRA, RLHF |
5. Informal Definitions
- Encoder: A neural compressor – ingests input and emits a representation that encodes all relevant context. It’s the reader.
- Decoder: A neural expander – uses this representation (and previous outputs) to produce new content. It’s the writer.
Leave a comment