📢 Notice 📢

3 minute read

Notes from the summary of COMP3010 Lecture 8, labs, and online resources.

Key Concepts

Transformers are the dominant architecture in natural language processing and are increasingly used in vision and multi-modal tasks.
They outperform MLPs and CNNs in many contexts, especially where sequence modeling and long-range dependencies are important.

Motivation from NLP

Early models for tasks like machine translation followed the sequence-to-sequence paradigm with an encoder-decoder structure based on RNNs:

  • The encoder processes the input sequence and converts it into a fixed-length context vector.
  • The decoder uses this context vector to generate the output sequence step by step.

Transformers replace RNNs in this structure, allowing for parallelization and better handling of long-range dependencies.

Attention Mechanism

  • Attention allows models to focus on relevant parts of the input sequence when generating each output.
  • To compute attention for input token $i_2$, the query $q_2$ interacts with all keys $k$.
  • Scaled Dot-Product Attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$

  • Scaling improves numerical stability when computing softmax over large dot products.

Multi-Headed Attention lets the model attend to different types of relationships simultaneously.

Transformer Architecture

  • Encoder-Only: e.g., BERT, RoBERTa – good for representation tasks.
  • Decoder-Only: e.g., GPT – optimized for generation.
  • Full Transformer: encoder-decoder architecture, as in translation models like T5.

GPT (2018) introduced the idea that decoders alone (with causal masking) can scale efficiently for large language models.

Vision Transformers (ViT)

  • Adapt Transformer architectures for image tasks by treating image patches as tokens.
  • Unlike CNNs, which leverage locality and translation invariance, ViTs learn all patterns from data, needing large-scale training datasets.
  • Despite needing more data, ViTs can outperform ConvNets on many vision benchmarks.

Modern Architectures and Challenges

  • Many state-of-the-art models blend ideas across architectures.
  • A common challenge is the quadratic time and memory complexity of standard self-attention, especially for long sequences.
  • Solutions: FlashAttention, Performer, Linformer, Longformer, etc.

Lab 08 Discussion Highlights

1. Self-Attention vs. Cross-Attention

Feature Self-Attention Cross-Attention
What attends to what? Tokens attend to themselves (Q, K, V from same sequence) Queries attend to external K/V (e.g., decoder attends to encoder)
Use Case Intra-sequence modeling (within a sentence or image) Inter-sequence modeling (e.g., translation, multi-modal)
Masking Causal (in decoder) or none (in encoder) Usually unmasked; decoder may be causally masked
Parameters Shared projections for Q, K, V Separate projections for Q (decoder) and K/V (encoder)
Intuition “How relevant is token i to j in the same sequence?” “How relevant is encoder token j to decoder position i?”

2. Efficiency and Scalability

Property Transformer RNN CNN
Parallelism Fully parallel across tokens Sequential steps Parallel in spatial dimension
Path Length 1 (global attention) $O(n)$ (step-by-step dependency) $O(\log n)$ (dilated/stacked) or $O(n)$
Memory/Compute per Layer $O(n^2 \cdot d)$ $O(n \cdot d)$ $O(n \cdot k \cdot d)$
Scalability Scales to 100B+ params & long contexts Struggles with long sequences Great for 2D (images); limited on 1D
Inductive Bias None; must learn from data Temporal bias Locality & translation invariance
Best Use Case Language, code, protein folding Streaming, time series Vision, audio, local patterns

3. Self-Attention vs. 1D Convolution (Kernel Size = Sequence Length)

Aspect Self-Attention 1D Convolution (Kernel = $n$)
Receptive Field Global – all tokens attend to each other Global – same kernel across all sequences
Parameter Count $O(d^2)$ per head $O(n \cdot d_{in} \cdot d_{out})$
Computation $O(n^2 \cdot d)$ $O(n \cdot d_{in} \cdot d_{out})$
Content Dependence Attention weights vary with input content Fixed weights, same for all inputs
Positional Info Needs explicit encodings Encoded by filter position
Inductive Bias None – must be learned Strong locality bias
Interpretability High – via attention maps Harder to interpret

4. Encoder vs. Decoder Roles

Encoder

  • Input: Full source sequence (text, image patches, etc.)
  • Operation: Stacks of self-attention + feedforward layers (no masking)
  • Output: Contextualized hidden states

Decoder

  • Input: Past outputs (during training: ground truth), and encoder’s output
  • Operation:
    1. Masked self-attention (autoregressive)
    2. Cross-attention over encoder output
    3. Feedforward and normalization
Feature Encoder-Only (e.g., BERT) Decoder-Only (e.g., GPT)
Architecture Self-attention blocks only Masked self-attention blocks
Training Task Masked language modeling, contrastive learning Autoregressive next-token prediction
Usage Embedding generation, classification, QA Text generation, code completion, planning
Context Window Typically shorter Long-range generation (streaming, large context)
Fine-Tuning Add task-specific heads, adapters Prompting, LoRA, RLHF

5. Informal Definitions

  • Encoder: A neural compressor – ingests input and emits a representation that encodes all relevant context. It’s the reader.
  • Decoder: A neural expander – uses this representation (and previous outputs) to produce new content. It’s the writer.

Leave a comment