📢 Notice 📢

Please have a read first!

Transformers and Beyond

April 28, 2025 3 minute read

Notes from the summary of COMP3010 Lecture 8, labs, and online resources.

Key Concepts

Transformers are the dominant architecture in natural language processing and are increasingly used in vision and multi-modal tasks.
They outperform MLPs and CNNs in many contexts, especially where sequence modeling and long-range dependencies are important.

Motivation from NLP

Early models for tasks like machine translation followed the sequence-to-sequence paradigm with an encoder-decoder structure based on RNNs:

The encoder processes the input sequence and converts it into a fixed-length context vector.
The decoder uses this context vector to generate the output sequence step by step.

Transformers replace RNNs in this structure, allowing for parallelization and better handling of long-range dependencies.

Attention Mechanism

Attention allows models to focus on relevant parts of the input sequence when generating each output.
To compute attention for input token $i_2$, the query $q_2$ interacts with all keys $k$.
Scaled Dot-Product Attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right) V$

Scaling improves numerical stability when computing softmax over large dot products.

Multi-Headed Attention lets the model attend to different types of relationships simultaneously.

Transformer Architecture

Encoder-Only: e.g., BERT, RoBERTa – good for representation tasks.
Decoder-Only: e.g., GPT – optimized for generation.
Full Transformer: encoder-decoder architecture, as in translation models like T5.

GPT (2018) introduced the idea that decoders alone (with causal masking) can scale efficiently for large language models.

Vision Transformers (ViT)

Adapt Transformer architectures for image tasks by treating image patches as tokens.
Unlike CNNs, which leverage locality and translation invariance, ViTs learn all patterns from data, needing large-scale training datasets.
Despite needing more data, ViTs can outperform ConvNets on many vision benchmarks.

Modern Architectures and Challenges

Many state-of-the-art models blend ideas across architectures.
A common challenge is the quadratic time and memory complexity of standard self-attention, especially for long sequences.
Solutions: FlashAttention, Performer, Linformer, Longformer, etc.

Lab 08 Discussion Highlights

1. Self-Attention vs. Cross-Attention

Feature	Self-Attention	Cross-Attention
What attends to what?	Tokens attend to themselves (Q, K, V from same sequence)	Queries attend to external K/V (e.g., decoder attends to encoder)
Use Case	Intra-sequence modeling (within a sentence or image)	Inter-sequence modeling (e.g., translation, multi-modal)
Masking	Causal (in decoder) or none (in encoder)	Usually unmasked; decoder may be causally masked
Parameters	Shared projections for Q, K, V	Separate projections for Q (decoder) and K/V (encoder)
Intuition	“How relevant is token i to j in the same sequence?”	“How relevant is encoder token j to decoder position i?”

2. Efficiency and Scalability

Property	Transformer	RNN	CNN
Parallelism	Fully parallel across tokens	Sequential steps	Parallel in spatial dimension
Path Length	1 (global attention)	$O(n)$ (step-by-step dependency)	$O(\log n)$ (dilated/stacked) or $O(n)$
Memory/Compute per Layer	$O(n^2 \cdot d)$	$O(n \cdot d)$	$O(n \cdot k \cdot d)$
Scalability	Scales to 100B+ params & long contexts	Struggles with long sequences	Great for 2D (images); limited on 1D
Inductive Bias	None; must learn from data	Temporal bias	Locality & translation invariance
Best Use Case	Language, code, protein folding	Streaming, time series	Vision, audio, local patterns

3. Self-Attention vs. 1D Convolution (Kernel Size = Sequence Length)

Aspect	Self-Attention	1D Convolution (Kernel = $n$)
Receptive Field	Global – all tokens attend to each other	Global – same kernel across all sequences
Parameter Count	$O(d^2)$ per head	$O(n \cdot d_{in} \cdot d_{out})$
Computation	$O(n^2 \cdot d)$	$O(n \cdot d_{in} \cdot d_{out})$
Content Dependence	Attention weights vary with input content	Fixed weights, same for all inputs
Positional Info	Needs explicit encodings	Encoded by filter position
Inductive Bias	None – must be learned	Strong locality bias
Interpretability	High – via attention maps	Harder to interpret

4. Encoder vs. Decoder Roles

Encoder

Input: Full source sequence (text, image patches, etc.)
Operation: Stacks of self-attention + feedforward layers (no masking)
Output: Contextualized hidden states

Decoder

Input: Past outputs (during training: ground truth), and encoder’s output
Operation:
1. Masked self-attention (autoregressive)
2. Cross-attention over encoder output
3. Feedforward and normalization

Feature	Encoder-Only (e.g., BERT)	Decoder-Only (e.g., GPT)
Architecture	Self-attention blocks only	Masked self-attention blocks
Training Task	Masked language modeling, contrastive learning	Autoregressive next-token prediction
Usage	Embedding generation, classification, QA	Text generation, code completion, planning
Context Window	Typically shorter	Long-range generation (streaming, large context)
Fine-Tuning	Add task-specific heads, adapters	Prompting, LoRA, RLHF

5. Informal Definitions

Encoder: A neural compressor – ingests input and emits a representation that encodes all relevant context. It’s the reader.
Decoder: A neural expander – uses this representation (and previous outputs) to produce new content. It’s the writer.

Share on

X Facebook LinkedIn Bluesky