Self-Supervised Learning
Notes from the summary of COMP3010 Lecture 10, labs, and online resources.
Table of Contents
- What is Self-Supervised Learning?
- Pretext Tasks
- Contrastive Learning
- Masked Autoencoders
- Multi-modal SSL
What is Self-Supervised Learning?
Recap:
- Supervised Learning: Data $(x, y)$, where the goal is to learn a mapping $x \rightarrow y$. Examples include classification and regression.
- Unsupervised Learning: Data $x$ only; the goal is to discover hidden structure (e.g., clusters, dimensions).
The Problem:
- Supervised learning is powerful but labeling large datasets is expensive.
- Unsupervised learning is underconstrained—models often struggle to learn useful patterns without guidance.
Enter Self-Supervised Learning (SSL):
- SSL trains models using pseudo-labels generated from the data itself.
- It combines the scalability of unsupervised learning with the structure of supervised frameworks.
SSL Variants:
- Self-supervised learning: Predict part of the data from other parts.
- Contrastive learning: Learn by distinguishing between similar and dissimilar pairs.
- Weak supervision: Use indirect/noisy labels from heuristics or other models.
Pretext Tasks
Definition:
A pretext task is an artificial task designed to help models learn meaningful representations without labeled data.
Strategy:
- Pre-train a model on a pretext task (e.g., rotation prediction, inpainting).
- Transfer the learned encoder to downstream tasks (e.g., classification, detection).
Examples:
- Autoencoders: Learn to reconstruct the input from compressed latent features.
- Rotation Prediction (RotNet): Predict image rotation angle (0°, 90°, 180°, 270°).
- Jigsaw Puzzle: Solve spatial permutations of image patches.
- Inpainting: Predict missing pixels in an image.
- Colorization: Predict color from grayscale images.
Benefits:
- Models learn generalizable visual representations.
- However, designing pretext tasks can be tedious and task-specific.
Contrastive Learning
Core Idea:
Learn representations by pulling together similar data points (positives) and pushing apart dissimilar ones (negatives).
Setup:
- Given a reference image $x$, generate two augmented views: $x^+$ and $x^-$.
- Use a contrastive loss (e.g., InfoNCE) to:
- Maximize similarity between $x$ and $x^+$
- Minimize similarity between $x$ and all $x^-$
Loss (simplified):
$\mathcal{L}{x} = -\log \frac{\exp(s(x, x^+)/\tau)}{\sum{x’} \exp(s(x, x’)/\tau)}$
Where $s(\cdot, \cdot)$ is a similarity function (e.g., cosine), and $\tau$ is a temperature parameter.
Applications:
- SimCLR, MoCo, BYOL, and CLIP are notable contrastive learning frameworks.
- Visual results show strong performance in downstream classification tasks.
Masked Autoencoders (MAE)
Motivation:
Inspired by NLP models like BERT, MAEs mask parts of input images and reconstruct the missing parts.
Pipeline:
- Divide the image into non-overlapping patches.
- Mask a high percentage of them (e.g., 75%).
- Encode the visible patches using a Vision Transformer (ViT).
- Decode to reconstruct the masked patches.
Key Insight:
MAEs scale well and often outperform contrastive methods on large vision tasks. They are efficient and effective for vision pretraining.
Deep Clustering
Process:
- Initialize a CNN.
- Extract features for many images.
- Cluster the features using K-means.
- Use cluster assignments as pseudo-labels.
- Train CNN to predict these pseudo-labels.
- Repeat steps 2–5.
Benefit:
Does not require labels and progressively improves the feature quality through bootstrapping.
Multi-Modal SSL
Idea:
Use additional modalities (e.g., audio, video, text, depth) alongside images.
Examples:
- Image + Video: Learn from temporal consistency across frames.
- Image + Audio: Predict ambient sound from visual input.
- Image + 3D: Leverage point clouds or depth maps.
- Image + Text: Use captions or web data for joint training (e.g., CLIP, VirTex).
Why Language?
- Semantic Density: Few words convey rich meaning.
- Universality: Can describe almost anything.
- Scalability: Easy to gather at scale from the web.
Summary
- Self-Supervised Learning enables scalable learning without manual labels.
- Models are pretrained on artificial tasks, then fine-tuned for downstream use.
- Pretext task design, contrastive learning, and MAEs are major strategies.
- SSL is foundational in vision, language, and increasingly in multi-modal learning.
Leave a comment