📢 Notice 📢

Please have a read first!

Self-Supervised Learning

May 12, 2025 2 minute read

Notes from the summary of COMP3010 Lecture 10, labs, and online resources.

What is Self-Supervised Learning?
Pretext Tasks
Contrastive Learning
Masked Autoencoders
Multi-modal SSL

What is Self-Supervised Learning?

Recap:

Supervised Learning: Data $(x, y)$, where the goal is to learn a mapping $x \rightarrow y$. Examples include classification and regression.
Unsupervised Learning: Data $x$ only; the goal is to discover hidden structure (e.g., clusters, dimensions).

The Problem:

Supervised learning is powerful but labeling large datasets is expensive.
Unsupervised learning is underconstrained—models often struggle to learn useful patterns without guidance.

Enter Self-Supervised Learning (SSL):

SSL trains models using pseudo-labels generated from the data itself.
It combines the scalability of unsupervised learning with the structure of supervised frameworks.

SSL Variants:

Self-supervised learning: Predict part of the data from other parts.
Contrastive learning: Learn by distinguishing between similar and dissimilar pairs.
Weak supervision: Use indirect/noisy labels from heuristics or other models.

Pretext Tasks

Definition:

A pretext task is an artificial task designed to help models learn meaningful representations without labeled data.

Strategy:

Pre-train a model on a pretext task (e.g., rotation prediction, inpainting).
Transfer the learned encoder to downstream tasks (e.g., classification, detection).

Examples:

Autoencoders: Learn to reconstruct the input from compressed latent features.
Rotation Prediction (RotNet): Predict image rotation angle (0°, 90°, 180°, 270°).
Jigsaw Puzzle: Solve spatial permutations of image patches.
Inpainting: Predict missing pixels in an image.
Colorization: Predict color from grayscale images.

Benefits:

Models learn generalizable visual representations.
However, designing pretext tasks can be tedious and task-specific.

Contrastive Learning

Core Idea:

Learn representations by pulling together similar data points (positives) and pushing apart dissimilar ones (negatives).

Setup:

Given a reference image $x$, generate two augmented views: $x^+$ and $x^-$.
Use a contrastive loss (e.g., InfoNCE) to:
- Maximize similarity between $x$ and $x^+$
- Minimize similarity between $x$ and all $x^-$

Loss (simplified):

$\mathcal{L}{x} = -\log \frac{\exp(s(x, x^+)/\tau)}{\sum{x’} \exp(s(x, x’)/\tau)}$

Where $s(\cdot, \cdot)$ is a similarity function (e.g., cosine), and $\tau$ is a temperature parameter.

Applications:

SimCLR, MoCo, BYOL, and CLIP are notable contrastive learning frameworks.
Visual results show strong performance in downstream classification tasks.

Masked Autoencoders (MAE)

Motivation:

Inspired by NLP models like BERT, MAEs mask parts of input images and reconstruct the missing parts.

Pipeline:

Divide the image into non-overlapping patches.
Mask a high percentage of them (e.g., 75%).
Encode the visible patches using a Vision Transformer (ViT).
Decode to reconstruct the masked patches.

Key Insight:

MAEs scale well and often outperform contrastive methods on large vision tasks. They are efficient and effective for vision pretraining.

Deep Clustering

Process:

Initialize a CNN.
Extract features for many images.
Cluster the features using K-means.
Use cluster assignments as pseudo-labels.
Train CNN to predict these pseudo-labels.
Repeat steps 2–5.

Benefit:

Does not require labels and progressively improves the feature quality through bootstrapping.

Idea:

Use additional modalities (e.g., audio, video, text, depth) alongside images.

Examples:

Image + Video: Learn from temporal consistency across frames.
Image + Audio: Predict ambient sound from visual input.
Image + 3D: Leverage point clouds or depth maps.
Image + Text: Use captions or web data for joint training (e.g., CLIP, VirTex).

Why Language?

Semantic Density: Few words convey rich meaning.
Universality: Can describe almost anything.
Scalability: Easy to gather at scale from the web.

Summary

Self-Supervised Learning enables scalable learning without manual labels.
Models are pretrained on artificial tasks, then fine-tuned for downstream use.
Pretext task design, contrastive learning, and MAEs are major strategies.
SSL is foundational in vision, language, and increasingly in multi-modal learning.

Share on

X Facebook LinkedIn Bluesky

Self-Supervised Learning

Table of Contents

What is Self-Supervised Learning?

Recap:

The Problem:

Enter Self-Supervised Learning (SSL):

SSL Variants:

Pretext Tasks

Definition:

Strategy:

Examples:

Benefits:

Contrastive Learning

Core Idea:

Setup:

Loss (simplified):

Applications:

Masked Autoencoders (MAE)

Motivation:

Pipeline:

Key Insight:

Deep Clustering

Process:

Benefit:

Idea:

Examples:

Why Language?

Summary

Share on

Leave a comment

You may also enjoy

Performance in HPC: Gnuplot & ARM map

Fundamental Differentiation Rules

Looking Back on My First Internship

Note on Attention, Transformers, and the New Linear Models

Table of Contents

What is Self-Supervised Learning?

Recap:

The Problem:

Enter Self-Supervised Learning (SSL):

SSL Variants:

Pretext Tasks

Definition:

Strategy:

Examples:

Benefits:

Contrastive Learning

Core Idea:

Setup:

Loss (simplified):

Applications:

Masked Autoencoders (MAE)

Motivation:

Pipeline:

Key Insight:

Deep Clustering

Process:

Benefit:

Multi-Modal SSL

Idea:

Examples:

Why Language?

Summary

Share on

Leave a comment

You may also enjoy

Performance in HPC: Gnuplot & ARM map

Fundamental Differentiation Rules

Looking Back on My First Internship

Note on Attention, Transformers, and the New Linear Models