📢 Notice 📢

2 minute read

Notes from the summary of COMP3010 Lecture 10, labs, and online resources.

Table of Contents

  • What is Self-Supervised Learning?
  • Pretext Tasks
  • Contrastive Learning
  • Masked Autoencoders
  • Multi-modal SSL

What is Self-Supervised Learning?

Recap:

  • Supervised Learning: Data $(x, y)$, where the goal is to learn a mapping $x \rightarrow y$. Examples include classification and regression.
  • Unsupervised Learning: Data $x$ only; the goal is to discover hidden structure (e.g., clusters, dimensions).

The Problem:

  • Supervised learning is powerful but labeling large datasets is expensive.
  • Unsupervised learning is underconstrained—models often struggle to learn useful patterns without guidance.

Enter Self-Supervised Learning (SSL):

  • SSL trains models using pseudo-labels generated from the data itself.
  • It combines the scalability of unsupervised learning with the structure of supervised frameworks.

SSL Variants:

  • Self-supervised learning: Predict part of the data from other parts.
  • Contrastive learning: Learn by distinguishing between similar and dissimilar pairs.
  • Weak supervision: Use indirect/noisy labels from heuristics or other models.

Pretext Tasks

Definition:

A pretext task is an artificial task designed to help models learn meaningful representations without labeled data.

Strategy:

  1. Pre-train a model on a pretext task (e.g., rotation prediction, inpainting).
  2. Transfer the learned encoder to downstream tasks (e.g., classification, detection).

Examples:

  • Autoencoders: Learn to reconstruct the input from compressed latent features.
  • Rotation Prediction (RotNet): Predict image rotation angle (0°, 90°, 180°, 270°).
  • Jigsaw Puzzle: Solve spatial permutations of image patches.
  • Inpainting: Predict missing pixels in an image.
  • Colorization: Predict color from grayscale images.

Benefits:

  • Models learn generalizable visual representations.
  • However, designing pretext tasks can be tedious and task-specific.

Contrastive Learning

Core Idea:

Learn representations by pulling together similar data points (positives) and pushing apart dissimilar ones (negatives).

Setup:

  • Given a reference image $x$, generate two augmented views: $x^+$ and $x^-$.
  • Use a contrastive loss (e.g., InfoNCE) to:
    • Maximize similarity between $x$ and $x^+$
    • Minimize similarity between $x$ and all $x^-$

Loss (simplified):

$\mathcal{L}{x} = -\log \frac{\exp(s(x, x^+)/\tau)}{\sum{x’} \exp(s(x, x’)/\tau)}$

Where $s(\cdot, \cdot)$ is a similarity function (e.g., cosine), and $\tau$ is a temperature parameter.

Applications:

  • SimCLR, MoCo, BYOL, and CLIP are notable contrastive learning frameworks.
  • Visual results show strong performance in downstream classification tasks.

Masked Autoencoders (MAE)

Motivation:

Inspired by NLP models like BERT, MAEs mask parts of input images and reconstruct the missing parts.

Pipeline:

  1. Divide the image into non-overlapping patches.
  2. Mask a high percentage of them (e.g., 75%).
  3. Encode the visible patches using a Vision Transformer (ViT).
  4. Decode to reconstruct the masked patches.

Key Insight:

MAEs scale well and often outperform contrastive methods on large vision tasks. They are efficient and effective for vision pretraining.

Deep Clustering

Process:

  1. Initialize a CNN.
  2. Extract features for many images.
  3. Cluster the features using K-means.
  4. Use cluster assignments as pseudo-labels.
  5. Train CNN to predict these pseudo-labels.
  6. Repeat steps 2–5.

Benefit:

Does not require labels and progressively improves the feature quality through bootstrapping.

Multi-Modal SSL

Idea:

Use additional modalities (e.g., audio, video, text, depth) alongside images.

Examples:

  • Image + Video: Learn from temporal consistency across frames.
  • Image + Audio: Predict ambient sound from visual input.
  • Image + 3D: Leverage point clouds or depth maps.
  • Image + Text: Use captions or web data for joint training (e.g., CLIP, VirTex).

Why Language?

  • Semantic Density: Few words convey rich meaning.
  • Universality: Can describe almost anything.
  • Scalability: Easy to gather at scale from the web.

Summary

  • Self-Supervised Learning enables scalable learning without manual labels.
  • Models are pretrained on artificial tasks, then fine-tuned for downstream use.
  • Pretext task design, contrastive learning, and MAEs are major strategies.
  • SSL is foundational in vision, language, and increasingly in multi-modal learning.

Leave a comment