📢 Notice 📢

3 minute read

Notes from the summary of COMP3010 Lecture 9, labs, and online resources.

Overview

Clustering and Dimensionality Reduction are both key techniques under the umbrella of unsupervised learning

Key Topics

  • What is unsupervised learning?
  • Clustering (K-Means)
  • Dimensionality Reduction (PCA, t-SNE, Autoencoders)

What is Unsupervised Learning?

Supervised Learning: Learns a mapping from input features ($x$) to output targets ($y$), with human-provided labels.

Unsupervised Learning: Discovers patterns, groupings, or structures in data without labeled targets.

Yann LeCun has argued that most current ML systems are too reliant on labeled data and tend to be brittle. In contrast, humans and animals learn by observation, not constant supervision. Babies accumulate enormous amounts of background knowledge simply by observing the world.

Unsupervised Learning Goals:

  • Structure and organize data → Clustering
  • Simplify data representation → Dimensionality Reduction
  • Learn useful internal representations → Self-supervised learning

Clustering

Clustering involves organizing data into distinct groups (clusters), such that:

  • Intra-cluster similarity is high
  • Inter-cluster similarity is low

Clustering is similar to classification, but we don’t know the “true labels” in advance.

Applications:

  • Customer segmentation
  • Image segmentation
  • Gene clustering
  • Community detection in social networks
  • News and article categorization

K-Means Clustering

K-Means is a centroid-based iterative clustering algorithm:

  1. Choose $K$ random initial centroids.
  2. Repeat until convergence:
    • Assign each point to the nearest centroid.
    • Recompute centroids as the mean of assigned points.

Each centroid represents the “center of gravity” for its cluster.

Time Complexity:

  • Assignment step: $O(KN)$
  • Update step: Recomputing centroids

Advantages:

  • Simple and fast
  • Scalable to large datasets
  • Effective for spherical clusters

Disadvantages:

  • Requires pre-specifying $K$
  • Sensitive to initialization and outliers
  • Poor performance on non-spherical clusters

Tip: Run K-means multiple times with random seeds and select the best result using evaluation metrics (e.g., WCSS).

Dimensionality Reduction (DR)

DR refers to reducing high-dimensional data ($d$) to lower dimensions ($d’ \ll d$).

  • High-dimensional: Large number of features (columns)
  • Large-scale: Large number of instances (rows)

Why DR?

  • Data visualization
  • Resource efficiency (speed, memory)
  • Noise reduction
  • Improved generalization

Principal Component Analysis (PCA)

PCA is a linear, unsupervised method that:

  • Finds directions (principal components) capturing maximum variance
  • Projects data onto a lower-dimensional linear subspace

Key Ideas:

  • Keep components with large eigenvalues
  • Drop components with small variance (noise)

PCA is often used before clustering (e.g., K-means) to improve performance.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear DR technique for visualizing high-dimensional data (typically into 2D or 3D).

Comparison with PCA:

  • PCA captures global structure via linear projection
  • t-SNE preserves local structure, maintaining neighborhood relationships
  • PCA is deterministic, t-SNE is stochastic
  • PCA can embed new data; t-SNE cannot easily do so

Autoencoders

Autoencoders are neural networks trained to reconstruct their own inputs.

  • Consist of an encoder (maps input to latent space) and a decoder (reconstructs input)
  • Can be convolutional or transformer-based
  • Unsupervised; learn to represent data compactly and meaningfully

They are used for:

  • Dimensionality reduction
  • Feature learning
  • Pretraining representations

ML Lab 09 Summary

Learning Outcomes

  • Understand and implement K-Means
  • Apply PCA for dimensionality reduction

We move from supervised learning (with labels) to unsupervised learning (without labels), exploring patterns directly from the data.

Two Key Tools:

  • K-Means: Groups similar data points by minimizing intra-cluster variance
  • t-SNE: Visualizes high-dimensional relationships by preserving local structure

Discussion Points

1. Elbow Method for Choosing $K$

The elbow method helps choose the optimal number of clusters $K$ by:

  • Running K-Means with multiple values of $K$
  • Plotting within-cluster sum of squares (WCSS) vs $K$
  • Selecting the point where WCSS starts decreasing slowly (the “elbow”)

This marks diminishing returns for adding more clusters.

2. How Does K-Means React to Outliers?

K-Means is sensitive to outliers:

  • Centroid distortion: Outliers shift the mean
  • Spurious clusters: May form clusters around outliers
  • Higher variance: Increased WCSS
  • Convergence issues

Mitigation:

  • Preprocess to detect/remove outliers
  • Use robust clustering alternatives

3. Kernel PCA (Non-Linear DR)

Kernel PCA extends PCA by:

  • Mapping data to a high-dimensional space using a kernel function
  • Performing PCA in the implicit kernel-induced space

Common Kernels:

  • Polynomial: $k(x_i, x_j) = (x_i^\top x_j + c)^d$
  • RBF: $k(x_i, x_j) = \exp\left(-\frac{|x_i - x_j|^2}{2\sigma^2}\right)$
  • Sigmoid: $k(x_i, x_j) = \tanh(\alpha x_i^\top x_j + c)$

Steps:

  1. Construct kernel matrix $K$
  2. Center $K$
  3. Perform eigen decomposition
  4. Project using top eigenvectors

Kernel PCA enables capturing non-linear patterns without explicitly transforming data.

4. PCA vs. t-SNE

Feature PCA t-SNE
Goal Preserve global variance Preserve local neighborhood structure
Linearity Linear Non-linear
Structure Global Local
Determinism Deterministic Stochastic
Speed Faster Slower (quadratic time complexity)
Visualization Good for linear trends Excellent for revealing clusters
Distance Metric Euclidean distance Probability-based pairwise similarities

Summary:

  • Use PCA when linear structure matters, for preprocessing or fast visualization.
  • Use t-SNE to explore non-linear clusters in small-to-medium datasets for interpretability.

Leave a comment