📢 Notice 📢

Please have a read first!

Clustering and Dimensionality Reduction

May 5, 2025 3 minute read

Notes from the summary of COMP3010 Lecture 9, labs, and online resources.

Overview

Clustering and Dimensionality Reduction are both key techniques under the umbrella of unsupervised learning

Key Topics

What is unsupervised learning?
Clustering (K-Means)
Dimensionality Reduction (PCA, t-SNE, Autoencoders)

What is Unsupervised Learning?

Supervised Learning: Learns a mapping from input features ($x$) to output targets ($y$), with human-provided labels.

Unsupervised Learning: Discovers patterns, groupings, or structures in data without labeled targets.

Yann LeCun has argued that most current ML systems are too reliant on labeled data and tend to be brittle. In contrast, humans and animals learn by observation, not constant supervision. Babies accumulate enormous amounts of background knowledge simply by observing the world.

Unsupervised Learning Goals:

Structure and organize data → Clustering
Simplify data representation → Dimensionality Reduction
Learn useful internal representations → Self-supervised learning

Clustering

Clustering involves organizing data into distinct groups (clusters), such that:

Intra-cluster similarity is high
Inter-cluster similarity is low

Clustering is similar to classification, but we don’t know the “true labels” in advance.

Applications:

Customer segmentation
Image segmentation
Gene clustering
Community detection in social networks
News and article categorization

K-Means Clustering

K-Means is a centroid-based iterative clustering algorithm:

Choose $K$ random initial centroids.
Repeat until convergence:
- Assign each point to the nearest centroid.
- Recompute centroids as the mean of assigned points.

Each centroid represents the “center of gravity” for its cluster.

Time Complexity:

Assignment step: $O(KN)$
Update step: Recomputing centroids

Advantages:

Simple and fast
Scalable to large datasets
Effective for spherical clusters

Disadvantages:

Requires pre-specifying $K$
Sensitive to initialization and outliers
Poor performance on non-spherical clusters

Tip: Run K-means multiple times with random seeds and select the best result using evaluation metrics (e.g., WCSS).

Dimensionality Reduction (DR)

DR refers to reducing high-dimensional data ($d$) to lower dimensions ($d’ \ll d$).

High-dimensional: Large number of features (columns)
Large-scale: Large number of instances (rows)

Why DR?

Data visualization
Resource efficiency (speed, memory)
Noise reduction
Improved generalization

Principal Component Analysis (PCA)

PCA is a linear, unsupervised method that:

Finds directions (principal components) capturing maximum variance
Projects data onto a lower-dimensional linear subspace

Key Ideas:

Keep components with large eigenvalues
Drop components with small variance (noise)

PCA is often used before clustering (e.g., K-means) to improve performance.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a non-linear DR technique for visualizing high-dimensional data (typically into 2D or 3D).

Comparison with PCA:

PCA captures global structure via linear projection
t-SNE preserves local structure, maintaining neighborhood relationships
PCA is deterministic, t-SNE is stochastic
PCA can embed new data; t-SNE cannot easily do so

Autoencoders

Autoencoders are neural networks trained to reconstruct their own inputs.

Consist of an encoder (maps input to latent space) and a decoder (reconstructs input)
Can be convolutional or transformer-based
Unsupervised; learn to represent data compactly and meaningfully

They are used for:

Dimensionality reduction
Feature learning
Pretraining representations

ML Lab 09 Summary

Learning Outcomes

Understand and implement K-Means
Apply PCA for dimensionality reduction

We move from supervised learning (with labels) to unsupervised learning (without labels), exploring patterns directly from the data.

Two Key Tools:

K-Means: Groups similar data points by minimizing intra-cluster variance
t-SNE: Visualizes high-dimensional relationships by preserving local structure

Discussion Points

1. Elbow Method for Choosing $K$

The elbow method helps choose the optimal number of clusters $K$ by:

Running K-Means with multiple values of $K$
Plotting within-cluster sum of squares (WCSS) vs $K$
Selecting the point where WCSS starts decreasing slowly (the “elbow”)

This marks diminishing returns for adding more clusters.

2. How Does K-Means React to Outliers?

K-Means is sensitive to outliers:

Centroid distortion: Outliers shift the mean
Spurious clusters: May form clusters around outliers
Higher variance: Increased WCSS
Convergence issues

Mitigation:

Preprocess to detect/remove outliers
Use robust clustering alternatives

3. Kernel PCA (Non-Linear DR)

Kernel PCA extends PCA by:

Mapping data to a high-dimensional space using a kernel function
Performing PCA in the implicit kernel-induced space

Common Kernels:

Polynomial: $k(x_i, x_j) = (x_i^\top x_j + c)^d$
RBF: $k(x_i, x_j) = \exp\left(-\frac{|x_i - x_j|^2}{2\sigma^2}\right)$
Sigmoid: $k(x_i, x_j) = \tanh(\alpha x_i^\top x_j + c)$

Steps:

Construct kernel matrix $K$
Center $K$
Perform eigen decomposition
Project using top eigenvectors

Kernel PCA enables capturing non-linear patterns without explicitly transforming data.

4. PCA vs. t-SNE

Feature	PCA	t-SNE
Goal	Preserve global variance	Preserve local neighborhood structure
Linearity	Linear	Non-linear
Structure	Global	Local
Determinism	Deterministic	Stochastic
Speed	Faster	Slower (quadratic time complexity)
Visualization	Good for linear trends	Excellent for revealing clusters
Distance Metric	Euclidean distance	Probability-based pairwise similarities

Summary:

Use PCA when linear structure matters, for preprocessing or fast visualization.
Use t-SNE to explore non-linear clusters in small-to-medium datasets for interpretability.

Share on

X Facebook LinkedIn Bluesky