Machine Learning for MP
Machine learning is the stage in the pipeline where features become decisions.
If image processing, feature detection, and feature extraction give us good descriptors,
machine learning maps those descriptors to labels, scores, or actions.
1. Where Machine Learning Fits in the Pipeline
Typical vision pipeline in COMP3007:
- Image processing – denoise, normalize, correct artifacts.
- Feature detection – find interesting points/regions (corners, blobs, lines).
- Feature extraction – turn local patches, shapes, or textures into vectors.
- Machine learning – learn a mapping from features → outputs.
Machine learning is used for:
- Recognition & classification – “Which class does this object belong to?”
- Regression – “What continuous value should I predict?” (e.g., depth, angle).
- Ranking / retrieval – “Which database images look most similar?”
2. Types of Machine Learning
2.1 Supervised Learning
Learn a function $ f: X \to Y $ from labeled examples:
- Input X: feature vectors (e.g., HOG descriptors, CNN embeddings).
- Output Y: labels or continuous values.
Tasks:
- Classification – finite set of labels (cat vs dog, digit 0–9).
- Regression – continuous outputs (depth, steering angle, probability score).
Assumptions:
- Test samples belong to one of the known classes.
- Small changes in input should not flip the label (stability / smoothness).
2.2 Unsupervised Learning
- Data has no labels.
- Goal: discover structure or manifolds in the data.
- Common tasks:
- Clustering (k-means, hierarchical clustering).
- Dimensionality reduction (PCA, autoencoders).
2.3 Reinforcement Learning
- Agent interacts with an environment.
- Learns a policy that maximizes long-term reward via trial and error.
- Examples:
- Robot navigation.
- Game playing (Atari, Go).
- Control tasks.
In COMP3007, supervised learning + deep learning for vision are the main focus.
3. Practical Issues in Supervised Learning
3.1 Data Quality & Preprocessing
- Garbage in, garbage out – bad data → bad model.
- Typical preprocessing:
- Normalizing features (zero mean, unit variance).
- Handling missing values.
- Removing obvious noise or corrupted samples.
3.2 Curse of Dimensionality & Feature Selection
- High-dimensional feature spaces (e.g., raw pixels) can:
- Make distance-based methods (k-NN) unreliable.
- Require exponentially more data to cover the space.
- Solutions:
- Use compact, informative features (e.g., HOG, CNN embeddings).
- Apply dimensionality reduction (PCA).
- Feature selection / regularization.
3.3 Overfitting vs Generalization
- Overfitting: model learns training noise, fails on new data.
- Symptoms:
- Very high training accuracy, poor validation accuracy.
- Remedies:
- More varied training data.
- Data augmentation (for images).
- Regularization (L2, dropout).
- Simpler models.
- Early stopping based on validation performance.
3.4 Data Imbalance
- One class dominates others (e.g., 95% negatives, 5% positives).
- Accuracy is misleading (always predicting majority still high).
- Strategies:
- Resampling:
- Oversample minority class.
- Undersample majority class.
- Synthetic samples (e.g., SMOTE in ML literature).
- Weighted loss: penalize errors on minority class more.
- Resampling:
3.5 Cross-Validation
- Split training data into k folds.
- Train on k−1 folds, validate on the remaining fold.
- Repeat for each fold, average performance.
- Purpose:
- Robust estimate of generalization.
- Hyperparameter tuning (k in k-NN, C & kernel in SVM, tree depth, etc.).
4. Classical Machine Learning Methods
4.1 k-Nearest Neighbors (k-NN)
- Idea: “Show me k most similar training examples, then vote.”
- Steps:
- Store all labeled training samples.
- For a new sample, compute distances to all training points.
- Take the k closest neighbors and assign the majority label.
Properties:
- No explicit training phase (just store data).
- Easy to implement, works well for small datasets.
Hyperparameters & choices:
- k:
- Small k → complex decision boundary → overfitting.
- Large k → smoother boundary → possible underfitting.
- Distance metric:
- Euclidean, Manhattan, cosine, Hamming, etc.
Limitations:
- Slow at test time for large datasets.
- Sensitive to irrelevant features and scaling of dimensions.
- Struggles in very high-dimensional spaces (curse of dimensionality).
4.2 Support Vector Machines (SVM)
- Goal: find a hyperplane that maximizes the margin between classes.
- Support vectors = training points lying on the margin; they define the classifier.
Key ideas:
- For linearly separable data:
- Large margin → better generalization.
- For non-separable data:
- Allow misclassifications using slack variables (soft margin).
- Kernels:
- Map inputs into higher-dimensional feature space implicitly.
- Common kernels:
- Linear
- Polynomial
- RBF (Gaussian)
- Sigmoid
Pros:
- Strong theoretical foundation.
- Works well in high-dimensional spaces.
- Often very good performance with the right kernel.
Cons:
- Training can be slow on very large datasets.
- Choosing kernel and hyperparameters can be tricky.
- Less interpretable than decision trees.
4.3 Decision Trees
- Tree-structured model:
- Internal nodes: tests on attributes (e.g., “color == red?”, “size > 3.5?”).
- Branches: outcomes of tests.
- Leaves: predicted class (or distribution over classes).
Growing a tree:
- Start with all data at the root.
- Choose an attribute and threshold to split the data.
- Repeat recursively on each child node.
Split criteria:
- Error rate (rarely used alone).
- Gini impurity.
- Entropy / Information gain.
Stopping & pruning:
- Trees can overfit if too deep.
- Pre-pruning: limit depth, min samples per leaf, min gain.
- Post-pruning: grow full tree, then cut back branches that don’t improve validation performance.
Pros:
- Highly interpretable (“if–else” rules).
- Handle mixed attribute types (categorical + numerical).
- Very fast at inference.
Cons:
- Unstable: small data changes can produce different trees.
- Easy to overfit.
- Poor at modeling smooth decision boundaries unless deep (which overfits).
4.4 Other Classical Classifiers (Awareness Level)
- Naïve Bayes – simple probabilistic model, often used in text classification.
- Logistic Regression – linear model for classification with probabilistic outputs.
- Linear Discriminant Analysis (LDA) – finds linear projections that best separate classes.
- Relevance Vector Machines (RVM) – Bayesian version of SVM with sparsity.
5. Evaluating Classifiers
For classification:
- Confusion matrix – counts TP, TN, FP, FN per class.
- Accuracy – fraction of correct predictions.
- Precision = TP / (TP + FP)
“When I predict positive, how often am I correct?” - Recall = TP / (TP + FN)
“How many of the true positives did I find?” - F1-score – harmonic mean of precision and recall; good for imbalanced data.
For multi-class problems:
- Compute metrics per class and average (macro / micro averaging).
6. Neural Networks: Bridge to Deep Learning
6.1 Basic Neural Network Structure
Components:
- Input layer – feature vector (pixels, descriptors, embeddings).
- Hidden layers – intermediate representations.
- Output layer – class scores or regression outputs.
- Weights & biases – parameters learned from data.
- Activation functions – introduce non-linearity:
- Sigmoid
- tanh
- ReLU and variants (modern deep learning).
6.2 Training with Gradient Descent & Backpropagation
- Choose a loss function:
- Cross-entropy for classification.
- MSE for regression.
- Compute gradients of the loss w.r.t. weights using backpropagation.
- Update weights in the opposite direction of the gradient (gradient descent).
Challenges:
- Choosing learning rate.
- Avoiding local minima / plateaus.
- Computational cost for large networks and datasets.
Advantages:
- Model complex, non-linear decision boundaries.
- Naturally suited to GPUs (parallel matrix operations).
Disadvantages:
- Many hyperparameters (layers, units, activations, learning rate, etc.).
- Less interpretable (“black box”).
- Can overfit easily without regularization.
7. Deep Learning: Modern Neural Networks
Deep learning is essentially neural networks with many layers, focusing on representation learning:
- Early layers learn low-level features (edges, textures).
- Deeper layers learn parts and objects.
- Output layers map to task-specific predictions.
7.1 Tensors & Data Representation
- Scalars – single numbers.
- Vectors – 1D tensors (e.g., feature vectors).
- Matrices – 2D tensors.
- Higher-rank tensors – images, videos, batches:
- Images: (batch, height, width, channels).
- Videos: (batch, time, height, width, channels).
7.2 Loss Functions
- Cross-entropy (binary / categorical) – classification.
- MSE – regression.
- KL divergence – comparing probability distributions (e.g., VAEs).
7.3 Optimizers
- SGD – simple, strong baseline; often used with momentum.
- RMSProp – adaptive learning rates, good for recurrent nets.
- Adam – combines momentum + adaptive learning rates; widely used default.
- Adagrad / Adadelta – older adaptive methods.
7.4 Vanishing & Exploding Gradients
- In very deep networks, gradients can:
- Vanish – become too small to update early layers.
- Explode – become extremely large, causing instability.
- Mitigation:
- Careful initialization.
- Proper activation functions (ReLU).
- Normalization layers (BatchNorm).
- Residual connections (ResNet in practice, beyond this lecture).
7.5 Convolutional Neural Networks (CNNs)
- Specialized for images.
- Layers:
- Convolution – sliding filters to detect patterns (edges, corners, textures).
- Pooling – downsampling, keeps salient structure, reduces parameters.
- Fully connected – final classification.
Early example: LeNet – classic CNN for digit recognition (MNIST).
8. Training Deep Networks in Practice
8.1 Data Augmentation
- Geometric transforms: rotation, flips, translations, zoom.
- Photometric transforms: brightness, contrast, color jitter.
- Noise injection: improve robustness.
- Advanced:
- Mixup, Cutout.
- GAN/diffusion-based synthetic data (beyond basics).
Goal: improve generalization by exposing the network to varied views of the same underlying data.
8.2 Regularization: BatchNorm, Dropout, Early Stopping
Batch Normalization
- Normalize activations across a batch, then apply learnable scale and shift.
- Benefits:
- Stabilizes training.
- Allows higher learning rates.
- Acts as a regularizer.
- Caveat: can be unstable with very small batch sizes.
Dropout
- Randomly set a fraction of activations to zero during training.
- Encourages redundancy and robustness in learned representations.
- Typical dropout rate: 0.2–0.5.
Early Stopping
- Monitor validation performance.
- Stop training when validation loss stops improving.
- Prevents overfitting and saves compute.
8.3 Transfer Learning & Foundation Models
- Start from a pre-trained model (e.g., on ImageNet).
- Replace top layers with a small classifier head for the task.
- Freeze some or all of the earlier layers.
- Benefits:
- Transfer general visual features to the domain.
- Requires less data and compute.
- Faster convergence, less overfitting.
Modern “foundation models” (ResNet, ViT, large vision transformers) act as generic backbones for many tasks.
9. Modern Deep Learning Themes (Beyond the Basics)
Awareness level for COMP3007, but good context:
- Self-supervised learning:
- Learn from unlabeled data via proxy tasks (e.g., predicting rotations, jigsaw puzzles, colorization).
- Transformers for vision:
- Self-attention over image patches (ViT, Swin).
- Strong results in classification, detection, segmentation, vision–language.
- Generative models:
- VAEs – probabilistic latent spaces, stable but often blurry outputs.
- GANs – adversarial training, sharp images (StyleGAN, deepfakes).
- Diffusion models – state-of-the-art image synthesis (e.g., Stable Diffusion).
10. Exam-Oriented Takeaways
- Explain where machine learning fits in the perception pipeline.
- Distinguish supervised / unsupervised / reinforcement learning.
- Describe classification vs regression and give examples.
- Discuss overfitting, curse of dimensionality, data imbalance, and fixes.
- Explain k-NN, SVM, and decision trees:
- Mechanism, pros/cons, key hyperparameters.
- Compute and interpret accuracy, precision, recall, F1-score, and confusion matrices.
- Sketch the structure of a neural network and explain:
- Layers, activations, loss, gradient descent, backpropagation.
- Summarize what makes deep learning different:
- Representation learning, large datasets, GPUs, end-to-end training.
- Describe key practical tools:
- Data augmentation, BatchNorm, dropout, early stopping, transfer learning.
- Recognize modern trends:
- Self-supervised learning, transformers, generative models (GANs, VAEs, diffusion).
Machine learning is the glue that turns all the earlier vision steps into useful decisions. Once we have good features, the rest is about picking the right model, training it well, and evaluating it honestly.
Leave a comment