📢 Notice 📢

3 minute read

Notes from the summary of COMP3010 Lecture 6, labs, and online resources.

Deep Learning Model Types

  • Multilayer Perceptron (MLP): Fully connected layers, suitable for tabular data and general function approximation.
  • Convolutional Neural Networks (CNNs): Excellent for grid-like data (e.g., images). Used in classification, object detection, facial recognition, and medical imaging.
  • Recurrent Neural Networks (RNNs): Designed for sequential data. Applied to audio, text, and time series analysis.
  • Transformers: Use attention mechanisms for parallel processing of sequences. Widely used in NLP and increasingly in image and audio tasks.

Model selection should align with the structure of the input data. Frameworks like PyTorch, TensorFlow, and Keras provide pre-built models.

Optimization Challenges

Training deep networks involves solving a difficult non-convex optimization problem.

  • Gradient Vanishing/Explosion: Gradients can shrink or grow excessively.
    • Solutions: ReLU activation, gradient clipping, Xavier/He initialization, lower learning rate, layer normalization.
  • Underfitting: Model is too simple to capture patterns.
    • Solutions: Increase network complexity, train longer, reduce learning rate.
  • Overfitting: Model memorizes training data.
    • Solutions: Dropout, L1/L2 regularization, data augmentation, early stopping.

Training Details

  • Epoch: One full pass of the dataset through the network.
  • Mini-Batch Gradient Descent: Efficient and stable optimization strategy using small batches of data.
  • Data Splitting: Divide data into training, validation, and test sets for generalization.
  • Cross-Validation: Used to reduce bias from a single validation set, though less common in deep learning due to compute cost.

Hyperparameter Tuning

Hyperparameters are chosen before training and are not updated through backpropagation.

  • Common Hyperparameters:
    • Learning Rate
    • Batch Size
    • Activation Function (e.g., ReLU, Sigmoid)
    • Loss Function
    • Number of Layers and Neurons
    • Optimizer (e.g., SGD, Adam)
    • Regularization parameters (e.g., weight decay)
    • Gradient Clipping and Normalization

Automated tuning methods like grid search, random search, and Bayesian optimization (e.g., Optuna) are commonly used.

Data Preprocessing

A critical phase that can dominate the project timeline.

  • Handling Missing Values: Impute (mean/median) or drop.
  • Outlier Detection: Use Z-score or IQR; decide to transform, remove, or keep.
  • Categorical Encoding:
    • One-Hot Encoding: Few categories
    • Embedding Layers: Many categories
  • Feature Normalization:
    • Min-Max scaling
    • Z-score standardization
    • Always apply the same scaling to test data
  • Data Augmentation: Improves generalization using transformations (e.g., image flipping, text paraphrasing).
  • Feature Engineering: Domain-informed construction of new features.

Summary

  • Neural networks can handle many data types with appropriate architectures.
  • Training is complex and sensitive to hyperparameters.
  • Data preprocessing is essential for success in deep learning projects.

Key Notes

  • Backpropagation is used to update weights—not hyperparameters.
  • Hyperparameters must be manually or automatically tuned using external strategies.

Lab 6 Discussion

Universal Approximation Theorem (UAT) & Deep NNs

  • A single hidden layer with enough neurons can approximate any continuous function.
  • But in practice, shallow networks:
    • Require many neurons
    • Generalize poorly
  • Deep networks offer:
    • Hierarchical feature learning
    • Efficient representation of complex functions
    • Better generalization with fewer parameters

MLPs for Regression vs. Classification

  • Shared architecture (fully connected layers), different output layers:
    • Regression:
      • Output: 1 neuron, linear activation
      • Loss: MSE or MAE
    • Binary Classification:
      • Output: 1 neuron, sigmoid activation
      • Threshold at 0.5
      • Loss: Binary cross-entropy
    • Multi-Class Classification:
      • Output: multiple neurons (one per class)
      • Activation: softmax
      • Output is a probability distribution (sums to 1)

Training Loss Becoming NaN

Common causes:

  • Exploding Gradients: Large weights produce unstable updates
    • Mitigation: Gradient clipping, lower learning rate
  • Numerical Errors: Division by zero, log(0)
    • Mitigation: Add small epsilon constants
  • Poor Initialization: Use Xavier or He initialization
  • Improper Data Preprocessing: Normalize inputs

Shallow Learners as Special Cases of Neural Networks

Many classical ML models are just simple NNs:

Traditional Model Equivalent NN
Linear Regression 1-layer NN, linear activation, MSE loss
Logistic Regression 1-layer NN, sigmoid activation, cross-entropy
SVM (Soft Margin) 1-layer NN, hinge loss (margin maximization)

Deep neural networks generalize these by stacking layers and introducing nonlinearities, enabling the modeling of highly complex functions.

Leave a comment