📢 Notice 📢

7 minute read

Notes from the summary of COMP3010 Lecture 4, labs, and online resources.

High School Statistics Definition of Variance

In statistics, variance measures how spread out a set of numbers is. If you have a dataset like:

{10, 12, 8, 15, 14}

The variance tells you how much these numbers deviate from the mean. A high variance means the numbers are very spread out, and a low variance means they are close to the mean.

Machine Learning Definition of Variance

In machine learning, variance doesn’t refer to the spread of numbers in a dataset. Instead, it describes how much a model’s predictions change when trained on different datasets.

  • High variance model (overfitting): If you train the model on different datasets, its predictions change a lot because it’s too sensitive to small differences.
  • Low variance model (underfitting): The model gives similar predictions no matter which dataset you train it on, but they might all be wrong.

📌 Analogy to Connect the Two

  • In statistics, variance measures how much data points vary from the mean.
  • In machine learning, variance measures how much a model’s predictions vary across different training datasets.

Even though the word variance is used in both fields, its meaning shifts from describing data spread (statistics) to model sensitivity (machine learning).

What is Bias?

Bias refers to the error due to overly simplistic assumptions in the learning algorithm. A model with high bias is too simple and doesn’t capture the complexity of the data well, leading to underfitting.

Example of High Bias (Underfitting)

Imagine you are trying to predict house prices based on the size of the house.

  • High bias (underfitting): If you use a very simple model, like a straight-line equation (e.g., Price = 500 × Size), it might not fit the actual price trends well.
  • It fails to capture more complex relationships, such as how neighborhood, number of rooms, or house condition affect the price.
  • 🛑 Result: The model performs poorly on both training and test data.

What is Variance?

Variance refers to how much the model changes when trained on different datasets. A model with high variance is too sensitive to small fluctuations in the training data and doesn’t generalize well, leading to overfitting.

Example of High Variance (Overfitting)

Now, suppose we make the house price prediction model very complex, considering too many factors, including house color, window shape, street name, and even the day of the week the house was listed.

  • High variance (overfitting): This model fits the training data extremely well, but it captures noise instead of actual patterns.
  • If we train it on a new dataset, it fails to generalize because it relied too much on specific, irrelevant details from the original training data.
  • 🛑 Result: The model performs well on training data but poorly on new, unseen data.

Bias-Variance Tradeoff

We want a balance between bias and variance:

  • High bias, low variance → The model is too simple (underfitting). It makes similar predictions regardless of the data.
  • Low bias, high variance → The model is too complex (overfitting). It varies too much with small data changes.
  • Balanced bias and variance → The model captures important patterns without memorizing noise.

📌 Analogy: Imagine you are learning to shoot arrows at a target.

  • High bias: All arrows land far from the target in a similar spot (consistent but wrong).
  • High variance: Arrows are scattered all over the target (random, unpredictable).
  • Balanced model: Arrows are mostly close to the center.

How Does This Apply to Decision Trees?

Shallow Trees (High Bias, Low Variance)

  • If the decision tree is too small (e.g., only 1 or 2 splits), it won’t capture enough patterns → underfitting.

Deep Trees (Low Bias, High Variance)

  • If the tree is too deep, it memorizes training data, even capturing random noise → overfitting.

Optimal Tree Depth

  • Using pruning (stopping early) and setting constraints (like max depth or min leaf size) helps find a good balance.

Overview of Decision Trees

  • Decision trees are nonlinear models used in supervised learning.
  • They differ from linear models like logistic regression and support vector machines (SVMs).
  • Topics covered: decision tree inference, learning, bias-variance tradeoff, overfitting, and ensemble learning (bagging and boosting).

1. Decision Tree Inference

  • A decision tree makes predictions by recursively splitting attributes.
  • Example: Classifying a fruit based on weight and height.
  • The tree structure consists of:
    • Root node (first decision)
    • Internal nodes (splitting points)
    • Leaf nodes (final classification)
  • Decision trees can be visualized as 2D graphs, where each branch defines a region.

2. Decision Tree Construction

  • Decision trees can handle both categorical and continuous data.
  • Example: Predicting if a customer will wait at a restaurant based on different attributes.
  • Recursive splitting occurs until nodes reach certainty (pure classification).

3. Decision Trees for Classification and Regression

  • Classification Trees → Assigning categories.
  • Regression Trees → Averaging outputs in a region.
  • They approximate continuous functions using piecewise constant segments.

4. Overfitting & Generalization

  • Deep trees achieve 100% accuracy on training data but may not generalize well.
  • Smaller trees are preferred for better generalization (Ockham’s Razor principle).
  • Finding the smallest optimal tree is NP-hard, so we use heuristics like greedy algorithms.

5. Splitting Criteria & Entropy

  • Decision trees use splitting criteria to determine the best attribute to split on.
  • Accuracy alone is not a good measure for splits.
  • Entropy from information theory measures uncertainty:
    • Low entropy = more certainty (pure nodes).
    • High entropy = uncertainty (mixed classes).
  • A good split reduces entropy, making classification more certain.

6. Bias-Variance Tradeoff in Decision Trees

Decision trees must balance:

  • Bias: Simplifying assumptions that can underfit the data.
  • Variance: Sensitivity to small changes that can overfit the data.

A deep tree may overfit, while a shallow tree may underfit. The goal is to find an optimal depth that minimizes both bias and variance.

Summary

  • Bias-variance tradeoff is key to generalization.
  • Shallow trees underfit, deep trees overfit.
  • Entropy & information gain help in choosing the best splits.
  • Pruning & regularization can help control tree complexity.

This structured breakdown ensures clarity in understanding how decision trees leverage entropy and information gain for effective classification and regression.

Ensemble Methods Overview

What is an Ensemble?

An ensemble of predictors is a set of models whose individual decisions are combined to improve classification accuracy. The combination can be done through majority voting or weighted voting.

To be effective, classifiers in the ensemble must differ in some way, such as:

  • Using different algorithms
  • Choosing different hyperparameters
  • Training on different subsets of data
  • Applying different weightings to training examples

Ensembles are easy to implement, but choosing the right ensemble method depends on specific goals.

Sources of Error in Machine Learning Models

A machine learning model’s error consists of three components:

$ [\text{Error} = \text{Variance} + \text{Bias} + \epsilon ] $

where (epsilon) represents the Bayes error or inherent unpredictability.

Types of Ensemble Strategies

  1. Bagging (Bootstrap Aggregating)
    • Trains classifiers independently on random subsets of training data.
    • Reduces variance without affecting bias.
  2. Boosting
    • Trains classifiers sequentially, focusing on examples that previous models misclassified.
    • Reduces bias by iteratively improving weak learners.

Bagging: Motivation

If we could sample (m) independent training sets from a data distribution (p), we could compute multiple predictions and take their average:

$ [ y = \frac{1}{m} \sum_{i=1}^{m} y_i ] $

Effect on Model Error:

  • Bayes error: Unchanged (cannot be reduced by averaging)
  • Bias: Unchanged (averaged predictions have the same expectation)
  • Variance: Reduced (averaging independent predictions smooths the variance)

$ [ \text{Var}(y) = \sigma^2 \rightarrow \frac{1}{m} \sigma^2 ] $

Random Forests

A random forest is a collection of bagged decision trees with an additional step to decorrelate predictions:

  • At each tree node, a random subset of input features is chosen for splitting.
  • Works well with minimal tuning and is often a strong default model.

Boosting: The Idea

Boosting builds an additive ensemble model step by step:

$ [ f(x) = \sum_{i=1}^{m} \beta_i h_i(x) ] $

  • Each weak learner compensates for errors of the previous ones.
  • Weights (beta) are assigned based on weak learner performance.
  • AdaBoost: The first successful boosting algorithm, used in real-time object detection (Viola-Jones algorithm).
  • Gradient Boosted Decision Trees (GBDT):
    • Uses gradient descent to minimize errors.
    • Commonly used in Kaggle competitions.
    • Often outperforms deep learning on tabular data.

Summary

  • Decision trees are simple and intuitive but prone to overfitting.
  • Bias-variance tradeoff is crucial for all ML models.
  • Ensemble learning improves accuracy and is easy to implement.
  • Ensemble decision trees (e.g., Random Forests, GBDT) make tree-based models more powerful but reduce interpretability.

Leave a comment