Linear Regression and Gradient Descent
Notes from the summary of COMP3010 Lecture 2, labs, and online resources.
Linear Regression Overview
In linear regression, we aim to find the best-fitting line to predict a target value $y$ from an input $x$. The model can be expressed as:
$[ f(x) = wx + b ]$
Where:
- $w$ is the weight (or slope of the line),
- $b$ is the bias (or y-intercept),
- $f(x)$ is the predicted value for a given input $x$.
The goal is to minimize the difference between the predicted values $f(x)$ and the actual target values $y$.
Gradient Descent
Gradient descent is an optimization algorithm used to find the optimal values for $w$ and $b$ by minimizing the loss function (usually Mean Squared Error, MSE).
How Gradient Descent Works:
- Initialize parameters: Start with random values for $w$ and $b$.
- Compute the loss: Calculate the loss (e.g., MSE) based on the current values of $w$ and $b$.
- Calculate gradients: Compute the gradients of the loss with respect to $w$ and $b$.
- Update parameters: Adjust $w$ and $b$ to reduce the loss.
- Repeat: Continue iterating until convergence.
Gradient Equation for $w$:
The gradient of the loss function with respect to the weight $w$ is given by:
$ \frac{\partial L}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (f(x_i) - y_i) \cdot x_i $
Where:
- $L$ is the loss function (Mean Squared Error),
- $n$ is the number of data points,
- $f(x_i)$ is the predicted value for the $i$-th data point,
- $y_i$ is the actual value for the $i$-th data point,
- $x_i$ is the input value for the $i$-th data point.
Learning Rate
The learning rate $\alpha$ controls how much we adjust the parameters $w$ and $b$ during each update. It determines the step size in gradient descent.
- A high learning rate might cause the algorithm to overshoot the optimal values.
- A low learning rate makes the updates slow and may lead to prolonged convergence or getting stuck in a local minimum.
Convergence
Convergence refers to the point at which the algorithm has found the optimal values for $w$ and $b$, or when further updates result in negligible changes to the loss function.
- Convergence is achieved when the gradient approaches zero and the model’s predictions become stable.
Breaking Down the Gradient Equation Term by Term
Let’s break down the gradient equation for $w$ into simpler terms:
Gradient Equation:
$ \frac{\partial L}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (f(x_i) - y_i) \cdot x_i $
- $ \frac{2}{n} $:
- The 2 comes from the derivative of the squared error term (squared differences between predicted and actual values).
- $n$ is the number of training examples, and dividing by $n$ averages the gradient over all data points.
- Summation $ \sum_{i=1}^{n} $:
- This is the summation symbol, meaning we add up the values for each data point from $i = 1$ to $n$ (all training examples).
- $ ( f(x_i) - y_i ) $:
- This is the error for the $i$-th data point: the difference between the predicted value $f(x_i)$ and the actual target value $y_i$.
- $ x_i $:
- This is the input value $x_i$ for the $i$-th data point, which is multiplied by the error. This step ensures that the error is weighted by how much the input $x_i$ influences the prediction.
By summing the contributions from all data points and averaging them, the gradient gives us the direction in which to adjust $w$ to minimize the loss function.
Conclusion
- Linear Regression tries to fit a line to the data by adjusting the weights and bias.
- Gradient Descent helps minimize the loss function by iteratively adjusting the weights and bias.
- Learning Rate controls the size of the steps taken in each iteration.
- Convergence occurs when the loss stops decreasing.
Additional Insights
Understanding the Learning Rate
The learning rate determines how quickly or slowly a model learns during training. It influences how much the model updates its parameters in response to the computed gradient.
- A high learning rate may cause the optimization to overshoot the minimum of the loss function. This can lead to:
- Oscillations around the minimum
- Failure to converge (divergence)
- A low learning rate slows down convergence:
- Many more iterations are needed to reach a minimum
- Risk of getting stuck in suboptimal regions
In practice, the learning rate is typically chosen through trial and error or using techniques such as:
- Adaptive learning rate methods (e.g., Adam, RMSprop)
- Learning rate scheduling (gradually decreasing the rate during training)
Local Minima vs. Global Minima
The goal of gradient descent is to minimize the loss function, which measures how far off the predictions are from the actual target values.
- A local minimum is a point where the loss is lower than its neighboring points.
- A global minimum is the absolute lowest point across the entire function domain.
Why It Matters:
- In simple models like linear regression (convex problems), the cost function has a single global minimum — no risk of local traps.
- In complex models (e.g., deep neural networks with non-convex loss surfaces), the cost function may contain multiple local minima.
Finding a good local minimum is often sufficient for practical model performance. Some techniques even introduce randomness into the optimization (like stochastic gradient descent) to help avoid poor local minima.
Weight Magnitude and Feature Scale
The value of a weight (w) should be proportional to the scale of the feature it multiplies.
- If a feature has a large numeric range, a small weight might not contribute effectively.
- If a feature has a small range, a large weight may dominate disproportionately.
To address this, we use feature normalization techniques such as:
- Min-max scaling
- Z-score standardization
These help bring all feature values onto a similar scale, which:
- Prevents one feature from dominating due to magnitude
- Speeds up convergence
- Improves numerical stability
Quick Recap
- Regression problems have continuous-valued targets.
- Linear Regression aims to fit a line:
$\displaystyle f(x) = wx + b$ - Gradient Descent iteratively adjusts $w$ and $b$ to minimize the loss.
- The learning rate controls how big each update step is.
- Convergence happens when updates stop changing the loss significantly.
- Understanding local vs. global minima and using feature normalization are important for model performance.
Leave a comment