Linear Classification
Notes from the summary of COMP3010 Lecture 3, labs, and online resources.
Introduction to Classification
Classification is a core task in machine learning where the goal is to assign input data to predefined categories or classes. Unlike regression, which predicts continuous values, classification outputs discrete labels.
Common examples:
- Email filtering (spam vs. non-spam)
- Disease diagnosis (positive vs. negative)
- Image recognition (e.g., cat vs. dog)
Logistic Regression vs. Linear Regression
| Feature | Linear Regression | Logistic Regression |
|---|---|---|
| Output Type | Continuous values | Probabilities for classification |
| Model | Fits a straight line | Applies sigmoid function |
| Loss Function | Mean Squared Error (MSE) | Log Loss (Cross-Entropy) |
| Optimization | Least squares / Gradient Descent | Gradient Descent |
| Usage | Predicting numerical outputs | Binary or Multi-class Classification |
Despite its name, logistic regression is a classification method that combines a linear model with a sigmoid function to output values between 0 and 1. A threshold (usually 0.5) is applied to map probabilities to class labels.
Training Logistic Regression
The sigmoid function used in logistic regression is:
$y = \frac{1}{1 + e^{-(wx + b)}}$
Why Not MSE?
- Using MSE leads to non-convex loss, which complicates convergence.
- Instead, Binary Cross-Entropy (Log Loss) is used:
$L = - \left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right]$
This loss function is convex and better suited for optimization using gradient descent.
Learning Rate and Convergence
- A proper learning rate is crucial:
- Too high → overshooting, oscillation.
- Too low → slow or stuck convergence.
- Adaptive methods (e.g., Adam, RMSProp) help adjust learning rate dynamically.
Support Vector Machines (SVM)
SVM is a linear classification technique that finds the optimal hyperplane separating classes by maximizing the margin between support vectors.
SVM Intuition:
- Support Vectors: Closest data points to the decision boundary.
- Margin: Distance between support vectors and the decision boundary.
- Focuses on separation rather than probabilities.
The Kernel Trick
For non-linearly separable data, SVM uses the kernel trick to map data into a higher-dimensional space where a linear boundary becomes feasible.
- Implicit Mapping: Transforms input space without explicitly computing new coordinates.
- Kernel Functions:
- Polynomial
- Gaussian (RBF)
- Sigmoid
- Benefits:
- Allows complex decision boundaries
- Computationally efficient
- Adaptable to different data patterns via kernel choice
SVM vs Deep Learning
| Aspect | SVM with Kernel | Deep Learning |
|---|---|---|
| Feature Engineering | Handcrafted (kernel selection) | Learned from data |
| Flexibility | Moderate, depends on kernel | Highly flexible, end-to-end |
| Data Requirement | Lower | Requires large labeled datasets |
| Computation | Efficient with good kernel | Computationally expensive |
Effect of Outliers
Logistic Regression:
- Sensitive to outliers: Outliers can distort the decision boundary.
- Why? LR tries to maximize likelihood, and extreme data points shift the boundary.
Mitigation:
- Regularization: L1/L2 penalizes large weights.
- Preprocessing: Detect and remove outliers.
SVM:
- More robust: Focuses on maximizing margin, not fitting all data.
- Soft Margin SVM: Introduces tolerance for misclassifications with C parameter:
- Lower C → more flexibility, larger margin
- Higher C → stricter separation, less tolerance
Multi-Class Classification
Both logistic regression and SVM are inherently binary classifiers. For multi-class problems, they can be extended using:
One-vs-All (OvA) / One-vs-Rest (OvR)
- Train one classifier per class (positive vs. rest).
- Prediction: Class with highest probability.
- Pros: Simple and efficient.
- Cons: Sensitive to imbalanced datasets.
One-vs-One (OvO)
- Train one classifier per pair of classes.
- Total classifiers: $K(K-1)/2$
- Prediction: Majority voting.
- Pros: Smaller training sets per classifier.
- Cons: High computational cost for many classes.
Softmax Regression (Multinomial Logistic Regression)
- Extends logistic regression directly to handle multiple classes.
- Outputs a probability distribution over all classes.
SVM Multi-Class Extensions
- Also supports OvA and OvO strategies.
Summary of Key Concepts
| Concept | Definition |
|---|---|
| Classification | Assigning discrete labels to input data |
| Logistic Regression | Linear model with sigmoid for binary classification |
| Cross-Entropy Loss | Convex loss for classification tasks |
| Gradient Descent | Optimization to minimize the loss |
| Learning Rate | Controls update step size in optimization |
| Decision Boundary | Line or hyperplane separating classes |
| Support Vectors | Points closest to the decision boundary in SVM |
| Kernel Trick | Implicitly maps data to higher dimensions |
| Regularization | Penalizes complexity to prevent overfitting |
| Soft Margin SVM | Allows misclassification with penalty |
| OvA / OvO | Multi-class extension strategies for binary classifiers |
| Softmax Regression | Multinomial logistic regression for multi-class tasks |
Leave a comment