Convolutional Neural Network
Notes from the summary of COMP3010 Lecture 7, labs, and online resources.
Convolutional Neural Networks (CNN) Overview
MLP vs. CNN
- MLP (Fully Connected Layers): Best suited for tabular data. Each input feature is treated independently, and spatial structure is lost.
- CNN: Designed for image data with spatial hierarchies. Preserves spatial relationships between pixels.
Fully Connected Layers (MLP)
- A 32×32×3 RGB image is flattened to 3072×1 to be input into a fully connected layer.
- Every pixel is treated as an independent value, connected to every neuron.
- Issue: Flattening discards spatial information—neighboring pixels lose context.
Convolutional Layers
- Apply filters (e.g., 5×5×3) that slide over the image.
- Preserve spatial relationships and detect local patterns like edges, textures, and shapes.
- Filters span the full depth of the input (e.g., 3 channels for RGB).
- Fewer parameters due to weight sharing across the image—same filters scan different regions.
Padding
- Zero-padding is often added to input borders to maintain spatial dimensions.
- Prevents the feature map from shrinking too quickly and helps retain edge information.
- Important for deeper layers to process meaningful spatial data.
Parameter Efficiency in CNNs
- Parameters reside in the filters (e.g., 5×5×3 = 75 weights per filter).
- Example: 10 such filters → 750 weights + 10 biases = 760 parameters.
- Compared to MLPs: 3072 inputs × 1024 neurons = ~3 million weights.
- CNNs reduce parameters by reusing the same filters across the entire image.
Pooling Layers
- No parameters (no weights or biases).
- Perform operations like max pooling or average pooling to reduce spatial dimensions.
- Benefits:
- Reduce computational cost
- Introduce translation invariance
- Summarize local features
CNN Architectures
- Convolutional Layer: Preserves spatial layout via local connections and shared filters. Introduces receptive fields and captures low- to high-level patterns.
- Pooling Layer: Reduces dimensionality and retains key features.
- Fully Connected Layer: Final layer that performs classification or regression.
Notable Architectures
- AlexNet: First CNN to win ImageNet; proved CNNs are viable for vision tasks.
- VGGNet: Used deeper networks with small (3×3) filters.
- GoogLeNet: Introduced inception modules; removed fully connected layers.
- ResNet: Added residual connections; allowed very deep networks.
- MobileNet, ShuffleNet: Lightweight models optimized for speed and efficiency.
- Neural Architecture Search: Automates CNN architecture design.
Transfer Learning
- Use a CNN trained on a large dataset (e.g., ImageNet) to fine-tune for a smaller custom dataset.
- Techniques:
- Freeze lower layers (if new data is similar)
- Fine-tune entire model (if new data is different)
- Use pre-trained CNN as a feature extractor
Available through model zoos in PyTorch, TensorFlow, etc.
Key Concepts
Comparing Convolutional vs. Fully Connected Layers
| Feature | Convolutional Layers | Fully-Connected Layers |
|---|---|---|
| Connectivity | Local (receptive field) | Global (all-to-all) |
| Parameters | Shared across spatial locations | Unique weights for every input-neuron pair |
| Feature Learning | Spatial hierarchies (edges, patterns) | Global relationships |
| Input Format | Multi-dimensional arrays (images) | Flattened 1D vectors |
Translation Equivariance: CNNs preserve spatial structure. If an object moves, the activation shifts accordingly.
Role of Pooling Layers
- Dimensionality Reduction: Downsample feature maps.
- Translation Invariance: Makes features robust to small shifts.
- Feature Summarization: Capture dominant features in local regions.
Common types:
- Max Pooling: Retains maximum value
- Average Pooling: Computes average in region
Receptive Field
- The area of the input that influences a particular neuron’s activation.
- Deeper layers see progressively larger input regions.
Effective Receptive Field for Three 3×3 Conv Layers (stride = 1):
- 1st layer: 3×3
- 2nd layer: 5×5 (3 + 2)
- 3rd layer: 7×7 (5 + 2)
Parameter Count for 3 Layers (assuming constant channels $C$):
- Each: $3×3×C×C + C = 9C^2 + C$
- Total: $3×(9C^2 + C) = 27C^2 + 3C$
CNN Troubleshooting & Strategies
Low Accuracy (Train + Test)
- Check dataset: Cleanliness, label accuracy, class balance
- Model architecture: May be too simple or too complex
- Hyperparameters: Adjust learning rate, batch size, epochs
- Preprocessing: Normalize inputs; verify augmentations
- Debugging: Check for implementation bugs
High Training, Low Test Accuracy
- Overfitting mitigation:
- Data Augmentation
- Dropout
- L1/L2 Regularization
- Batch Normalization
- Early Stopping
- Improving Generalization:
- Collect more data
- Try different architectures
- Use Transfer Learning
- Optimize hyperparameters for validation performance
Leave a comment