📢 Notice 📢

Please have a read first!

Convolutional Neural Network

April 14, 2025 3 minute read

Notes from the summary of COMP3010 Lecture 7, labs, and online resources.

Convolutional Neural Networks (CNN) Overview

MLP vs. CNN

MLP (Fully Connected Layers): Best suited for tabular data. Each input feature is treated independently, and spatial structure is lost.
CNN: Designed for image data with spatial hierarchies. Preserves spatial relationships between pixels.

Fully Connected Layers (MLP)

A 32×32×3 RGB image is flattened to 3072×1 to be input into a fully connected layer.
Every pixel is treated as an independent value, connected to every neuron.
Issue: Flattening discards spatial information—neighboring pixels lose context.

Convolutional Layers

Apply filters (e.g., 5×5×3) that slide over the image.
Preserve spatial relationships and detect local patterns like edges, textures, and shapes.
Filters span the full depth of the input (e.g., 3 channels for RGB).
Fewer parameters due to weight sharing across the image—same filters scan different regions.

Padding

Zero-padding is often added to input borders to maintain spatial dimensions.
Prevents the feature map from shrinking too quickly and helps retain edge information.
Important for deeper layers to process meaningful spatial data.

Parameter Efficiency in CNNs

Parameters reside in the filters (e.g., 5×5×3 = 75 weights per filter).
Example: 10 such filters → 750 weights + 10 biases = 760 parameters.
Compared to MLPs: 3072 inputs × 1024 neurons = ~3 million weights.
CNNs reduce parameters by reusing the same filters across the entire image.

Pooling Layers

No parameters (no weights or biases).
Perform operations like max pooling or average pooling to reduce spatial dimensions.
Benefits:
- Reduce computational cost
- Introduce translation invariance
- Summarize local features

CNN Architectures

Convolutional Layer: Preserves spatial layout via local connections and shared filters. Introduces receptive fields and captures low- to high-level patterns.
Pooling Layer: Reduces dimensionality and retains key features.
Fully Connected Layer: Final layer that performs classification or regression.

Notable Architectures

AlexNet: First CNN to win ImageNet; proved CNNs are viable for vision tasks.
VGGNet: Used deeper networks with small (3×3) filters.
GoogLeNet: Introduced inception modules; removed fully connected layers.
ResNet: Added residual connections; allowed very deep networks.
MobileNet, ShuffleNet: Lightweight models optimized for speed and efficiency.
Neural Architecture Search: Automates CNN architecture design.

Transfer Learning

Use a CNN trained on a large dataset (e.g., ImageNet) to fine-tune for a smaller custom dataset.
Techniques:
- Freeze lower layers (if new data is similar)
- Fine-tune entire model (if new data is different)
- Use pre-trained CNN as a feature extractor

Available through model zoos in PyTorch, TensorFlow, etc.

Key Concepts

Comparing Convolutional vs. Fully Connected Layers

Feature	Convolutional Layers	Fully-Connected Layers
Connectivity	Local (receptive field)	Global (all-to-all)
Parameters	Shared across spatial locations	Unique weights for every input-neuron pair
Feature Learning	Spatial hierarchies (edges, patterns)	Global relationships
Input Format	Multi-dimensional arrays (images)	Flattened 1D vectors

Translation Equivariance: CNNs preserve spatial structure. If an object moves, the activation shifts accordingly.

Role of Pooling Layers

Dimensionality Reduction: Downsample feature maps.
Translation Invariance: Makes features robust to small shifts.
Feature Summarization: Capture dominant features in local regions.

Common types:

Max Pooling: Retains maximum value
Average Pooling: Computes average in region

Receptive Field

The area of the input that influences a particular neuron’s activation.
Deeper layers see progressively larger input regions.

Effective Receptive Field for Three 3×3 Conv Layers (stride = 1):

1st layer: 3×3
2nd layer: 5×5 (3 + 2)
3rd layer: 7×7 (5 + 2)

Parameter Count for 3 Layers (assuming constant channels $C$):

Each: $3×3×C×C + C = 9C^2 + C$
Total: $3×(9C^2 + C) = 27C^2 + 3C$

CNN Troubleshooting & Strategies

Low Accuracy (Train + Test)

Check dataset: Cleanliness, label accuracy, class balance
Model architecture: May be too simple or too complex
Hyperparameters: Adjust learning rate, batch size, epochs
Preprocessing: Normalize inputs; verify augmentations
Debugging: Check for implementation bugs

High Training, Low Test Accuracy

Overfitting mitigation:
- Data Augmentation
- Dropout
- L1/L2 Regularization
- Batch Normalization
- Early Stopping
Improving Generalization:
- Collect more data
- Try different architectures
- Use Transfer Learning
- Optimize hyperparameters for validation performance

Share on

X Facebook LinkedIn Bluesky