Image Segmentation
Image segmentation divides an image into meaningful regions so that each region corresponds to a distinct object or part of an object.
Where object detection gives boxes, segmentation gives pixel-precise masks.
1. What Segmentation Is (and Isn’t)
- Classification: one label for the whole image (e.g., “cat”).
- Detection: labels + bounding boxes around objects.
- Segmentation: labels at the pixel level – we care about the exact shape and extent of objects (road vs sidewalk vs cars vs sky).
Segmentation is crucial in:
- Autonomous driving (road, lane, pedestrians, cars).
- Medical imaging (tumours, organs, vessels).
- Remote sensing (land cover, buildings).
- Agriculture (crops vs weeds, canopy vs soil).
2. Taxonomy: Semantic, Instance, Panoptic
2.1 Semantic Segmentation
- Assigns a class label to every pixel.
- All pixels of the same class share one label (e.g., all “road” pixels = label 0).
- Output: label map where each pixel stores a class ID.
Limitations:
- Cannot distinguish individual instances of the same class (car #1 vs car #2).
2.2 Instance Segmentation
- Separates each object instance of a class:
- Car 1, Car 2, Person 1, Person 2, …
- Output: one mask per object + class label.
Often evaluated with detection-style metrics (AP) adapted to masks.
2.3 Panoptic Segmentation
- Combines both ideas:
- Gives each pixel a semantic label, and
- Distinguishes instances of “thing” classes (cars, people, etc.).
- Treats:
- Stuff: amorphous regions (road, grass, sky).
- Things: countable objects (cars, people, bikes).
Metric: Panoptic Quality (PQ)
- $ \text{PQ} = \text{SQ} \times \text{RQ} $
- SQ – Segmentation Quality (how good the overlaps are).
- RQ – Recognition Quality (how many segments are matched correctly).
3. Evaluating Segmentation
Common metrics:
- Pixel Accuracy
- Fraction of correctly labeled pixels.
- Simple but can be misleading if one class (e.g., background) dominates.
- Intersection over Union (IoU)
-
For a class: $ \text{IoU} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}} $
-
Measures overlap quality between predicted and ground-truth regions.
-
- Mean IoU (mIoU)
- Average IoU across all classes.
- Standard metric for semantic segmentation benchmarks (Cityscapes, ADE20K).
- AP / mAP for instance masks
- Borrowed from object detection, but applied to mask IoU instead of box IoU.
- PQ (Panoptic Quality) for panoptic segmentation
- Jointly captures detection and segmentation quality.
4. Classical Unsupervised Segmentation
These methods do not require labeled training data. They operate on the image directly and are very exam-relevant (Otsu, Watershed, Mean Shift).
4.1 Otsu’s Thresholding
Goal: Find a global intensity threshold that best separates foreground and background in a grayscale image.
Key idea:
- Look at the image histogram of gray levels.
- Partition it into two classes (below threshold = background, above = foreground).
- Choose the threshold that:
- Minimises within-class variance, or
- Equivalently maximises between-class variance.
Pros:
- Very fast and fully automatic.
- Works well when the histogram is bimodal (clear separation).
Limitations:
- Ignores spatial relationships between pixels – only looks at the histogram.
- Struggles with:
- Uneven illumination.
- Overlapping intensity distributions.
- Touching/overlapping objects.
- Can produce noisy masks and outliers in complex scenes.
4.2 Watershed (Often with Distance Transform)
Intuition: Treat the (preprocessed) image like a topographic surface.
Typical pipeline for separating touching objects:
- Obtain a coarse binary mask (foreground vs background).
- Compute a distance transform:
- Each foreground pixel’s value = distance to the nearest background pixel.
- Peaks correspond to object centers.
- Identify markers (local maxima / seeds).
- Apply the watershed transform:
- “Flood” from each marker.
- Where two floods meet, create a boundary (watershed line).
Use cases:
- Separating touching coins/cells after thresholding.
- Refining segmentation in binary images.
Pitfalls:
- Highly sensitive to noise – tends to over-segment if not carefully preprocessed.
- Usually combined with smoothing, thresholding, or marker-based approaches.
4.3 Mean Shift Segmentation
Mean Shift is a mode-seeking algorithm on a kernel density estimate.
For segmentation:
- Represent each pixel in a feature space such as:
- Color (R,G,B or Lab*), and optionally
- Spatial coordinates (x,y).
- Places a kernel around a point and iteratively moves it to the local density maximum (mode).
- Pixels converging to the same mode become a cluster/segment.
Advantages:
- No assumption of Gaussian clusters.
- Can handle arbitrary-shaped clusters.
- Works well in low-dimensional feature spaces (color + position).
Disadvantages:
- Computationally expensive for large images/high-dimensional features.
- Bandwidth (kernel size) strongly affects results:
- Small bandwidth → many small segments.
- Large bandwidth → over-smoothing and loss of detail.
5. Deep Learning for Segmentation
Modern semantic/instance/panoptic segmentation relies heavily on CNN-based architectures.
5.1 Fully Convolutional Networks (FCN)
- Replace fully connected layers with convolutional layers so the network produces per-pixel predictions.
- Use upsampling / deconvolution (transpose conv) to bring coarse feature maps back to image resolution.
FCN introduced the core idea: take a classification CNN and turn it into a dense predictor.
5.2 U-Net
Classic for medical imaging.
- Encoder–decoder structure:
- Encoder: downsampling path (convolutions + pooling).
- Decoder: upsampling path (deconvolutions / upconvs).
- Skip connections:
- Copy features from encoder layers to corresponding decoder layers.
- Helps recover fine details lost during downsampling.
Strengths:
- Works well even with relatively small datasets (common in medical).
- Sharp, detailed masks.
5.3 SegNet / PSPNet / Fast-SCNN (Awareness)
- SegNet:
- Similar encoder–decoder.
- Decoder reuses pooling indices from encoder to upsample more efficiently.
- PSPNet:
- Uses pyramid pooling to capture multi-scale context.
- Improves performance on complex, large-scale scenes.
- Fast-SCNN:
- Lightweight architecture for real-time semantic segmentation.
- Designed for embedded or mobile deployment (speed vs accuracy trade-off).
5.4 Mask R-CNN (Instance Segmentation)
- Built on top of Faster R-CNN.
- Adds a third branch (besides bounding box and class):
- A small FCN that predicts a binary mask for each detected object.
- Output per detection:
- Class label.
- Bounding box.
- Pixel-wise mask.
This turns a detection model into an instance segmentation model.
6. Datasets We Should Know
Common benchmarks mentioned in lectures:
- Cityscapes
- High-quality annotations for urban street scenes.
- Used extensively for autonomous driving research.
- 19 semantic classes (road, sidewalk, building, etc.).
- ADE20K
- ~150 semantic categories.
- Diverse scenes (indoor + outdoor).
- More challenging due to variety and fine-grained labels.
- CamVid
- Older dataset of road scenes.
- Lower resolution and fewer images, but historically important.
Knowing the typical classes and scenes helps when answering exam questions about applications and challenges.
7. Practical Issues & Trade-offs
7.1 Speed vs Accuracy
- Heavier models (PSPNet, large U-Nets, high-res FCNs):
- Higher mIoU, slower inference.
- Lightweight models (Fast-SCNN, smaller backbones):
- Lower mIoU but faster FPS.
- Choice depends on:
- Real-time constraints (e.g., autonomous driving).
- Hardware budget (GPU vs embedded).
7.2 Class Imbalance
- Background or “stuff” classes often dominate.
- Small objects (signs, pedestrians) may be rare.
- Remedies:
- Class-balanced or focal losses.
- Oversampling hard/rare examples.
- Carefully tuned data augmentation.
7.3 Domain Shift
- Training on one dataset (e.g., Cityscapes) and deploying on another (different city, country, sensor) can cause performance drops.
- Domain adaptation and fine-tuning are often required in practice.
8. Exam-Oriented Takeaways
Be ready to:
- Define semantic segmentation, instance segmentation, and panoptic segmentation, and explain:
- Output formats.
- Use cases.
- PQ = SQ × RQ for panoptic.
- Describe Otsu’s method:
- Operates on the intensity histogram.
- Chooses the threshold that minimises within-class variance / maximises between-class variance.
- Fails with non-uniform lighting, overlapping distributions, and no spatial context.
- Describe watershed segmentation:
- Optionally combined with distance transform.
- “Basins flood from markers; where floods meet you get boundaries.”
- Good for separating touching objects; prone to over-segmentation.
- Describe Mean Shift segmentation:
- Clustering in joint color–position space via mode seeking on a kernel density estimate.
- Bandwidth controls segment size; works best in low-dimensional feature spaces.
- Outline key deep learning models:
- FCN, U-Net, SegNet/PSPNet/Fast-SCNN (semantic).
- Mask R-CNN (instance).
- Mention speed–accuracy trade-offs and evaluation metrics (Pixel Acc, mIoU, AP, PQ).
Image segmentation is what turns raw images into a dense, structured understanding of the scene – telling us not just what is present, but exactly where each thing begins and ends.
Leave a comment