📢 Notice 📢

6 minute read

Image segmentation divides an image into meaningful regions so that each region corresponds to a distinct object or part of an object.

Where object detection gives boxes, segmentation gives pixel-precise masks.

1. What Segmentation Is (and Isn’t)

  • Classification: one label for the whole image (e.g., “cat”).
  • Detection: labels + bounding boxes around objects.
  • Segmentation: labels at the pixel level – we care about the exact shape and extent of objects (road vs sidewalk vs cars vs sky).

Segmentation is crucial in:

  • Autonomous driving (road, lane, pedestrians, cars).
  • Medical imaging (tumours, organs, vessels).
  • Remote sensing (land cover, buildings).
  • Agriculture (crops vs weeds, canopy vs soil).

2. Taxonomy: Semantic, Instance, Panoptic

2.1 Semantic Segmentation

  • Assigns a class label to every pixel.
  • All pixels of the same class share one label (e.g., all “road” pixels = label 0).
  • Output: label map where each pixel stores a class ID.

Limitations:

  • Cannot distinguish individual instances of the same class (car #1 vs car #2).

2.2 Instance Segmentation

  • Separates each object instance of a class:
    • Car 1, Car 2, Person 1, Person 2, …
  • Output: one mask per object + class label.

Often evaluated with detection-style metrics (AP) adapted to masks.

2.3 Panoptic Segmentation

  • Combines both ideas:
    • Gives each pixel a semantic label, and
    • Distinguishes instances of “thing” classes (cars, people, etc.).
  • Treats:
    • Stuff: amorphous regions (road, grass, sky).
    • Things: countable objects (cars, people, bikes).

Metric: Panoptic Quality (PQ)

  • $ \text{PQ} = \text{SQ} \times \text{RQ} $
    • SQ – Segmentation Quality (how good the overlaps are).
    • RQ – Recognition Quality (how many segments are matched correctly).

3. Evaluating Segmentation

Common metrics:

  • Pixel Accuracy
    • Fraction of correctly labeled pixels.
    • Simple but can be misleading if one class (e.g., background) dominates.
  • Intersection over Union (IoU)
    • For a class: $ \text{IoU} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}} $

    • Measures overlap quality between predicted and ground-truth regions.

  • Mean IoU (mIoU)
    • Average IoU across all classes.
    • Standard metric for semantic segmentation benchmarks (Cityscapes, ADE20K).
  • AP / mAP for instance masks
    • Borrowed from object detection, but applied to mask IoU instead of box IoU.
  • PQ (Panoptic Quality) for panoptic segmentation
    • Jointly captures detection and segmentation quality.

4. Classical Unsupervised Segmentation

These methods do not require labeled training data. They operate on the image directly and are very exam-relevant (Otsu, Watershed, Mean Shift).

4.1 Otsu’s Thresholding

Goal: Find a global intensity threshold that best separates foreground and background in a grayscale image.

Key idea:

  • Look at the image histogram of gray levels.
  • Partition it into two classes (below threshold = background, above = foreground).
  • Choose the threshold that:
    • Minimises within-class variance, or
    • Equivalently maximises between-class variance.

Pros:

  • Very fast and fully automatic.
  • Works well when the histogram is bimodal (clear separation).

Limitations:

  • Ignores spatial relationships between pixels – only looks at the histogram.
  • Struggles with:
    • Uneven illumination.
    • Overlapping intensity distributions.
    • Touching/overlapping objects.
  • Can produce noisy masks and outliers in complex scenes.

4.2 Watershed (Often with Distance Transform)

Intuition: Treat the (preprocessed) image like a topographic surface.

Typical pipeline for separating touching objects:

  1. Obtain a coarse binary mask (foreground vs background).
  2. Compute a distance transform:
    • Each foreground pixel’s value = distance to the nearest background pixel.
    • Peaks correspond to object centers.
  3. Identify markers (local maxima / seeds).
  4. Apply the watershed transform:
    • “Flood” from each marker.
    • Where two floods meet, create a boundary (watershed line).

Use cases:

  • Separating touching coins/cells after thresholding.
  • Refining segmentation in binary images.

Pitfalls:

  • Highly sensitive to noise – tends to over-segment if not carefully preprocessed.
  • Usually combined with smoothing, thresholding, or marker-based approaches.

4.3 Mean Shift Segmentation

Mean Shift is a mode-seeking algorithm on a kernel density estimate.

For segmentation:

  • Represent each pixel in a feature space such as:
    • Color (R,G,B or Lab*), and optionally
    • Spatial coordinates (x,y).
  • Places a kernel around a point and iteratively moves it to the local density maximum (mode).
  • Pixels converging to the same mode become a cluster/segment.

Advantages:

  • No assumption of Gaussian clusters.
  • Can handle arbitrary-shaped clusters.
  • Works well in low-dimensional feature spaces (color + position).

Disadvantages:

  • Computationally expensive for large images/high-dimensional features.
  • Bandwidth (kernel size) strongly affects results:
    • Small bandwidth → many small segments.
    • Large bandwidth → over-smoothing and loss of detail.

5. Deep Learning for Segmentation

Modern semantic/instance/panoptic segmentation relies heavily on CNN-based architectures.

5.1 Fully Convolutional Networks (FCN)

  • Replace fully connected layers with convolutional layers so the network produces per-pixel predictions.
  • Use upsampling / deconvolution (transpose conv) to bring coarse feature maps back to image resolution.

FCN introduced the core idea: take a classification CNN and turn it into a dense predictor.

5.2 U-Net

Classic for medical imaging.

  • Encoder–decoder structure:
    • Encoder: downsampling path (convolutions + pooling).
    • Decoder: upsampling path (deconvolutions / upconvs).
  • Skip connections:
    • Copy features from encoder layers to corresponding decoder layers.
    • Helps recover fine details lost during downsampling.

Strengths:

  • Works well even with relatively small datasets (common in medical).
  • Sharp, detailed masks.

5.3 SegNet / PSPNet / Fast-SCNN (Awareness)

  • SegNet:
    • Similar encoder–decoder.
    • Decoder reuses pooling indices from encoder to upsample more efficiently.
  • PSPNet:
    • Uses pyramid pooling to capture multi-scale context.
    • Improves performance on complex, large-scale scenes.
  • Fast-SCNN:
    • Lightweight architecture for real-time semantic segmentation.
    • Designed for embedded or mobile deployment (speed vs accuracy trade-off).

5.4 Mask R-CNN (Instance Segmentation)

  • Built on top of Faster R-CNN.
  • Adds a third branch (besides bounding box and class):
    • A small FCN that predicts a binary mask for each detected object.
  • Output per detection:
    • Class label.
    • Bounding box.
    • Pixel-wise mask.

This turns a detection model into an instance segmentation model.

6. Datasets We Should Know

Common benchmarks mentioned in lectures:

  • Cityscapes
    • High-quality annotations for urban street scenes.
    • Used extensively for autonomous driving research.
    • 19 semantic classes (road, sidewalk, building, etc.).
  • ADE20K
    • ~150 semantic categories.
    • Diverse scenes (indoor + outdoor).
    • More challenging due to variety and fine-grained labels.
  • CamVid
    • Older dataset of road scenes.
    • Lower resolution and fewer images, but historically important.

Knowing the typical classes and scenes helps when answering exam questions about applications and challenges.

7. Practical Issues & Trade-offs

7.1 Speed vs Accuracy

  • Heavier models (PSPNet, large U-Nets, high-res FCNs):
    • Higher mIoU, slower inference.
  • Lightweight models (Fast-SCNN, smaller backbones):
    • Lower mIoU but faster FPS.
  • Choice depends on:
    • Real-time constraints (e.g., autonomous driving).
    • Hardware budget (GPU vs embedded).

7.2 Class Imbalance

  • Background or “stuff” classes often dominate.
  • Small objects (signs, pedestrians) may be rare.
  • Remedies:
    • Class-balanced or focal losses.
    • Oversampling hard/rare examples.
    • Carefully tuned data augmentation.

7.3 Domain Shift

  • Training on one dataset (e.g., Cityscapes) and deploying on another (different city, country, sensor) can cause performance drops.
  • Domain adaptation and fine-tuning are often required in practice.

8. Exam-Oriented Takeaways

Be ready to:

  • Define semantic segmentation, instance segmentation, and panoptic segmentation, and explain:
    • Output formats.
    • Use cases.
    • PQ = SQ × RQ for panoptic.
  • Describe Otsu’s method:
    • Operates on the intensity histogram.
    • Chooses the threshold that minimises within-class variance / maximises between-class variance.
    • Fails with non-uniform lighting, overlapping distributions, and no spatial context.
  • Describe watershed segmentation:
    • Optionally combined with distance transform.
    • “Basins flood from markers; where floods meet you get boundaries.”
    • Good for separating touching objects; prone to over-segmentation.
  • Describe Mean Shift segmentation:
    • Clustering in joint color–position space via mode seeking on a kernel density estimate.
    • Bandwidth controls segment size; works best in low-dimensional feature spaces.
  • Outline key deep learning models:
    • FCN, U-Net, SegNet/PSPNet/Fast-SCNN (semantic).
    • Mask R-CNN (instance).
    • Mention speed–accuracy trade-offs and evaluation metrics (Pixel Acc, mIoU, AP, PQ).

Image segmentation is what turns raw images into a dense, structured understanding of the scene – telling us not just what is present, but exactly where each thing begins and ends.

Leave a comment