📢 Notice 📢

Please have a read first!

Image Segmentation

September 22, 2025 6 minute read

Image segmentation divides an image into meaningful regions so that each region corresponds to a distinct object or part of an object.

Where object detection gives boxes, segmentation gives pixel-precise masks.

1. What Segmentation Is (and Isn’t)

Classification: one label for the whole image (e.g., “cat”).
Detection: labels + bounding boxes around objects.
Segmentation: labels at the pixel level – we care about the exact shape and extent of objects (road vs sidewalk vs cars vs sky).

Segmentation is crucial in:

Autonomous driving (road, lane, pedestrians, cars).
Medical imaging (tumours, organs, vessels).
Remote sensing (land cover, buildings).
Agriculture (crops vs weeds, canopy vs soil).

2. Taxonomy: Semantic, Instance, Panoptic

2.1 Semantic Segmentation

Assigns a class label to every pixel.
All pixels of the same class share one label (e.g., all “road” pixels = label 0).
Output: label map where each pixel stores a class ID.

Limitations:

Cannot distinguish individual instances of the same class (car #1 vs car #2).

2.2 Instance Segmentation

Separates each object instance of a class:
- Car 1, Car 2, Person 1, Person 2, …
Output: one mask per object + class label.

Often evaluated with detection-style metrics (AP) adapted to masks.

2.3 Panoptic Segmentation

Combines both ideas:
- Gives each pixel a semantic label, and
- Distinguishes instances of “thing” classes (cars, people, etc.).
Treats:
- Stuff: amorphous regions (road, grass, sky).
- Things: countable objects (cars, people, bikes).

Metric: Panoptic Quality (PQ)

$ \text{PQ} = \text{SQ} \times \text{RQ} $
- SQ – Segmentation Quality (how good the overlaps are).
- RQ – Recognition Quality (how many segments are matched correctly).

3. Evaluating Segmentation

Common metrics:

Pixel Accuracy
- Fraction of correctly labeled pixels.
- Simple but can be misleading if one class (e.g., background) dominates.
Intersection over Union (IoU)
- For a class: $ \text{IoU} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}} $
- Measures overlap quality between predicted and ground-truth regions.
Mean IoU (mIoU)
- Average IoU across all classes.
- Standard metric for semantic segmentation benchmarks (Cityscapes, ADE20K).
AP / mAP for instance masks
- Borrowed from object detection, but applied to mask IoU instead of box IoU.
PQ (Panoptic Quality) for panoptic segmentation
- Jointly captures detection and segmentation quality.

4. Classical Unsupervised Segmentation

These methods do not require labeled training data. They operate on the image directly and are very exam-relevant (Otsu, Watershed, Mean Shift).

4.1 Otsu’s Thresholding

Goal: Find a global intensity threshold that best separates foreground and background in a grayscale image.

Key idea:

Look at the image histogram of gray levels.
Partition it into two classes (below threshold = background, above = foreground).
Choose the threshold that:
- Minimises within-class variance, or
- Equivalently maximises between-class variance.

Pros:

Very fast and fully automatic.
Works well when the histogram is bimodal (clear separation).

Limitations:

Ignores spatial relationships between pixels – only looks at the histogram.
Struggles with:
- Uneven illumination.
- Overlapping intensity distributions.
- Touching/overlapping objects.
Can produce noisy masks and outliers in complex scenes.

4.2 Watershed (Often with Distance Transform)

Intuition: Treat the (preprocessed) image like a topographic surface.

Typical pipeline for separating touching objects:

Obtain a coarse binary mask (foreground vs background).
Compute a distance transform:
- Each foreground pixel’s value = distance to the nearest background pixel.
- Peaks correspond to object centers.
Identify markers (local maxima / seeds).
Apply the watershed transform:
- “Flood” from each marker.
- Where two floods meet, create a boundary (watershed line).

Use cases:

Separating touching coins/cells after thresholding.
Refining segmentation in binary images.

Pitfalls:

Highly sensitive to noise – tends to over-segment if not carefully preprocessed.
Usually combined with smoothing, thresholding, or marker-based approaches.

4.3 Mean Shift Segmentation

Mean Shift is a mode-seeking algorithm on a kernel density estimate.

For segmentation:

Represent each pixel in a feature space such as:
- Color (R,G,B or Lab*), and optionally
- Spatial coordinates (x,y).
Places a kernel around a point and iteratively moves it to the local density maximum (mode).
Pixels converging to the same mode become a cluster/segment.

Advantages:

No assumption of Gaussian clusters.
Can handle arbitrary-shaped clusters.
Works well in low-dimensional feature spaces (color + position).

Disadvantages:

Computationally expensive for large images/high-dimensional features.
Bandwidth (kernel size) strongly affects results:
- Small bandwidth → many small segments.
- Large bandwidth → over-smoothing and loss of detail.

5. Deep Learning for Segmentation

Modern semantic/instance/panoptic segmentation relies heavily on CNN-based architectures.

5.1 Fully Convolutional Networks (FCN)

Replace fully connected layers with convolutional layers so the network produces per-pixel predictions.
Use upsampling / deconvolution (transpose conv) to bring coarse feature maps back to image resolution.

FCN introduced the core idea: take a classification CNN and turn it into a dense predictor.

5.2 U-Net

Classic for medical imaging.

Encoder–decoder structure:
- Encoder: downsampling path (convolutions + pooling).
- Decoder: upsampling path (deconvolutions / upconvs).
Skip connections:
- Copy features from encoder layers to corresponding decoder layers.
- Helps recover fine details lost during downsampling.

Strengths:

Works well even with relatively small datasets (common in medical).
Sharp, detailed masks.

5.3 SegNet / PSPNet / Fast-SCNN (Awareness)

SegNet:
- Similar encoder–decoder.
- Decoder reuses pooling indices from encoder to upsample more efficiently.
PSPNet:
- Uses pyramid pooling to capture multi-scale context.
- Improves performance on complex, large-scale scenes.
Fast-SCNN:
- Lightweight architecture for real-time semantic segmentation.
- Designed for embedded or mobile deployment (speed vs accuracy trade-off).

5.4 Mask R-CNN (Instance Segmentation)

Built on top of Faster R-CNN.
Adds a third branch (besides bounding box and class):
- A small FCN that predicts a binary mask for each detected object.
Output per detection:
- Class label.
- Bounding box.
- Pixel-wise mask.

This turns a detection model into an instance segmentation model.

6. Datasets We Should Know

Common benchmarks mentioned in lectures:

Cityscapes
- High-quality annotations for urban street scenes.
- Used extensively for autonomous driving research.
- 19 semantic classes (road, sidewalk, building, etc.).
ADE20K
- ~150 semantic categories.
- Diverse scenes (indoor + outdoor).
- More challenging due to variety and fine-grained labels.
CamVid
- Older dataset of road scenes.
- Lower resolution and fewer images, but historically important.

Knowing the typical classes and scenes helps when answering exam questions about applications and challenges.

7. Practical Issues & Trade-offs

7.1 Speed vs Accuracy

Heavier models (PSPNet, large U-Nets, high-res FCNs):
- Higher mIoU, slower inference.
Lightweight models (Fast-SCNN, smaller backbones):
- Lower mIoU but faster FPS.
Choice depends on:
- Real-time constraints (e.g., autonomous driving).
- Hardware budget (GPU vs embedded).

7.2 Class Imbalance

Background or “stuff” classes often dominate.
Small objects (signs, pedestrians) may be rare.
Remedies:
- Class-balanced or focal losses.
- Oversampling hard/rare examples.
- Carefully tuned data augmentation.

7.3 Domain Shift

Training on one dataset (e.g., Cityscapes) and deploying on another (different city, country, sensor) can cause performance drops.
Domain adaptation and fine-tuning are often required in practice.

8. Exam-Oriented Takeaways

Be ready to:

Define semantic segmentation, instance segmentation, and panoptic segmentation, and explain:
- Output formats.
- Use cases.
- PQ = SQ × RQ for panoptic.
Describe Otsu’s method:
- Operates on the intensity histogram.
- Chooses the threshold that minimises within-class variance / maximises between-class variance.
- Fails with non-uniform lighting, overlapping distributions, and no spatial context.
Describe watershed segmentation:
- Optionally combined with distance transform.
- “Basins flood from markers; where floods meet you get boundaries.”
- Good for separating touching objects; prone to over-segmentation.
Describe Mean Shift segmentation:
- Clustering in joint color–position space via mode seeking on a kernel density estimate.
- Bandwidth controls segment size; works best in low-dimensional feature spaces.
Outline key deep learning models:
- FCN, U-Net, SegNet/PSPNet/Fast-SCNN (semantic).
- Mask R-CNN (instance).
- Mention speed–accuracy trade-offs and evaluation metrics (Pixel Acc, mIoU, AP, PQ).

Image segmentation is what turns raw images into a dense, structured understanding of the scene – telling us not just what is present, but exactly where each thing begins and ends.

Share on

X Facebook LinkedIn Bluesky