Object Detection
Object detection goes beyond whole-image classification.
It asks what objects are present and where they are, usually via bounding boxes.
Typical output for a single image:
- A list of bounding boxes
- A class label for each box
- A confidence score per detection
→ followed by non-maximum suppression (NMS) to remove duplicate boxes.
1. Problem Definition & Outputs
- Classification: what object class (car, person, cat, …)?
- Localization: where is it? (bounding box around the object)
- Detection = classification + localization for all objects in the image.
Each detection is a tuple like:
$(b_x, b_y, b_w, b_h, c, s)$
where $(b_x, b_y, b_w, b_h)$ define the box, $c$ is the class label, and $s$ is the confidence score.
2. Evaluation: IoU, Precision–Recall, AP, mAP
Intersection over Union (IoU)
To compare predicted and ground-truth boxes:
$\text{IoU} = \frac{\text{area}(B_{\text{pred}} \cap B_{\text{gt}})}{\text{area}(B_{\text{pred}} \cup B_{\text{gt}})}$
- IoU ∈ [0, 1]
- Higher IoU → better overlap.
A prediction is usually counted as a True Positive (TP) if:
- The class label is correct, and
- IoU ≥ threshold (e.g., 0.5).
Otherwise, the detection is a False Positive (FP). Missing a ground-truth object gives a False Negative (FN).
Precision & Recall
-
Precision = TP / (TP + FP)
“Of the boxes I predicted, how many were correct?” -
Recall = TP / (TP + FN)
“Of all real objects, how many did I actually detect?”
By varying the confidence threshold, we get different (precision, recall) pairs and can plot a precision–recall (P–R) curve.
AP & mAP
- Average Precision (AP) = area under the P–R curve for one class.
- Mean AP (mAP) = average AP over all classes.
Common conventions:
- PASCAL VOC: AP at IoU = 0.5 (AP@0.5).
- COCO: AP averaged over IoUs from 0.5 to 0.95 in steps of 0.05 (AP@[.5:.95]) → much stricter.
3. Datasets & Annotation Formats
Popular detection datasets:
- PASCAL VOC – early benchmark; fewer classes, relatively small.
- ImageNet Detection (ILSVRC) – large-scale, many categories.
- MS COCO – “Common Objects in Context”; many everyday categories with complex scenes.
Bounding boxes are stored in different formats:
- PASCAL VOC (XML):
- Boxes as $(x_{\min}, y_{\min}, x_{\max}, y_{\max})$.
- COCO (JSON):
- Boxes typically as $(x, y, w, h)$ with (x, y) = top-left.
Be careful when converting between formats (off-by-one errors, width/height vs max coordinates, etc.).
4. Classical (Pre-Deep Learning) Detectors
4.1 Sliding Window Template Matching
Pipeline:
- Choose a window size.
- Slide it over the image at many positions and scales.
- For each window, compute a similarity score to a template:
- Sum of Squared Differences (SSD)
- Normalized Cross-Correlation (NCC)
- Threshold scores → detections.
Problems:
- Very slow (many windows).
- Sensitive to scale, rotation, deformation, and background clutter.
- Rarely used alone today.
4.2 Viola–Jones Face Detector
Key ideas:
- Haar-like features:
- Simple rectangular patterns (edges, lines, center–surround).
- Integral image:
- Allows extremely fast feature sums over rectangles.
- Boosting (AdaBoost):
- Combine many weak classifiers into a strong one.
- Cascade:
- Early stages reject easy negatives quickly.
- Later stages focus on harder cases.
Strengths:
- Real-time frontal face detection on CPU.
- Lightweight, classic algorithm still used in embedded systems.
Limitations:
- Sensitive to pose (works best for near-frontal faces).
- Struggles with large appearance changes and complex backgrounds.
4.3 HOG + SVM for Pedestrian Detection
Histograms of Oriented Gradients (HOG) + linear SVM.
Pipeline:
- Compute gradient orientations and magnitudes.
- Build histograms over cells, normalize over blocks (HOG descriptor).
- Train a linear SVM to classify windows as pedestrian vs background.
- Use sliding windows at multiple scales + NMS.
Strengths:
- Robust to moderate illumination changes and small deformations.
- Benchmark method for pedestrian detection pre-deep-learning.
Limitations:
- Heavy at high resolutions and dense scales.
- Less robust to large pose/scale variations compared to modern CNN methods.
5. Deep Learning Detectors
Deep learning detectors use CNN feature extractors and predict class scores + bounding box coordinates.
Two broad families:
- Two-stage: propose candidate regions → classify/refine them.
- One-stage: directly predict boxes and classes on a dense grid or feature map.
5.1 Two-Stage: R-CNN → Fast R-CNN → Faster R-CNN
R-CNN (Region-based CNN)
- Use Selective Search to generate ~2000 region proposals (possible object boxes).
- Warp each region to a fixed size.
- Feed each region into a CNN to extract features.
- Use:
- class-specific SVMs for classification,
- linear regressors for box refinement.
Pros:
- Huge jump in accuracy over classical methods.
Cons:
- Very slow (one forward pass per region).
- Large disk footprint (features for each region stored).
- Multi-stage training pipeline.
Fast R-CNN
Improvements:
- Run CNN once on the whole image to get a feature map.
- For each region proposal:
- Use RoI pooling to crop & resize a region from the feature map.
- Feed pooled features to fully connected layers → class scores + box regression.
Pros:
- Much faster than R-CNN.
- End-to-end training for detection head.
Cons:
- Still relies on external region proposals (Selective Search), which is slow.
Faster R-CNN
Key idea: learn proposals inside the network.
- Add a Region Proposal Network (RPN) on top of shared CNN features:
- Slides small networks over feature map to predict:
- Objectness scores.
- Box adjustments for anchors (predefined reference boxes).
- Slides small networks over feature map to predict:
- Proposed regions are then fed to the detection head (via RoI pooling/align).
Pros:
- End-to-end trainable.
- Very strong detection accuracy.
- Can be combined with Feature Pyramid Networks (FPN) for multi-scale.
Cons:
- Heavier than single-stage detectors for real-time constraints.
5.2 One-Stage: YOLO, SSD, RetinaNet
YOLO (You Only Look Once)
Original idea:
- Divide the image into an $S \times S$ grid.
- Each grid cell predicts:
- $B$ bounding boxes + confidence,
- Class probabilities.
- Combine them into final detections via NMS.
Characteristics:
- Treats detection as a single regression problem from image pixels to box coordinates + class scores.
- Very fast; suitable for real-time detection.
Trade-offs:
- Early versions struggled with small objects and crowded scenes.
- Newer versions (YOLOv3–v10) + variants significantly improve accuracy, multi-scale heads, and training tricks.
SSD & RetinaNet (awareness level)
- SSD (Single Shot MultiBox Detector):
- Predict boxes and classes from multiple feature map scales.
- RetinaNet:
- Built on FPN.
- Uses focal loss to address class imbalance (few objects vs many background locations).
5.3 DETR & Transformers (Awareness Only)
- DETR (DEtection TRansformer):
- Uses a transformer encoder–decoder.
- Predicts a fixed set of boxes directly.
- Removes the need for anchors and NMS during training.
- More recent variants (e.g. RT-DETR) aim for better speed.
6. Non-Maximum Suppression (NMS)
Problem: detectors often produce many overlapping boxes around the same object.
NMS algorithm:
- Sort all predicted boxes by confidence (high → low).
- Take the highest-confidence box, add it to the final list.
- Remove all remaining boxes with IoU > threshold (e.g., 0.5) w.r.t. this box.
- Repeat with the next highest-confidence box.
Result:
- Only a few strong, non-overlapping detections remain.
- Prevents multiple detections of the same object.
7. Practical Challenges
- Scale variance & tiny objects:
- Harder to detect; may occupy just a few pixels.
- Multi-scale features (FPN), multi-scale training, and custom anchors help.
- Occlusions & crowding:
- People or objects overlap heavily → ambiguous boxes.
- Advanced NMS variants and training with crowded scenes help.
- Domain shift (train vs test mismatch):
- Trained on COCO, deployed in a factory or hospital → different textures, viewpoints, noise.
- Need domain-specific fine-tuning or more diverse training data.
- Speed vs accuracy:
- Two-stage (Faster R-CNN) → better accuracy, slower.
- One-stage (YOLO, SSD, RetinaNet, RT-DETR) → faster, sometimes slightly lower accuracy.
- Choice depends on target hardware & FPS goals (e.g., embedded device vs GPU server).
8. Exam-Oriented Summary
- Define object detection and distinguish it from whole-image classification.
- Explain what bounding boxes, class labels, confidences, and NMS are.
- Define IoU, and describe how it’s used to decide TP/FP.
- Explain precision, recall, AP, and mAP, and know the difference between VOC AP@0.5 and COCO AP@[.5:.95].
- Describe classical detectors:
- Template matching (high-level idea + limitations),
- Viola–Jones (Haar features, integral images, cascade + boosting),
- HOG + SVM (pipeline + strengths/weaknesses).
- Outline the pipelines for:
- R-CNN,
- Fast R-CNN,
- Faster R-CNN (role of RPN + shared CNN + RoI pooling).
- Explain how YOLO formulates detection as a single regression over a grid and why it’s fast.
- Describe what NMS does and why it’s needed.
- Discuss typical challenges (scale, occlusion, domain shift, speed vs accuracy) and how modern detectors address them.
Object detection is one of the core applications of machine perception, bringing together features, CNNs, and smart post-processing to turn raw pixels into a structured list of objects in the scene.
Leave a comment