📢 Notice 📢

6 minute read

3D perception extends 2D vision into depth and spatial structure.
Instead of just asking “what is in this image?”, we ask: “what does the 3D world look like?”

1. Why 3D Perception?

Goal: reconstruct and recognize 3D scenes from sensors such as cameras, stereo pairs, depth sensors, or LiDAR.

Applications:

  • Robotics & autonomous driving – mapping, obstacle avoidance, path planning.
  • AR/VR & gaming – real-time environment understanding, body and hand tracking.
  • Healthcare – 3D face and body tracking, morphology, medical imaging.
  • Industry – inspection, infrastructure digitisation, kiln/flame monitoring.
  • Cultural heritage & urban modelling – 3D city models, virtual museums.
  • Human motion & safety – fall detection, sports analytics, pose estimation.

2. 3D Representations

Different tasks favour different 3D representations:

  • Point clouds
    • Set of points $ (x, y, z) $, typically from LiDAR or multi-view stereo.
    • Pros: simple, compact, direct sensor output.
    • Cons: no explicit surface connectivity; varying density; irregular structure.
  • Meshes / surfaces
    • Vertices connected into triangles/quads approximating surfaces.
    • Pros: good for rendering, accurate geometry.
    • Cons: more complex to store and process; mesh quality depends on reconstruction.
  • Voxels (volumetric grids)
    • 3D occupancy grid (like 3D pixels).
    • Pros: regular structure → easy for 3D CNNs.
    • Cons: memory heavy; limited resolution for large scenes.
  • Depth maps
    • 2D image where each pixel = distance to camera.
    • Pros: easy to capture and process with 2D CNNs; aligns with RGB.
    • Cons: only encodes surfaces visible from that viewpoint; can lose fine 3D structure.

3. Input Modalities

  • Monocular images
    • Depth inferred from learned priors and cues (texture, size, perspective).
    • Fundamental ambiguity: many 3D scenes project to the same 2D image.
  • Stereo cameras
    • Two calibrated cameras with baseline $B$.
    • Disparity $d$ (horizontal shift between rectified views) gives depth:
      $ Z = \frac{fB}{d} $
      where $f$ is focal length.
    • Larger baseline → better depth accuracy at long range, but harder to match.
  • Multi-view setups
    • Many overlapping views (structure-from-motion, multi-view stereo).
    • Can reconstruct dense point clouds / meshes of scenes.
  • Active depth sensors
    • LiDAR – time-of-flight laser, sparse but accurate at long range.
    • Structured light / active stereo – projected IR pattern; triangulation (e.g., Kinect v1).
    • Time-of-Flight (ToF) – phase/time delay of emitted light; dense depth maps.
    • Pros: direct depth; Cons: cost, power, sensitivity to environment (e.g., sunlight).

4. From Stereo to Depth: Disparity & PSMNet

4.1 Classical Stereo Matching

Rectification: warp images so epipolar lines are horizontal → search only along rows.

Then:

  1. Compute matching cost per disparity (e.g., SAD/SSD/NCC on windows).
  2. Aggregate costs with:
    • Local windows (box, bilateral).
    • Semi-global / global methods (SGM, graph cuts).
  3. Select disparity with lowest cost.
  4. Post-process (median filter, left–right consistency check, hole filling).

Failure modes:

  • Textureless regions.
  • Repetitive patterns.
  • Specular surfaces, reflections.
  • Occlusions and slanted surfaces.

4.2 Deep Stereo: PSMNet (Pyramid Stereo Matching Network)

Modern approach that learns stereo matching end-to-end:

  • CNN feature extractor with shared weights for left/right images.
  • Spatial Pyramid Pooling (SPP) for multi-scale context.
  • Build a 4D cost volume: $H \times W \times D \times \text{features}$.
  • Use a stacked hourglass 3D CNN to regularise and smooth the cost volume.
  • Regress final disparity map → convert to depth.

Pros:

  • State-of-the-art accuracy on benchmarks (e.g., KITTI).
  • Robust to challenging textures/structures.

Cons:

  • High memory & compute cost (GPU-heavy).
  • Deployment trade-off between speed and accuracy.

5. 3D Reconstruction & Geometry (Szeliski Ch. 13 Highlights)

5.1 Camera Models & Epipolar Geometry

  • Pinhole model:
    ( \mathbf{x} = K [R \mid t] \mathbf{X} )
    where:
    • (K): camera intrinsics (focal length, principal point, skew),
    • (R, t): extrinsics (rotation and translation),
    • (\mathbf{X}): 3D point, (\mathbf{x}): image point.
  • Distortion: radial & tangential; undistort before geometry.

  • Epipolar geometry:
    • Essential matrix (E) for calibrated cameras, fundamental matrix (F) for uncalibrated.
    • Epipolar constraint: [ \mathbf{x}’^T F \mathbf{x} = 0 ]
    • Reduces matching from 2D → 1D search along epipolar lines.
  • Rectification: warp stereo pair so corresponding points lie on the same row, turning 2D search into 1D disparity search.

5.2 Multi-View Stereo (MVS) & Structure from Motion (SfM)

  • MVS:
    • Triangulate points seen in many calibrated views.
    • Enforce photometric consistency across views.
    • Techniques: plane-sweep, PatchMatch, depth map fusion (TSDF, Poisson).
  • SfM (unordered photo collections):
    1. Detect & match features (SIFT/ORB).
    2. Robust two-view geometry with RANSAC.
    3. Incrementally add cameras (PnP) and triangulate new points.
    4. Bundle Adjustment:
      • Jointly refine camera poses and 3D points by minimising reprojection error. - Monocular reconstructions are up to an unknown scale; fix with additional cues (known distances, IMU, etc.).

5.3 VO, SLAM & Active Sensing

  • Visual Odometry (VO): estimate camera motion over time from images (feature-based or direct).
  • SLAM: VO + map + loop closure to reduce drift.
  • Visual–inertial fusion: combining cameras with IMU improves robustness and scale.

Active depth sensing recap:

  • Structured light, ToF, LiDAR with different trade-offs in range, density, accuracy, and environmental robustness.

6. 3D Recognition: VoxNet & PointNet

After we have 3D data, we also want to classify or segment objects in 3D.

6.1 VoxNet

  • Early 3D deep-learning approach:
    • Convert point clouds into voxel grids.
    • Apply 3D CNN for object recognition.
  • Pros:
    • Simple extension of 2D CNN ideas to 3D.
  • Cons:
    • Voxelisation introduces:
      • Quantisation errors (loss of detail).
      • Huge memory usage at high resolution.

6.2 PointNet

Designed to work directly on point clouds:

  • Input: unordered set of points ((x, y, z)) (optionally with features like intensity, RGB).
  • Architecture:
    • Shared MLPs applied to each point independently.
    • Symmetric aggregation (max pooling) to extract a global descriptor invariant to point ordering.
  • Outputs:
    • Classification (global feature).
    • Segmentation (per-point labels with shared/global features).

Advantages:

  • No voxelisation → avoids resolution and memory issues.
  • Invariant to permutation of input points.

Extensions like PointNet++ add local neighbourhoods to capture fine-grained structure.

7. SADRNet: 3D Face Alignment from a Single Image

SADRNet (Self-Aligned Dual Face Regression Network) is an example of 3D reconstruction from a single RGB image, focused on faces.

Task: estimate dense 3D facial landmarks/shape from monocular input under occlusion and pose changes.

Key ideas:

  • Dual branches:
    • Pose-dependent branch: captures head orientation.
    • Pose-independent branch: captures underlying face shape.
  • Self-alignment module:
    • Learns to align outputs of the two branches via a similarity transform.
  • Occlusion-aware attention:
    • Learns to focus on visible regions.
    • Robust to self-occlusion and external occluders (hands, hair, objects).

Outcome:

  • Real-time (≈70 FPS) 3D face alignment with strong robustness to occlusion and large pose variations.

8. Point Clouds, Registration & Fusion

Once multiple depth maps or scans are available, we often need to align and merge them:

  • Normal & curvature estimation from local neighbourhoods.
  • Plane / cylinder detection via RANSAC.
  • ICP (Iterative Closest Point):
    • Iteratively find correspondences between two point clouds.
    • Estimate the rigid transform that minimises point-to-point or point-to-plane distances.
    • Needs good initialisation, otherwise can get stuck in local minima.
  • Volumetric fusion:
    • Integrate many depth maps into a TSDF volume.
    • Extract surfaces via Marching Cubes to get a mesh.

Other shape-from-X cues (awareness level):

  • Photometric stereo, shape-from-shading, shape-from-defocus, shape-from-polarisation—each with its own assumptions and failure modes.

9. Practical Pitfalls

  • Scale ambiguity in monocular setups.
  • Rolling shutter: violates simple camera models when camera or scene moves quickly.
  • Dynamic scenes: break static-world assumptions used in SfM/MVS.
  • Calibration quality:
    • Poor calibration ruins depth estimates.
    • Use good patterns, many viewpoints, and refine with bundle adjustment.

10. Exam-Oriented Summary

  • Explain why 3D perception is important and list key applications.
  • Describe and compare point clouds, meshes, voxels, and depth maps.
  • Derive and use the stereo depth formula (Z = fB/d); discuss how baseline affects depth accuracy.
  • Explain the main steps and challenges in stereo matching and what PSMNet adds.
  • Outline core elements of epipolar geometry, rectification, SfM, and bundle adjustment.
  • Describe how VO and SLAM extend multi-view geometry to continuous sequences.
  • Explain what VoxNet and PointNet do and why PointNet avoids voxelisation.
  • Summarise SADRNet’s purpose (3D face alignment) and its dual-branch + self-alignment design.
  • Discuss common pitfalls: calibration errors, dynamic scenes, scale ambiguity, and sensor trade-offs.

3D perception ties together geometry, learning, and sensing so machines can move from flat images to a structured 3D understanding of the world.

Leave a comment