📢 Notice 📢

Please have a read first!

3D Perception

October 6, 2025 6 minute read

3D perception extends 2D vision into depth and spatial structure.
Instead of just asking “what is in this image?”, we ask: “what does the 3D world look like?”

1. Why 3D Perception?

Goal: reconstruct and recognize 3D scenes from sensors such as cameras, stereo pairs, depth sensors, or LiDAR.

Applications:

Robotics & autonomous driving – mapping, obstacle avoidance, path planning.
AR/VR & gaming – real-time environment understanding, body and hand tracking.
Healthcare – 3D face and body tracking, morphology, medical imaging.
Industry – inspection, infrastructure digitisation, kiln/flame monitoring.
Cultural heritage & urban modelling – 3D city models, virtual museums.
Human motion & safety – fall detection, sports analytics, pose estimation.

2. 3D Representations

Different tasks favour different 3D representations:

Point clouds
- Set of points $ (x, y, z) $, typically from LiDAR or multi-view stereo.
- Pros: simple, compact, direct sensor output.
- Cons: no explicit surface connectivity; varying density; irregular structure.
Meshes / surfaces
- Vertices connected into triangles/quads approximating surfaces.
- Pros: good for rendering, accurate geometry.
- Cons: more complex to store and process; mesh quality depends on reconstruction.
Voxels (volumetric grids)
- 3D occupancy grid (like 3D pixels).
- Pros: regular structure → easy for 3D CNNs.
- Cons: memory heavy; limited resolution for large scenes.
Depth maps
- 2D image where each pixel = distance to camera.
- Pros: easy to capture and process with 2D CNNs; aligns with RGB.
- Cons: only encodes surfaces visible from that viewpoint; can lose fine 3D structure.

3. Input Modalities

Monocular images
- Depth inferred from learned priors and cues (texture, size, perspective).
- Fundamental ambiguity: many 3D scenes project to the same 2D image.
Stereo cameras
- Two calibrated cameras with baseline $B$.
- Disparity $d$ (horizontal shift between rectified views) gives depth:
  $ Z = \frac{fB}{d} $
  where $f$ is focal length.
- Larger baseline → better depth accuracy at long range, but harder to match.
Multi-view setups
- Many overlapping views (structure-from-motion, multi-view stereo).
- Can reconstruct dense point clouds / meshes of scenes.
Active depth sensors
- LiDAR – time-of-flight laser, sparse but accurate at long range.
- Structured light / active stereo – projected IR pattern; triangulation (e.g., Kinect v1).
- Time-of-Flight (ToF) – phase/time delay of emitted light; dense depth maps.
- Pros: direct depth; Cons: cost, power, sensitivity to environment (e.g., sunlight).

4. From Stereo to Depth: Disparity & PSMNet

4.1 Classical Stereo Matching

Rectification: warp images so epipolar lines are horizontal → search only along rows.

Then:

Compute matching cost per disparity (e.g., SAD/SSD/NCC on windows).
Aggregate costs with:
- Local windows (box, bilateral).
- Semi-global / global methods (SGM, graph cuts).
Select disparity with lowest cost.
Post-process (median filter, left–right consistency check, hole filling).

Failure modes:

Textureless regions.
Repetitive patterns.
Specular surfaces, reflections.
Occlusions and slanted surfaces.

4.2 Deep Stereo: PSMNet (Pyramid Stereo Matching Network)

Modern approach that learns stereo matching end-to-end:

CNN feature extractor with shared weights for left/right images.
Spatial Pyramid Pooling (SPP) for multi-scale context.
Build a 4D cost volume: $H \times W \times D \times \text{features}$.
Use a stacked hourglass 3D CNN to regularise and smooth the cost volume.
Regress final disparity map → convert to depth.

Pros:

State-of-the-art accuracy on benchmarks (e.g., KITTI).
Robust to challenging textures/structures.

Cons:

High memory & compute cost (GPU-heavy).
Deployment trade-off between speed and accuracy.

5. 3D Reconstruction & Geometry (Szeliski Ch. 13 Highlights)

5.1 Camera Models & Epipolar Geometry

Pinhole model:
( \mathbf{x} = K [R \mid t] \mathbf{X} )
where:
- (K): camera intrinsics (focal length, principal point, skew),
- (R, t): extrinsics (rotation and translation),
- (\mathbf{X}): 3D point, (\mathbf{x}): image point.
Distortion: radial & tangential; undistort before geometry.
Epipolar geometry:
- Essential matrix (E) for calibrated cameras, fundamental matrix (F) for uncalibrated.
- Epipolar constraint: [ \mathbf{x}’^T F \mathbf{x} = 0 ]
- Reduces matching from 2D → 1D search along epipolar lines.
Rectification: warp stereo pair so corresponding points lie on the same row, turning 2D search into 1D disparity search.

5.2 Multi-View Stereo (MVS) & Structure from Motion (SfM)

MVS:
- Triangulate points seen in many calibrated views.
- Enforce photometric consistency across views.
- Techniques: plane-sweep, PatchMatch, depth map fusion (TSDF, Poisson).
SfM (unordered photo collections):
1. Detect & match features (SIFT/ORB).
2. Robust two-view geometry with RANSAC.
3. Incrementally add cameras (PnP) and triangulate new points.
4. Bundle Adjustment:
  - Jointly refine camera poses and 3D points by minimising reprojection error. - Monocular reconstructions are up to an unknown scale; fix with additional cues (known distances, IMU, etc.).

5.3 VO, SLAM & Active Sensing

Visual Odometry (VO): estimate camera motion over time from images (feature-based or direct).
SLAM: VO + map + loop closure to reduce drift.
Visual–inertial fusion: combining cameras with IMU improves robustness and scale.

Active depth sensing recap:

Structured light, ToF, LiDAR with different trade-offs in range, density, accuracy, and environmental robustness.

6. 3D Recognition: VoxNet & PointNet

After we have 3D data, we also want to classify or segment objects in 3D.

6.1 VoxNet

Early 3D deep-learning approach:
- Convert point clouds into voxel grids.
- Apply 3D CNN for object recognition.
Pros:
- Simple extension of 2D CNN ideas to 3D.
Cons:
- Voxelisation introduces:
  - Quantisation errors (loss of detail).
  - Huge memory usage at high resolution.

6.2 PointNet

Designed to work directly on point clouds:

Input: unordered set of points ((x, y, z)) (optionally with features like intensity, RGB).
Architecture:
- Shared MLPs applied to each point independently.
- Symmetric aggregation (max pooling) to extract a global descriptor invariant to point ordering.
Outputs:
- Classification (global feature).
- Segmentation (per-point labels with shared/global features).

Advantages:

No voxelisation → avoids resolution and memory issues.
Invariant to permutation of input points.

Extensions like PointNet++ add local neighbourhoods to capture fine-grained structure.

7. SADRNet: 3D Face Alignment from a Single Image

SADRNet (Self-Aligned Dual Face Regression Network) is an example of 3D reconstruction from a single RGB image, focused on faces.

Task: estimate dense 3D facial landmarks/shape from monocular input under occlusion and pose changes.

Key ideas:

Dual branches:
- Pose-dependent branch: captures head orientation.
- Pose-independent branch: captures underlying face shape.
Self-alignment module:
- Learns to align outputs of the two branches via a similarity transform.
Occlusion-aware attention:
- Learns to focus on visible regions.
- Robust to self-occlusion and external occluders (hands, hair, objects).

Outcome:

Real-time (≈70 FPS) 3D face alignment with strong robustness to occlusion and large pose variations.

8. Point Clouds, Registration & Fusion

Once multiple depth maps or scans are available, we often need to align and merge them:

Normal & curvature estimation from local neighbourhoods.
Plane / cylinder detection via RANSAC.
ICP (Iterative Closest Point):
- Iteratively find correspondences between two point clouds.
- Estimate the rigid transform that minimises point-to-point or point-to-plane distances.
- Needs good initialisation, otherwise can get stuck in local minima.
Volumetric fusion:
- Integrate many depth maps into a TSDF volume.
- Extract surfaces via Marching Cubes to get a mesh.

Other shape-from-X cues (awareness level):

Photometric stereo, shape-from-shading, shape-from-defocus, shape-from-polarisation—each with its own assumptions and failure modes.

9. Practical Pitfalls

Scale ambiguity in monocular setups.
Rolling shutter: violates simple camera models when camera or scene moves quickly.
Dynamic scenes: break static-world assumptions used in SfM/MVS.
Calibration quality:
- Poor calibration ruins depth estimates.
- Use good patterns, many viewpoints, and refine with bundle adjustment.

10. Exam-Oriented Summary

Explain why 3D perception is important and list key applications.
Describe and compare point clouds, meshes, voxels, and depth maps.
Derive and use the stereo depth formula (Z = fB/d); discuss how baseline affects depth accuracy.
Explain the main steps and challenges in stereo matching and what PSMNet adds.
Outline core elements of epipolar geometry, rectification, SfM, and bundle adjustment.
Describe how VO and SLAM extend multi-view geometry to continuous sequences.
Explain what VoxNet and PointNet do and why PointNet avoids voxelisation.
Summarise SADRNet’s purpose (3D face alignment) and its dual-branch + self-alignment design.
Discuss common pitfalls: calibration errors, dynamic scenes, scale ambiguity, and sensor trade-offs.

3D perception ties together geometry, learning, and sensing so machines can move from flat images to a structured 3D understanding of the world.

Share on

X Facebook LinkedIn Bluesky