3D Perception
3D perception extends 2D vision into depth and spatial structure.
Instead of just asking “what is in this image?”, we ask: “what does the 3D world look like?”
1. Why 3D Perception?
Goal: reconstruct and recognize 3D scenes from sensors such as cameras, stereo pairs, depth sensors, or LiDAR.
Applications:
- Robotics & autonomous driving – mapping, obstacle avoidance, path planning.
- AR/VR & gaming – real-time environment understanding, body and hand tracking.
- Healthcare – 3D face and body tracking, morphology, medical imaging.
- Industry – inspection, infrastructure digitisation, kiln/flame monitoring.
- Cultural heritage & urban modelling – 3D city models, virtual museums.
- Human motion & safety – fall detection, sports analytics, pose estimation.
2. 3D Representations
Different tasks favour different 3D representations:
- Point clouds
- Set of points $ (x, y, z) $, typically from LiDAR or multi-view stereo.
- Pros: simple, compact, direct sensor output.
- Cons: no explicit surface connectivity; varying density; irregular structure.
- Meshes / surfaces
- Vertices connected into triangles/quads approximating surfaces.
- Pros: good for rendering, accurate geometry.
- Cons: more complex to store and process; mesh quality depends on reconstruction.
- Voxels (volumetric grids)
- 3D occupancy grid (like 3D pixels).
- Pros: regular structure → easy for 3D CNNs.
- Cons: memory heavy; limited resolution for large scenes.
- Depth maps
- 2D image where each pixel = distance to camera.
- Pros: easy to capture and process with 2D CNNs; aligns with RGB.
- Cons: only encodes surfaces visible from that viewpoint; can lose fine 3D structure.
3. Input Modalities
- Monocular images
- Depth inferred from learned priors and cues (texture, size, perspective).
- Fundamental ambiguity: many 3D scenes project to the same 2D image.
- Stereo cameras
- Two calibrated cameras with baseline $B$.
- Disparity $d$ (horizontal shift between rectified views) gives depth:
$ Z = \frac{fB}{d} $
where $f$ is focal length. - Larger baseline → better depth accuracy at long range, but harder to match.
- Multi-view setups
- Many overlapping views (structure-from-motion, multi-view stereo).
- Can reconstruct dense point clouds / meshes of scenes.
- Active depth sensors
- LiDAR – time-of-flight laser, sparse but accurate at long range.
- Structured light / active stereo – projected IR pattern; triangulation (e.g., Kinect v1).
- Time-of-Flight (ToF) – phase/time delay of emitted light; dense depth maps.
- Pros: direct depth; Cons: cost, power, sensitivity to environment (e.g., sunlight).
4. From Stereo to Depth: Disparity & PSMNet
4.1 Classical Stereo Matching
Rectification: warp images so epipolar lines are horizontal → search only along rows.
Then:
- Compute matching cost per disparity (e.g., SAD/SSD/NCC on windows).
- Aggregate costs with:
- Local windows (box, bilateral).
- Semi-global / global methods (SGM, graph cuts).
- Select disparity with lowest cost.
- Post-process (median filter, left–right consistency check, hole filling).
Failure modes:
- Textureless regions.
- Repetitive patterns.
- Specular surfaces, reflections.
- Occlusions and slanted surfaces.
4.2 Deep Stereo: PSMNet (Pyramid Stereo Matching Network)
Modern approach that learns stereo matching end-to-end:
- CNN feature extractor with shared weights for left/right images.
- Spatial Pyramid Pooling (SPP) for multi-scale context.
- Build a 4D cost volume: $H \times W \times D \times \text{features}$.
- Use a stacked hourglass 3D CNN to regularise and smooth the cost volume.
- Regress final disparity map → convert to depth.
Pros:
- State-of-the-art accuracy on benchmarks (e.g., KITTI).
- Robust to challenging textures/structures.
Cons:
- High memory & compute cost (GPU-heavy).
- Deployment trade-off between speed and accuracy.
5. 3D Reconstruction & Geometry (Szeliski Ch. 13 Highlights)
5.1 Camera Models & Epipolar Geometry
- Pinhole model:
( \mathbf{x} = K [R \mid t] \mathbf{X} )
where:- (K): camera intrinsics (focal length, principal point, skew),
- (R, t): extrinsics (rotation and translation),
- (\mathbf{X}): 3D point, (\mathbf{x}): image point.
-
Distortion: radial & tangential; undistort before geometry.
- Epipolar geometry:
- Essential matrix (E) for calibrated cameras, fundamental matrix (F) for uncalibrated.
- Epipolar constraint: [ \mathbf{x}’^T F \mathbf{x} = 0 ]
- Reduces matching from 2D → 1D search along epipolar lines.
- Rectification: warp stereo pair so corresponding points lie on the same row, turning 2D search into 1D disparity search.
5.2 Multi-View Stereo (MVS) & Structure from Motion (SfM)
- MVS:
- Triangulate points seen in many calibrated views.
- Enforce photometric consistency across views.
- Techniques: plane-sweep, PatchMatch, depth map fusion (TSDF, Poisson).
- SfM (unordered photo collections):
- Detect & match features (SIFT/ORB).
- Robust two-view geometry with RANSAC.
- Incrementally add cameras (PnP) and triangulate new points.
- Bundle Adjustment:
- Jointly refine camera poses and 3D points by minimising reprojection error. - Monocular reconstructions are up to an unknown scale; fix with additional cues (known distances, IMU, etc.).
5.3 VO, SLAM & Active Sensing
- Visual Odometry (VO): estimate camera motion over time from images (feature-based or direct).
- SLAM: VO + map + loop closure to reduce drift.
- Visual–inertial fusion: combining cameras with IMU improves robustness and scale.
Active depth sensing recap:
- Structured light, ToF, LiDAR with different trade-offs in range, density, accuracy, and environmental robustness.
6. 3D Recognition: VoxNet & PointNet
After we have 3D data, we also want to classify or segment objects in 3D.
6.1 VoxNet
- Early 3D deep-learning approach:
- Convert point clouds into voxel grids.
- Apply 3D CNN for object recognition.
- Pros:
- Simple extension of 2D CNN ideas to 3D.
- Cons:
- Voxelisation introduces:
- Quantisation errors (loss of detail).
- Huge memory usage at high resolution.
- Voxelisation introduces:
6.2 PointNet
Designed to work directly on point clouds:
- Input: unordered set of points ((x, y, z)) (optionally with features like intensity, RGB).
- Architecture:
- Shared MLPs applied to each point independently.
- Symmetric aggregation (max pooling) to extract a global descriptor invariant to point ordering.
- Outputs:
- Classification (global feature).
- Segmentation (per-point labels with shared/global features).
Advantages:
- No voxelisation → avoids resolution and memory issues.
- Invariant to permutation of input points.
Extensions like PointNet++ add local neighbourhoods to capture fine-grained structure.
7. SADRNet: 3D Face Alignment from a Single Image
SADRNet (Self-Aligned Dual Face Regression Network) is an example of 3D reconstruction from a single RGB image, focused on faces.
Task: estimate dense 3D facial landmarks/shape from monocular input under occlusion and pose changes.
Key ideas:
- Dual branches:
- Pose-dependent branch: captures head orientation.
- Pose-independent branch: captures underlying face shape.
- Self-alignment module:
- Learns to align outputs of the two branches via a similarity transform.
- Occlusion-aware attention:
- Learns to focus on visible regions.
- Robust to self-occlusion and external occluders (hands, hair, objects).
Outcome:
- Real-time (≈70 FPS) 3D face alignment with strong robustness to occlusion and large pose variations.
8. Point Clouds, Registration & Fusion
Once multiple depth maps or scans are available, we often need to align and merge them:
- Normal & curvature estimation from local neighbourhoods.
- Plane / cylinder detection via RANSAC.
- ICP (Iterative Closest Point):
- Iteratively find correspondences between two point clouds.
- Estimate the rigid transform that minimises point-to-point or point-to-plane distances.
- Needs good initialisation, otherwise can get stuck in local minima.
- Volumetric fusion:
- Integrate many depth maps into a TSDF volume.
- Extract surfaces via Marching Cubes to get a mesh.
Other shape-from-X cues (awareness level):
- Photometric stereo, shape-from-shading, shape-from-defocus, shape-from-polarisation—each with its own assumptions and failure modes.
9. Practical Pitfalls
- Scale ambiguity in monocular setups.
- Rolling shutter: violates simple camera models when camera or scene moves quickly.
- Dynamic scenes: break static-world assumptions used in SfM/MVS.
- Calibration quality:
- Poor calibration ruins depth estimates.
- Use good patterns, many viewpoints, and refine with bundle adjustment.
10. Exam-Oriented Summary
- Explain why 3D perception is important and list key applications.
- Describe and compare point clouds, meshes, voxels, and depth maps.
- Derive and use the stereo depth formula (Z = fB/d); discuss how baseline affects depth accuracy.
- Explain the main steps and challenges in stereo matching and what PSMNet adds.
- Outline core elements of epipolar geometry, rectification, SfM, and bundle adjustment.
- Describe how VO and SLAM extend multi-view geometry to continuous sequences.
- Explain what VoxNet and PointNet do and why PointNet avoids voxelisation.
- Summarise SADRNet’s purpose (3D face alignment) and its dual-branch + self-alignment design.
- Discuss common pitfalls: calibration errors, dynamic scenes, scale ambiguity, and sensor trade-offs.
3D perception ties together geometry, learning, and sensing so machines can move from flat images to a structured 3D understanding of the world.
Leave a comment