📢 Notice 📢

Please have a read first!

Motion and Tracking

September 29, 2025 6 minute read

Tracking extends perception from single images to video and dynamic scenes.
Instead of just asking “what is this?”, we now ask: “where is it over time?”

1. Dynamic World & Why Tracking Matters

Objects move, sensors are noisy, and we only see snapshots at discrete times.

Applications:

Autonomous driving and autopilot systems.
Sports analytics (player and ball tracking).
Surveillance and CCTV.
Radar and air-traffic tracking.
Vehicle and pedestrian tracking in smart cities.

Challenges:

Sensor noise and blur.
Occlusions (objects blocking each other).
Changing appearance (scale, pose, lighting, partial view).
Real-time constraints.

2. World Modelling: States and Observations

We describe motion with two types of variables:

State variables (hidden):
- True properties we care about (e.g., position, velocity, acceleration).
- Example: in 1D, state $ \mathbf{x} = [p, v]^T $.
Observation variables (measured):
- Noisy sensor readings from cameras, radar, LiDAR, etc.
- Example: measured position $ z $ with noise.

We also decide on a sampling rate (frame rate):

Higher rate → smoother trajectories, more compute.
Lower rate → cheaper but might miss fast motion.

2.1 State & Observation Equations

We model dynamics as:

State (prediction) equation
$ \mathbf{x}k = F_k \mathbf{x}{k-1} + B_k \mathbf{u}_k + \mathbf{w}_k $
- $F_k$: state-transition matrix (physics of motion).
- $\mathbf{u}_k$: control input (e.g., steering, acceleration from controller).
- $B_k$: control-input matrix.
- $\mathbf{w}_k$: process noise (uncertainty).
Observation (measurement) equation
$ \mathbf{z}_k = H_k \mathbf{x}_k + \mathbf{v}_k $
- $H_k$: observation matrix (how state maps to measurements).
- $\mathbf{v}_k$: measurement noise.

3. Kalman Filter (KF)

The Kalman Filter is a recursive estimator for linear systems with Gaussian noise.

It combines:

A prediction based on the motion model.
An update using the latest measurement.

3.1 Assumptions

Linear dynamics (state and observation equations).
Gaussian noise (process and measurement).
Known noise covariances.

3.2 Kalman Filter Steps

Let:

$$\hat{\mathbf{x}}_{k k-1}$$: Predicted state at time k given measurements up to k−1.
$$\hat{\mathbf{x}}_{k k}$$: Updated (posterior) state after receiving measurement at k.
$$P_{k k-1},\; P_{k k}$$: Predicted and updated covariance matrices.
$K_k$: Kalman gain.

Prediction:

\[\hat{\mathbf{x}}_{k|k-1} = F_k\, \hat{\mathbf{x}}_{k-1|k-1} + B_k\, \mathbf{u}_k\] \[P_{k|k-1} = F_k\, P_{k-1|k-1}\, F_k^{T} + Q_k\]

Update:

Innovation (residual):
$ \mathbf{y}k = \mathbf{z}_k - H_k \hat{\mathbf{x}}{k|k-1} $

Innovation covariance:
$ S_k = H_k P_{k|k-1} H_k^T + R_k $

Kalman gain:
$ K_k = P_{k|k-1} H_k^T S_k^{-1} $

State update:
$\hat{\mathbf{x}}{k|k}=\hat{\mathbf{x}}{k|k-1}+K_k\mathbf{y}_k$

Covariance update:
$ P_{k|k} = (I - K_k H_k) P_{k|k-1} $

Intuition:

If measurements are very noisy (large $R_k$), the filter trusts the prediction more.
If the model is uncertain (large $Q_k$), the filter trusts measurements more.

3.3 Nonlinear Extensions

EKF (Extended Kalman Filter):
- Linearizes nonlinear functions via Taylor expansion around current estimate.
UKF (Unscented Kalman Filter):
- Uses carefully chosen sigma points to propagate mean and covariance through nonlinear functions more accurately.

4. Tracking Approaches: Model-Based vs Appearance-Based

4.1 Model-Based (Top-Down)

Use physical motion models:
- Constant velocity, constant acceleration, bicycle model, etc.
Tools: Kalman Filter, EKF/UKF, particle filters (beyond this lecture scope).
Typical for:
- Radar tracking, navigation, robotics.
- Situations where dynamics are well understood.

Pros:

Robust to missing detections for a few frames.
Can predict where the object should be even with partial measurements.

Cons:

Ignores appearance; might drift if the object suddenly changes direction/speed in ways the model doesn’t capture.

4.2 Appearance-Based (Bottom-Up)

Track based on visual cues:
- Color histograms, texture, edges, gradients, corners, etc.
Does not require an explicit physics model.
Good when:
- Appearance is distinctive and stable.
- The underlying motion is too complex to model simply.

Kernel-based tracking (Mean-Shift / CamShift) sits here.

5. Kernel-Based Tracking: Mean-Shift & CamShift

Kernel-based trackers treat the target as a probability distribution over some feature space (often color).

5.1 Representation

Select a region of interest (ROI) containing the target in the first frame.
Compute a target histogram in some feature space:
- e.g., color (HSV) histogram.
Define a kernel to weight pixels near the center more.

Similarity between target and candidate region can be measured with the Bhattacharyya coefficient, which compares histograms.

5.2 Mean-Shift Algorithm

Goal: Given the previous location of the target, find the new location that maximizes similarity.

Steps (per frame):

Start from previous window position.
Compute histogram of the current window.
Compute weights for each pixel based on how well its feature contributes to matching the target histogram.
Compute the mean shift vector: weighted average of pixel positions.
Move the window center in the direction of this mean shift.
Iterate until convergence (shift becomes small).

Characteristics:

Purely appearance-based.
Assumes limited movement between frames.
Window size is fixed.

5.3 CamShift (Continuously Adaptive Mean Shift)

Extension of Mean-Shift:

After convergence, use the zeroth and second-order moments of the distribution to:
- Update window size (scale) and
- Potentially orientation.

That means CamShift can:

Adapt to changes in object size.
Work better for objects moving toward/away from the camera.

Used in:

Real-time face tracking.
Simple single-object tracking scenarios.

Limitations (Mean-Shift & CamShift):

Struggle under heavy occlusion.
Sensitive to significant appearance change (lighting, pose).
Typically track one object at a time.

6. Multiple-Object Tracking (MOT)

Goal: Track many objects and maintain consistent IDs over time.

Typical pipeline:

Detection:
- Use an object detector per frame (e.g., YOLO, Faster R-CNN).
Prediction:
- For each existing track, use a Kalman filter to predict the new position.
Data association:
- Match predicted tracks to current detections.
- Common tools:
  - IoU-based cost between predicted boxes and detections.
  - Hungarian algorithm to find the optimal assignment.
Track management:
- Create new tracks for unmatched detections.
- Mark tracks as “lost” or delete them after several missed frames.
- Optionally do re-identification (ReID) when an object reappears.

6.1 Metrics: MOTA and MOTP

MOTA (Multiple Object Tracking Accuracy):
- Penalizes:
  - False positives (extra tracks).
  - False negatives (missed objects).
  - ID switches (when track ID changes for the same object).
- Higher is better.
MOTP (Multiple Object Tracking Precision):
- Measures localization accuracy:
  - How well the predicted positions/boxes overlap the ground truth (typically using IoU).
- Higher is better.

7. SORT and DeepSORT

7.1 SORT (Simple Online Real-Time Tracking)

Minimalist yet strong baseline for MOT.

Components:

Detector (e.g., YOLO).
Per-track Kalman Filter (often on bounding box parameters).
Hungarian algorithm for data association using IoU as a cost.

Workflow:

Detect objects in current frame.
Predict track locations with Kalman Filter.
Build cost matrix from IoU between predicted tracks and detections.
Hungarian algorithm → optimal assignment.
Update tracks with matched detections; create/delete tracks as needed.

Advantages:

Very fast.
Simple to implement.
Works surprisingly well in many scenarios with moderate occlusion.

Limitations:

Data association relies only on geometry (IoU):
- Fails with long occlusions and crowded scenes.
- IDs can switch when objects cross paths.

7.2 DeepSORT

Extends SORT by adding appearance features.

Train a CNN to produce embedding vectors for each detection:
- Similar vectors for the same object across frames.
Combine:
- Geometric cost (IoU) and
- Appearance cost (distance between embeddings).

Benefits:

Much more robust to:
- Occlusion.
- Crossing trajectories.
- Similar motion patterns.

DeepSORT is widely used for:

Pedestrian tracking.
Multi-camera surveillance.
Traffic analysis.

8. Choosing the Right Tracking Approach

Single object, stable appearance, no strong occlusion:
- Kernel-based methods (Mean-Shift, CamShift) can be enough.
Sensor-based tracking with good motion model:
- Kalman Filter / EKF / UKF with simple measurement model.
Multiple objects in video (e.g., pedestrians, cars):
- Detection + tracking via:
  - SORT for speed.
  - DeepSORT when ID consistency under occlusion matters.
Hybrid setups:
- Use Kalman filters inside MOT frameworks (SORT/DeepSORT).
- Combine appearance-based cues (color, CNN embeddings) with model-based motion prediction.

9. Exam-Oriented Summary

Explain state vs observation and write the basic state + observation equations.
Describe the Kalman Filter:
- Assumptions.
- Prediction and update steps.
- Role of Kalman gain.
Compare model-based (top-down) vs appearance-based (bottom-up) tracking.
Explain how Mean-Shift and CamShift work:
- Target representation (histograms + kernel).
- Iterative mean-shift updates.
- Adaptive window size in CamShift.
Describe the standard MOT pipeline:
- Detection → prediction → data association → track management.
Define MOTA and MOTP.
Outline the workflows of SORT and DeepSORT, and explain when we’d prefer one over the other.

Tracking ties together world modelling, probabilistic state estimation, and visual appearance so we can follow objects reliably across time, even when measurements are noisy and incomplete.

Share on

X Facebook LinkedIn Bluesky