Chapter 4: Perception and Sensor Fusion

Perception is the module that transforms raw sensor data into a structured representation of the world: where objects are, what they are, and how they’re moving. Sensor fusion is the art and science of combining data from cameras, LiDAR, radar, and other sensors into a coherent, reliable world model. Together, they form the eyes and brain of the autonomous vehicle.

The World Model

The output of the perception system is a world model — a structured representation that downstream modules (prediction, planning) can consume. A typical world model includes:

Tracked objects: For each object — position (x, y, z), velocity (vx, vy), acceleration, dimensions (length, width, height), heading, class (car, pedestrian, cyclist, truck), and a unique tracking ID
Road geometry: Lane boundaries, road edges, drivable area, crosswalks
Traffic state: Traffic light states, detected signs, speed limit
Static obstacles: Parked cars, barriers, construction cones
Free space: Areas the vehicle can safely drive through

3D Object Detection

LiDAR-Based 3D Detection

LiDAR point clouds provide direct 3D information, making them naturally suited for 3D object detection.

PointPillars (2019): Converts the point cloud into a pseudo-image by:

Dividing the ground plane into a grid of “pillars” (vertical columns)
Encoding the points within each pillar using a simplified PointNet
Scattering the pillar features onto a 2D pseudo-image
Applying a standard 2D detection head (SSD-style)

This approach is fast (~60 Hz) and accurate, making it popular for real-time applications.

CenterPoint (2021): Detects objects as center points in a BEV representation:

Voxelize the point cloud into a 3D grid
Process with a 3D sparse convolution backbone (VoxelNet)
Flatten to BEV and apply a center-point detection head
Predict bounding box attributes (size, heading, velocity) from each center

CenterPoint achieves state-of-the-art results on nuScenes and Waymo Open benchmarks.

PV-RCNN (2020): Combines voxel-based and point-based representations. Uses voxel processing for efficiency but samples key points from the raw point cloud for precise localization.

Camera-Based 3D Detection

3D detection from cameras alone is more challenging because cameras lack direct depth measurement.

BEVDet / BEVDepth: Predicts depth distributions for each pixel, then lifts 2D features into 3D space and projects to BEV for detection.

DETR3D (2021): Uses a set of learned 3D reference points. For each reference point:

Project it into each camera view
Sample image features at the projected locations
Use a transformer decoder to refine 3D predictions

StreamPETR (2023): Extends PETR with temporal modeling, propagating object queries across frames to improve consistency and accuracy.

Combining LiDAR and camera data leverages the strengths of both:

BEVFusion (2022):

Process camera images through a BEV encoder (using depth estimation + lift-splat)
Process LiDAR point clouds through a voxel-based encoder
Fuse both BEV representations by concatenation or addition
Apply a shared detection head

This approach achieves the best of both worlds: the semantic richness of cameras and the geometric precision of LiDAR.

TransFusion (2022): Uses cross-attention between LiDAR and camera features. LiDAR features generate queries, which attend to image features for semantic enrichment.

Multi-Object Tracking (MOT)

Detection produces per-frame results; tracking maintains consistent object identities across frames.

Tracking-by-Detection

The dominant paradigm: first detect objects in each frame independently, then associate detections across frames.

Association methods:

Hungarian algorithm: Optimal bipartite matching between detections and existing tracks, using cost metrics like IoU (Intersection over Union), Mahalanobis distance, or appearance similarity.
Kalman Filter prediction: Predict where each tracked object should be in the next frame, then match detections to predictions:
- State: $[x, y, z, v_x, v_y, v_z, \theta, \omega]$ (position, velocity, heading, yaw rate)
- Predict: Propagate state using a constant-velocity or bicycle model
- Update: Correct prediction with matched detection
Appearance matching: Use learned visual features (ReID embeddings) to match objects that may have been temporarily occluded.

SORT (Simple Online and Realtime Tracking): Combines Kalman Filter prediction with Hungarian matching on IoU. Fast and effective.

DeepSORT: Extends SORT with appearance features (a deep network extracts a 128-d embedding for each detection) to handle occlusions and ID switches.

Track Management

Track initialization: A new track is created after an object is detected in N consecutive frames (typically 2–3) to filter false positives.
Track termination: A track is deleted if not matched for M consecutive frames (typically 5–10).
Track states: Tentative → Confirmed → Lost → Deleted

Joint Detection and Tracking

Recent work moves beyond the detect-then-track paradigm:

CenterTrack (2020): Jointly detects and tracks objects by predicting a displacement vector for each detection that links it to its previous position.

QDTrack (2021): Learns quasi-dense similarity matching across frames for robust association.

Occupancy Networks

An emerging representation that goes beyond bounding boxes: 3D occupancy grids. Instead of detecting individual objects, the system predicts whether each voxel in 3D space is occupied and by what class.

Why Occupancy?

Bounding boxes work well for regular objects (cars, pedestrians) but poorly for irregular shapes (construction barriers, fallen trees, overturned trucks). Occupancy grids handle arbitrary geometry naturally.

TPVFormer (2023): Uses tri-perspective view representations to predict dense 3D occupancy from camera images.

SurroundOcc (2023): Predicts 3D semantic occupancy from multi-camera images, providing a complete volumetric understanding of the scene.

Occ3D: A benchmark for 3D occupancy prediction, driving progress in this area.

Tesla has adopted occupancy networks as a core component of their FSD system, using them to reason about arbitrary obstacles.

Sensor Fusion Architectures

Early Fusion

Raw data from different sensors is combined before any processing:

Example: Painting LiDAR points with camera RGB colors, then processing the colorized point cloud
Advantage: Maximum information preservation
Disadvantage: Requires precise sensor calibration; sensor-specific noise can propagate

Late Fusion

Each sensor is processed independently to produce detections, then the detection lists are merged:

Example: Run separate detection networks on camera images and LiDAR, then fuse the two detection lists using matching and confidence weighting
Advantage: Modular; each sensor pipeline can be developed independently
Disadvantage: Loses the ability to exploit cross-modal correlations at the feature level

Mid-Level (Feature) Fusion

Features extracted from each sensor are fused at an intermediate representation level:

Example: BEVFusion — project both camera and LiDAR features into BEV space, then fuse
Advantage: Preserves cross-modal information while allowing sensor-specific feature extraction
Disadvantage: More complex architecture; requires careful feature alignment

Which Approach Wins?

As of 2026, mid-level fusion consistently achieves the best results on benchmarks. BEVFusion and similar architectures dominate the nuScenes and Waymo Open Dataset leaderboards. However, the gap between camera-only and multi-modal systems is narrowing, especially for simpler driving scenarios.

Sensor Calibration

Sensor fusion requires precise knowledge of the geometric relationship between sensors — their positions and orientations relative to each other and the vehicle body. This is extrinsic calibration.

Intrinsic Calibration

For cameras, intrinsic calibration determines:

Focal length (fx, fy)
Principal point (cx, cy)
Distortion coefficients (radial and tangential)

Typically done in the factory using checkerboard patterns.

Extrinsic Calibration

The 6-DoF transformation (rotation + translation) between each sensor and a reference frame (usually the vehicle’s rear axle center). This must be accurate to within millimeters and fractions of a degree.

Calibration methods:

Target-based: Use known calibration targets (checkerboards, ArUco markers, special 3D structures) visible to multiple sensors simultaneously
Target-free: Align sensor data by matching common features in the environment (edges, planes, poles)
Online calibration: Continuously refine calibration during driving by detecting and correcting misalignments (important because vibration and temperature changes can shift sensors)

Time Synchronization

Sensors capture data at different rates and with different latencies:

Cameras: 30 Hz, ~30 ms latency
LiDAR: 10–20 Hz, ~50 ms for full rotation
Radar: 13–20 Hz, ~50 ms
IMU: 100–1000 Hz, <1 ms

All sensor data must be timestamped precisely (using hardware triggers or GPS-synchronized clocks) and interpolated to a common time reference for fusion.

Handling Uncertainty

Perception is inherently uncertain. A detected object might be misclassified, its position might be imprecise, or it might be a false positive entirely. Robust perception systems model and propagate uncertainty throughout the pipeline.

Confidence Scores

Every detection includes a confidence score $p \in [0, 1]$ representing the model’s certainty. Downstream modules can use these scores to weight decisions.

Covariance Estimation

For tracked objects, the Kalman Filter maintains a covariance matrix representing position and velocity uncertainty. Objects farther away or partially occluded have higher uncertainty.

Ensemble Methods

Running multiple models or multiple sensor pipelines and comparing their outputs provides an additional layer of robustness. Disagreement between pipelines signals potential errors.

Summary

Perception and sensor fusion transform raw sensor data into an actionable world model:

3D object detection identifies and locates objects in 3D space using LiDAR, cameras, or both
Multi-object tracking maintains consistent object identities across time
Occupancy networks represent the world as a dense 3D grid, handling arbitrary geometry
Sensor fusion combines complementary sensors for robust, reliable perception
Calibration and synchronization ensure sensors are properly aligned in space and time
Uncertainty modeling quantifies confidence for downstream decision-making

The next chapter explores what happens after perception: predicting what other road users will do next.

← Previous: Localization and Mapping

Next: Prediction and Behavior Modeling →

← Back to Table of Contents