Perception is the module that transforms raw sensor data into a structured representation of the world: where objects are, what they are, and how they’re moving. Sensor fusion is the art and science of combining data from cameras, LiDAR, radar, and other sensors into a coherent, reliable world model. Together, they form the eyes and brain of the autonomous vehicle.
The output of the perception system is a world model — a structured representation that downstream modules (prediction, planning) can consume. A typical world model includes:
LiDAR point clouds provide direct 3D information, making them naturally suited for 3D object detection.
PointPillars (2019): Converts the point cloud into a pseudo-image by:
This approach is fast (~60 Hz) and accurate, making it popular for real-time applications.
CenterPoint (2021): Detects objects as center points in a BEV representation:
CenterPoint achieves state-of-the-art results on nuScenes and Waymo Open benchmarks.
PV-RCNN (2020): Combines voxel-based and point-based representations. Uses voxel processing for efficiency but samples key points from the raw point cloud for precise localization.
3D detection from cameras alone is more challenging because cameras lack direct depth measurement.
BEVDet / BEVDepth: Predicts depth distributions for each pixel, then lifts 2D features into 3D space and projects to BEV for detection.
DETR3D (2021): Uses a set of learned 3D reference points. For each reference point:
StreamPETR (2023): Extends PETR with temporal modeling, propagating object queries across frames to improve consistency and accuracy.
Combining LiDAR and camera data leverages the strengths of both:
BEVFusion (2022):
This approach achieves the best of both worlds: the semantic richness of cameras and the geometric precision of LiDAR.
TransFusion (2022): Uses cross-attention between LiDAR and camera features. LiDAR features generate queries, which attend to image features for semantic enrichment.
Detection produces per-frame results; tracking maintains consistent object identities across frames.
The dominant paradigm: first detect objects in each frame independently, then associate detections across frames.
Association methods:
Hungarian algorithm: Optimal bipartite matching between detections and existing tracks, using cost metrics like IoU (Intersection over Union), Mahalanobis distance, or appearance similarity.
Kalman Filter prediction: Predict where each tracked object should be in the next frame, then match detections to predictions:
Appearance matching: Use learned visual features (ReID embeddings) to match objects that may have been temporarily occluded.
SORT (Simple Online and Realtime Tracking): Combines Kalman Filter prediction with Hungarian matching on IoU. Fast and effective.
DeepSORT: Extends SORT with appearance features (a deep network extracts a 128-d embedding for each detection) to handle occlusions and ID switches.
Recent work moves beyond the detect-then-track paradigm:
CenterTrack (2020): Jointly detects and tracks objects by predicting a displacement vector for each detection that links it to its previous position.
QDTrack (2021): Learns quasi-dense similarity matching across frames for robust association.
An emerging representation that goes beyond bounding boxes: 3D occupancy grids. Instead of detecting individual objects, the system predicts whether each voxel in 3D space is occupied and by what class.
Bounding boxes work well for regular objects (cars, pedestrians) but poorly for irregular shapes (construction barriers, fallen trees, overturned trucks). Occupancy grids handle arbitrary geometry naturally.
TPVFormer (2023): Uses tri-perspective view representations to predict dense 3D occupancy from camera images.
SurroundOcc (2023): Predicts 3D semantic occupancy from multi-camera images, providing a complete volumetric understanding of the scene.
Occ3D: A benchmark for 3D occupancy prediction, driving progress in this area.
Tesla has adopted occupancy networks as a core component of their FSD system, using them to reason about arbitrary obstacles.
Raw data from different sensors is combined before any processing:
Each sensor is processed independently to produce detections, then the detection lists are merged:
Features extracted from each sensor are fused at an intermediate representation level:
As of 2026, mid-level fusion consistently achieves the best results on benchmarks. BEVFusion and similar architectures dominate the nuScenes and Waymo Open Dataset leaderboards. However, the gap between camera-only and multi-modal systems is narrowing, especially for simpler driving scenarios.
Sensor fusion requires precise knowledge of the geometric relationship between sensors — their positions and orientations relative to each other and the vehicle body. This is extrinsic calibration.
For cameras, intrinsic calibration determines:
Typically done in the factory using checkerboard patterns.
The 6-DoF transformation (rotation + translation) between each sensor and a reference frame (usually the vehicle’s rear axle center). This must be accurate to within millimeters and fractions of a degree.
Calibration methods:
Sensors capture data at different rates and with different latencies:
All sensor data must be timestamped precisely (using hardware triggers or GPS-synchronized clocks) and interpolated to a common time reference for fusion.
Perception is inherently uncertain. A detected object might be misclassified, its position might be imprecise, or it might be a false positive entirely. Robust perception systems model and propagate uncertainty throughout the pipeline.
Every detection includes a confidence score $p \in [0, 1]$ representing the model’s certainty. Downstream modules can use these scores to weight decisions.
For tracked objects, the Kalman Filter maintains a covariance matrix representing position and velocity uncertainty. Objects farther away or partially occluded have higher uncertainty.
Running multiple models or multiple sensor pipelines and comparing their outputs provides an additional layer of robustness. Disagreement between pipelines signals potential errors.
Perception and sensor fusion transform raw sensor data into an actionable world model:
The next chapter explores what happens after perception: predicting what other road users will do next.
| ← Previous: Localization and Mapping | Next: Prediction and Behavior Modeling → |