Chapter 4: Perception and Sensor Fusion

Perception is the module that transforms raw sensor data into a structured representation of the world: where objects are, what they are, and how they’re moving. Sensor fusion is the art and science of combining data from cameras, LiDAR, radar, and other sensors into a coherent, reliable world model. Together, they form the eyes and brain of the autonomous vehicle.

The World Model

The output of the perception system is a world model — a structured representation that downstream modules (prediction, planning) can consume. A typical world model includes:

3D Object Detection

LiDAR-Based 3D Detection

LiDAR point clouds provide direct 3D information, making them naturally suited for 3D object detection.

PointPillars (2019): Converts the point cloud into a pseudo-image by:

  1. Dividing the ground plane into a grid of “pillars” (vertical columns)
  2. Encoding the points within each pillar using a simplified PointNet
  3. Scattering the pillar features onto a 2D pseudo-image
  4. Applying a standard 2D detection head (SSD-style)

This approach is fast (~60 Hz) and accurate, making it popular for real-time applications.

CenterPoint (2021): Detects objects as center points in a BEV representation:

  1. Voxelize the point cloud into a 3D grid
  2. Process with a 3D sparse convolution backbone (VoxelNet)
  3. Flatten to BEV and apply a center-point detection head
  4. Predict bounding box attributes (size, heading, velocity) from each center

CenterPoint achieves state-of-the-art results on nuScenes and Waymo Open benchmarks.

PV-RCNN (2020): Combines voxel-based and point-based representations. Uses voxel processing for efficiency but samples key points from the raw point cloud for precise localization.

Camera-Based 3D Detection

3D detection from cameras alone is more challenging because cameras lack direct depth measurement.

BEVDet / BEVDepth: Predicts depth distributions for each pixel, then lifts 2D features into 3D space and projects to BEV for detection.

DETR3D (2021): Uses a set of learned 3D reference points. For each reference point:

  1. Project it into each camera view
  2. Sample image features at the projected locations
  3. Use a transformer decoder to refine 3D predictions

StreamPETR (2023): Extends PETR with temporal modeling, propagating object queries across frames to improve consistency and accuracy.

Multi-Modal Fusion for 3D Detection

Combining LiDAR and camera data leverages the strengths of both:

BEVFusion (2022):

  1. Process camera images through a BEV encoder (using depth estimation + lift-splat)
  2. Process LiDAR point clouds through a voxel-based encoder
  3. Fuse both BEV representations by concatenation or addition
  4. Apply a shared detection head

This approach achieves the best of both worlds: the semantic richness of cameras and the geometric precision of LiDAR.

TransFusion (2022): Uses cross-attention between LiDAR and camera features. LiDAR features generate queries, which attend to image features for semantic enrichment.

Multi-Object Tracking (MOT)

Detection produces per-frame results; tracking maintains consistent object identities across frames.

Tracking-by-Detection

The dominant paradigm: first detect objects in each frame independently, then associate detections across frames.

Association methods:

  1. Hungarian algorithm: Optimal bipartite matching between detections and existing tracks, using cost metrics like IoU (Intersection over Union), Mahalanobis distance, or appearance similarity.

  2. Kalman Filter prediction: Predict where each tracked object should be in the next frame, then match detections to predictions:

  3. Appearance matching: Use learned visual features (ReID embeddings) to match objects that may have been temporarily occluded.

SORT (Simple Online and Realtime Tracking): Combines Kalman Filter prediction with Hungarian matching on IoU. Fast and effective.

DeepSORT: Extends SORT with appearance features (a deep network extracts a 128-d embedding for each detection) to handle occlusions and ID switches.

Track Management

Joint Detection and Tracking

Recent work moves beyond the detect-then-track paradigm:

CenterTrack (2020): Jointly detects and tracks objects by predicting a displacement vector for each detection that links it to its previous position.

QDTrack (2021): Learns quasi-dense similarity matching across frames for robust association.

Occupancy Networks

An emerging representation that goes beyond bounding boxes: 3D occupancy grids. Instead of detecting individual objects, the system predicts whether each voxel in 3D space is occupied and by what class.

Why Occupancy?

Bounding boxes work well for regular objects (cars, pedestrians) but poorly for irregular shapes (construction barriers, fallen trees, overturned trucks). Occupancy grids handle arbitrary geometry naturally.

TPVFormer (2023): Uses tri-perspective view representations to predict dense 3D occupancy from camera images.

SurroundOcc (2023): Predicts 3D semantic occupancy from multi-camera images, providing a complete volumetric understanding of the scene.

Occ3D: A benchmark for 3D occupancy prediction, driving progress in this area.

Tesla has adopted occupancy networks as a core component of their FSD system, using them to reason about arbitrary obstacles.

Sensor Fusion Architectures

Early Fusion

Raw data from different sensors is combined before any processing:

Late Fusion

Each sensor is processed independently to produce detections, then the detection lists are merged:

Mid-Level (Feature) Fusion

Features extracted from each sensor are fused at an intermediate representation level:

Which Approach Wins?

As of 2026, mid-level fusion consistently achieves the best results on benchmarks. BEVFusion and similar architectures dominate the nuScenes and Waymo Open Dataset leaderboards. However, the gap between camera-only and multi-modal systems is narrowing, especially for simpler driving scenarios.

Sensor Calibration

Sensor fusion requires precise knowledge of the geometric relationship between sensors — their positions and orientations relative to each other and the vehicle body. This is extrinsic calibration.

Intrinsic Calibration

For cameras, intrinsic calibration determines:

Typically done in the factory using checkerboard patterns.

Extrinsic Calibration

The 6-DoF transformation (rotation + translation) between each sensor and a reference frame (usually the vehicle’s rear axle center). This must be accurate to within millimeters and fractions of a degree.

Calibration methods:

Time Synchronization

Sensors capture data at different rates and with different latencies:

All sensor data must be timestamped precisely (using hardware triggers or GPS-synchronized clocks) and interpolated to a common time reference for fusion.

Handling Uncertainty

Perception is inherently uncertain. A detected object might be misclassified, its position might be imprecise, or it might be a false positive entirely. Robust perception systems model and propagate uncertainty throughout the pipeline.

Confidence Scores

Every detection includes a confidence score $p \in [0, 1]$ representing the model’s certainty. Downstream modules can use these scores to weight decisions.

Covariance Estimation

For tracked objects, the Kalman Filter maintains a covariance matrix representing position and velocity uncertainty. Objects farther away or partially occluded have higher uncertainty.

Ensemble Methods

Running multiple models or multiple sensor pipelines and comparing their outputs provides an additional layer of robustness. Disagreement between pipelines signals potential errors.

Summary

Perception and sensor fusion transform raw sensor data into an actionable world model:

  1. 3D object detection identifies and locates objects in 3D space using LiDAR, cameras, or both
  2. Multi-object tracking maintains consistent object identities across time
  3. Occupancy networks represent the world as a dense 3D grid, handling arbitrary geometry
  4. Sensor fusion combines complementary sensors for robust, reliable perception
  5. Calibration and synchronization ensure sensors are properly aligned in space and time
  6. Uncertainty modeling quantifies confidence for downstream decision-making

The next chapter explores what happens after perception: predicting what other road users will do next.


← Previous: Localization and Mapping Next: Prediction and Behavior Modeling →

← Back to Table of Contents