An autonomous vehicle needs to know not just what is around it, but exactly where it is — to within centimeters. GPS alone provides accuracy of only 2–5 meters, which is insufficient to distinguish between lanes on a highway. This chapter explores the algorithms that achieve centimeter-level localization and the maps that make it possible.
Localization answers the question: “Where am I, precisely, in the world?” For an autonomous vehicle, this means determining its 6 Degrees of Freedom (6-DoF) pose: position (x, y, z) and orientation (roll, pitch, yaw).
The required accuracy depends on the task:
Most autonomous driving systems rely on HD maps — detailed, pre-built maps of the environment that go far beyond consumer navigation maps.
| Layer | Content | Accuracy |
|---|---|---|
| Geometric | 3D point cloud of static environment (buildings, poles, curbs) | ~5 cm |
| Semantic | Lane boundaries, lane types, road edges, crosswalks | ~10 cm |
| Topological | Lane connectivity, intersection structure, merge/diverge points | Logical |
| Regulatory | Speed limits, traffic light positions, stop sign locations, yield rules | Associated with geometry |
| Dynamic | Construction zones, temporary closures (updated frequently) | Variable |
HD maps become stale quickly. Road construction, seasonal changes (snow covering lane markings), and even parked cars can invalidate map assumptions. This is why:
Some companies are developing crowd-sourced mapping where production vehicles contribute sensor data to update maps continuously. Mobileye’s REM (Road Experience Management) system is a leading example — millions of vehicles with Mobileye chips upload road landmark data to continuously refresh the map.
SLAM is the problem of building a map of an unknown environment while simultaneously tracking the agent’s position within it. It is a classic chicken-and-egg problem: you need a map to localize, and you need to know your position to build a map.
Visual SLAM uses camera images as the primary sensor. Key algorithms:
ORB-SLAM (2015–2021): One of the most influential visual SLAM systems. It:
Direct methods (LSD-SLAM, DSO): Instead of extracting features, direct methods minimize the photometric error between frames, using raw pixel intensities. They work well in texture-poor environments where feature extraction fails.
LiDAR SLAM matches 3D point clouds between frames to estimate motion:
ICP (Iterative Closest Point): The foundational algorithm. Given two point clouds, ICP iteratively:
The optimization minimizes:
\[E(R, t) = \sum_{i} \|p_i - (R q_i + t)\|^2\]where $R$ is the rotation matrix, $t$ is the translation vector, and $(p_i, q_i)$ are corresponding points.
NDT (Normal Distributions Transform): Divides the reference point cloud into a grid and fits a normal distribution to the points in each cell. The new scan is aligned by maximizing the likelihood of its points under these distributions. NDT is faster and more robust than ICP.
LOAM (LiDAR Odometry and Mapping): Separates the SLAM problem into a fast odometry module (10 Hz) and a slower mapping module (1 Hz). Extracts edge and planar features from point clouds for efficient registration.
LIO-SAM (2020): Tightly couples LiDAR inertial odometry with factor graph optimization, achieving state-of-the-art accuracy and robustness.
Modern SLAM systems represent the problem as a factor graph:
The optimal map and trajectory are found by solving a nonlinear least-squares optimization:
\[\hat{x} = \arg\min_x \sum_{i,j} \|z_{ij} - h_{ij}(x_i, x_j)\|^2_{\Sigma_{ij}}\]where $z_{ij}$ is the measurement, $h_{ij}$ is the measurement model, and $\Sigma_{ij}$ is the measurement covariance.
This is typically solved using iterative solvers like Gauss-Newton or Levenberg-Marquardt, implemented in libraries like GTSAM, g2o, or Ceres Solver.
During normal operation, the AV doesn’t need to build a map from scratch — it localizes within a pre-existing HD map. This is a simpler problem than full SLAM.
The most common approach: match the current LiDAR scan against the map’s reference point cloud.
Cameras can also localize against a map by matching visual features:
Different features are useful for different scenarios:
Localization is not a one-shot computation — it’s a continuous process of estimating the vehicle’s state (position, velocity, orientation) over time, fusing measurements from multiple sensors.
The EKF is the workhorse of state estimation. It maintains:
Predict step (using IMU/odometry): \(\hat{\mathbf{x}}_{k|k-1} = f(\hat{\mathbf{x}}_{k-1}, \mathbf{u}_k)\) \(\hat{\mathbf{P}}_{k|k-1} = \mathbf{F}_k \hat{\mathbf{P}}_{k-1} \mathbf{F}_k^T + \mathbf{Q}_k\)
Update step (using GPS, LiDAR match, etc.): \(\mathbf{K}_k = \hat{\mathbf{P}}_{k|k-1} \mathbf{H}_k^T (\mathbf{H}_k \hat{\mathbf{P}}_{k|k-1} \mathbf{H}_k^T + \mathbf{R}_k)^{-1}\) \(\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k|k-1} + \mathbf{K}_k (\mathbf{z}_k - h(\hat{\mathbf{x}}_{k|k-1}))\) \(\hat{\mathbf{P}}_k = (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k) \hat{\mathbf{P}}_{k|k-1}\)
where $\mathbf{K}_k$ is the Kalman gain, $\mathbf{F}_k$ is the state transition Jacobian, $\mathbf{H}_k$ is the observation Jacobian, $\mathbf{Q}_k$ is process noise, and $\mathbf{R}_k$ is measurement noise.
Particle filters represent the probability distribution over possible states using a set of weighted samples (particles). They can handle non-Gaussian distributions and multimodal hypotheses (e.g., ambiguity about which road the vehicle is on).
The algorithm:
Particle filters are used for global localization (when the vehicle doesn’t know its initial position) and for handling ambiguous situations.
Some approaches aim to reduce or eliminate the dependency on pre-built HD maps:
Uses a sparse topological map (like OpenStreetMap) combined with real-time sensor data. The vehicle uses its sensors to detect road boundaries and navigate, needing only basic connectivity information from the map.
Recent work uses neural networks to learn implicit map representations:
These approaches are promising because they could enable autonomous driving without expensive HD map creation and maintenance.
Urban canyons (tall buildings blocking satellite signals), tunnels, parking garages, and dense tree cover all degrade GPS accuracy. The system must seamlessly switch to LiDAR/visual localization in these areas.
The world changes: construction zones, seasonal changes (snow covering landmarks), time of day (lighting changes), and long-term changes (new buildings). Localization must be robust to these changes.
Real-time localization at 100+ Hz requires efficient algorithms. LiDAR scan matching is computationally expensive — sub-sampling, voxel grids, and coarse-to-fine strategies are used to meet latency requirements.
Rain, fog, and snow degrade both visual and LiDAR-based localization. Multi-sensor fusion with IMU dead-reckoning provides continuity through brief outages, but extended degradation remains a challenge.
Localization is the glue that connects perception to planning. Without knowing precisely where the vehicle is, it cannot make correct decisions about lane changes, turns, or stops. The key takeaways:
The next chapter covers perception and sensor fusion — how the vehicle builds a complete understanding of its dynamic surroundings.
| ← Previous: Computer Vision and Deep Learning | Next: Perception and Sensor Fusion → |