Chapter 3: Localization and Mapping

An autonomous vehicle needs to know not just what is around it, but exactly where it is — to within centimeters. GPS alone provides accuracy of only 2–5 meters, which is insufficient to distinguish between lanes on a highway. This chapter explores the algorithms that achieve centimeter-level localization and the maps that make it possible.

The Localization Problem

Localization answers the question: “Where am I, precisely, in the world?” For an autonomous vehicle, this means determining its 6 Degrees of Freedom (6-DoF) pose: position (x, y, z) and orientation (roll, pitch, yaw).

The required accuracy depends on the task:

High-Definition Maps (HD Maps)

Most autonomous driving systems rely on HD maps — detailed, pre-built maps of the environment that go far beyond consumer navigation maps.

What HD Maps Contain

Layer Content Accuracy
Geometric 3D point cloud of static environment (buildings, poles, curbs) ~5 cm
Semantic Lane boundaries, lane types, road edges, crosswalks ~10 cm
Topological Lane connectivity, intersection structure, merge/diverge points Logical
Regulatory Speed limits, traffic light positions, stop sign locations, yield rules Associated with geometry
Dynamic Construction zones, temporary closures (updated frequently) Variable

How HD Maps Are Built

  1. Data collection: Mapping vehicles equipped with high-end LiDAR, cameras, GPS/IMU, and wheel odometry drive every road in the operating area multiple times.
  2. SLAM processing: Simultaneous Localization and Mapping algorithms align all the sensor data into a coherent 3D map.
  3. Annotation: Human annotators (and increasingly, automated tools) label lane boundaries, traffic signs, curbs, and other semantic features.
  4. Validation: Quality assurance to ensure accuracy and completeness.
  5. Continuous updates: Maps must be updated as the world changes — new construction, repainted lanes, new traffic signals.

The Map Maintenance Challenge

HD maps become stale quickly. Road construction, seasonal changes (snow covering lane markings), and even parked cars can invalidate map assumptions. This is why:

Some companies are developing crowd-sourced mapping where production vehicles contribute sensor data to update maps continuously. Mobileye’s REM (Road Experience Management) system is a leading example — millions of vehicles with Mobileye chips upload road landmark data to continuously refresh the map.

Simultaneous Localization and Mapping (SLAM)

SLAM is the problem of building a map of an unknown environment while simultaneously tracking the agent’s position within it. It is a classic chicken-and-egg problem: you need a map to localize, and you need to know your position to build a map.

Visual SLAM

Visual SLAM uses camera images as the primary sensor. Key algorithms:

ORB-SLAM (2015–2021): One of the most influential visual SLAM systems. It:

  1. Extracts ORB (Oriented FAST and Rotated BRIEF) features from each frame
  2. Matches features between frames to estimate camera motion
  3. Triangulates 3D landmark positions from matched features
  4. Optimizes the trajectory and map jointly using bundle adjustment
  5. Detects loop closures — recognizing previously visited places to correct accumulated drift

Direct methods (LSD-SLAM, DSO): Instead of extracting features, direct methods minimize the photometric error between frames, using raw pixel intensities. They work well in texture-poor environments where feature extraction fails.

LiDAR SLAM

LiDAR SLAM matches 3D point clouds between frames to estimate motion:

ICP (Iterative Closest Point): The foundational algorithm. Given two point clouds, ICP iteratively:

  1. Finds the closest point in cloud B for each point in cloud A
  2. Computes the rigid transformation (rotation + translation) that minimizes the sum of squared distances
  3. Applies the transformation and repeats until convergence

The optimization minimizes:

\[E(R, t) = \sum_{i} \|p_i - (R q_i + t)\|^2\]

where $R$ is the rotation matrix, $t$ is the translation vector, and $(p_i, q_i)$ are corresponding points.

NDT (Normal Distributions Transform): Divides the reference point cloud into a grid and fits a normal distribution to the points in each cell. The new scan is aligned by maximizing the likelihood of its points under these distributions. NDT is faster and more robust than ICP.

LOAM (LiDAR Odometry and Mapping): Separates the SLAM problem into a fast odometry module (10 Hz) and a slower mapping module (1 Hz). Extracts edge and planar features from point clouds for efficient registration.

LIO-SAM (2020): Tightly couples LiDAR inertial odometry with factor graph optimization, achieving state-of-the-art accuracy and robustness.

Graph-Based SLAM

Modern SLAM systems represent the problem as a factor graph:

The optimal map and trajectory are found by solving a nonlinear least-squares optimization:

\[\hat{x} = \arg\min_x \sum_{i,j} \|z_{ij} - h_{ij}(x_i, x_j)\|^2_{\Sigma_{ij}}\]

where $z_{ij}$ is the measurement, $h_{ij}$ is the measurement model, and $\Sigma_{ij}$ is the measurement covariance.

This is typically solved using iterative solvers like Gauss-Newton or Levenberg-Marquardt, implemented in libraries like GTSAM, g2o, or Ceres Solver.

Map-Based Localization

During normal operation, the AV doesn’t need to build a map from scratch — it localizes within a pre-existing HD map. This is a simpler problem than full SLAM.

Point Cloud Matching

The most common approach: match the current LiDAR scan against the map’s reference point cloud.

  1. Preprocessing: Remove dynamic objects (cars, pedestrians) from the current scan — we only want to match against static structures.
  2. Initial estimate: Use GPS + IMU to get a rough position estimate.
  3. Scan matching: Align the current scan to the map using ICP or NDT.
  4. Refinement: The matched pose is fused with IMU and wheel odometry using an Extended Kalman Filter (EKF) or factor graph.

Visual Localization

Cameras can also localize against a map by matching visual features:

  1. Landmark matching: Detect landmarks in the image (traffic signs, poles, building corners) and match them to landmarks stored in the map.
  2. Image retrieval: Find the most similar stored image using visual place recognition (NetVLAD, SuperGlue).
  3. Pose estimation: Solve the PnP (Perspective-n-Point) problem to recover the camera pose from 2D-3D correspondences.

Feature Types for Localization

Different features are useful for different scenarios:

State Estimation and Sensor Fusion

Localization is not a one-shot computation — it’s a continuous process of estimating the vehicle’s state (position, velocity, orientation) over time, fusing measurements from multiple sensors.

Extended Kalman Filter (EKF)

The EKF is the workhorse of state estimation. It maintains:

Predict step (using IMU/odometry): \(\hat{\mathbf{x}}_{k|k-1} = f(\hat{\mathbf{x}}_{k-1}, \mathbf{u}_k)\) \(\hat{\mathbf{P}}_{k|k-1} = \mathbf{F}_k \hat{\mathbf{P}}_{k-1} \mathbf{F}_k^T + \mathbf{Q}_k\)

Update step (using GPS, LiDAR match, etc.): \(\mathbf{K}_k = \hat{\mathbf{P}}_{k|k-1} \mathbf{H}_k^T (\mathbf{H}_k \hat{\mathbf{P}}_{k|k-1} \mathbf{H}_k^T + \mathbf{R}_k)^{-1}\) \(\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k|k-1} + \mathbf{K}_k (\mathbf{z}_k - h(\hat{\mathbf{x}}_{k|k-1}))\) \(\hat{\mathbf{P}}_k = (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k) \hat{\mathbf{P}}_{k|k-1}\)

where $\mathbf{K}_k$ is the Kalman gain, $\mathbf{F}_k$ is the state transition Jacobian, $\mathbf{H}_k$ is the observation Jacobian, $\mathbf{Q}_k$ is process noise, and $\mathbf{R}_k$ is measurement noise.

Particle Filters

Particle filters represent the probability distribution over possible states using a set of weighted samples (particles). They can handle non-Gaussian distributions and multimodal hypotheses (e.g., ambiguity about which road the vehicle is on).

The algorithm:

  1. Sample: Generate particles by propagating previous particles through the motion model with noise
  2. Weight: Assign weights based on how well each particle’s predicted observation matches the actual observation
  3. Resample: Duplicate high-weight particles and remove low-weight ones

Particle filters are used for global localization (when the vehicle doesn’t know its initial position) and for handling ambiguous situations.

Map-Free Localization

Some approaches aim to reduce or eliminate the dependency on pre-built HD maps:

MapLite (MIT CSAIL, 2018)

Uses a sparse topological map (like OpenStreetMap) combined with real-time sensor data. The vehicle uses its sensors to detect road boundaries and navigate, needing only basic connectivity information from the map.

Neural Map Priors

Recent work uses neural networks to learn implicit map representations:

These approaches are promising because they could enable autonomous driving without expensive HD map creation and maintenance.

Challenges in Localization

GPS-Denied Environments

Urban canyons (tall buildings blocking satellite signals), tunnels, parking garages, and dense tree cover all degrade GPS accuracy. The system must seamlessly switch to LiDAR/visual localization in these areas.

Dynamic Environments

The world changes: construction zones, seasonal changes (snow covering landmarks), time of day (lighting changes), and long-term changes (new buildings). Localization must be robust to these changes.

Computational Constraints

Real-time localization at 100+ Hz requires efficient algorithms. LiDAR scan matching is computationally expensive — sub-sampling, voxel grids, and coarse-to-fine strategies are used to meet latency requirements.

Degraded Conditions

Rain, fog, and snow degrade both visual and LiDAR-based localization. Multi-sensor fusion with IMU dead-reckoning provides continuity through brief outages, but extended degradation remains a challenge.

Summary

Localization is the glue that connects perception to planning. Without knowing precisely where the vehicle is, it cannot make correct decisions about lane changes, turns, or stops. The key takeaways:

  1. HD maps provide a rich prior about the environment but are expensive to create and maintain
  2. SLAM algorithms build maps and localize simultaneously, using LiDAR, cameras, and IMU
  3. Map-based localization matches current sensor data against pre-built maps for centimeter accuracy
  4. Sensor fusion (EKF, factor graphs) combines GPS, IMU, LiDAR, and visual measurements for robust state estimation
  5. Map-free approaches are emerging but not yet at the accuracy level of map-based systems

The next chapter covers perception and sensor fusion — how the vehicle builds a complete understanding of its dynamic surroundings.


← Previous: Computer Vision and Deep Learning Next: Perception and Sensor Fusion →

← Back to Table of Contents