Chapter 3: Localization and Mapping

An autonomous vehicle needs to know not just what is around it, but exactly where it is — to within centimeters. GPS alone provides accuracy of only 2–5 meters, which is insufficient to distinguish between lanes on a highway. This chapter explores the algorithms that achieve centimeter-level localization and the maps that make it possible.

The Localization Problem

Localization answers the question: “Where am I, precisely, in the world?” For an autonomous vehicle, this means determining its 6 Degrees of Freedom (6-DoF) pose: position (x, y, z) and orientation (roll, pitch, yaw).

The required accuracy depends on the task:

Lane-level positioning: ~0.5 m accuracy (which lane am I in?)
Lane-centering: ~10 cm accuracy (am I centered in my lane?)
Docking/parking: ~2 cm accuracy

High-Definition Maps (HD Maps)

Most autonomous driving systems rely on HD maps — detailed, pre-built maps of the environment that go far beyond consumer navigation maps.

What HD Maps Contain

Layer	Content	Accuracy
Geometric	3D point cloud of static environment (buildings, poles, curbs)	~5 cm
Semantic	Lane boundaries, lane types, road edges, crosswalks	~10 cm
Topological	Lane connectivity, intersection structure, merge/diverge points	Logical
Regulatory	Speed limits, traffic light positions, stop sign locations, yield rules	Associated with geometry
Dynamic	Construction zones, temporary closures (updated frequently)	Variable

How HD Maps Are Built

Data collection: Mapping vehicles equipped with high-end LiDAR, cameras, GPS/IMU, and wheel odometry drive every road in the operating area multiple times.
SLAM processing: Simultaneous Localization and Mapping algorithms align all the sensor data into a coherent 3D map.
Annotation: Human annotators (and increasingly, automated tools) label lane boundaries, traffic signs, curbs, and other semantic features.
Validation: Quality assurance to ensure accuracy and completeness.
Continuous updates: Maps must be updated as the world changes — new construction, repainted lanes, new traffic signals.

The Map Maintenance Challenge

HD maps become stale quickly. Road construction, seasonal changes (snow covering lane markings), and even parked cars can invalidate map assumptions. This is why:

Waymo operates only in mapped cities where it can maintain HD maps
Mercedes Drive Pilot requires centimeter-accurate maps of specific highways
Tesla’s approach explicitly avoids HD maps, instead building a real-time understanding of the road

Some companies are developing crowd-sourced mapping where production vehicles contribute sensor data to update maps continuously. Mobileye’s REM (Road Experience Management) system is a leading example — millions of vehicles with Mobileye chips upload road landmark data to continuously refresh the map.

Simultaneous Localization and Mapping (SLAM)

SLAM is the problem of building a map of an unknown environment while simultaneously tracking the agent’s position within it. It is a classic chicken-and-egg problem: you need a map to localize, and you need to know your position to build a map.

Visual SLAM

Visual SLAM uses camera images as the primary sensor. Key algorithms:

ORB-SLAM (2015–2021): One of the most influential visual SLAM systems. It:

Extracts ORB (Oriented FAST and Rotated BRIEF) features from each frame
Matches features between frames to estimate camera motion
Triangulates 3D landmark positions from matched features
Optimizes the trajectory and map jointly using bundle adjustment
Detects loop closures — recognizing previously visited places to correct accumulated drift

Direct methods (LSD-SLAM, DSO): Instead of extracting features, direct methods minimize the photometric error between frames, using raw pixel intensities. They work well in texture-poor environments where feature extraction fails.

LiDAR SLAM

LiDAR SLAM matches 3D point clouds between frames to estimate motion:

ICP (Iterative Closest Point): The foundational algorithm. Given two point clouds, ICP iteratively:

Finds the closest point in cloud B for each point in cloud A
Computes the rigid transformation (rotation + translation) that minimizes the sum of squared distances
Applies the transformation and repeats until convergence

The optimization minimizes:

\[E(R, t) = \sum_{i} \|p_i - (R q_i + t)\|^2\]

where $R$ is the rotation matrix, $t$ is the translation vector, and $(p_i, q_i)$ are corresponding points.

NDT (Normal Distributions Transform): Divides the reference point cloud into a grid and fits a normal distribution to the points in each cell. The new scan is aligned by maximizing the likelihood of its points under these distributions. NDT is faster and more robust than ICP.

LOAM (LiDAR Odometry and Mapping): Separates the SLAM problem into a fast odometry module (10 Hz) and a slower mapping module (1 Hz). Extracts edge and planar features from point clouds for efficient registration.

LIO-SAM (2020): Tightly couples LiDAR inertial odometry with factor graph optimization, achieving state-of-the-art accuracy and robustness.

Graph-Based SLAM

Modern SLAM systems represent the problem as a factor graph:

Nodes: Robot poses at different times, and landmark positions
Edges (factors): Constraints between nodes from sensor measurements (odometry, loop closures, GPS fixes, landmark observations)

The optimal map and trajectory are found by solving a nonlinear least-squares optimization:

\[\hat{x} = \arg\min_x \sum_{i,j} \|z_{ij} - h_{ij}(x_i, x_j)\|^2_{\Sigma_{ij}}\]

where $z_{ij}$ is the measurement, $h_{ij}$ is the measurement model, and $\Sigma_{ij}$ is the measurement covariance.

This is typically solved using iterative solvers like Gauss-Newton or Levenberg-Marquardt, implemented in libraries like GTSAM, g2o, or Ceres Solver.

Map-Based Localization

During normal operation, the AV doesn’t need to build a map from scratch — it localizes within a pre-existing HD map. This is a simpler problem than full SLAM.

Point Cloud Matching

The most common approach: match the current LiDAR scan against the map’s reference point cloud.

Preprocessing: Remove dynamic objects (cars, pedestrians) from the current scan — we only want to match against static structures.
Initial estimate: Use GPS + IMU to get a rough position estimate.
Scan matching: Align the current scan to the map using ICP or NDT.
Refinement: The matched pose is fused with IMU and wheel odometry using an Extended Kalman Filter (EKF) or factor graph.

Visual Localization

Cameras can also localize against a map by matching visual features:

Landmark matching: Detect landmarks in the image (traffic signs, poles, building corners) and match them to landmarks stored in the map.
Image retrieval: Find the most similar stored image using visual place recognition (NetVLAD, SuperGlue).
Pose estimation: Solve the PnP (Perspective-n-Point) problem to recover the camera pose from 2D-3D correspondences.

Feature Types for Localization

Different features are useful for different scenarios:

Poles and posts: Reliable, stable, visible from long range
Road markings: Accurate lateral positioning, but can be worn or snow-covered
Building facades: Provide strong constraints in urban areas
Curb edges: Detected by LiDAR, stable over time
Traffic signs: Distinctive appearance, mapped locations known precisely

State Estimation and Sensor Fusion

Localization is not a one-shot computation — it’s a continuous process of estimating the vehicle’s state (position, velocity, orientation) over time, fusing measurements from multiple sensors.

Extended Kalman Filter (EKF)

The EKF is the workhorse of state estimation. It maintains:

A state vector $\mathbf{x}$ (position, velocity, orientation, biases)
A covariance matrix $\mathbf{P}$ representing uncertainty

Predict step (using IMU/odometry): $\hat{\mathbf{x}}_{k|k-1} = f(\hat{\mathbf{x}}_{k-1}, \mathbf{u}_k)$ $\hat{\mathbf{P}}_{k|k-1} = \mathbf{F}_k \hat{\mathbf{P}}_{k-1} \mathbf{F}_k^T + \mathbf{Q}_k$

Update step (using GPS, LiDAR match, etc.): $\mathbf{K}_k = \hat{\mathbf{P}}_{k|k-1} \mathbf{H}_k^T (\mathbf{H}_k \hat{\mathbf{P}}_{k|k-1} \mathbf{H}_k^T + \mathbf{R}_k)^{-1}$ $\hat{\mathbf{x}}_k = \hat{\mathbf{x}}_{k|k-1} + \mathbf{K}_k (\mathbf{z}_k - h(\hat{\mathbf{x}}_{k|k-1}))$ $\hat{\mathbf{P}}_k = (\mathbf{I} - \mathbf{K}_k \mathbf{H}_k) \hat{\mathbf{P}}_{k|k-1}$

where $\mathbf{K}_k$ is the Kalman gain, $\mathbf{F}_k$ is the state transition Jacobian, $\mathbf{H}_k$ is the observation Jacobian, $\mathbf{Q}_k$ is process noise, and $\mathbf{R}_k$ is measurement noise.

Particle Filters

Particle filters represent the probability distribution over possible states using a set of weighted samples (particles). They can handle non-Gaussian distributions and multimodal hypotheses (e.g., ambiguity about which road the vehicle is on).

The algorithm:

Sample: Generate particles by propagating previous particles through the motion model with noise
Weight: Assign weights based on how well each particle’s predicted observation matches the actual observation
Resample: Duplicate high-weight particles and remove low-weight ones

Particle filters are used for global localization (when the vehicle doesn’t know its initial position) and for handling ambiguous situations.

Map-Free Localization

Some approaches aim to reduce or eliminate the dependency on pre-built HD maps:

MapLite (MIT CSAIL, 2018)

Uses a sparse topological map (like OpenStreetMap) combined with real-time sensor data. The vehicle uses its sensors to detect road boundaries and navigate, needing only basic connectivity information from the map.

Neural Map Priors

Recent work uses neural networks to learn implicit map representations:

MapTR (2023): Constructs vectorized HD maps from onboard sensor data in real time.
MapTRv2: Improves accuracy and includes lane connectivity estimation.
Neural Map Prior (2023): Uses a learned prior from previously seen data to fill in gaps in real-time map construction.

These approaches are promising because they could enable autonomous driving without expensive HD map creation and maintenance.

Challenges in Localization

GPS-Denied Environments

Urban canyons (tall buildings blocking satellite signals), tunnels, parking garages, and dense tree cover all degrade GPS accuracy. The system must seamlessly switch to LiDAR/visual localization in these areas.

Dynamic Environments

The world changes: construction zones, seasonal changes (snow covering landmarks), time of day (lighting changes), and long-term changes (new buildings). Localization must be robust to these changes.

Computational Constraints

Real-time localization at 100+ Hz requires efficient algorithms. LiDAR scan matching is computationally expensive — sub-sampling, voxel grids, and coarse-to-fine strategies are used to meet latency requirements.

Degraded Conditions

Rain, fog, and snow degrade both visual and LiDAR-based localization. Multi-sensor fusion with IMU dead-reckoning provides continuity through brief outages, but extended degradation remains a challenge.

Summary

Localization is the glue that connects perception to planning. Without knowing precisely where the vehicle is, it cannot make correct decisions about lane changes, turns, or stops. The key takeaways:

HD maps provide a rich prior about the environment but are expensive to create and maintain
SLAM algorithms build maps and localize simultaneously, using LiDAR, cameras, and IMU
Map-based localization matches current sensor data against pre-built maps for centimeter accuracy
Sensor fusion (EKF, factor graphs) combines GPS, IMU, LiDAR, and visual measurements for robust state estimation
Map-free approaches are emerging but not yet at the accuracy level of map-based systems

The next chapter covers perception and sensor fusion — how the vehicle builds a complete understanding of its dynamic surroundings.

← Previous: Computer Vision and Deep Learning

Next: Perception and Sensor Fusion →

← Back to Table of Contents