The modular autonomous driving stack — perception, localization, prediction, planning, control — has been the dominant paradigm for decades. But a revolutionary alternative is gaining ground: end-to-end learning, where a single neural network learns to map directly from raw sensor input to driving actions. This chapter explores this paradigm shift.
In an end-to-end system, the entire driving pipeline is replaced (or augmented) by a single learned model:
Traditional: Sensors → Perception → Prediction → Planning → Control → Actuators
End-to-End: Sensors → Neural Network → Actuators (or waypoints)
The neural network learns all intermediate representations implicitly through training, without human-designed modules for detection, tracking, or path planning.
The first end-to-end driving system: Autonomous Land Vehicle In a Neural Network by Dean Pomerleau at Carnegie Mellon. A simple 3-layer neural network mapped 30×32 camera images directly to steering commands. It successfully drove at up to 70 mph on highways, trained by observing human driving.
NVIDIA demonstrated a modern end-to-end approach:
The key insight from PilotNet: visualization of the network’s attention (using saliency maps) showed it naturally learned to focus on lane markings, road edges, and other relevant features — without ever being told what these were.
The simplest end-to-end approach: directly copy human behavior.
Training: Collect (observation, action) pairs from human driving, then train a neural network to predict actions from observations using supervised learning:
\[\mathcal{L} = \sum_{i} \|a_i - \pi_\theta(o_i)\|^2\]where $a_i$ is the human driver’s action, $o_i$ is the observation, and $\pi_\theta$ is the learned policy.
Distribution shift problem: The fundamental challenge. The model is trained on the distribution of states visited by the human driver. But when the model drives, any small error causes it to visit states the human never encountered. In these unseen states, the model may behave unpredictably, leading to compounding errors.
Example: If the training data only includes driving in the center of the lane, the model never learns how to recover from being off-center. When a small perturbation pushes it off-center, it doesn’t know how to return.
DAgger addresses distribution shift iteratively:
This progressively exposes the model to its own mistakes and teaches it to recover.
A navigation command (turn left, turn right, go straight, follow lane) is provided as input to handle intersections where the correct action depends on the intended route:
\[a = \pi_\theta(o, c)\]where $c$ is the high-level command. Without this conditioning, the model would average left and right turn behaviors at intersections, going straight.
Model the driving task as a Markov Decision Process:
| The goal is to learn a policy $\pi(a | s)$ that maximizes the expected cumulative reward: |
To avoid the safety issues of online RL, offline RL learns from a fixed dataset of driving logs (similar to imitation learning) but optimizes a reward function rather than just copying the expert:
Tesla’s FSD v12 represents a watershed moment for end-to-end driving. Released in March 2024, it replaces over 300,000 lines of C++ planning code with a single neural network:
Key observations from FSD v12:
UniAD by Shanghai AI Lab demonstrates a fully differentiable, multi-task end-to-end framework:
All modules are jointly trained end-to-end. The key finding: joint training improves performance across all tasks compared to training each module separately. Planning benefits from better prediction, which benefits from better tracking, which benefits from better perception.
An end-to-end approach that introduces explicit “thinking” into the planning process:
This two-stage reasoning improves safety in complex scenarios.
VAD represents the scene using vectorized elements (agent trajectories, map polylines) rather than dense rasterized images. The vectorized representation is more compact and structured, enabling more efficient planning.
A world model is a neural network that predicts how the world will evolve — effectively, a learned simulator. World models are increasingly important for end-to-end driving.
A generative world model for autonomous driving:
An end-to-end model that combines a world model with an action-generating policy, demonstrating that world models can improve driving performance by enabling “mental simulation” of consequences.
The success of foundation models (large pre-trained models) in NLP and vision has inspired their application to autonomous driving.
Models like GPT-4V can reason about driving scenes:
DriveGPT4: Uses a multi-modal large language model to interpret driving scenes and generate driving decisions with natural language explanations.
Large-scale pre-training on internet video (not just driving data) can provide a strong foundation for driving models:
End-to-end models are “black boxes.” When a modular system makes a mistake, you can identify which module failed (detection missed an object, prediction was wrong, planner chose poorly). With end-to-end, the failure could be anywhere in the learned representation.
Mitigation: Attention visualization, intermediate representation inspection, natural language explanations.
The model might learn spurious correlations from training data:
Driving is dominated by routine scenarios (highway cruising, following traffic). Rare but critical scenarios (construction zones, emergency vehicles, unusual road geometries) are underrepresented in training data.
How do you prove that a neural network will always drive safely? Traditional verification methods don’t apply to deep learning models. This is perhaps the biggest open challenge in end-to-end autonomous driving.
End-to-end models are data-hungry. Tesla’s approach works partly because they have access to billions of miles of driving data from their fleet. Smaller companies cannot match this data advantage.
Many practical systems combine end-to-end learning with modular components:
End-to-end learning represents a paradigm shift in autonomous driving:
The next chapter covers how autonomous driving systems are tested and validated before deployment.
| ← Previous: Vehicle Control Systems | Next: Simulation, Testing, and Validation → |