Chapter 8: End-to-End Learning Approaches

The modular autonomous driving stack — perception, localization, prediction, planning, control — has been the dominant paradigm for decades. But a revolutionary alternative is gaining ground: end-to-end learning, where a single neural network learns to map directly from raw sensor input to driving actions. This chapter explores this paradigm shift.

What Is End-to-End Driving?

In an end-to-end system, the entire driving pipeline is replaced (or augmented) by a single learned model:

Traditional:  Sensors → Perception → Prediction → Planning → Control → Actuators
End-to-End:   Sensors → Neural Network → Actuators (or waypoints)

The neural network learns all intermediate representations implicitly through training, without human-designed modules for detection, tracking, or path planning.

Historical Roots

ALVINN (1989)

The first end-to-end driving system: Autonomous Land Vehicle In a Neural Network by Dean Pomerleau at Carnegie Mellon. A simple 3-layer neural network mapped 30×32 camera images directly to steering commands. It successfully drove at up to 70 mph on highways, trained by observing human driving.

DAVE-2 / PilotNet (NVIDIA, 2016)

NVIDIA demonstrated a modern end-to-end approach:

The key insight from PilotNet: visualization of the network’s attention (using saliency maps) showed it naturally learned to focus on lane markings, road edges, and other relevant features — without ever being told what these were.

Imitation Learning

Behavioral Cloning

The simplest end-to-end approach: directly copy human behavior.

Training: Collect (observation, action) pairs from human driving, then train a neural network to predict actions from observations using supervised learning:

\[\mathcal{L} = \sum_{i} \|a_i - \pi_\theta(o_i)\|^2\]

where $a_i$ is the human driver’s action, $o_i$ is the observation, and $\pi_\theta$ is the learned policy.

Distribution shift problem: The fundamental challenge. The model is trained on the distribution of states visited by the human driver. But when the model drives, any small error causes it to visit states the human never encountered. In these unseen states, the model may behave unpredictably, leading to compounding errors.

Example: If the training data only includes driving in the center of the lane, the model never learns how to recover from being off-center. When a small perturbation pushes it off-center, it doesn’t know how to return.

DAgger (Dataset Aggregation)

DAgger addresses distribution shift iteratively:

  1. Train an initial policy $\pi_1$ via behavioral cloning
  2. Execute $\pi_1$ to collect new observations from the states it actually visits
  3. Have the expert label these new observations with the correct actions
  4. Aggregate the new data with the original dataset
  5. Retrain the policy and repeat

This progressively exposes the model to its own mistakes and teaches it to recover.

Conditional Imitation Learning (CIL)

A navigation command (turn left, turn right, go straight, follow lane) is provided as input to handle intersections where the correct action depends on the intended route:

\[a = \pi_\theta(o, c)\]

where $c$ is the high-level command. Without this conditioning, the model would average left and right turn behaviors at intersections, going straight.

Reinforcement Learning

The RL Framework for Driving

Model the driving task as a Markov Decision Process:

The goal is to learn a policy $\pi(a s)$ that maximizes the expected cumulative reward:
\[J(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]\]

Challenges of RL for Driving

  1. Safety during training: You cannot learn to drive by crashing thousands of times on real roads. RL training requires simulation.
  2. Reward engineering: Designing a reward function that captures all aspects of good driving is extremely difficult. Missing a subtle case (e.g., not penalizing driving too close to a cyclist) can lead to dangerous behavior.
  3. Sample efficiency: RL typically requires millions of episodes to learn, even in simulation.
  4. Sim-to-real transfer: Policies learned in simulation may not transfer to the real world due to the “sim-to-real gap” — differences between simulated and real physics, rendering, and agent behavior.

Offline Reinforcement Learning

To avoid the safety issues of online RL, offline RL learns from a fixed dataset of driving logs (similar to imitation learning) but optimizes a reward function rather than just copying the expert:

Modern End-to-End Architectures

Tesla FSD v12 (2024)

Tesla’s FSD v12 represents a watershed moment for end-to-end driving. Released in March 2024, it replaces over 300,000 lines of C++ planning code with a single neural network:

Key observations from FSD v12:

UniAD (Unified Autonomous Driving, 2023)

UniAD by Shanghai AI Lab demonstrates a fully differentiable, multi-task end-to-end framework:

  1. Perception: BEV feature extraction from multi-camera images
  2. Tracking: Query-based object tracking with learned object queries
  3. Mapping: Online vectorized map prediction
  4. Motion forecasting: Transformer-based multi-agent trajectory prediction
  5. Occupancy prediction: Future occupancy estimation
  6. Planning: Trajectory planning conditioned on all upstream outputs

All modules are jointly trained end-to-end. The key finding: joint training improves performance across all tasks compared to training each module separately. Planning benefits from better prediction, which benefits from better tracking, which benefits from better perception.

Think Twice (2023)

An end-to-end approach that introduces explicit “thinking” into the planning process:

  1. First thought: Generate an initial trajectory based on current perception
  2. Look ahead: Use a world model to simulate the consequences of the initial trajectory
  3. Second thought: Refine the trajectory based on the simulated outcomes

This two-stage reasoning improves safety in complex scenarios.

VAD (Vectorized Scene Representation for Autonomous Driving, 2023)

VAD represents the scene using vectorized elements (agent trajectories, map polylines) rather than dense rasterized images. The vectorized representation is more compact and structured, enabling more efficient planning.

World Models

A world model is a neural network that predicts how the world will evolve — effectively, a learned simulator. World models are increasingly important for end-to-end driving.

How World Models Work

  1. Encode the current scene (sensor inputs) into a latent state
  2. Predict future latent states given actions
  3. Decode predicted latent states into observable predictions (images, occupancy, agent positions)

GAIA-1 (Wayve, 2023)

A generative world model for autonomous driving:

ADriver-I (2023)

An end-to-end model that combines a world model with an action-generating policy, demonstrating that world models can improve driving performance by enabling “mental simulation” of consequences.

Applications of World Models

  1. Simulation: Generate realistic scenarios for training and testing
  2. Planning: Evaluate potential actions by predicting their outcomes
  3. Data augmentation: Generate diverse training scenarios from limited real data
  4. Interpretability: The model’s predictions can be inspected to understand its reasoning

Foundation Models for Driving

The success of foundation models (large pre-trained models) in NLP and vision has inspired their application to autonomous driving.

Vision-Language Models

Models like GPT-4V can reason about driving scenes:

DriveGPT4: Uses a multi-modal large language model to interpret driving scenes and generate driving decisions with natural language explanations.

Pre-Training at Scale

Large-scale pre-training on internet video (not just driving data) can provide a strong foundation for driving models:

Challenges of End-to-End Learning

Interpretability

End-to-end models are “black boxes.” When a modular system makes a mistake, you can identify which module failed (detection missed an object, prediction was wrong, planner chose poorly). With end-to-end, the failure could be anywhere in the learned representation.

Mitigation: Attention visualization, intermediate representation inspection, natural language explanations.

Causal Confusion

The model might learn spurious correlations from training data:

Long-Tail Distribution

Driving is dominated by routine scenarios (highway cruising, following traffic). Rare but critical scenarios (construction zones, emergency vehicles, unusual road geometries) are underrepresented in training data.

Safety Verification

How do you prove that a neural network will always drive safely? Traditional verification methods don’t apply to deep learning models. This is perhaps the biggest open challenge in end-to-end autonomous driving.

Data Requirements

End-to-end models are data-hungry. Tesla’s approach works partly because they have access to billions of miles of driving data from their fleet. Smaller companies cannot match this data advantage.

Hybrid Approaches

Many practical systems combine end-to-end learning with modular components:

Summary

End-to-end learning represents a paradigm shift in autonomous driving:

  1. Imitation learning at massive scale (Tesla FSD v12) has proven surprisingly effective
  2. Joint training across perception, prediction, and planning improves all tasks
  3. World models enable learned simulation for planning and evaluation
  4. Foundation models bring broad world knowledge to driving systems
  5. Interpretability and safety verification remain critical open challenges
  6. Hybrid approaches combining learning and classical methods offer the best of both worlds

The next chapter covers how autonomous driving systems are tested and validated before deployment.


← Previous: Vehicle Control Systems Next: Simulation, Testing, and Validation →

← Back to Table of Contents