Chapter 8: End-to-End Learning Approaches

The modular autonomous driving stack — perception, localization, prediction, planning, control — has been the dominant paradigm for decades. But a revolutionary alternative is gaining ground: end-to-end learning, where a single neural network learns to map directly from raw sensor input to driving actions. This chapter explores this paradigm shift.

What Is End-to-End Driving?

In an end-to-end system, the entire driving pipeline is replaced (or augmented) by a single learned model:

Traditional:  Sensors → Perception → Prediction → Planning → Control → Actuators
End-to-End:   Sensors → Neural Network → Actuators (or waypoints)

The neural network learns all intermediate representations implicitly through training, without human-designed modules for detection, tracking, or path planning.

Historical Roots

ALVINN (1989)

The first end-to-end driving system: Autonomous Land Vehicle In a Neural Network by Dean Pomerleau at Carnegie Mellon. A simple 3-layer neural network mapped 30×32 camera images directly to steering commands. It successfully drove at up to 70 mph on highways, trained by observing human driving.

DAVE-2 / PilotNet (NVIDIA, 2016)

NVIDIA demonstrated a modern end-to-end approach:

Input: 3 front-facing camera images
Architecture: 9-layer CNN (5 convolutional + 4 fully connected)
Output: Steering angle
Training: ~72 hours of human driving data
Result: Successfully drove on various road types, including unpaved roads

The key insight from PilotNet: visualization of the network’s attention (using saliency maps) showed it naturally learned to focus on lane markings, road edges, and other relevant features — without ever being told what these were.

Imitation Learning

Behavioral Cloning

The simplest end-to-end approach: directly copy human behavior.

Training: Collect (observation, action) pairs from human driving, then train a neural network to predict actions from observations using supervised learning:

\[\mathcal{L} = \sum_{i} \|a_i - \pi_\theta(o_i)\|^2\]

where $a_i$ is the human driver’s action, $o_i$ is the observation, and $\pi_\theta$ is the learned policy.

Distribution shift problem: The fundamental challenge. The model is trained on the distribution of states visited by the human driver. But when the model drives, any small error causes it to visit states the human never encountered. In these unseen states, the model may behave unpredictably, leading to compounding errors.

Example: If the training data only includes driving in the center of the lane, the model never learns how to recover from being off-center. When a small perturbation pushes it off-center, it doesn’t know how to return.

DAgger (Dataset Aggregation)

DAgger addresses distribution shift iteratively:

Train an initial policy $\pi_1$ via behavioral cloning
Execute $\pi_1$ to collect new observations from the states it actually visits
Have the expert label these new observations with the correct actions
Aggregate the new data with the original dataset
Retrain the policy and repeat

This progressively exposes the model to its own mistakes and teaches it to recover.

Conditional Imitation Learning (CIL)

A navigation command (turn left, turn right, go straight, follow lane) is provided as input to handle intersections where the correct action depends on the intended route:

\[a = \pi_\theta(o, c)\]

where $c$ is the high-level command. Without this conditioning, the model would average left and right turn behaviors at intersections, going straight.

Reinforcement Learning

The RL Framework for Driving

Model the driving task as a Markov Decision Process:

State $s$: Current sensor observations + vehicle state
Action $a$: Steering, throttle, brake
Reward $r(s, a)$: Designed to encourage good driving:
- Positive: Progress along route, staying centered, maintaining appropriate speed
- Negative: Collisions, lane departures, traffic violations, uncomfortable maneuvers
- Terminal: Crash or reaching the destination

The goal is to learn a policy $\pi(a

s)$ that maximizes the expected cumulative reward:

\[J(\pi) = \mathbb{E}_\pi\left[\sum_{t=0}^{T} \gamma^t r(s_t, a_t)\right]\]

Challenges of RL for Driving

Safety during training: You cannot learn to drive by crashing thousands of times on real roads. RL training requires simulation.
Reward engineering: Designing a reward function that captures all aspects of good driving is extremely difficult. Missing a subtle case (e.g., not penalizing driving too close to a cyclist) can lead to dangerous behavior.
Sample efficiency: RL typically requires millions of episodes to learn, even in simulation.
Sim-to-real transfer: Policies learned in simulation may not transfer to the real world due to the “sim-to-real gap” — differences between simulated and real physics, rendering, and agent behavior.

Offline Reinforcement Learning

To avoid the safety issues of online RL, offline RL learns from a fixed dataset of driving logs (similar to imitation learning) but optimizes a reward function rather than just copying the expert:

CQL (Conservative Q-Learning): Learns a conservative Q-function that avoids overestimating the value of actions not seen in the data
IQL (Implicit Q-Learning): Learns Q-values without evaluating out-of-distribution actions
Decision Transformer: Frames RL as a sequence modeling problem — given a desired return, generate the sequence of actions that achieves it

Modern End-to-End Architectures

Tesla FSD v12 (2024)

Tesla’s FSD v12 represents a watershed moment for end-to-end driving. Released in March 2024, it replaces over 300,000 lines of C++ planning code with a single neural network:

Input: 8 camera images (no LiDAR, no radar)
Architecture: A large transformer model that jointly handles perception, prediction, and planning
Output: Vehicle trajectory waypoints (not raw control commands)
Training data: Millions of hours of human driving from Tesla’s fleet
Training approach: Primarily imitation learning with massive data scale

Key observations from FSD v12:

It handles complex scenarios (unprotected left turns, construction zones, narrow roads) that the previous rule-based system struggled with
It drives more “naturally” than the rule-based system
It still makes mistakes, particularly in unusual situations not well-represented in training data
It remains classified as Level 2 — the driver must remain attentive

UniAD (Unified Autonomous Driving, 2023)

UniAD by Shanghai AI Lab demonstrates a fully differentiable, multi-task end-to-end framework:

Perception: BEV feature extraction from multi-camera images
Tracking: Query-based object tracking with learned object queries
Mapping: Online vectorized map prediction
Motion forecasting: Transformer-based multi-agent trajectory prediction
Occupancy prediction: Future occupancy estimation
Planning: Trajectory planning conditioned on all upstream outputs

All modules are jointly trained end-to-end. The key finding: joint training improves performance across all tasks compared to training each module separately. Planning benefits from better prediction, which benefits from better tracking, which benefits from better perception.

Think Twice (2023)

An end-to-end approach that introduces explicit “thinking” into the planning process:

First thought: Generate an initial trajectory based on current perception
Look ahead: Use a world model to simulate the consequences of the initial trajectory
Second thought: Refine the trajectory based on the simulated outcomes

This two-stage reasoning improves safety in complex scenarios.

VAD (Vectorized Scene Representation for Autonomous Driving, 2023)

VAD represents the scene using vectorized elements (agent trajectories, map polylines) rather than dense rasterized images. The vectorized representation is more compact and structured, enabling more efficient planning.

World Models

A world model is a neural network that predicts how the world will evolve — effectively, a learned simulator. World models are increasingly important for end-to-end driving.

How World Models Work

Encode the current scene (sensor inputs) into a latent state
Predict future latent states given actions
Decode predicted latent states into observable predictions (images, occupancy, agent positions)

GAIA-1 (Wayve, 2023)

A generative world model for autonomous driving:

Trained on driving video and action pairs
Can generate realistic future driving video conditioned on actions
Demonstrates understanding of 3D geometry, traffic rules, and agent interactions
Can be used for planning by searching for actions that lead to desirable future states

ADriver-I (2023)

An end-to-end model that combines a world model with an action-generating policy, demonstrating that world models can improve driving performance by enabling “mental simulation” of consequences.

Applications of World Models

Simulation: Generate realistic scenarios for training and testing
Planning: Evaluate potential actions by predicting their outcomes
Data augmentation: Generate diverse training scenarios from limited real data
Interpretability: The model’s predictions can be inspected to understand its reasoning

Foundation Models for Driving

The success of foundation models (large pre-trained models) in NLP and vision has inspired their application to autonomous driving.

Vision-Language Models

Models like GPT-4V can reason about driving scenes:

“What should the car do next?” given a camera image
Provide explanations for driving decisions
Identify potential hazards

DriveGPT4: Uses a multi-modal large language model to interpret driving scenes and generate driving decisions with natural language explanations.

Pre-Training at Scale

Large-scale pre-training on internet video (not just driving data) can provide a strong foundation for driving models:

Understanding of physics (object permanence, gravity, inertia)
Recognition of diverse objects and situations
Spatial reasoning and depth perception

Challenges of End-to-End Learning

Interpretability

End-to-end models are “black boxes.” When a modular system makes a mistake, you can identify which module failed (detection missed an object, prediction was wrong, planner chose poorly). With end-to-end, the failure could be anywhere in the learned representation.

Mitigation: Attention visualization, intermediate representation inspection, natural language explanations.

Causal Confusion

The model might learn spurious correlations from training data:

Braking because a brake light was red, not because the car ahead was stopped
Following the car ahead through a red light because the training data rarely showed the ego vehicle as the first car at a red light

Long-Tail Distribution

Driving is dominated by routine scenarios (highway cruising, following traffic). Rare but critical scenarios (construction zones, emergency vehicles, unusual road geometries) are underrepresented in training data.

Safety Verification

How do you prove that a neural network will always drive safely? Traditional verification methods don’t apply to deep learning models. This is perhaps the biggest open challenge in end-to-end autonomous driving.

Data Requirements

End-to-end models are data-hungry. Tesla’s approach works partly because they have access to billions of miles of driving data from their fleet. Smaller companies cannot match this data advantage.

Hybrid Approaches

Many practical systems combine end-to-end learning with modular components:

Learned perception + classical planning: Use neural networks for perception but traditional optimization for planning (provides safety guarantees)
Learned planning with safety filters: Generate trajectories with a neural network but verify them against safety constraints (RSS, collision checking) before execution
Neural network proposals + optimization refinement: The neural network suggests candidate trajectories, and an optimizer selects and refines the best one

Summary

End-to-end learning represents a paradigm shift in autonomous driving:

Imitation learning at massive scale (Tesla FSD v12) has proven surprisingly effective
Joint training across perception, prediction, and planning improves all tasks
World models enable learned simulation for planning and evaluation
Foundation models bring broad world knowledge to driving systems
Interpretability and safety verification remain critical open challenges
Hybrid approaches combining learning and classical methods offer the best of both worlds

The next chapter covers how autonomous driving systems are tested and validated before deployment.

← Previous: Vehicle Control Systems

Next: Simulation, Testing, and Validation →

← Back to Table of Contents