Chapter 5: Prediction and Behavior Modeling

The most unpredictable element in driving is not the road geometry or the traffic lights — it’s the other road users. Humans are complex, sometimes irrational agents who change their minds, get distracted, and make mistakes. The prediction module must anticipate what every detected agent will do over the next 3–8 seconds to enable safe planning.

Why Prediction Is Hard

Consider a car in the adjacent lane, slightly ahead of you, with its left turn signal on. Will it:

Change into your lane immediately?
Change lanes after the car in front passes?
Cancel the lane change and continue straight?
Turn left at the upcoming intersection?

Each of these outcomes is plausible. The prediction system must reason about multiple possible futures and assign probabilities to each.

Key challenges include:

Multimodality: There are multiple plausible futures for each agent
Interactions: Agents influence each other (if I slow down, the other car is more likely to merge)
Long time horizons: Predictions over 5+ seconds diverge rapidly
Rare events: Unusual behaviors (U-turns, running red lights) are hard to predict from limited data
Intent recognition: Understanding what a driver intends to do from subtle cues

Prediction Problem Formulation

Given:

Agent history: Past positions, velocities, and headings over the last 2–5 seconds
Map context: Lane geometry, traffic lights, stop signs, speed limits
Scene context: Other agents, their states and histories

Predict:

Future trajectories over the next 3–8 seconds, typically as a sequence of (x, y) waypoints at 0.1–0.5 s intervals
Multiple possible trajectories with associated probabilities

Classical Approaches

Physics-Based Models

The simplest prediction: assume constant velocity and heading.

\(\hat{x}(t + \Delta t) = x(t) + v_x \cdot \Delta t\) \(\hat{y}(t + \Delta t) = y(t) + v_y \cdot \Delta t\)

This works well for short horizons (< 1 second) but diverges quickly for longer predictions. Variants include constant acceleration and constant turn rate models.

Maneuver-Based Models

Classify the agent’s current maneuver (lane keep, lane change left, lane change right, turn, stop), then predict a trajectory consistent with that maneuver:

Intent classification: Use a classifier (SVM, random forest, or neural network) to predict the maneuver type from kinematic features (lateral velocity, heading offset from lane center, turn signal state)
Trajectory generation: Generate a smooth trajectory consistent with the predicted maneuver (e.g., a polynomial curve for a lane change)

Gaussian Processes

Model trajectory prediction as a Gaussian Process regression problem. Given historical data of similar maneuvers, predict the distribution over future trajectories. GPs naturally provide uncertainty estimates but scale poorly with dataset size.

Deep Learning Approaches

Sequence-to-Sequence Models

Treat prediction as a sequence translation problem: given a sequence of past positions, predict a sequence of future positions.

LSTM-based: Encode the past trajectory with an LSTM (Long Short-Term Memory) encoder, then decode the future trajectory with an LSTM decoder:

Past trajectory → LSTM Encoder → Hidden state → LSTM Decoder → Future trajectory

Social LSTM (2016): Introduces a “social pooling” mechanism where each agent’s LSTM shares information with nearby agents through a pooling grid, enabling modeling of social forces (people avoiding each other).

Convolutional Approaches

Social Force CNN: Rasterize the scene (map + agent histories) into a multi-channel image and use a CNN to predict future trajectories. The input channels might include:

Road geometry (lanes, boundaries)
Agent history (rendered as colored trails)
Ego vehicle state
Traffic light states

This approach is simple and effective, and was widely used in early deep learning prediction systems.

Graph Neural Networks (GNNs)

Model the scene as a graph:

Nodes: Agents (vehicles, pedestrians, cyclists)
Edges: Relationships between agents (proximity, same lane, facing each other)

VectorNet (Waymo, 2020): Represents both agents and map elements as vectors (polylines). A GNN processes the interactions:

Local graph: Each polyline (lane segment, agent trajectory) is processed independently
Global graph: A fully-connected graph connects all polylines, enabling cross-element attention

LaneGCN (2020): Builds a lane graph from the HD map and uses graph convolutions to propagate information along the lane structure, then between the lane graph and agents.

Transformer-Based Prediction

Transformers have become the dominant architecture for trajectory prediction:

Scene Transformer (Waymo, 2022):

Encodes all agents and map elements as a set of tokens
Uses self-attention to model all pairwise interactions
Predicts multiple future trajectories jointly for all agents
Captures the joint distribution: if agent A goes left, agent B is more likely to slow down

Wayformer (Waymo, 2023):

Explores different attention factorizations for efficient scene encoding
Demonstrates that factored attention (first temporal, then social, then spatial) achieves strong results with reduced computation

MTR (Motion Transformer, 2023):

Uses learnable motion query pairs: one static intent query (representing different maneuver types) and one dynamic search query (refined through cross-attention with the scene)
Achieves state-of-the-art results on Waymo Motion Prediction benchmark

Multimodal Prediction

A critical requirement: predictions must be multimodal — representing multiple possible futures, not just a single trajectory.

Approaches to Multimodality

Multiple trajectory regression: Predict K trajectories with associated probabilities. A common approach in competition-winning systems.

Mixture models: Predict parameters of a Gaussian Mixture Model (GMM) at each future timestep:

\[p(\mathbf{x}_t) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_t | \mu_k, \Sigma_k)\]

Goal-conditioned prediction: First predict a set of possible goal locations (destinations), then predict trajectories conditioned on each goal.

TNT (Target-driven Trajectory Prediction, 2020):

Sample candidate target (goal) positions from the lane graph
Score each target position
Predict a trajectory conditioned on each top-scoring target
Score the final trajectories

DenseTNT: Extends TNT with denser goal sampling and direct trajectory optimization.

Generative Models

CVAE (Conditional Variational Autoencoder): Learn a latent space of trajectory modes. During inference, sample multiple latent codes to generate diverse trajectory predictions.

Diffusion Models (2023–2025): Apply denoising diffusion to trajectory prediction. Start from noise and iteratively refine to produce diverse, realistic trajectory samples. MotionDiffuser and LED are notable examples. Diffusion models naturally produce diverse predictions but are computationally expensive.

Interaction Modeling

Joint Prediction

Agents don’t move independently — their futures are correlated. If a car starts to merge, nearby vehicles react. Joint prediction models this:

M2I (2022): Identifies influencer-reactor pairs and predicts the reactor’s trajectory conditioned on the influencer’s prediction.

Scene Transformer: Jointly predicts all agents’ trajectories, capturing correlations through attention.

GameFormer (2023): Models multi-agent prediction as an iterative game, where agents take turns refining their trajectories based on others’ predicted actions.

Conditional Prediction

For planning, it’s useful to predict how other agents would react to different actions by the ego vehicle:

“If I change lanes now, will that car let me in or speed up?”

This requires conditional prediction models that take the ego vehicle’s planned trajectory as input and predict others’ responses.

Pedestrian Prediction

Pedestrians are particularly challenging to predict because:

They can change direction instantly
They are influenced by social norms (waiting at crosswalks, following others)
Their intent is hard to read from kinematics alone
Body language and gaze direction provide important cues

The classic model (Helbing & Molnar, 1995): pedestrians are modeled as particles subject to forces:

Desired velocity force: Pulls toward the destination
Repulsive force: Pushes away from obstacles and other pedestrians
Attractive force: Pulls toward companions

Body Pose and Gaze

Recent work incorporates pedestrian body pose and head orientation to improve intent prediction:

A pedestrian looking at their phone while approaching a crosswalk is less likely to cross
A pedestrian turning their head to check traffic is more likely to cross

Evaluation Metrics

Common Metrics

ADE (Average Displacement Error): Mean L2 distance between predicted and ground-truth trajectory across all timesteps
FDE (Final Displacement Error): L2 distance at the final predicted timestep
minADE / minFDE: Minimum error across K predicted trajectories (measures whether at least one prediction is accurate)
Miss Rate: Fraction of predictions where no trajectory comes within a threshold of the ground truth
Overlap Rate: Fraction of predicted trajectories that collide with other agents (lower is better)

Benchmark Datasets

Waymo Open Motion Dataset: 570 hours of driving data, 20-second scenarios, standardized evaluation
Argoverse 2 Forecasting: Large-scale, diverse scenarios with HD map context
nuScenes Prediction: 1000 scenes, 20-second scenarios

Summary

Prediction is the bridge between perception (what is the world now?) and planning (what should I do?). Key takeaways:

Multimodal prediction is essential — the future is uncertain, and multiple plausible outcomes must be considered
Transformers have become the dominant architecture, with attention mechanisms capturing complex agent interactions
Map context is critical — lane geometry strongly constrains plausible trajectories
Interaction modeling captures the coupled dynamics of multi-agent scenes
Evaluation must measure diversity and accuracy, not just single-trajectory precision
Pedestrian prediction remains particularly challenging due to the freedom of movement

The next chapter covers what the ego vehicle does with all this information: planning its own trajectory.

← Previous: Perception and Sensor Fusion

Next: Path Planning and Decision Making →

← Back to Table of Contents