The most unpredictable element in driving is not the road geometry or the traffic lights — it’s the other road users. Humans are complex, sometimes irrational agents who change their minds, get distracted, and make mistakes. The prediction module must anticipate what every detected agent will do over the next 3–8 seconds to enable safe planning.
Consider a car in the adjacent lane, slightly ahead of you, with its left turn signal on. Will it:
Each of these outcomes is plausible. The prediction system must reason about multiple possible futures and assign probabilities to each.
Key challenges include:
Given:
Predict:
The simplest prediction: assume constant velocity and heading.
\(\hat{x}(t + \Delta t) = x(t) + v_x \cdot \Delta t\) \(\hat{y}(t + \Delta t) = y(t) + v_y \cdot \Delta t\)
This works well for short horizons (< 1 second) but diverges quickly for longer predictions. Variants include constant acceleration and constant turn rate models.
Classify the agent’s current maneuver (lane keep, lane change left, lane change right, turn, stop), then predict a trajectory consistent with that maneuver:
Model trajectory prediction as a Gaussian Process regression problem. Given historical data of similar maneuvers, predict the distribution over future trajectories. GPs naturally provide uncertainty estimates but scale poorly with dataset size.
Treat prediction as a sequence translation problem: given a sequence of past positions, predict a sequence of future positions.
LSTM-based: Encode the past trajectory with an LSTM (Long Short-Term Memory) encoder, then decode the future trajectory with an LSTM decoder:
Past trajectory → LSTM Encoder → Hidden state → LSTM Decoder → Future trajectory
Social LSTM (2016): Introduces a “social pooling” mechanism where each agent’s LSTM shares information with nearby agents through a pooling grid, enabling modeling of social forces (people avoiding each other).
Social Force CNN: Rasterize the scene (map + agent histories) into a multi-channel image and use a CNN to predict future trajectories. The input channels might include:
This approach is simple and effective, and was widely used in early deep learning prediction systems.
Model the scene as a graph:
VectorNet (Waymo, 2020): Represents both agents and map elements as vectors (polylines). A GNN processes the interactions:
LaneGCN (2020): Builds a lane graph from the HD map and uses graph convolutions to propagate information along the lane structure, then between the lane graph and agents.
Transformers have become the dominant architecture for trajectory prediction:
Scene Transformer (Waymo, 2022):
Wayformer (Waymo, 2023):
MTR (Motion Transformer, 2023):
A critical requirement: predictions must be multimodal — representing multiple possible futures, not just a single trajectory.
Multiple trajectory regression: Predict K trajectories with associated probabilities. A common approach in competition-winning systems.
Mixture models: Predict parameters of a Gaussian Mixture Model (GMM) at each future timestep:
\[p(\mathbf{x}_t) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}_t | \mu_k, \Sigma_k)\]Goal-conditioned prediction: First predict a set of possible goal locations (destinations), then predict trajectories conditioned on each goal.
TNT (Target-driven Trajectory Prediction, 2020):
DenseTNT: Extends TNT with denser goal sampling and direct trajectory optimization.
CVAE (Conditional Variational Autoencoder): Learn a latent space of trajectory modes. During inference, sample multiple latent codes to generate diverse trajectory predictions.
Diffusion Models (2023–2025): Apply denoising diffusion to trajectory prediction. Start from noise and iteratively refine to produce diverse, realistic trajectory samples. MotionDiffuser and LED are notable examples. Diffusion models naturally produce diverse predictions but are computationally expensive.
Agents don’t move independently — their futures are correlated. If a car starts to merge, nearby vehicles react. Joint prediction models this:
M2I (2022): Identifies influencer-reactor pairs and predicts the reactor’s trajectory conditioned on the influencer’s prediction.
Scene Transformer: Jointly predicts all agents’ trajectories, capturing correlations through attention.
GameFormer (2023): Models multi-agent prediction as an iterative game, where agents take turns refining their trajectories based on others’ predicted actions.
For planning, it’s useful to predict how other agents would react to different actions by the ego vehicle:
“If I change lanes now, will that car let me in or speed up?”
This requires conditional prediction models that take the ego vehicle’s planned trajectory as input and predict others’ responses.
Pedestrians are particularly challenging to predict because:
The classic model (Helbing & Molnar, 1995): pedestrians are modeled as particles subject to forces:
Recent work incorporates pedestrian body pose and head orientation to improve intent prediction:
Prediction is the bridge between perception (what is the world now?) and planning (what should I do?). Key takeaways:
The next chapter covers what the ego vehicle does with all this information: planning its own trajectory.
| ← Previous: Perception and Sensor Fusion | Next: Path Planning and Decision Making → |