Computer vision is the discipline that enables autonomous vehicles to extract meaning from visual data. Over the past decade, deep learning has revolutionized this field, transforming perception from hand-crafted feature pipelines to end-to-end learned representations. This chapter covers the core algorithms that turn pixels into understanding.
An AV’s vision system must solve several tasks simultaneously:
CNNs form the backbone of most vision systems in autonomous driving. A CNN learns hierarchical features through stacked convolutional layers.
A 2D convolution applies a small learnable filter (kernel) across the input image:
\[(f * g)(x, y) = \sum_{i}\sum_{j} f(i, j) \cdot g(x-i, y-j)\]Early layers learn low-level features (edges, corners, textures). Deeper layers compose these into high-level representations (wheels, headlights, human silhouettes).
ResNet (2015): Introduced residual connections (skip connections) that enable training very deep networks (50–152+ layers). The key insight: instead of learning the function $H(x)$, learn the residual $F(x) = H(x) - x$, then compute $H(x) = F(x) + x$. This solves the vanishing gradient problem for deep networks.
EfficientNet (2019): Systematically scales network depth, width, and resolution using a compound scaling coefficient, achieving better accuracy-efficiency trade-offs.
ConvNeXt (2022): Modernized the classic ResNet design by incorporating ideas from Vision Transformers (large kernel sizes, LayerNorm, GELU activation), showing that pure CNNs can match transformer performance.
Transformers, originally designed for natural language processing, have transformed computer vision since the introduction of ViT (Vision Transformer) in 2020.
Patch embedding: The input image is divided into fixed-size patches (e.g., 16×16 pixels). Each patch is flattened and linearly projected to produce a sequence of patch embeddings.
Positional encoding: Position information is added to each patch embedding since transformers have no inherent notion of spatial order.
Self-attention: The core mechanism. For each patch, the model computes:
where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of the patch embeddings, and $d_k$ is the key dimension. This allows every patch to attend to every other patch, capturing long-range dependencies.
The Swin Transformer (2021) introduces shifted window attention, which restricts attention to local windows and shifts them between layers. This reduces the quadratic complexity of global attention to linear complexity with respect to image size, making it practical for high-resolution images in autonomous driving.
A major innovation in AV perception is the Bird’s Eye View (BEV) transformer. These models transform multi-camera images into a unified top-down (BEV) representation that is natural for driving:
The BEV representation is powerful because:
Faster R-CNN (2015): The canonical two-stage detector.
Two-stage detectors are accurate but relatively slow.
YOLO (You Only Look Once): Divides the image into a grid and predicts bounding boxes and class probabilities directly in a single pass. Dramatically faster than two-stage detectors. The YOLO family has evolved through many versions (YOLOv1 through YOLOv8+), with each iteration improving speed and accuracy.
SSD (Single Shot Detector): Predicts bounding boxes at multiple scales from different layers of a feature pyramid.
Modern detectors increasingly avoid predefined anchor boxes:
CenterNet (2019): Represents objects as center points and regresses properties (size, offset) from each center point. Elegant and efficient.
FCOS (Fully Convolutional One-Stage): Predicts bounding boxes from every foreground pixel, with a centerness branch to downweight low-quality predictions.
Detecting objects in 3D from monocular or multi-camera images is a key challenge. Approaches include:
Semantic segmentation assigns a class label to every pixel in the image.
FCN (2015) replaced the fully-connected layers in classification networks with convolutional layers, enabling pixel-wise prediction at any input resolution. The key challenge: convolution and pooling reduce spatial resolution. To recover it, FCN uses upsampling (transposed convolution or bilinear interpolation) and skip connections from earlier layers.
U-Net (2015) introduced the symmetric encoder-decoder structure with skip connections at each level:
Encoder Decoder
────── ──────
Input → Conv → Pool ─────────────> Upsample + Concat → Conv → Output
Conv → Pool ───────────> Upsample + Concat → Conv
Conv → Pool ─────────> Upsample + Concat → Conv
Conv (bottleneck)
The skip connections preserve fine spatial details that are lost during downsampling.
DeepLabv3+ (2018): Uses atrous (dilated) convolution to increase the receptive field without reducing resolution, plus an Atrous Spatial Pyramid Pooling (ASPP) module that captures multi-scale context.
Panoptic segmentation unifies semantic segmentation (stuff: road, sky, vegetation) and instance segmentation (things: individual cars, pedestrians):
Lane detection is critical for autonomous driving but challenging due to occlusions, worn markings, and complex road geometries.
Segmentation-based: Treat lane detection as pixel-wise segmentation, then fit curves to the segmented regions.
Anchor-based: Define a set of predefined lane anchors (e.g., straight lines at various positions) and predict offsets from these anchors. LaneATT uses this approach.
Parametric curve fitting: Directly regress polynomial or spline coefficients that describe each lane. PolyLaneNet fits 3rd-degree polynomials.
Row-wise classification: For each row of the image, classify which column contains the lane marking. Ultra Fast Lane Detection achieves real-time performance with this approach.
Transformer-based: CLRNet (2022) uses cross-layer refinement with attention mechanisms for robust lane detection.
Traffic light recognition involves:
This is particularly challenging because traffic lights are small in the image (especially at distance), can be occluded, and their relevance depends on lane assignment.
Traffic sign recognition (TSR) is a well-studied problem. The German Traffic Sign Recognition Benchmark (GTSRB) has been solved to superhuman accuracy. Modern systems can recognize hundreds of sign types, including speed limits, stop signs, yield signs, and regulatory signs.
Estimating depth from a single image is inherently ambiguous (an ill-posed problem), but deep networks can learn depth priors from training data:
Stereo matching estimates depth by finding corresponding pixels between left and right camera images:
\[\text{depth} = \frac{f \cdot B}{d}\]where $f$ is the focal length, $B$ is the baseline, and $d$ is the disparity.
RAFT-Stereo (2021): Uses iterative updates to a correlation volume for high-quality stereo matching.
Deep learning models are data-hungry. Autonomous driving perception requires massive annotated datasets:
| Dataset | Year | Size | Annotations |
|---|---|---|---|
| KITTI | 2012 | 15K frames | 2D/3D boxes, depth, flow |
| nuScenes | 2019 | 1.4M frames | 3D boxes, maps, trajectories |
| Waymo Open | 2019 | 1.2M frames | 3D boxes, segmentation |
| Argoverse 2 | 2021 | 1M frames | 3D boxes, maps, forecasting |
| ONCE | 2021 | 1M frames | 3D boxes (semi-supervised) |
Labeling 3D bounding boxes is expensive — typically $2–$10 per box. A single 20-second driving clip might contain 100+ objects. Companies like Scale AI and Appen provide annotation services, but the cost of labeling at scale (billions of frames) is a major bottleneck.
Auto-labeling uses the AV’s own high-quality sensors (especially LiDAR) to generate annotations automatically. For example, 3D bounding boxes can be auto-generated from LiDAR point clouds and used to train camera-only detectors. This is a key advantage of having a LiDAR-equipped fleet, even if the production system relies on cameras alone.
Autonomous driving perception must operate in real time — typically at 10–30 Hz with latency under 100 ms. This creates a fundamental tension between accuracy and speed:
Computer vision for autonomous driving has been transformed by deep learning. The key trends in 2024–2026 include:
The next chapter explores how the vehicle determines its precise location in the world — the localization problem.
| ← Previous: Sensor Technologies and Hardware | Next: Localization and Mapping → |