Chapter 2: Computer Vision and Deep Learning

Computer vision is the discipline that enables autonomous vehicles to extract meaning from visual data. Over the past decade, deep learning has revolutionized this field, transforming perception from hand-crafted feature pipelines to end-to-end learned representations. This chapter covers the core algorithms that turn pixels into understanding.

The Perception Tasks

An AV’s vision system must solve several tasks simultaneously:

  1. Object detection: Locating and classifying objects (cars, pedestrians, cyclists, trucks) with 2D bounding boxes
  2. 3D object detection: Estimating 3D position, size, and orientation of objects
  3. Semantic segmentation: Classifying every pixel (road, sidewalk, vegetation, sky, vehicle)
  4. Instance segmentation: Distinguishing individual object instances
  5. Lane detection: Finding lane boundaries and road edges
  6. Traffic light/sign recognition: Detecting and classifying traffic signals
  7. Depth estimation: Inferring 3D structure from monocular or stereo images
  8. Optical flow: Computing pixel-level motion between frames

Convolutional Neural Networks (CNNs)

CNNs form the backbone of most vision systems in autonomous driving. A CNN learns hierarchical features through stacked convolutional layers.

Convolution Operation

A 2D convolution applies a small learnable filter (kernel) across the input image:

\[(f * g)(x, y) = \sum_{i}\sum_{j} f(i, j) \cdot g(x-i, y-j)\]

Early layers learn low-level features (edges, corners, textures). Deeper layers compose these into high-level representations (wheels, headlights, human silhouettes).

Key CNN Architectures

ResNet (2015): Introduced residual connections (skip connections) that enable training very deep networks (50–152+ layers). The key insight: instead of learning the function $H(x)$, learn the residual $F(x) = H(x) - x$, then compute $H(x) = F(x) + x$. This solves the vanishing gradient problem for deep networks.

EfficientNet (2019): Systematically scales network depth, width, and resolution using a compound scaling coefficient, achieving better accuracy-efficiency trade-offs.

ConvNeXt (2022): Modernized the classic ResNet design by incorporating ideas from Vision Transformers (large kernel sizes, LayerNorm, GELU activation), showing that pure CNNs can match transformer performance.

Vision Transformers

Transformers, originally designed for natural language processing, have transformed computer vision since the introduction of ViT (Vision Transformer) in 2020.

How ViT Works

  1. Patch embedding: The input image is divided into fixed-size patches (e.g., 16×16 pixels). Each patch is flattened and linearly projected to produce a sequence of patch embeddings.

  2. Positional encoding: Position information is added to each patch embedding since transformers have no inherent notion of spatial order.

  3. Self-attention: The core mechanism. For each patch, the model computes:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of the patch embeddings, and $d_k$ is the key dimension. This allows every patch to attend to every other patch, capturing long-range dependencies.

  1. Multi-head attention: Multiple attention heads capture different types of relationships in parallel.

Swin Transformer

The Swin Transformer (2021) introduces shifted window attention, which restricts attention to local windows and shifts them between layers. This reduces the quadratic complexity of global attention to linear complexity with respect to image size, making it practical for high-resolution images in autonomous driving.

BEV Transformers

A major innovation in AV perception is the Bird’s Eye View (BEV) transformer. These models transform multi-camera images into a unified top-down (BEV) representation that is natural for driving:

The BEV representation is powerful because:

Object Detection

Two-Stage Detectors

Faster R-CNN (2015): The canonical two-stage detector.

  1. Stage 1: A Region Proposal Network (RPN) generates candidate bounding boxes (proposals) that might contain objects.
  2. Stage 2: A classification head and regression head refine each proposal’s class label and bounding box coordinates.

Two-stage detectors are accurate but relatively slow.

Single-Stage Detectors

YOLO (You Only Look Once): Divides the image into a grid and predicts bounding boxes and class probabilities directly in a single pass. Dramatically faster than two-stage detectors. The YOLO family has evolved through many versions (YOLOv1 through YOLOv8+), with each iteration improving speed and accuracy.

SSD (Single Shot Detector): Predicts bounding boxes at multiple scales from different layers of a feature pyramid.

Anchor-Free Detectors

Modern detectors increasingly avoid predefined anchor boxes:

CenterNet (2019): Represents objects as center points and regresses properties (size, offset) from each center point. Elegant and efficient.

FCOS (Fully Convolutional One-Stage): Predicts bounding boxes from every foreground pixel, with a centerness branch to downweight low-quality predictions.

3D Object Detection from Camera

Detecting objects in 3D from monocular or multi-camera images is a key challenge. Approaches include:

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in the image.

Fully Convolutional Networks (FCN)

FCN (2015) replaced the fully-connected layers in classification networks with convolutional layers, enabling pixel-wise prediction at any input resolution. The key challenge: convolution and pooling reduce spatial resolution. To recover it, FCN uses upsampling (transposed convolution or bilinear interpolation) and skip connections from earlier layers.

U-Net Architecture

U-Net (2015) introduced the symmetric encoder-decoder structure with skip connections at each level:

Encoder                    Decoder
──────                    ──────
Input  → Conv → Pool ─────────────> Upsample + Concat → Conv → Output
          Conv → Pool ───────────> Upsample + Concat → Conv
            Conv → Pool ─────────> Upsample + Concat → Conv
              Conv (bottleneck)

The skip connections preserve fine spatial details that are lost during downsampling.

DeepLab Series

DeepLabv3+ (2018): Uses atrous (dilated) convolution to increase the receptive field without reducing resolution, plus an Atrous Spatial Pyramid Pooling (ASPP) module that captures multi-scale context.

Panoptic Segmentation

Panoptic segmentation unifies semantic segmentation (stuff: road, sky, vegetation) and instance segmentation (things: individual cars, pedestrians):

Lane Detection

Lane detection is critical for autonomous driving but challenging due to occlusions, worn markings, and complex road geometries.

Approaches

  1. Segmentation-based: Treat lane detection as pixel-wise segmentation, then fit curves to the segmented regions.

  2. Anchor-based: Define a set of predefined lane anchors (e.g., straight lines at various positions) and predict offsets from these anchors. LaneATT uses this approach.

  3. Parametric curve fitting: Directly regress polynomial or spline coefficients that describe each lane. PolyLaneNet fits 3rd-degree polynomials.

  4. Row-wise classification: For each row of the image, classify which column contains the lane marking. Ultra Fast Lane Detection achieves real-time performance with this approach.

  5. Transformer-based: CLRNet (2022) uses cross-layer refinement with attention mechanisms for robust lane detection.

Traffic Light and Sign Recognition

Traffic Light Detection

Traffic light recognition involves:

  1. Detection: Locating traffic lights in the image
  2. State classification: Determining the light state (red, yellow, green, flashing, arrow direction)
  3. Relevance association: Determining which traffic light applies to the ego vehicle’s lane

This is particularly challenging because traffic lights are small in the image (especially at distance), can be occluded, and their relevance depends on lane assignment.

Sign Recognition

Traffic sign recognition (TSR) is a well-studied problem. The German Traffic Sign Recognition Benchmark (GTSRB) has been solved to superhuman accuracy. Modern systems can recognize hundreds of sign types, including speed limits, stop signs, yield signs, and regulatory signs.

Depth Estimation

Monocular Depth Estimation

Estimating depth from a single image is inherently ambiguous (an ill-posed problem), but deep networks can learn depth priors from training data:

Stereo Depth Estimation

Stereo matching estimates depth by finding corresponding pixels between left and right camera images:

\[\text{depth} = \frac{f \cdot B}{d}\]

where $f$ is the focal length, $B$ is the baseline, and $d$ is the disparity.

RAFT-Stereo (2021): Uses iterative updates to a correlation volume for high-quality stereo matching.

Training Data and Annotation

Deep learning models are data-hungry. Autonomous driving perception requires massive annotated datasets:

Major Datasets

Dataset Year Size Annotations
KITTI 2012 15K frames 2D/3D boxes, depth, flow
nuScenes 2019 1.4M frames 3D boxes, maps, trajectories
Waymo Open 2019 1.2M frames 3D boxes, segmentation
Argoverse 2 2021 1M frames 3D boxes, maps, forecasting
ONCE 2021 1M frames 3D boxes (semi-supervised)

Annotation Challenges

Labeling 3D bounding boxes is expensive — typically $2–$10 per box. A single 20-second driving clip might contain 100+ objects. Companies like Scale AI and Appen provide annotation services, but the cost of labeling at scale (billions of frames) is a major bottleneck.

Auto-labeling uses the AV’s own high-quality sensors (especially LiDAR) to generate annotations automatically. For example, 3D bounding boxes can be auto-generated from LiDAR point clouds and used to train camera-only detectors. This is a key advantage of having a LiDAR-equipped fleet, even if the production system relies on cameras alone.

Real-Time Constraints

Autonomous driving perception must operate in real time — typically at 10–30 Hz with latency under 100 ms. This creates a fundamental tension between accuracy and speed:

Summary

Computer vision for autonomous driving has been transformed by deep learning. The key trends in 2024–2026 include:

  1. Transformers replacing CNNs for many tasks, especially with BEV representations
  2. Multi-task architectures that jointly solve detection, segmentation, depth, and lane estimation
  3. Camera-based 3D perception narrowing the gap with LiDAR-based methods
  4. Foundation models (pre-trained on internet-scale data) being adapted for driving tasks
  5. Self-supervised and auto-labeling reducing dependency on manual annotation

The next chapter explores how the vehicle determines its precise location in the world — the localization problem.


← Previous: Sensor Technologies and Hardware Next: Localization and Mapping →

← Back to Table of Contents