Chapter 2: Computer Vision and Deep Learning

Computer vision is the discipline that enables autonomous vehicles to extract meaning from visual data. Over the past decade, deep learning has revolutionized this field, transforming perception from hand-crafted feature pipelines to end-to-end learned representations. This chapter covers the core algorithms that turn pixels into understanding.

The Perception Tasks

An AV’s vision system must solve several tasks simultaneously:

Object detection: Locating and classifying objects (cars, pedestrians, cyclists, trucks) with 2D bounding boxes
3D object detection: Estimating 3D position, size, and orientation of objects
Semantic segmentation: Classifying every pixel (road, sidewalk, vegetation, sky, vehicle)
Instance segmentation: Distinguishing individual object instances
Lane detection: Finding lane boundaries and road edges
Traffic light/sign recognition: Detecting and classifying traffic signals
Depth estimation: Inferring 3D structure from monocular or stereo images
Optical flow: Computing pixel-level motion between frames

Convolutional Neural Networks (CNNs)

CNNs form the backbone of most vision systems in autonomous driving. A CNN learns hierarchical features through stacked convolutional layers.

Convolution Operation

A 2D convolution applies a small learnable filter (kernel) across the input image:

\[(f * g)(x, y) = \sum_{i}\sum_{j} f(i, j) \cdot g(x-i, y-j)\]

Early layers learn low-level features (edges, corners, textures). Deeper layers compose these into high-level representations (wheels, headlights, human silhouettes).

Key CNN Architectures

ResNet (2015): Introduced residual connections (skip connections) that enable training very deep networks (50–152+ layers). The key insight: instead of learning the function $H(x)$, learn the residual $F(x) = H(x) - x$, then compute $H(x) = F(x) + x$. This solves the vanishing gradient problem for deep networks.

EfficientNet (2019): Systematically scales network depth, width, and resolution using a compound scaling coefficient, achieving better accuracy-efficiency trade-offs.

ConvNeXt (2022): Modernized the classic ResNet design by incorporating ideas from Vision Transformers (large kernel sizes, LayerNorm, GELU activation), showing that pure CNNs can match transformer performance.

Vision Transformers

Transformers, originally designed for natural language processing, have transformed computer vision since the introduction of ViT (Vision Transformer) in 2020.

How ViT Works

Patch embedding: The input image is divided into fixed-size patches (e.g., 16×16 pixels). Each patch is flattened and linearly projected to produce a sequence of patch embeddings.
Positional encoding: Position information is added to each patch embedding since transformers have no inherent notion of spatial order.
Self-attention: The core mechanism. For each patch, the model computes:

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

where $Q$ (queries), $K$ (keys), and $V$ (values) are linear projections of the patch embeddings, and $d_k$ is the key dimension. This allows every patch to attend to every other patch, capturing long-range dependencies.

Multi-head attention: Multiple attention heads capture different types of relationships in parallel.

Swin Transformer

The Swin Transformer (2021) introduces shifted window attention, which restricts attention to local windows and shifts them between layers. This reduces the quadratic complexity of global attention to linear complexity with respect to image size, making it practical for high-resolution images in autonomous driving.

BEV Transformers

A major innovation in AV perception is the Bird’s Eye View (BEV) transformer. These models transform multi-camera images into a unified top-down (BEV) representation that is natural for driving:

BEVFormer (2022): Uses spatial cross-attention to transform 2D image features into BEV features, then temporal self-attention to incorporate history.
BEVDet: Projects image features to BEV space using predicted depth distributions.
LSS (Lift, Splat, Shoot): Lifts 2D features into 3D using depth estimation, then “splats” them onto the BEV plane.

The BEV representation is powerful because:

It unifies multi-camera views into a single representation
Distances and sizes are consistent (unlike perspective images)
It naturally supports downstream tasks like planning and motion prediction
It aligns with the map representation

Object Detection

Two-Stage Detectors

Faster R-CNN (2015): The canonical two-stage detector.

Stage 1: A Region Proposal Network (RPN) generates candidate bounding boxes (proposals) that might contain objects.
Stage 2: A classification head and regression head refine each proposal’s class label and bounding box coordinates.

Two-stage detectors are accurate but relatively slow.

Single-Stage Detectors

YOLO (You Only Look Once): Divides the image into a grid and predicts bounding boxes and class probabilities directly in a single pass. Dramatically faster than two-stage detectors. The YOLO family has evolved through many versions (YOLOv1 through YOLOv8+), with each iteration improving speed and accuracy.

SSD (Single Shot Detector): Predicts bounding boxes at multiple scales from different layers of a feature pyramid.

Anchor-Free Detectors

Modern detectors increasingly avoid predefined anchor boxes:

CenterNet (2019): Represents objects as center points and regresses properties (size, offset) from each center point. Elegant and efficient.

FCOS (Fully Convolutional One-Stage): Predicts bounding boxes from every foreground pixel, with a centerness branch to downweight low-quality predictions.

3D Object Detection from Camera

Detecting objects in 3D from monocular or multi-camera images is a key challenge. Approaches include:

FCOS3D: Extends the 2D FCOS detector to predict 3D bounding box parameters (depth, dimensions, rotation).
DETR3D: Uses 3D reference points projected into camera views to extract features via cross-attention.
PETR/PETRv2: Encodes 3D position information into image features for 3D detection.

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel in the image.

Fully Convolutional Networks (FCN)

FCN (2015) replaced the fully-connected layers in classification networks with convolutional layers, enabling pixel-wise prediction at any input resolution. The key challenge: convolution and pooling reduce spatial resolution. To recover it, FCN uses upsampling (transposed convolution or bilinear interpolation) and skip connections from earlier layers.

U-Net Architecture

U-Net (2015) introduced the symmetric encoder-decoder structure with skip connections at each level:

Encoder                    Decoder
──────                    ──────
Input  → Conv → Pool ─────────────> Upsample + Concat → Conv → Output
          Conv → Pool ───────────> Upsample + Concat → Conv
            Conv → Pool ─────────> Upsample + Concat → Conv
              Conv (bottleneck)

The skip connections preserve fine spatial details that are lost during downsampling.

DeepLab Series

DeepLabv3+ (2018): Uses atrous (dilated) convolution to increase the receptive field without reducing resolution, plus an Atrous Spatial Pyramid Pooling (ASPP) module that captures multi-scale context.

Panoptic Segmentation

Panoptic segmentation unifies semantic segmentation (stuff: road, sky, vegetation) and instance segmentation (things: individual cars, pedestrians):

Panoptic-DeepLab: Extends DeepLab for panoptic prediction
Mask2Former (2022): Uses masked attention for both semantic and instance segmentation, achieving state-of-the-art results

Lane Detection

Lane detection is critical for autonomous driving but challenging due to occlusions, worn markings, and complex road geometries.

Approaches

Segmentation-based: Treat lane detection as pixel-wise segmentation, then fit curves to the segmented regions.
Anchor-based: Define a set of predefined lane anchors (e.g., straight lines at various positions) and predict offsets from these anchors. LaneATT uses this approach.
Parametric curve fitting: Directly regress polynomial or spline coefficients that describe each lane. PolyLaneNet fits 3rd-degree polynomials.
Row-wise classification: For each row of the image, classify which column contains the lane marking. Ultra Fast Lane Detection achieves real-time performance with this approach.
Transformer-based: CLRNet (2022) uses cross-layer refinement with attention mechanisms for robust lane detection.

Traffic Light and Sign Recognition

Traffic Light Detection

Traffic light recognition involves:

Detection: Locating traffic lights in the image
State classification: Determining the light state (red, yellow, green, flashing, arrow direction)
Relevance association: Determining which traffic light applies to the ego vehicle’s lane

This is particularly challenging because traffic lights are small in the image (especially at distance), can be occluded, and their relevance depends on lane assignment.

Sign Recognition

Traffic sign recognition (TSR) is a well-studied problem. The German Traffic Sign Recognition Benchmark (GTSRB) has been solved to superhuman accuracy. Modern systems can recognize hundreds of sign types, including speed limits, stop signs, yield signs, and regulatory signs.

Depth Estimation

Monocular Depth Estimation

Estimating depth from a single image is inherently ambiguous (an ill-posed problem), but deep networks can learn depth priors from training data:

MonoDepth2 (2019): Self-supervised monocular depth estimation using photometric consistency between adjacent frames.
MiDaS / DPT (2021): Transformer-based architectures trained on diverse datasets achieve robust relative depth estimation.
Metric3D (2023): Addresses the challenge of predicting metric (absolute) depth from monocular images.

Stereo Depth Estimation

Stereo matching estimates depth by finding corresponding pixels between left and right camera images:

\[\text{depth} = \frac{f \cdot B}{d}\]

where $f$ is the focal length, $B$ is the baseline, and $d$ is the disparity.

RAFT-Stereo (2021): Uses iterative updates to a correlation volume for high-quality stereo matching.

Training Data and Annotation

Deep learning models are data-hungry. Autonomous driving perception requires massive annotated datasets:

Major Datasets

Dataset	Year	Size	Annotations
KITTI	2012	15K frames	2D/3D boxes, depth, flow
nuScenes	2019	1.4M frames	3D boxes, maps, trajectories
Waymo Open	2019	1.2M frames	3D boxes, segmentation
Argoverse 2	2021	1M frames	3D boxes, maps, forecasting
ONCE	2021	1M frames	3D boxes (semi-supervised)

Annotation Challenges

Labeling 3D bounding boxes is expensive — typically $2–$10 per box. A single 20-second driving clip might contain 100+ objects. Companies like Scale AI and Appen provide annotation services, but the cost of labeling at scale (billions of frames) is a major bottleneck.

Auto-labeling uses the AV’s own high-quality sensors (especially LiDAR) to generate annotations automatically. For example, 3D bounding boxes can be auto-generated from LiDAR point clouds and used to train camera-only detectors. This is a key advantage of having a LiDAR-equipped fleet, even if the production system relies on cameras alone.

Real-Time Constraints

Autonomous driving perception must operate in real time — typically at 10–30 Hz with latency under 100 ms. This creates a fundamental tension between accuracy and speed:

Model compression: Pruning, quantization (INT8 instead of FP32), and knowledge distillation to reduce model size and inference time
TensorRT / ONNX Runtime: Optimized inference engines that fuse layers, optimize memory, and exploit GPU parallelism
Multi-task networks: Sharing a backbone across detection, segmentation, and depth estimation tasks reduces total compute
Temporal propagation: Reusing features from previous frames to reduce redundant computation

Summary

Computer vision for autonomous driving has been transformed by deep learning. The key trends in 2024–2026 include:

Transformers replacing CNNs for many tasks, especially with BEV representations
Multi-task architectures that jointly solve detection, segmentation, depth, and lane estimation
Camera-based 3D perception narrowing the gap with LiDAR-based methods
Foundation models (pre-trained on internet-scale data) being adapted for driving tasks
Self-supervised and auto-labeling reducing dependency on manual annotation

The next chapter explores how the vehicle determines its precise location in the world — the localization problem.

← Previous: Sensor Technologies and Hardware

Next: Localization and Mapping →

← Back to Table of Contents