Deep learning in 2025 is defined by large foundation models, multimodal capabilities, and an increasing focus on reasoning and efficiency. But the field continues to evolve rapidly. This chapter surveys the most promising research directions and the challenges that remain.
Transformers have a fundamental limitation: self-attention is $O(n^2)$ in sequence length. Processing a document with 100,000 tokens requires computing attention between every pair of tokens — 10 billion operations per layer. This makes long-context processing expensive and limits practical context windows.
Gu et al. proposed S4 (Structured State Spaces for Sequence Modeling), drawing on continuous-time dynamical systems:
\(h'(t) = Ah(t) + Bx(t)\) \(y(t) = Ch(t) + Dx(t)\)
where $A, B, C, D$ are learnable matrices. Discretized for sequence processing, this becomes a linear recurrence that can be computed as a convolution during training (parallelizable) and as a recurrence during inference (efficient).
Key advantages:
Gu and Dao introduced Mamba, which made SSMs competitive with Transformers for language modeling:
Mamba achieved Transformer-level performance on language tasks while being significantly faster for long sequences.
The most promising approach may be combining Transformers and SSMs:
Standard LLMs generate answers in a single forward pass — essentially “thinking” for the same amount of time regardless of question difficulty. A simple arithmetic question and a complex proof get the same computational budget.
Wei et al. showed that prompting models to “think step by step” dramatically improves performance on reasoning tasks:
Q: Roger has 5 balls. He buys 2 cans of 3 each. How many balls total?
A: Roger starts with 5 balls. He buys 2 cans of 3, which is 2×3 = 6 balls.
Total: 5 + 6 = 11 balls.
This was an early form of trading more compute at inference time for better results.
OpenAI’s o1 and o3 models, and DeepSeek’s R1, introduced models that explicitly use extended reasoning before answering:
Emerging research shows that just as training-time compute follows scaling laws, inference-time compute does too. Allowing models to “think longer” (more tokens of reasoning) predictably improves accuracy on hard problems.
This suggests a fundamental shift: instead of only scaling the model and training data, scale the compute at inference time too.
Models that process both images and text natively:
| Model | Year | Capabilities |
|---|---|---|
| GPT-4V | 2023 | Image understanding, visual QA |
| Gemini | 2023 | Native multimodal (text, image, audio, video) |
| Claude 3 | 2024 | Vision + long-context text |
| LLaVA | 2023 | Open-source vision-language model |
Two main approaches for multimodal models:
Early fusion: Convert all modalities to tokens and process with a single Transformer: \(\text{tokens} = [\text{text\_tokens}; \text{image\_tokens}; \text{audio\_tokens}]\)
Cross-attention fusion: Separate encoders per modality, connected by cross-attention: \(\text{text\_output} = \text{CrossAttn}(\text{text\_features}, \text{image\_features})\)
Video adds the temporal dimension to vision-language models. Challenges:
Reduce model precision from 16-bit to 4-bit or even lower:
| Precision | Bits per Parameter | Memory (7B model) | Quality Impact |
|---|---|---|---|
| FP16/BF16 | 16 | 14 GB | Baseline |
| INT8 | 8 | 7 GB | Minimal |
| INT4 (GPTQ/AWQ) | 4 | 3.5 GB | Small |
| 2-bit | 2 | 1.75 GB | Noticeable |
Modern quantization methods (GPTQ, AWQ, GGUF) can compress a 70B model to fit on consumer hardware with surprisingly little quality loss.
Train a smaller “student” model to mimic a larger “teacher” model:
\[L = \alpha \cdot L_{\text{task}} + (1 - \alpha) \cdot \text{KL}(p_{\text{teacher}} \| p_{\text{student}})\]Distillation can produce models that are 10× smaller but retain 90%+ of the teacher’s capability.
Use a small, fast “draft” model to generate candidate tokens, then verify them in parallel with the large model. This can achieve 2–3× speedup with no quality loss, because the large model only needs to verify (not generate) most tokens.
| Method | Complexity | Approach |
|---|---|---|
| Standard attention | $O(n^2)$ | Full pairwise computation |
| Flash Attention | $O(n^2)$ (but faster) | IO-aware GPU kernel |
| Sparse attention | $O(n \sqrt{n})$ | Attend to fixed patterns |
| Linear attention | $O(n)$ | Kernel approximation |
| Sliding window | $O(n \cdot w)$ | Local attention window |
| Ring attention | Distributed | Distribute across devices |
Flash Attention (Dao, 2022) doesn’t reduce asymptotic complexity but restructures the computation to minimize memory reads/writes on GPUs, achieving 2–4× wall-clock speedup.
Reinforcement Learning from Human Feedback (RLHF) aligns models with human preferences but is complex and unstable. Alternatives:
Instead of relying solely on knowledge stored in model weights, retrieve relevant information at inference time:
RAG addresses hallucination (grounded in real documents), knowledge recency (database can be updated), and attribution (can cite sources).
Modern AI systems are evolving from passive text generators to active agents that can:
Observe → Think → Act → Observe → Think → Act → ...
The model generates an action (e.g., “search for X”), executes it, observes the result, and decides the next step. This loop continues until the task is complete.
The internet contains a finite amount of high-quality text. Some estimates suggest current LLMs have nearly exhausted available high-quality data. Where does more training data come from?
Use existing models to generate training data for new models:
Several fundamental questions remain unanswered:
Deep learning has come remarkably far in seven decades:
1943: Mathematical neurons → Pure theory
1958: Perceptron → Can learn simple patterns
1986: Backpropagation → Can train multi-layer networks
1997: LSTM → Can process sequences
2012: AlexNet → Deep networks beat everything
2015: ResNet → Networks can be arbitrarily deep
2017: Transformer → Universal architecture
2020: GPT-3 → Models can do tasks from examples
2022: ChatGPT → AI goes mainstream
2024: Reasoning models → AI that "thinks" before answering
2025: ???
Each breakthrough built on the ones before it. Each solved a specific information flow problem. And each opened new possibilities that no one had predicted.
The next breakthrough is out there — perhaps in a technique we’ve already invented but haven’t yet applied correctly, or perhaps in an idea that hasn’t been conceived yet. What’s certain is that the arc of deep learning is far from complete.
Previous Chapter: Optimization Advances — Making Training Practical