Imagine you’re writing data to a disk. The disk is slow — a write might take 10 milliseconds. The CPU could sit in a loop checking “is it done yet?” every microsecond (polling), wasting 10,000 CPU cycles doing nothing useful.
Linux uses two mechanisms to avoid this waste: interrupts and DMA.
An interrupt is a hardware signal from a device to the CPU that says “I need attention.” When the interrupt fires, the CPU:
This means your Python code runs uninterrupted most of the time. When a network packet arrives or a key is pressed, the CPU briefly handles it and returns. From your program’s perspective, it never happened.
CPU running your Python code
│
├─────── Network packet arrives ──→ CPU pauses
│ CPU runs network interrupt handler
│ Packet stored in kernel buffer
│ CPU resumes Python code
│
├─────── Key press ───────────────→ CPU pauses
│ CPU runs keyboard handler
│ Keycode stored in input queue
│ CPU resumes Python code
│
└─ (Python code never noticed)
Hardware IRQs — physical lines from devices to the interrupt controller:
Software interrupts (softirqs and tasklets) — deferred work scheduled by hardware interrupt handlers. Used for networking (packet processing), block I/O completion, etc.
Devices don’t connect directly to the CPU. They connect to an interrupt controller (APIC on x86), which aggregates all interrupt lines and delivers them to CPU cores in a managed way.
On SMP systems (multi-core), the kernel distributes interrupts across CPUs using the I/O APIC and LAPIC (local APIC per core). You can see and control this via /proc/irq/.
# Real-time interrupt counts per CPU
watch -n 1 cat /proc/interrupts
# Interrupts for a specific device (e.g., eth0)
grep eth0 /proc/interrupts
# Which CPU handles which interrupt
cat /proc/irq/24/smp_affinity # bitmask of allowed CPUs
cat /proc/irq/24/smp_affinity_list # human-readable: "0-3"
To move an IRQ to a specific CPU (useful for latency tuning):
echo 2 | sudo tee /proc/irq/24/smp_affinity # pin to CPU1
The time from interrupt fired → handler runs is interrupt latency. Standard Linux has latencies of tens to hundreds of microseconds. For real-time applications (audio, industrial control), you need the PREEMPT_RT patch which reduces this dramatically.
DMA (Direct Memory Access) lets hardware devices read from and write to RAM without CPU involvement. The CPU sets up a DMA transfer and does other work while the hardware moves data.
Without DMA:
CPU reads byte from disk controller register
CPU writes byte to RAM
CPU reads next byte from disk controller...
(repeat 4096 times for one page)
With DMA:
CPU tells DMA controller: "move 4096 bytes from disk controller to RAM at 0x12345678"
CPU does other work
DMA controller moves all 4096 bytes autonomously
DMA controller fires interrupt: "done"
CPU processes the result
DMA is why modern systems can saturate 10 Gbit/s network cards, NVMe SSDs at 7 GB/s, and GPUs at hundreds of GB/s — the CPU would be the bottleneck otherwise.
The kernel has a DMA API that drivers use to:
As a Python developer you never program DMA directly. But understanding it explains:
mmap() on a device is so fast (the device uses DMA into the mapped buffer)O_DIRECT file I/O bypasses the page cache (uses DMA into your buffer)Legacy ISA DMA channels (rarely used today):
cat /proc/dma
Modern DMA usage is per-driver and not exposed in a single place. You can see DMA buffers in:
cat /proc/iomem | grep -i dma
Let’s trace a recv() call — receiving data from a TCP socket:
1. Ethernet frame arrives at NIC
2. NIC stores frame in its receive ring buffer (DMA into RAM, no CPU involved)
3. NIC fires an interrupt: "new data in ring"
4. CPU runs NIC interrupt handler (top half, fast)
5. Handler schedules a softirq for packet processing (bottom half, deferred)
6. CPU returns from interrupt, resumes other work
7. Softirq runs: kernel processes Ethernet → IP → TCP layers
8. Data placed in socket receive buffer
9. If your process is blocked on recv(), kernel wakes it up
10. Your Python code gets the bytes
Steps 2–8 happen completely outside your Python code, driven by interrupts and DMA.
Modern high-speed NICs use interrupt coalescing: instead of firing an interrupt for every packet (which would overwhelm the CPU at 10 Gbit/s), they batch multiple packets and fire one interrupt. This trades latency for throughput.
You can tune this with ethtool:
ethtool -c eth0 # show current coalescing settings
ethtool -C eth0 rx-usecs 50 # fire interrupt every 50µs max
Lower coalescing = lower latency, higher CPU usage. Higher coalescing = higher latency, lower CPU usage.
This is why high-frequency trading systems tune their NICs and why low-latency audio systems need careful interrupt configuration.
When your Python code seems to be waiting on I/O, it’s usually waiting for one of these:
asyncio, select, and epoll all work by telling the kernel “wake me up when this file descriptor has data” and then sleeping. The kernel wakes them up in response to hardware interrupts. This is why async Python is more efficient than threads for I/O — one thread can manage thousands of hardware events through a single epoll wait.
Previous: Chapter 6 — /proc