Network problems are notoriously difficult to diagnose because they span multiple layers, involve multiple devices, and can be intermittent. But with a systematic approach and the right tools, you can isolate and resolve most issues efficiently.
This final chapter brings together everything you’ve learned into a practical troubleshooting methodology and a set of best practices for building reliable networked systems.
Follow the bottom-up approach — start at the Physical layer and work your way up:
Layer 1 (Physical) → Is the cable plugged in? Is the link up?
Layer 2 (Data Link) → Can I see MAC addresses? Is ARP working?
Layer 3 (Network) → Can I ping the gateway? Is routing correct?
Layer 4 (Transport) → Is the port open? Is TCP connecting?
Layer 7 (Application) → Is the service responding correctly?
At each layer, ask: “Does this layer work?” If yes, move up. If no, you’ve found the problem layer.
# Is the interface up? Is the link detected?
ip link show eth0
# Look for:
# - state UP (interface enabled)
# - <LOWER_UP> (physical link detected)
# - NO-CARRIER (cable disconnected or link partner down)
# Check for errors
ip -s link show eth0
# Look for: RX errors, TX errors, dropped packets, overruns
# Can we resolve the gateway's MAC address?
ip neigh show
# If the gateway shows FAILED or is missing:
# - Wrong IP configuration
# - Wrong VLAN
# - Physical connectivity issue
# Force ARP resolution
arping -c 3 192.168.1.1
# Ping the default gateway
ping -c 4 192.168.1.1
# Ping a public DNS server
ping -c 4 8.8.8.8
# Ping with specific packet size (check MTU issues)
ping -c 4 -s 1472 -M do 8.8.8.8 # 1472 + 28 = 1500 (standard MTU)
Interpreting ping results:
| Symptom | Likely Cause |
|---|---|
Destination Host Unreachable |
No route to host, or ARP failure |
Network is unreachable |
No route in routing table |
Request timeout |
Firewall blocking ICMP, or host is down |
| High latency (>100 ms on LAN) | Congestion, duplex mismatch, bad cable |
| Packet loss (>1%) | Congestion, interference, hardware issue |
# Show the path to a destination
traceroute 8.8.8.8
# Use ICMP instead of UDP (may work better through firewalls)
sudo traceroute -I 8.8.8.8
# Use TCP SYN on port 80 (works through most firewalls)
sudo traceroute -T -p 80 example.com
# MTR — continuous traceroute with statistics
mtr 8.8.8.8
Reading traceroute output:
1 192.168.1.1 1.2 ms ← Local gateway
2 10.0.0.1 5.3 ms ← ISP router
3 * * * ← Router doesn't respond to probes
4 72.14.215.85 15.7 ms ← Internet backbone
5 8.8.8.8 12.1 ms ← Destination
Stars (* * *) don’t necessarily mean a problem — many routers are configured to not respond to traceroute probes.
# Show the routing table
ip route show
# Which route will be used for a specific destination?
ip route get 8.8.8.8
# Check if IP forwarding is enabled (for routers/gateways)
sysctl net.ipv4.ip_forward
ss (replacement for netstat) is the primary tool for inspecting TCP/UDP sockets:
# List all listening TCP ports
ss -tlnp
# List all established TCP connections
ss -tnp
# List all UDP sockets
ss -ulnp
# Show socket statistics (retransmissions, RTT)
ss -ti
# Filter by port
ss -tnp 'sport = :80'
# Filter by state
ss -tn state established
ss -tn state time-wait
# Test if a TCP port is open
nc -zv example.com 80
nc -zv example.com 443
# Test with timeout
nc -zv -w 3 example.com 22
# Show connection timing
curl -o /dev/null -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" https://example.com
Understanding TCP states helps diagnose connection issues:
| State | Meaning | Common Issue |
|---|---|---|
| LISTEN | Server waiting for connections | Normal |
| ESTABLISHED | Active connection | Normal |
| SYN_SENT | Client waiting for SYN-ACK | Firewall blocking, server down |
| SYN_RECV | Server received SYN | SYN flood attack |
| TIME_WAIT | Connection closed, waiting for stale packets | Normal; too many = port exhaustion |
| CLOSE_WAIT | Remote side closed, local side hasn’t | Application bug (not closing sockets) |
| FIN_WAIT1/2 | Closing connection | Normal transition |
# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Test DNS resolution
dig example.com
dig @8.8.8.8 example.com # Query a specific server
# Check all record types
dig example.com ANY
# Reverse DNS
dig -x 93.184.216.34
# Check DNS propagation
dig +trace example.com
# Test with system resolver
getent hosts example.com
Common DNS issues:
| Symptom | Check |
|---|---|
| Name resolution fails | cat /etc/resolv.conf, try dig @8.8.8.8 |
| Resolves to wrong IP | Check for stale DNS cache, conflicting /etc/hosts |
| Slow resolution | DNS server may be unreachable; check with dig |
# Verbose HTTP request
curl -v https://example.com
# Show only headers
curl -I https://example.com
# Follow redirects
curl -L -v https://example.com
# Test with specific HTTP version
curl --http2 https://example.com
# POST with data
curl -X POST -d '{"key": "value"}' -H "Content-Type: application/json" https://api.example.com/data
Build a comprehensive diagnostic script:
import socket
import subprocess
import ssl
import time
def check_dns(hostname: str) -> str:
try:
ip = socket.gethostbyname(hostname)
return f"✓ DNS: {hostname} → {ip}"
except socket.gaierror as e:
return f"✗ DNS: {hostname} — {e}"
def check_tcp(host: str, port: int, timeout: float = 5.0) -> str:
try:
start = time.time()
with socket.create_connection((host, port), timeout=timeout):
elapsed = (time.time() - start) * 1000
return f"✓ TCP: {host}:{port} — {elapsed:.1f} ms"
except (socket.timeout, ConnectionRefusedError, OSError) as e:
return f"✗ TCP: {host}:{port} — {e}"
def check_tls(hostname: str, port: int = 443) -> str:
try:
ctx = ssl.create_default_context()
with socket.create_connection((hostname, port), timeout=5) as sock:
with ctx.wrap_socket(sock, server_hostname=hostname) as ssock:
cert = ssock.getpeercert()
version = ssock.version()
not_after = cert.get("notAfter", "unknown")
return f"✓ TLS: {version}, expires {not_after}"
except Exception as e:
return f"✗ TLS: {e}"
def check_ping(host: str) -> str:
result = subprocess.run(
["ping", "-c", "3", "-W", "2", host],
capture_output=True, text=True
)
if result.returncode == 0:
# Extract average RTT
for line in result.stdout.split("\n"):
if "avg" in line:
return f"✓ Ping: {line.strip()}"
return f"✗ Ping: {host} unreachable"
# Run diagnostics
target = "example.com"
print(check_dns(target))
print(check_ping(target))
print(check_tcp(target, 80))
print(check_tcp(target, 443))
print(check_tls(target))
Full example: code/network_diagnostics.py
import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
# Increase buffer sizes for high-throughput applications
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 262144) # 256 KB
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 262144)
# Disable Nagle's algorithm for low-latency
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
# Enable TCP keepalive for long-lived connections
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
# Increase the maximum number of open file descriptors
ulimit -n 65536
# Increase TCP connection tracking limits
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# Enable TCP fast open
sudo sysctl -w net.ipv4.tcp_fastopen=3
# Increase port range for outgoing connections
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Reduce TIME_WAIT duration
sudo sysctl -w net.ipv4.tcp_fin_timeout=15
Always reuse connections when making multiple requests to the same server:
# Bad — new connection per request
for i in range(100):
requests.get("https://api.example.com/data") # 100 TCP + TLS handshakes
# Good — reuse connections via session
session = requests.Session()
for i in range(100):
session.get("https://api.example.com/data") # 1 TCP + TLS handshake
| Practice | Description |
|---|---|
| Encrypt everything | Use TLS for all network communication |
| Validate certificates | Never disable certificate verification in production |
| Use strong ciphers | TLS 1.3 with AES-256-GCM or ChaCha20 |
| Firewall defaults | Default-deny; whitelist only necessary traffic |
| Least privilege | Run services as non-root; bind to specific interfaces |
| Input validation | Never trust data from the network |
| Rate limiting | Protect services from abuse and DoS |
| Logging | Log connection attempts, errors, and unusual patterns |
| Update regularly | Patch OS, libraries, and network device firmware |
| Segment networks | Use VLANs and firewalls to limit blast radius |
# Problem: Restarting a server fails because the port is in TIME_WAIT
# Solution: Set SO_REUSEADDR before binding
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
Checklist:
ss -tlnp | grep <port>)0.0.0.0 vs 127.0.0.1)iptables -L -n)Checklist:
ping <host>)nc -zv <host> <port>)traceroute <host>)tcpdump)# Problem: recv() returns less data than expected
# Solution: Loop until you have all the data
def recv_all(sock, length):
data = b""
while len(data) < length:
chunk = sock.recv(length - len(data))
if not chunk:
raise ConnectionError("Connection closed prematurely")
data += chunk
return data
# Problem: DNS lookups add latency to every connection
# Solution: Cache DNS results
import functools
@functools.lru_cache(maxsize=256)
def resolve(hostname: str) -> str:
return socket.gethostbyname(hostname)
Before deploying a networked application, verify:
You’ve covered a lot of ground in this book — from the OSI model to packet crafting, from TCP sockets to async servers, from firewall rules to network automation. Here are paths for continued learning:
io_uring, DPDK, and kernel bypass networkingping, traceroute, ss, dig, curl, tcpdumpCLOSE_WAIT) and resource exhaustion (TIME_WAIT)TCP_NODELAY, buffer sizes, keepalive) and OS parameters for performance| ← Previous: Network Automation and Configuration | Table of Contents |