Chapter 13: Troubleshooting and Best Practices


The Art of Network Debugging

Network problems are notoriously difficult to diagnose because they span multiple layers, involve multiple devices, and can be intermittent. But with a systematic approach and the right tools, you can isolate and resolve most issues efficiently.

This final chapter brings together everything you’ve learned into a practical troubleshooting methodology and a set of best practices for building reliable networked systems.


The Troubleshooting Methodology

Follow the bottom-up approach — start at the Physical layer and work your way up:

Layer 1 (Physical)    → Is the cable plugged in? Is the link up?
Layer 2 (Data Link)   → Can I see MAC addresses? Is ARP working?
Layer 3 (Network)     → Can I ping the gateway? Is routing correct?
Layer 4 (Transport)   → Is the port open? Is TCP connecting?
Layer 7 (Application) → Is the service responding correctly?

At each layer, ask: “Does this layer work?” If yes, move up. If no, you’ve found the problem layer.


Check Interface Status

# Is the interface up? Is the link detected?
ip link show eth0

# Look for:
# - state UP (interface enabled)
# - <LOWER_UP> (physical link detected)
# - NO-CARRIER (cable disconnected or link partner down)

# Check for errors
ip -s link show eth0
# Look for: RX errors, TX errors, dropped packets, overruns

Check ARP/NDP

# Can we resolve the gateway's MAC address?
ip neigh show

# If the gateway shows FAILED or is missing:
# - Wrong IP configuration
# - Wrong VLAN
# - Physical connectivity issue

# Force ARP resolution
arping -c 3 192.168.1.1

Layer 3: Network Diagnostics

Ping — Basic Reachability

# Ping the default gateway
ping -c 4 192.168.1.1

# Ping a public DNS server
ping -c 4 8.8.8.8

# Ping with specific packet size (check MTU issues)
ping -c 4 -s 1472 -M do 8.8.8.8  # 1472 + 28 = 1500 (standard MTU)

Interpreting ping results:

Symptom Likely Cause
Destination Host Unreachable No route to host, or ARP failure
Network is unreachable No route in routing table
Request timeout Firewall blocking ICMP, or host is down
High latency (>100 ms on LAN) Congestion, duplex mismatch, bad cable
Packet loss (>1%) Congestion, interference, hardware issue

Traceroute — Path Analysis

# Show the path to a destination
traceroute 8.8.8.8

# Use ICMP instead of UDP (may work better through firewalls)
sudo traceroute -I 8.8.8.8

# Use TCP SYN on port 80 (works through most firewalls)
sudo traceroute -T -p 80 example.com

# MTR — continuous traceroute with statistics
mtr 8.8.8.8

Reading traceroute output:

 1  192.168.1.1      1.2 ms     ← Local gateway
 2  10.0.0.1         5.3 ms     ← ISP router
 3  * * *                       ← Router doesn't respond to probes
 4  72.14.215.85    15.7 ms     ← Internet backbone
 5  8.8.8.8         12.1 ms     ← Destination

Stars (* * *) don’t necessarily mean a problem — many routers are configured to not respond to traceroute probes.

Route Verification

# Show the routing table
ip route show

# Which route will be used for a specific destination?
ip route get 8.8.8.8

# Check if IP forwarding is enabled (for routers/gateways)
sysctl net.ipv4.ip_forward

Layer 4: Transport Diagnostics

ss — Socket Statistics

ss (replacement for netstat) is the primary tool for inspecting TCP/UDP sockets:

# List all listening TCP ports
ss -tlnp

# List all established TCP connections
ss -tnp

# List all UDP sockets
ss -ulnp

# Show socket statistics (retransmissions, RTT)
ss -ti

# Filter by port
ss -tnp 'sport = :80'

# Filter by state
ss -tn state established
ss -tn state time-wait

Connection Testing

# Test if a TCP port is open
nc -zv example.com 80
nc -zv example.com 443

# Test with timeout
nc -zv -w 3 example.com 22

# Show connection timing
curl -o /dev/null -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" https://example.com

TCP Connection States

Understanding TCP states helps diagnose connection issues:

State Meaning Common Issue
LISTEN Server waiting for connections Normal
ESTABLISHED Active connection Normal
SYN_SENT Client waiting for SYN-ACK Firewall blocking, server down
SYN_RECV Server received SYN SYN flood attack
TIME_WAIT Connection closed, waiting for stale packets Normal; too many = port exhaustion
CLOSE_WAIT Remote side closed, local side hasn’t Application bug (not closing sockets)
FIN_WAIT1/2 Closing connection Normal transition
# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

Layer 7: Application Diagnostics

DNS Troubleshooting

# Test DNS resolution
dig example.com
dig @8.8.8.8 example.com   # Query a specific server

# Check all record types
dig example.com ANY

# Reverse DNS
dig -x 93.184.216.34

# Check DNS propagation
dig +trace example.com

# Test with system resolver
getent hosts example.com

Common DNS issues:

Symptom Check
Name resolution fails cat /etc/resolv.conf, try dig @8.8.8.8
Resolves to wrong IP Check for stale DNS cache, conflicting /etc/hosts
Slow resolution DNS server may be unreachable; check with dig

HTTP Troubleshooting

# Verbose HTTP request
curl -v https://example.com

# Show only headers
curl -I https://example.com

# Follow redirects
curl -L -v https://example.com

# Test with specific HTTP version
curl --http2 https://example.com

# POST with data
curl -X POST -d '{"key": "value"}' -H "Content-Type: application/json" https://api.example.com/data

Python Troubleshooting Toolkit

Build a comprehensive diagnostic script:

import socket
import subprocess
import ssl
import time

def check_dns(hostname: str) -> str:
    try:
        ip = socket.gethostbyname(hostname)
        return f"✓ DNS: {hostname}{ip}"
    except socket.gaierror as e:
        return f"✗ DNS: {hostname}{e}"

def check_tcp(host: str, port: int, timeout: float = 5.0) -> str:
    try:
        start = time.time()
        with socket.create_connection((host, port), timeout=timeout):
            elapsed = (time.time() - start) * 1000
            return f"✓ TCP: {host}:{port}{elapsed:.1f} ms"
    except (socket.timeout, ConnectionRefusedError, OSError) as e:
        return f"✗ TCP: {host}:{port}{e}"

def check_tls(hostname: str, port: int = 443) -> str:
    try:
        ctx = ssl.create_default_context()
        with socket.create_connection((hostname, port), timeout=5) as sock:
            with ctx.wrap_socket(sock, server_hostname=hostname) as ssock:
                cert = ssock.getpeercert()
                version = ssock.version()
                not_after = cert.get("notAfter", "unknown")
                return f"✓ TLS: {version}, expires {not_after}"
    except Exception as e:
        return f"✗ TLS: {e}"

def check_ping(host: str) -> str:
    result = subprocess.run(
        ["ping", "-c", "3", "-W", "2", host],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        # Extract average RTT
        for line in result.stdout.split("\n"):
            if "avg" in line:
                return f"✓ Ping: {line.strip()}"
    return f"✗ Ping: {host} unreachable"

# Run diagnostics
target = "example.com"
print(check_dns(target))
print(check_ping(target))
print(check_tcp(target, 80))
print(check_tcp(target, 443))
print(check_tls(target))

Full example: code/network_diagnostics.py


Performance Best Practices

Socket Tuning

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Increase buffer sizes for high-throughput applications
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 262144)  # 256 KB
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 262144)

# Disable Nagle's algorithm for low-latency
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

# Enable TCP keepalive for long-lived connections
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)

System-Level Tuning

# Increase the maximum number of open file descriptors
ulimit -n 65536

# Increase TCP connection tracking limits
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# Enable TCP fast open
sudo sysctl -w net.ipv4.tcp_fastopen=3

# Increase port range for outgoing connections
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Reduce TIME_WAIT duration
sudo sysctl -w net.ipv4.tcp_fin_timeout=15

Connection Pooling

Always reuse connections when making multiple requests to the same server:

# Bad — new connection per request
for i in range(100):
    requests.get("https://api.example.com/data")  # 100 TCP + TLS handshakes

# Good — reuse connections via session
session = requests.Session()
for i in range(100):
    session.get("https://api.example.com/data")  # 1 TCP + TLS handshake

Security Best Practices

Practice Description
Encrypt everything Use TLS for all network communication
Validate certificates Never disable certificate verification in production
Use strong ciphers TLS 1.3 with AES-256-GCM or ChaCha20
Firewall defaults Default-deny; whitelist only necessary traffic
Least privilege Run services as non-root; bind to specific interfaces
Input validation Never trust data from the network
Rate limiting Protect services from abuse and DoS
Logging Log connection attempts, errors, and unusual patterns
Update regularly Patch OS, libraries, and network device firmware
Segment networks Use VLANs and firewalls to limit blast radius

Common Pitfalls and Solutions

“Address Already in Use”

# Problem: Restarting a server fails because the port is in TIME_WAIT
# Solution: Set SO_REUSEADDR before binding
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

“Connection Refused”

Checklist:

  1. Is the server running? (ss -tlnp | grep <port>)
  2. Is it listening on the right interface? (0.0.0.0 vs 127.0.0.1)
  3. Is a firewall blocking the port? (iptables -L -n)

“Connection Timed Out”

Checklist:

  1. Is the host reachable? (ping <host>)
  2. Is the port open? (nc -zv <host> <port>)
  3. Is there a routing issue? (traceroute <host>)
  4. Is a firewall silently dropping packets? (check with tcpdump)

Partial Reads / Incomplete Data

# Problem: recv() returns less data than expected
# Solution: Loop until you have all the data
def recv_all(sock, length):
    data = b""
    while len(data) < length:
        chunk = sock.recv(length - len(data))
        if not chunk:
            raise ConnectionError("Connection closed prematurely")
        data += chunk
    return data

DNS Resolution Delays

# Problem: DNS lookups add latency to every connection
# Solution: Cache DNS results
import functools

@functools.lru_cache(maxsize=256)
def resolve(hostname: str) -> str:
    return socket.gethostbyname(hostname)

Production Readiness Checklist

Before deploying a networked application, verify:


Where to Go from Here

You’ve covered a lot of ground in this book — from the OSI model to packet crafting, from TCP sockets to async servers, from firewall rules to network automation. Here are paths for continued learning:


Key Takeaways


← Previous: Network Automation and Configuration Table of Contents