Chapter 13: Troubleshooting and Best Practices

The Art of Network Debugging

Network problems are notoriously difficult to diagnose because they span multiple layers, involve multiple devices, and can be intermittent. But with a systematic approach and the right tools, you can isolate and resolve most issues efficiently.

This final chapter brings together everything you’ve learned into a practical troubleshooting methodology and a set of best practices for building reliable networked systems.

The Troubleshooting Methodology

Follow the bottom-up approach — start at the Physical layer and work your way up:

Layer 1 (Physical)    → Is the cable plugged in? Is the link up?
Layer 2 (Data Link)   → Can I see MAC addresses? Is ARP working?
Layer 3 (Network)     → Can I ping the gateway? Is routing correct?
Layer 4 (Transport)   → Is the port open? Is TCP connecting?
Layer 7 (Application) → Is the service responding correctly?

At each layer, ask: “Does this layer work?” If yes, move up. If no, you’ve found the problem layer.

Layer 1–2: Physical and Link Diagnostics

Check Interface Status

# Is the interface up? Is the link detected?
ip link show eth0

# Look for:
# - state UP (interface enabled)
# - <LOWER_UP> (physical link detected)
# - NO-CARRIER (cable disconnected or link partner down)

# Check for errors
ip -s link show eth0
# Look for: RX errors, TX errors, dropped packets, overruns

Check ARP/NDP

# Can we resolve the gateway's MAC address?
ip neigh show

# If the gateway shows FAILED or is missing:
# - Wrong IP configuration
# - Wrong VLAN
# - Physical connectivity issue

# Force ARP resolution
arping -c 3 192.168.1.1

Layer 3: Network Diagnostics

Ping — Basic Reachability

# Ping the default gateway
ping -c 4 192.168.1.1

# Ping a public DNS server
ping -c 4 8.8.8.8

# Ping with specific packet size (check MTU issues)
ping -c 4 -s 1472 -M do 8.8.8.8  # 1472 + 28 = 1500 (standard MTU)

Interpreting ping results:

Symptom	Likely Cause
`Destination Host Unreachable`	No route to host, or ARP failure
`Network is unreachable`	No route in routing table
`Request timeout`	Firewall blocking ICMP, or host is down
High latency (>100 ms on LAN)	Congestion, duplex mismatch, bad cable
Packet loss (>1%)	Congestion, interference, hardware issue

Traceroute — Path Analysis

# Show the path to a destination
traceroute 8.8.8.8

# Use ICMP instead of UDP (may work better through firewalls)
sudo traceroute -I 8.8.8.8

# Use TCP SYN on port 80 (works through most firewalls)
sudo traceroute -T -p 80 example.com

# MTR — continuous traceroute with statistics
mtr 8.8.8.8

Reading traceroute output:

192.168.1.1      1.2 ms     ← Local gateway
10.0.0.1         5.3 ms     ← ISP router
* * *                       ← Router doesn't respond to probes
72.14.215.85    15.7 ms     ← Internet backbone
8.8.8.8         12.1 ms     ← Destination

Stars (* * *) don’t necessarily mean a problem — many routers are configured to not respond to traceroute probes.

Route Verification

# Show the routing table
ip route show

# Which route will be used for a specific destination?
ip route get 8.8.8.8

# Check if IP forwarding is enabled (for routers/gateways)
sysctl net.ipv4.ip_forward

Layer 4: Transport Diagnostics

ss — Socket Statistics

ss (replacement for netstat) is the primary tool for inspecting TCP/UDP sockets:

# List all listening TCP ports
ss -tlnp

# List all established TCP connections
ss -tnp

# List all UDP sockets
ss -ulnp

# Show socket statistics (retransmissions, RTT)
ss -ti

# Filter by port
ss -tnp 'sport = :80'

# Filter by state
ss -tn state established
ss -tn state time-wait

Connection Testing

# Test if a TCP port is open
nc -zv example.com 80
nc -zv example.com 443

# Test with timeout
nc -zv -w 3 example.com 22

# Show connection timing
curl -o /dev/null -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTLS: %{time_appconnect}s\nTotal: %{time_total}s\n" https://example.com

TCP Connection States

Understanding TCP states helps diagnose connection issues:

State	Meaning	Common Issue
LISTEN	Server waiting for connections	Normal
ESTABLISHED	Active connection	Normal
SYN_SENT	Client waiting for SYN-ACK	Firewall blocking, server down
SYN_RECV	Server received SYN	SYN flood attack
TIME_WAIT	Connection closed, waiting for stale packets	Normal; too many = port exhaustion
CLOSE_WAIT	Remote side closed, local side hasn’t	Application bug (not closing sockets)
FIN_WAIT1/2	Closing connection	Normal transition

# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

Layer 7: Application Diagnostics

DNS Troubleshooting

# Test DNS resolution
dig example.com
dig @8.8.8.8 example.com   # Query a specific server

# Check all record types
dig example.com ANY

# Reverse DNS
dig -x 93.184.216.34

# Check DNS propagation
dig +trace example.com

# Test with system resolver
getent hosts example.com

Common DNS issues:

Symptom	Check
Name resolution fails	`cat /etc/resolv.conf`, try `dig @8.8.8.8`
Resolves to wrong IP	Check for stale DNS cache, conflicting `/etc/hosts`
Slow resolution	DNS server may be unreachable; check with `dig`

HTTP Troubleshooting

# Verbose HTTP request
curl -v https://example.com

# Show only headers
curl -I https://example.com

# Follow redirects
curl -L -v https://example.com

# Test with specific HTTP version
curl --http2 https://example.com

# POST with data
curl -X POST -d '{"key": "value"}' -H "Content-Type: application/json" https://api.example.com/data

Python Troubleshooting Toolkit

Build a comprehensive diagnostic script:

import socket
import subprocess
import ssl
import time

def check_dns(hostname: str) -> str:
    try:
        ip = socket.gethostbyname(hostname)
        return f"✓ DNS: {hostname} → {ip}"
    except socket.gaierror as e:
        return f"✗ DNS: {hostname} — {e}"

def check_tcp(host: str, port: int, timeout: float = 5.0) -> str:
    try:
        start = time.time()
        with socket.create_connection((host, port), timeout=timeout):
            elapsed = (time.time() - start) * 1000
            return f"✓ TCP: {host}:{port} — {elapsed:.1f} ms"
    except (socket.timeout, ConnectionRefusedError, OSError) as e:
        return f"✗ TCP: {host}:{port} — {e}"

def check_tls(hostname: str, port: int = 443) -> str:
    try:
        ctx = ssl.create_default_context()
        with socket.create_connection((hostname, port), timeout=5) as sock:
            with ctx.wrap_socket(sock, server_hostname=hostname) as ssock:
                cert = ssock.getpeercert()
                version = ssock.version()
                not_after = cert.get("notAfter", "unknown")
                return f"✓ TLS: {version}, expires {not_after}"
    except Exception as e:
        return f"✗ TLS: {e}"

def check_ping(host: str) -> str:
    result = subprocess.run(
        ["ping", "-c", "3", "-W", "2", host],
        capture_output=True, text=True
    )
    if result.returncode == 0:
        # Extract average RTT
        for line in result.stdout.split("\n"):
            if "avg" in line:
                return f"✓ Ping: {line.strip()}"
    return f"✗ Ping: {host} unreachable"

# Run diagnostics
target = "example.com"
print(check_dns(target))
print(check_ping(target))
print(check_tcp(target, 80))
print(check_tcp(target, 443))
print(check_tls(target))

Full example: code/network_diagnostics.py

Performance Best Practices

Socket Tuning

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Increase buffer sizes for high-throughput applications
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 262144)  # 256 KB
s.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 262144)

# Disable Nagle's algorithm for low-latency
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

# Enable TCP keepalive for long-lived connections
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)

System-Level Tuning

# Increase the maximum number of open file descriptors
ulimit -n 65536

# Increase TCP connection tracking limits
sudo sysctl -w net.core.somaxconn=65535
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=65535

# Enable TCP fast open
sudo sysctl -w net.ipv4.tcp_fastopen=3

# Increase port range for outgoing connections
sudo sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Reduce TIME_WAIT duration
sudo sysctl -w net.ipv4.tcp_fin_timeout=15

Connection Pooling

Always reuse connections when making multiple requests to the same server:

# Bad — new connection per request
for i in range(100):
    requests.get("https://api.example.com/data")  # 100 TCP + TLS handshakes

# Good — reuse connections via session
session = requests.Session()
for i in range(100):
    session.get("https://api.example.com/data")  # 1 TCP + TLS handshake

Security Best Practices

Practice	Description
Encrypt everything	Use TLS for all network communication
Validate certificates	Never disable certificate verification in production
Use strong ciphers	TLS 1.3 with AES-256-GCM or ChaCha20
Firewall defaults	Default-deny; whitelist only necessary traffic
Least privilege	Run services as non-root; bind to specific interfaces
Input validation	Never trust data from the network
Rate limiting	Protect services from abuse and DoS
Logging	Log connection attempts, errors, and unusual patterns
Update regularly	Patch OS, libraries, and network device firmware
Segment networks	Use VLANs and firewalls to limit blast radius

Common Pitfalls and Solutions

“Address Already in Use”

# Problem: Restarting a server fails because the port is in TIME_WAIT
# Solution: Set SO_REUSEADDR before binding
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

“Connection Refused”

Checklist:

Is the server running? (ss -tlnp | grep <port>)
Is it listening on the right interface? (0.0.0.0 vs 127.0.0.1)
Is a firewall blocking the port? (iptables -L -n)

“Connection Timed Out”