Chapter 4: Spam Prevention
The Spam Reality
Spam is the inevitable companion of any public comment system. Without effective prevention, your comment sections will fill with pharmaceutical ads, cryptocurrency scams, and SEO link farms. This chapter explores the multi-layered approach needed to keep your comments clean.
Understanding the Enemy
Automated Bot Spam:
- Scripts that submit forms automatically
- High volume, low sophistication
- Often generic, irrelevant content
- Usually includes links
Semi-Automated Spam:
- Humans solving CAPTCHAs
- Bots handling submission
- Higher quality than pure automation
- Harder to detect
Manual Spam:
- Human spammers posting individually
- Context-aware content
- May appear legitimate initially
- Most difficult to prevent
Coordinated Campaigns:
- Multiple accounts working together
- Gradual reputation building
- Sophisticated evasion tactics
- Often politically motivated
Spammer Motivations
Understanding why spammers target you helps design defenses:
- SEO value: Links from your site boost their rankings
- Direct traffic: Some visitors will click spam links
- Credential harvesting: Fake links lead to phishing sites
- Malware distribution: Links to malicious downloads
- Reputation damage: Competitors or trolls
- Cryptocurrency/scams: Pump-and-dump, fake investments
Defense in Depth
Effective spam prevention uses multiple layers. No single technique is sufficient.
Layer 1: Submission Barriers
Increase the cost of submitting comments:
Honeypot Fields:
Hidden form fields that humans don’t see but bots fill in. Any submission with these fields completed is rejected.
Time-Based Validation:
- Minimum time between page load and submission
- Maximum time (extremely long delay suspicious)
- Bots often submit instantly
JavaScript Requirements:
Require JavaScript execution to submit. Many bots don’t execute JavaScript. However, this excludes users without JavaScript.
Form Token Rotation:
Generate unique tokens that expire. Prevents replay attacks and forces fresh page loads.
Layer 2: Content Analysis
Examine what’s being submitted:
Link Detection:
- Count links in comment
- Check link destinations
- Flag shortened URLs
- Block known spam domains
Keyword Filtering:
- Block known spam keywords
- Pharmaceutical terms
- Adult content terms
- Common scam phrases
Pattern Matching:
- Excessive capitalization
- Repetitive characters
- Known spam patterns
- Suspicious formatting
Language Analysis:
- Does comment relate to post content?
- Grammatical anomalies common in spam
- Generic praise (“Great post!”)
- Template-like structure
Layer 3: Behavioral Analysis
Examine how submissions happen:
Rate Limiting:
- Limit comments per IP per time period
- Limit comments per user session
- Limit across the entire site
Velocity Checks:
- Unusual comment frequency
- Geographic impossibility (same user, different continents)
- Suspicious timing patterns
Browser Fingerprinting:
- Consistent fingerprint across submissions
- Detect browser automation
- Note: Privacy implications
Mouse/Keyboard Patterns:
- Bots often don’t generate natural interaction events
- Track engagement before submission
- Note: Can be spoofed
Layer 4: Reputation Systems
Build trust over time:
IP Reputation:
- Track spam history by IP
- Use external IP reputation services
- Note: Shared IPs (VPNs, offices) complicate this
Email Reputation:
- Disposable email detection
- Domain age and reputation
- Previous behavior from email
User Trust Scores:
- New users require approval
- Trust increases with good behavior
- Trust decreases with flags/spam
Layer 5: External Services
Leverage specialized services:
CAPTCHA Systems:
- reCAPTCHA, hCaptcha, Turnstile
- Adds friction but effective
- Accessibility concerns
- Privacy considerations
Spam Detection APIs:
- Akismet (WordPress ecosystem)
- CleanTalk
- Stop Forum Spam
- Usually paid services
Email Verification Services:
- Check if email is valid
- Detect disposable emails
- Risk scoring
CAPTCHA Considerations
Types of CAPTCHA
Traditional Image CAPTCHA:
- Distorted text to read
- Accessibility nightmare
- Increasingly solved by AI
- Poor user experience
Image Selection:
- “Select all images with X”
- Better accessibility
- Still solved by services
- Moderate user friction
Invisible/Risk-Based:
- Analyzes behavior, shows challenge only if suspicious
- Best user experience
- Requires third-party service
- Privacy concerns
Proof of Work:
- Browser computes mathematical problem
- No user interaction
- Increases submission cost
- Energy/battery concerns
CAPTCHA Trade-offs
Pros:
- Effective against basic bots
- Well-understood by users
- Easy to implement (third-party)
- Adjustable difficulty
Cons:
- Annoying for legitimate users
- Accessibility challenges
- Privacy concerns (tracking)
- Solved by human farms
- Doesn’t stop manual spam
CAPTCHA Recommendations
- Use invisible/risk-based CAPTCHA initially
- Only show challenges when suspicious
- Have accessible alternatives
- Don’t rely solely on CAPTCHA
- Consider CAPTCHA-free alternatives first
Building Your Spam Score
Combine multiple signals into a spam probability score:
Signal Weighting
Assign points to various signals:
- Contains links: +20 points
- New user: +15 points
- Failed honeypot: +100 points (certain spam)
- Contains spam keywords: +30 points
- Submitted too quickly: +25 points
- From known spam IP: +50 points
Threshold Actions
Based on total score:
- 0-30: Auto-approve
- 31-70: Hold for moderation
- 71+: Auto-reject (or shadow-ban)
Tuning Over Time
- Track false positives (legitimate comments blocked)
- Track false negatives (spam that got through)
- Adjust weights based on your data
- Different sites need different tuning
Shadow Banning
A technique where spammers don’t know they’re blocked:
How it works:
- Spam is accepted normally (from spammer’s view)
- Spammer sees their comment on the page
- No one else sees it
- Spammer thinks they’re successful
Advantages:
- Spammer doesn’t adapt tactics
- No feedback loop for them
- Reduces return attempts
- Satisfying (arguably)
Disadvantages:
- Ethical concerns
- Can catch legitimate users
- Requires per-user view logic
- Complexity in implementation
Recommendation:
Use sparingly and review shadow-banned submissions periodically.
Handling False Positives
Legitimate comments caught as spam:
Prevention
- Err toward approval for borderline cases
- Clear feedback when comments held
- Quick moderation turnaround
- Whitelist known good users
Recovery
- Easy appeal process
- Notification when approved
- Apology for inconvenience
- Adjust rules that caused false positive
Spam Prevention Without CAPTCHA
It’s possible to have effective spam prevention without user-facing challenges:
The Invisible Approach
- Honeypot fields
- Time validation
- JavaScript token
- Content analysis
- Rate limiting
- Behavior analysis
- Manual moderation for flagged items
This combination catches most automated spam while providing friction-free experience for legitimate users.
Cost Considerations
Free Options
- Honeypots
- Time validation
- Basic keyword filtering
- Rate limiting
- Manual moderation
Low-Cost Options
- Basic CAPTCHA services (free tiers)
- Simple IP reputation checks
- Disposable email detection
- Open-source spam detection
Paid Services
- Advanced CAPTCHA (reCAPTCHA Enterprise)
- Spam detection APIs
- IP reputation services
- Machine learning solutions
For most small sites, free options combined with manual moderation are sufficient.
Spam Prevention Checklist
Summary
Effective spam prevention requires:
- Multiple layers: No single technique is sufficient
- Balance: Don’t sacrifice UX for security
- Adaptation: Spammers evolve, so must you
- Monitoring: Track what’s working and what’s not
- Moderation: Human review remains important
Start with the free, invisible techniques (honeypots, time validation, content analysis). Add visible challenges only if needed. Remember that the goal is minimizing spam while maximizing legitimate participation.
The next chapter covers moderation systems—what happens after a comment passes your spam filters.