Chapter 3: Database Design and Data Modeling

Choosing the right database and designing an efficient schema is crucial for a comment system that performs well and scales with your needs.

Database Options for Comments

SQL Databases

PostgreSQL is the gold standard for comment systems:

SQLite works well for smaller sites:

MySQL/MariaDB remains popular:

NoSQL Options

MongoDB offers flexibility:

Redis for caching layer:

Firestore/Firebase for serverless:

Managed Solutions

Service Free Tier Best For
Supabase 500MB PostgreSQL + Auth
PlanetScale 5GB MySQL, branching
Turso 8GB Edge SQLite
Neon 512MB Serverless PostgreSQL
MongoDB Atlas 512MB Document storage

Core Data Model

Comments Table Schema

CREATE TABLE comments (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    page_id VARCHAR(255) NOT NULL,
    parent_id UUID REFERENCES comments(id),
    
    -- Author information
    author_name VARCHAR(100) NOT NULL,
    author_email VARCHAR(255),
    author_url VARCHAR(500),
    author_avatar VARCHAR(500),
    
    -- Content
    content TEXT NOT NULL,
    content_html TEXT,  -- Pre-rendered HTML
    
    -- Metadata
    status VARCHAR(20) DEFAULT 'pending',
    ip_address INET,
    user_agent TEXT,
    
    -- Timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    published_at TIMESTAMP WITH TIME ZONE,
    
    -- Indexes
    CONSTRAINT valid_status CHECK (status IN ('pending', 'approved', 'spam', 'deleted'))
);

CREATE INDEX idx_comments_page ON comments(page_id);
CREATE INDEX idx_comments_parent ON comments(parent_id);
CREATE INDEX idx_comments_status ON comments(status);
CREATE INDEX idx_comments_created ON comments(created_at DESC);

Supporting Tables

Pages Table - Track pages with comments:

CREATE TABLE pages (
    id VARCHAR(255) PRIMARY KEY,  -- URL path or slug
    url VARCHAR(500) NOT NULL,
    title VARCHAR(255),
    comment_count INTEGER DEFAULT 0,
    last_comment_at TIMESTAMP WITH TIME ZONE,
    settings JSONB DEFAULT '{}',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Users Table - For authenticated comments:

CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    provider VARCHAR(50) NOT NULL,  -- github, google, email
    provider_id VARCHAR(255) NOT NULL,
    email VARCHAR(255),
    name VARCHAR(100),
    avatar_url VARCHAR(500),
    role VARCHAR(20) DEFAULT 'commenter',
    banned_at TIMESTAMP WITH TIME ZONE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    
    UNIQUE(provider, provider_id)
);

Reactions Table - Likes, upvotes, etc.:

CREATE TABLE reactions (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    comment_id UUID REFERENCES comments(id) ON DELETE CASCADE,
    user_identifier VARCHAR(255) NOT NULL,  -- User ID or hash
    reaction_type VARCHAR(20) NOT NULL,  -- like, heart, helpful
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    
    UNIQUE(comment_id, user_identifier, reaction_type)
);

Threading Models

Flat Comments

Simple structure, comments displayed chronologically:

Comment 1
Comment 2
Comment 3
Comment 4

Implementation: parent_id is always NULL.

Two-Level Threading

Common in blogs - replies grouped under parent:

Comment 1
├── Reply 1.1
├── Reply 1.2
└── Reply 1.3
Comment 2
└── Reply 2.1

Implementation: parent_id references top-level comment only.

Deep Threading (Reddit-style)

Unlimited nesting levels:

Comment 1
├── Reply 1.1
│   └── Reply 1.1.1
│       └── Reply 1.1.1.1
├── Reply 1.2

Implementation: Use recursive queries or materialized path.

Materialized Path Pattern

Store the path to each comment for efficient queries:

ALTER TABLE comments ADD COLUMN path VARCHAR(1000);

-- Example paths:
-- Comment 1: '001'
-- Reply 1.1: '001.001'
-- Reply 1.1.1: '001.001.001'

Query all descendants:

SELECT * FROM comments 
WHERE path LIKE '001.%'
ORDER BY path;

Closure Table Pattern

For complex hierarchies:

CREATE TABLE comment_closure (
    ancestor_id UUID REFERENCES comments(id),
    descendant_id UUID REFERENCES comments(id),
    depth INTEGER NOT NULL,
    PRIMARY KEY (ancestor_id, descendant_id)
);

Performance Optimizations

Denormalization Strategies

Comment counts - Maintain counters:

-- Trigger to update page comment count
CREATE OR REPLACE FUNCTION update_comment_count()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'INSERT' AND NEW.status = 'approved' THEN
        UPDATE pages SET comment_count = comment_count + 1 
        WHERE id = NEW.page_id;
    ELSIF TG_OP = 'UPDATE' THEN
        IF OLD.status != 'approved' AND NEW.status = 'approved' THEN
            UPDATE pages SET comment_count = comment_count + 1 
            WHERE id = NEW.page_id;
        ELSIF OLD.status = 'approved' AND NEW.status != 'approved' THEN
            UPDATE pages SET comment_count = comment_count - 1 
            WHERE id = NEW.page_id;
        END IF;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Reply counts - Track per comment:

ALTER TABLE comments ADD COLUMN reply_count INTEGER DEFAULT 0;

Indexing Strategies

Essential indexes:

-- Composite index for common query
CREATE INDEX idx_comments_page_status_created 
ON comments(page_id, status, created_at DESC);

-- Partial index for approved comments only
CREATE INDEX idx_comments_approved 
ON comments(page_id, created_at DESC) 
WHERE status = 'approved';

-- Full-text search index
CREATE INDEX idx_comments_content_search 
ON comments USING gin(to_tsvector('english', content));

Query Optimization

Efficient comment fetching:

-- Get comments with reply counts
SELECT 
    c.*,
    COUNT(r.id) AS reply_count
FROM comments c
LEFT JOIN comments r ON r.parent_id = c.id AND r.status = 'approved'
WHERE c.page_id = $1 
    AND c.status = 'approved'
    AND c.parent_id IS NULL
GROUP BY c.id
ORDER BY c.created_at DESC
LIMIT 50;

Data Integrity

Foreign Key Considerations

Handle deletions gracefully:

-- Soft delete pattern
ALTER TABLE comments 
ADD COLUMN deleted_at TIMESTAMP WITH TIME ZONE;

-- Archive instead of delete
CREATE TABLE comments_archive (LIKE comments INCLUDING ALL);

Audit Trail

Track all changes:

CREATE TABLE comment_history (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    comment_id UUID NOT NULL,
    action VARCHAR(20) NOT NULL,  -- created, edited, moderated
    old_data JSONB,
    new_data JSONB,
    actor_id UUID,  -- Moderator or system
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Migration Strategies

From Disqus

Disqus provides XML export:

  1. Export comments from Disqus admin
  2. Parse XML file
  3. Map thread URLs to page IDs
  4. Import maintaining parent relationships

From WordPress

Direct database access:

-- WordPress comments structure
SELECT 
    comment_ID,
    comment_post_ID,
    comment_parent,
    comment_author,
    comment_author_email,
    comment_content,
    comment_date
FROM wp_comments
WHERE comment_approved = '1';

From JSON Files

For static site generators:

import json
import psycopg2

def import_comments(json_file, db_conn):
    with open(json_file) as f:
        comments = json.load(f)
    
    cursor = db_conn.cursor()
    for comment in comments:
        cursor.execute("""
            INSERT INTO comments (page_id, author_name, content, created_at)
            VALUES (%s, %s, %s, %s)
        """, (comment['page'], comment['author'], 
              comment['text'], comment['date']))
    
    db_conn.commit()

Chapter Summary

Aspect Recommendation
Database PostgreSQL for features, SQLite for simplicity
Threading Two-level for blogs, materialized path for forums
Indexes Composite on (page_id, status, created_at)
Counts Denormalize with triggers
Deletions Soft delete with archive

In the next chapter, we’ll implement the backend API using these schemas.


Navigation: