Chapter 11: Maximum Likelihood Estimation and Model Selection

We have seen the log-likelihood for individual models in earlier chapters. This chapter pulls together the full framework for likelihood-based inference: parameter estimation, uncertainty quantification, and model comparison. The tools here apply uniformly across all the point process families in this book.


11.1 The Universal Log-Likelihood

For any simple point process with conditional intensity λ*(t; θ), the log-likelihood of observing events t₁ < ... < tₙ on [0, T] is:

ℓ(θ) = Σᵢ₌₁ⁿ log λ*(tᵢ; θ) − ∫₀ᵀ λ*(t; θ) dt

Derivation sketch: Partition [0, T] into bins of width δ. The probability of the observed event pattern is approximately:

∏ᵢ [λ*(tᵢ)δ] · ∏_{bins without events} [1 − λ*(t)δ]
≈ ∏ᵢ λ*(tᵢ) · δ · exp(−∫ λ*(t) dt)

Taking logs and dropping the δ-dependent constant gives the formula.


11.2 Computing the Integral

The compensator ∫₀ᵀ λ*(t) dt must be evaluated exactly for efficient optimization. Its form depends on the model:

Model Compensator
HPP(λ) λ · T
NHPP(λ(t)) ∫₀ᵀ λ(t) dt (numerical if no closed form)
Hawkes (exp kernel) μT + (α/β) Σᵢ [1 − exp(−β(T−tᵢ))]
Renewal No simple closed form; use numerical integration

For the HPP and Hawkes (exponential kernel), the compensator is analytic, making optimization fast and stable.


11.3 Practical Optimization

The negative log-likelihood −ℓ(θ) is minimized using scipy.optimize.minimize:

from scipy.optimize import minimize

def fit_model(neg_loglik_fn, x0, bounds):
    best = None
    for _ in range(10):     # multiple restarts
        x0_perturbed = x0 * np.exp(np.random.randn(len(x0)) * 0.5)
        result = minimize(neg_loglik_fn, x0_perturbed,
                          method='L-BFGS-B', bounds=bounds,
                          options={'maxiter': 1000, 'ftol': 1e-12})
        if best is None or result.fun < best.fun:
            best = result
    return best

Key practices:


11.4 Model Comparison: AIC and BIC

Given two fitted models with k₁ and k₂ parameters and maximized log-likelihoods ℓ̂₁ and ℓ̂₂:

AIC = −2ℓ̂ + 2k
BIC = −2ℓ̂ + k · log(n)

Lower AIC/BIC is better. AIC favors predictive accuracy; BIC penalizes parameters more heavily and is consistent (selects the true model as n → ∞ under the true model).

Example comparison:

Model Parameters k log-likelihood ℓ̂ AIC BIC
HPP 1 -312.4 626.8 628.2
NHPP (sinusoidal) 3 -287.1 580.2 586.4
Hawkes 3 -261.8 529.6 535.8

The Hawkes model wins decisively here, justifying the extra complexity.


11.5 Likelihood Ratio Test

For nested models (model 0 is a special case of model 1 with k₁ − k₀ = d extra parameters), the likelihood ratio test (LRT) gives a formal hypothesis test:

H₀: restricted model (k₀ parameters)
H₁: full model (k₁ parameters)

LRT statistic: W = 2(ℓ̂₁ − ℓ̂₀) ~ χ²(d) under H₀

Example: Testing whether a Hawkes process reduces to a Poisson process (H₀: α = 0):

from scipy.stats import chi2
W = 2 * (loglik_hawkes - loglik_hpp)
p_value = 1 - chi2.cdf(W, df=2)
print(f"W = {W:.2f}, p = {p_value:.4f}")

11.6 Confidence Intervals

Wald confidence intervals (using the observed Fisher information):

SE(θ̂ⱼ) ≈ sqrt([I_obs⁻¹]ⱼⱼ)
CI: θ̂ⱼ ± 1.96 · SE(θ̂ⱼ)

where I_obs = −∇²ℓ(θ̂) is the negative Hessian.

Profile likelihood intervals are more accurate near boundaries and for non-linear parameters like n* = α/β:

ℓ_profile(n*) = max_{μ, β} ℓ(μ, n*·β, β)
CI: {n* : 2(ℓ̂ − ℓ_profile(n*)) ≤ 3.84}

Profile likelihood CIs are preferred for the branching ratio because the Wald approximation can be poor near n* = 0 or n* = 1.


11.7 Parametric Bootstrap

For small samples where asymptotic theory may not apply, use the parametric bootstrap:

  1. Fit the model to data → θ̂
  2. Simulate B datasets from the fitted model
  3. Re-fit the model to each simulated dataset → θ̂₁*, ..., θ̂_B*
  4. Use the empirical distribution of θ̂* to compute standard errors and confidence intervals
B = 500
bootstrap_estimates = []
for _ in range(B):
    sim_events = simulate_hawkes(mu_hat, alpha_hat, beta_hat, T)
    result = fit_hawkes(sim_events, T)
    bootstrap_estimates.append(result.x)
bootstrap_estimates = np.array(bootstrap_estimates)
SE_bootstrap = bootstrap_estimates.std(axis=0)

See code/11_mle_model_selection.py for the full pipeline.


11.8 Key Takeaways


← Chapter 10 Table of Contents Chapter 12: Goodness-of-Fit →