Generated using AI. Be aware that everything might not be accurate.

Chapter 11: Maximum Likelihood Estimation and Model Selection

We have seen the log-likelihood for individual models in earlier chapters. This chapter pulls together the full framework for likelihood-based inference: parameter estimation, uncertainty quantification, and model comparison. The tools here apply uniformly across all the point process families in this book.

11.1 The Universal Log-Likelihood

For any simple point process with conditional intensity λ*(t; θ), the log-likelihood of observing events t₁ < ... < tₙ on [0, T] is:

ℓ(θ) = Σᵢ₌₁ⁿ log λ*(tᵢ; θ) − ∫₀ᵀ λ*(t; θ) dt

Derivation sketch: Partition [0, T] into bins of width δ. The probability of the observed event pattern is approximately:

∏ᵢ [λ*(tᵢ)δ] · ∏_{bins without events} [1 − λ*(t)δ]
≈ ∏ᵢ λ*(tᵢ) · δ · exp(−∫ λ*(t) dt)

Taking logs and dropping the δ-dependent constant gives the formula.

11.2 Computing the Integral

The compensator ∫₀ᵀ λ*(t) dt must be evaluated exactly for efficient optimization. Its form depends on the model:

Model	Compensator
HPP(λ)	`λ · T`
NHPP(λ(t))	`∫₀ᵀ λ(t) dt` (numerical if no closed form)
Hawkes (exp kernel)	`μT + (α/β) Σᵢ [1 − exp(−β(T−tᵢ))]`
Renewal	No simple closed form; use numerical integration

For the HPP and Hawkes (exponential kernel), the compensator is analytic, making optimization fast and stable.

11.3 Practical Optimization

The negative log-likelihood −ℓ(θ) is minimized using scipy.optimize.minimize:

from scipy.optimize import minimize

def fit_model(neg_loglik_fn, x0, bounds):
    best = None
    for _ in range(10):     # multiple restarts
        x0_perturbed = x0 * np.exp(np.random.randn(len(x0)) * 0.5)
        result = minimize(neg_loglik_fn, x0_perturbed,
                          method='L-BFGS-B', bounds=bounds,
                          options={'maxiter': 1000, 'ftol': 1e-12})
        if best is None or result.fun < best.fun:
            best = result
    return best

Key practices:

Always run from multiple starting points (the likelihood can have local minima)
Set parameter bounds to enforce physical constraints (all rates > 0, n* < 1)
Check convergence: result.success and result.fun at optimum

11.4 Model Comparison: AIC and BIC

Given two fitted models with k₁ and k₂ parameters and maximized log-likelihoods ℓ̂₁ and ℓ̂₂:

AIC = −2ℓ̂ + 2k
BIC = −2ℓ̂ + k · log(n)

Lower AIC/BIC is better. AIC favors predictive accuracy; BIC penalizes parameters more heavily and is consistent (selects the true model as n → ∞ under the true model).

Example comparison:

Model	Parameters k	log-likelihood ℓ̂	AIC	BIC
HPP	1	-312.4	626.8	628.2
NHPP (sinusoidal)	3	-287.1	580.2	586.4
Hawkes	3	-261.8	529.6	535.8

The Hawkes model wins decisively here, justifying the extra complexity.

11.5 Likelihood Ratio Test

For nested models (model 0 is a special case of model 1 with k₁ − k₀ = d extra parameters), the likelihood ratio test (LRT) gives a formal hypothesis test:

H₀: restricted model (k₀ parameters)
H₁: full model (k₁ parameters)

LRT statistic: W = 2(ℓ̂₁ − ℓ̂₀) ~ χ²(d) under H₀

Example: Testing whether a Hawkes process reduces to a Poisson process (H₀: α = 0):

Under H₀, fit the HPP (1 parameter): ℓ̂₀
Under H₁, fit the Hawkes (3 parameters): ℓ̂₁
W = 2(ℓ̂₁ − ℓ̂₀) ~ χ²(2) under H₀
Reject H₀ if W > 5.99 (p < 0.05)

from scipy.stats import chi2
W = 2 * (loglik_hawkes - loglik_hpp)
p_value = 1 - chi2.cdf(W, df=2)
print(f"W = {W:.2f}, p = {p_value:.4f}")

11.6 Confidence Intervals

Wald confidence intervals (using the observed Fisher information):

SE(θ̂ⱼ) ≈ sqrt([I_obs⁻¹]ⱼⱼ)
CI: θ̂ⱼ ± 1.96 · SE(θ̂ⱼ)

where I_obs = −∇²ℓ(θ̂) is the negative Hessian.

Profile likelihood intervals are more accurate near boundaries and for non-linear parameters like n* = α/β:

ℓ_profile(n*) = max_{μ, β} ℓ(μ, n*·β, β)
CI: {n* : 2(ℓ̂ − ℓ_profile(n*)) ≤ 3.84}

Profile likelihood CIs are preferred for the branching ratio because the Wald approximation can be poor near n* = 0 or n* = 1.

11.7 Parametric Bootstrap

For small samples where asymptotic theory may not apply, use the parametric bootstrap:

Fit the model to data → θ̂
Simulate B datasets from the fitted model
Re-fit the model to each simulated dataset → θ̂₁*, ..., θ̂_B*
Use the empirical distribution of θ̂* to compute standard errors and confidence intervals

B = 500
bootstrap_estimates = []
for _ in range(B):
    sim_events = simulate_hawkes(mu_hat, alpha_hat, beta_hat, T)
    result = fit_hawkes(sim_events, T)
    bootstrap_estimates.append(result.x)
bootstrap_estimates = np.array(bootstrap_estimates)
SE_bootstrap = bootstrap_estimates.std(axis=0)

See code/11_mle_model_selection.py for the full pipeline.

11.8 Key Takeaways

The log-likelihood ℓ = Σ log λ*(tᵢ) − ∫ λ*(t) dt is universal; only the form of λ*(t) changes.
Minimize −ℓ numerically with multiple restarts; enforce parameter constraints via bounds.
Use AIC/BIC for model comparison; use LRT for nested models.
Wald CIs are fast but approximate; profile likelihood CIs are more accurate, especially near boundaries.
The parametric bootstrap provides finite-sample uncertainty estimates.

← Chapter 10

Table of Contents

Chapter 12: Goodness-of-Fit →

>> You can subscribe to my mailing list here for a monthly update. <<

Gaëlle Candel