Frequentist uncertainty vs. Bayesian uncertainty analysis#
Mark Bakker, TU Delft & Raoul Collenteur, Hydroconsult, February, 2025 (last updated May 2026)
In this notebook, the fit and uncertainty are compared for Pastas models solved with least squares (frequentist uncertainty) and with MCMC (Bayesian uncertainty). Besides Pastas the following Python Packages have to be installed to run this notebook:
import corner
import emcee
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pastas as ps
ps.set_log_level("ERROR")
ps.show_versions()
1. A ‘regular’ Pastas Model#
The first step is to create a Pastas Model with a linear RechargeModel and a Gamma response function to simulate the effect of precipitation and evaporation on the heads. The AR1 noise model is used. We first estimate the model parameters using the standard least-squares approach.
head = pd.read_csv(
"data/B32C0639001.csv", parse_dates=["date"], index_col="date"
).squeeze()
head = head["1990":] # use data from 1990 on for this example
evap = pd.read_csv("data/evap_260.csv", index_col=0, parse_dates=[0]).squeeze()
rain = pd.read_csv("data/rain_260.csv", index_col=0, parse_dates=[0]).squeeze()
ml1 = ps.Model(head)
ml1.add_noisemodel(ps.ArNoiseModel())
rm = ps.RechargeModel(
rain, evap, recharge=ps.rch.Linear(), rfunc=ps.Gamma(), name="rch"
)
ml1.add_stressmodel(rm)
ml1.solve()
ax = ml1.plots.results(figsize=(8, 4))
The diagnostics show that the noise meets the statistical requirements for uncertainty analysis reasonably well.
ml1.plots.diagnostics();
ml1.parameters
The estimated least squares parameters and standard errors are stored for later reference
ls_params = ml1.parameters[["optimal", "stderr"]].copy()
ls_params.rename(columns={"optimal": "ls_opt", "stderr": "ls_sig"}, inplace=True)
ls_params
# Compute prediction interval Pastas
pi = ml1.solver.prediction_interval(n=1000)
ax = ml1.plot(figsize=(10, 3))
ax.fill_between(pi.index, pi.iloc[:, 0], pi.iloc[:, 1], color="lightgray")
ax.legend(["Observations", "Simulation", "95% Prediction interval"], ncol=3, loc=2)
pi_pasta = np.mean(pi[0.975] - pi[0.025])
print(f"Mean prediction interval width: {pi_pasta:.3f} m")
print(f"Prediction interval coverage probability: {ps.stats.picp(head, pi): .3f}")
2. Use the EmceeSolver#
We will now use MCMC to estimate the model parameters and their uncertainties. The EmceeSolve solver wraps the Emcee package, which implements different versions of MCMC. A good understanding of Emcee helps when using this solver, so it comes recommended to check out their documentation as well.
We start by making a pastas model with a linear recharge model and a Gamma response function. No noise model is added, as this is taken care of in the likelihood function. The model is solved using the regular solve (least squares) to have a good estimate of the starting values of the parameters.
ml2 = ps.Model(head)
rm = ps.RechargeModel(
rain, evap, recharge=ps.rch.Linear(), rfunc=ps.Gamma(), name="rch"
)
ml2.add_stressmodel(rm)
ml2.parameters
To set up the EmceeSolve solver, a number of decisions need to be made:
Select a (log) likelihood function
Select the priors of the parameters
Select the number of steps and thinning
2a. Create the solver instance#
The next step is to create an instance of the EmceeSolve solver class. At this stage all the settings need to be provided on how the Ensemble Sampler is created (https://emcee.readthedocs.io/en/stable/user/sampler/). Important settings are the nwalkers, the moves, the objfunction. More advanced options are to parallelize the MCMC algorithm (parallel=True), and to set a backend to store the results. Here’s an example:
# Choose the objective function
ln_prob = ps.likelihood.GaussianLikelihoodAr1()
# Create the EmceeSolver with some settings
s = ps.solver.Emcee(
nwalkers=20,
moves=emcee.moves.DEMove(),
objfunction=ln_prob,
progress_bar=True,
parallel=False,
)
ml2.add_solver(s)
ml2.parameters
In the above code we created an EmceeSolve instance with 20 walkers, which take steps according to the DEMove move algorithm (see Emcee docs), and a Gaussian likelihood function that assumes AR1 correlated errors. Different objective functions are available, see the Pastas documentation on the different options.
Depending on the likelihood function, a number of additional parameters need to be inferred. These parameters are added to the Pastas Model instance. Using the set_parameter method of the model, these parameters can be changed. In this example where we use the GaussianLikelihoodAr1 function, the \(\sigma^2\) and \(\phi\) are estimated; the unknown standard deviation of the errors and the autoregressive parameter.
2b. Priors#
The first step is to select the priors of the parameters. This is done by using the ml.set_parameter method and the dist argument (from distribution). Any distribution from scipy.stats can be chosen (https://docs.scipy.org/doc/scipy/tutorial/stats/continuous.html), for example uniform, norm, or lognorm. Here, we select normal distributions for the priors.
Currently, pastas will use the initial value of the parameter for the loc argument of the distribution (e.g., the mean of a normal distribution), and the sigma as the scale argument (e.g., the standard deviation of a normal distribution). Only for the parameters with a uniform distribution, the pmin and pmax values are used to determine a uniform prior.
# Set the initial parameters to optimal values from least squares and a normal distribution
# for a good starting point
for name in ml1.parameters.index.intersection(ml2.parameters.index):
ml2.set_parameter(name, initial=ml1.parameters.at[name, "optimal"])
ml2.set_parameter(name, dist="norm")
ml2.set_parameter(name, sigma=2 * ml1.parameters.at[name, "stderr"])
ml2.parameters
2c. Run the solver and solve the model#
After setting the parameters and creating a EmceeSolve solver instance we are now ready to run the MCMC analysis. We can do this by running ml.solve. We can pass the same parameters that we normally provide to this method (e.g., tmin or fit_constant). Here we use the initial parameters from our least-square solve, and do not fit a noise model, because we take autocorrelated errors into account through the likelihood function.
All the arguments that are not used by ml.solve, for example steps and tune, are passed on to the run_mcmc method from the sampler (see Emcee docs). The most important is the steps argument, that determines how many steps each of the walkers takes.
# Use the solver to run MCMC
ml2.solve(
solver=s,
initial=False,
tmin="1990",
steps=1000,
tune=True,
report=False,
)
3. Posterior parameter distributions#
The results from the MCMC analysis are stored in the sampler object, accessible through ml.solver.sampler variable. The object ml.solver.sampler.flatchain contains a Pandas DataFrame with \(n\) the parameter samples, where \(n\) is calculated as follows:
\(n = \frac{\left(\text{steps}-\text{burn}\right)\cdot\text{nwalkers}}{\text{thin}} \)
Corner.py#
Corner is a simple but great python package that makes creating corner graphs easy. A couple of lines of code suffice to create a plot of the parameter distributions and the covariances between the parameters.
# Corner plot of the results
fig = plt.figure(figsize=(8, 8))
labels = list(ml2.parameters.index[ml2.parameters.vary])
labels = [label.split("_")[1] for label in labels]
best = list(ml2.parameters[ml2.parameters.vary].optimal)
axes = corner.corner(
ml2.solver.sampler.get_chain(flat=True, discard=500),
quantiles=[0.025, 0.5, 0.975],
labelpad=0.1,
show_titles=True,
title_kwargs=dict(fontsize=10),
label_kwargs=dict(fontsize=10),
max_n_ticks=3,
fig=fig,
labels=labels,
truths=best,
)
plt.show()
4. The trace shows when MCMC converges#
The walkers take steps in different directions for each step. It is expected that after a number of steps, the direction of the step becomes random, as a sign that an optimum has been found. This can be checked by looking at the autocorrelation, which should be insignificant after a number of steps. Below we just show how to obtain the different chains, the interpretation of which is outside the scope of this notebook.
fig, axes = plt.subplots(len(labels), figsize=(10, 7), sharex=True)
samples = ml2.solver.sampler.get_chain(flat=True)
for i in range(len(labels)):
ax = axes[i]
ax.plot(samples[:, i], "k", alpha=0.5)
ax.set_xlim(0, len(samples))
ax.set_ylabel(labels[i])
ax.yaxis.set_label_coords(-0.1, 0.5)
axes[-1].set_xlabel("step number")
mcn_params = pd.DataFrame(index=ls_params.index, columns=["mcn_opt", "mcn_sig"])
params = ml2.solver.sampler.get_chain(
flat=True, discard=500
) # discard first 500 of every chain
for iparam in range(params.shape[1] - 1):
mcn_params.iloc[iparam] = np.median(params[:, iparam]), np.std(params[:, iparam])
mean_time_diff = head.index.to_series().diff().mean().total_seconds() / 86400
# Translate phi into the value of alpha also used by the noisemodel
mcn_params.loc["noise_alpha", "mcn_opt"] = -mean_time_diff / np.log(
np.median(params[:, -1])
)
mcn_params.loc["noise_alpha", "mcn_sig"] = -mean_time_diff / np.log(
np.std(params[:, -1])
)
pd.concat((ls_params, mcn_params), axis=1)
Repeat with uniform priors#
Set more or less uninformative uniform priors. Now also include \(\sigma^2\).
ml3 = ps.Model(head)
rm = ps.RechargeModel(
rain, evap, recharge=ps.rch.Linear(), rfunc=ps.Gamma(), name="rch"
)
ml3.add_stressmodel(rm)
Uniform prior selected from 0.25 till 4 times the optimal values
# Choose the objective function
ln_prob = ps.likelihood.GaussianLikelihoodAr1()
# Create the EmceeSolver with some settings
s = ps.solver.Emcee(
nwalkers=20,
moves=emcee.moves.DEMove(),
objfunction=ln_prob,
progress_bar=True,
parallel=False,
)
ml3.add_solver(s)
for name in ml1.parameters.index.intersection(ml3.parameters.index):
opt = ml1.parameters.loc[name, "optimal"]
if not np.isnan(opt):
sign = np.sign(opt)
ml3.set_parameter(
name,
initial=sign * opt,
dist="uniform",
pmin=sign * 0.25 * opt,
pmax=sign * 4 * opt,
sigma=ml1.parameters.loc[name, "stderr"] ** 2,
)
ml3.set_parameter(
"solver_var",
initial=0.05**2,
pmin=0.05**2 / 4,
pmax=4 * 0.05**2,
vary=True,
dist="uniform",
)
ml3.set_parameter("solver_phi", dist="uniform")
ml3.parameters
# Use the solver to run MCMC
ml3.solve(
initial=False,
tmin="1990",
steps=1000,
tune=True,
report=False,
)
# Corner plot of the results
fig = plt.figure(figsize=(8, 8))
labels = list(ml3.parameters.index[ml3.parameters.vary])
labels = [label.split("_")[1] for label in labels]
best = list(ml3.parameters[ml3.parameters.vary].optimal)
axes = corner.corner(
ml3.solver.sampler.get_chain(flat=True, discard=500),
quantiles=[0.025, 0.5, 0.975],
labelpad=0.1,
show_titles=True,
title_kwargs=dict(fontsize=10),
label_kwargs=dict(fontsize=10),
max_n_ticks=3,
fig=fig,
labels=labels,
truths=best,
)
plt.show()
fig, axes = plt.subplots(len(labels), figsize=(10, 7), sharex=True)
samples = ml3.solver.sampler.get_chain(flat=True)
for i in range(len(labels)):
ax = axes[i]
ax.plot(samples[:, i], "k", alpha=0.5)
ax.set_xlim(0, len(samples))
ax.set_ylabel(labels[i])
ax.yaxis.set_label_coords(-0.1, 0.5)
axes[-1].set_xlabel("step number")
mcu_params = pd.DataFrame(index=ls_params.index, columns=["mcu_opt", "mcu_sig"])
params = ml3.solver.sampler.get_chain(
flat=True, discard=500
) # discard first 500 of every chain
for iparam in range(params.shape[1] - 1):
mcu_params.iloc[iparam] = np.median(params[:, iparam]), np.std(params[:, iparam])
mean_time_diff = head.index.to_series().diff().mean().total_seconds() / 86400
mcu_params.loc["noise_alpha", "mcu_opt"] = -mean_time_diff / np.log(
np.median(params[:, -1])
)
mcu_params.loc["noise_alpha", "mcu_sig"] = -mean_time_diff / np.log(
np.std(params[:, -1])
)
pd.concat((ls_params, mcn_params, mcu_params), axis=1)
5. Compute prediction interval#
nobs = len(head)
params = ml3.solver.sampler.get_chain(flat=True, discard=500)
sim = {}
# compute for 1000 random samples of chain
np.random.seed(1)
for i in np.random.choice(np.arange(10000), size=1000, replace=False):
h = ml3.simulate(p=params[i, :-2])
res = ml3.residuals(p=params[i, :-2])
h += np.random.normal(loc=0, scale=np.std(res), size=len(h))
sim[i] = h
simdf = pd.DataFrame.from_dict(sim, orient="columns", dtype=float)
alpha = 0.05
q = [alpha / 2, 1 - alpha / 2]
pi = simdf.quantile(q, axis=1).transpose()
pimean = np.mean(pi[0.975] - pi[0.025])
print(f"prediction interval emcee with uniform priors: {pimean:.3f} m")
print(f"PICP: {ps.stats.picp(head, pi):.3f}")
For this example, the prediction interval is dominated by the residuals not by the uncertainty of the parameters. In the code cell below, the parameter uncertainty is not included: the coverage only changes slightly and is mostly affected by the difference in randomly drawing residuals.
logprob = ml3.solver.sampler.compute_log_prob(
ml3.solver.sampler.get_chain(flat=True, discard=500)
)[0]
imax = np.argmax(logprob) # parameter set with larges likelihood
#
nobs = len(head)
params = ml3.solver.sampler.get_chain(flat=True, discard=500)
sim = {}
# compute for 1000 random samples of residuals, but one parameter set
h = ml3.simulate(p=params[imax, :-2])
res = ml3.residuals(p=params[imax, :-2])
np.random.seed(1)
for i in range(1000):
sim[i] = h + np.random.normal(loc=0, scale=np.std(res), size=len(h))
simdf = pd.DataFrame.from_dict(sim, orient="columns", dtype=float)
alpha = 0.05
q = [alpha / 2, 1 - alpha / 2]
pi = simdf.quantile(q, axis=1).transpose()
pimean = np.mean(pi[0.975] - pi[0.025])
print(f"prediction interval emcee with uniform priors: {pimean:.3f} m")
print(f"PICP: {ps.stats.picp(head, pi):.3f}")