MCMC uncertainty#

R.A. Collenteur, University of Graz, November 2019

In this notebook it is shown how the MCMC-algorithm can be used to estimate the model parameters for a Pastas model. Besides Pastas the following Python Packages have to be installed to run this notebook:

[1]:
import numpy as np
import pandas as pd

import pastas as ps
import corner
import emcee as mc

import matplotlib.pyplot as plt

ps.set_log_level("ERROR")
ps.show_versions(lmfit=True)
Python version: 3.10.12
NumPy version: 1.23.5
Pandas version: 2.0.3
SciPy version: 1.11.1
Matplotlib version: 3.7.2
Numba version: 0.57.1
LMfit version: 1.2.2
Latexify version: Not Installed
Pastas version: 1.1.0

1. Create a Pastas Model#

The first step is to create a Pastas Model object, including the RechargeModel to simulate the effect of precipitation and evaporation on the groundwater heads.

[2]:
# read observations and create the time series model
obs = pd.read_csv("data/head_nb1.csv", parse_dates=["date"], index_col="date").squeeze()
rain = pd.read_csv(
    "data/rain_nb1.csv", parse_dates=["date"], index_col="date"
).squeeze()
evap = pd.read_csv(
    "data/evap_nb1.csv", parse_dates=["date"], index_col="date"
).squeeze()

# Create the time series model
ml = ps.Model(obs, name="head")

sm = ps.RechargeModel(prec=rain, evap=evap, rfunc=ps.Exponential(), name="recharge")
ml.add_stressmodel(sm)
ml.solve(noise=True, report="basic")

ml.plot(figsize=(10, 3))
Fit report head                     Fit Statistics
==================================================
nfev    19                     EVP           92.91
nobs    644                    R2             0.93
noise   True                   RMSE           0.11
tmin    1985-11-14 00:00:00    AIC        -3257.53
tmax    2015-06-28 00:00:00    BIC        -3235.20
freq    D                      Obj            2.02
warmup  3650 days 00:00:00     ___
solver  LeastSquares           Interp.          No

Parameters (5 optimized)
==================================================
                optimal   stderr     initial  vary
recharge_A   686.246671   ±5.30%  215.674528  True
recharge_a   159.386048   ±5.01%   10.000000  True
recharge_f    -1.305359   ±4.04%   -1.000000  True
constant_d    27.920134   ±0.21%   27.900078  True
noise_alpha   49.911855  ±11.86%   15.000000  True
[2]:
<Axes: title={'center': 'Results of head'}, xlabel='date', ylabel='Groundwater levels [meter]'>
../_images/examples_uncertainty_emcee_3_2.png

2. Use the EMCEE Hammer#

Apart from the default solver (ps.LeastSquares()), Pastas also contains the option to use the LmFit package to estimate the parameters. This package wraps multiple optimization techniques, one of which is Emcee. The code bock below shows how to use this method to estimate the parameters of Pastas models.

Emcee takes a number of keyword arguments that determine how the optimization is done. The most important is the steps argument, that determines how many steps each of the walkers takes. The argument nwalkers can be used to set the number of walkers (default is 100). The burn argument determines how many samples from the start of the walkers are removed. The argument thin finally determines how many samples are accepted (1 in thin samples).

[3]:
ml.set_parameter("noise_alpha", vary=True)

ml.solve(
    tmin="2002",
    noise=True,
    initial=False,
    fit_constant=True,
    solver=ps.LmfitSolve(),
    method="emcee",
    nwalkers=10,
    steps=20,
    burn=2,
    thin=2,
    is_weighted=True,
    nan_policy="omit",
);
100%|██████████| 20/20 [00:01<00:00, 16.12it/s]
The chain is shorter than 50 times the integrated autocorrelation time for 5 parameter(s). Use this estimate with caution and run a longer chain!
N/50 = 0;
tau: [1.86471589 1.99831886 2.00288079 1.95886418 1.98561012]
Fit report head                    Fit Statistics
=================================================
nfev    200                    EVP          92.11
nobs    287                    R2            0.92
noise   True                   RMSE          0.11
tmin    2002-01-01 00:00:00    AIC       -1477.30
tmax    2015-06-28 00:00:00    BIC       -1459.00
freq    D                      Obj           1.61
warmup  3650 days 00:00:00     ___
solver  LmfitSolve             Interp.         No

Parameters (5 optimized)
=================================================
                optimal  stderr     initial  vary
recharge_A   686.240021  ±0.14%  686.246671  True
recharge_a   159.354362  ±0.18%  159.386048  True
recharge_f    -1.305439  ±0.23%   -1.305359  True
constant_d    27.919346  ±0.27%   27.920134  True
noise_alpha   49.909080  ±0.19%   49.911855  True

3. Visualize the results#

The results are stored in the result object, accessible through ml.fit.result. The object ml.fit.result.flatchain contains a Pandas DataFrame with \(n\) the parameter samples, whgere \(n\) is calculated as follows:

$n = \frac{\left(\text{steps}-\text{burn}\right)\cdot\text{nwalkers}}{\text{thin}} $

Corner.py#

Corner is a simple but great python package that makes creating corner graphs easy. One line of code suffices to create a plot of the parameter distributions and the covariances between the parameters.

[4]:
corner.corner(
    ml.fit.result.flatchain,
    truths=list(ml.parameters[ml.parameters.vary == True].optimal),
);
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
WARNING:root:Too few points to create valid contours
../_images/examples_uncertainty_emcee_7_1.png

4. What happens to the walkers at each step?#

The walkers take steps in different directions for each step. It is expected that after a number of steps, the direction of the step becomes random, as a sign that an optimum has been found. This can be checked by looking at the autocorrelation, which should be insignificant after a number of steps (NOT SURE HOW TO INTERPRET THIS YET). However it does not seem the case that the parameters converge to come optimum yet, even for the simple Linear model.

[5]:
labels = ml.fit.result.flatchain.columns

fig, axes = plt.subplots(labels.size, figsize=(10, 7), sharex=True)
samples = ml.fit.result.chain
for i in range(labels.size):
    ax = axes[i]
    ax.plot(samples[:, :, i], "k", alpha=0.3)
    ax.set_xlim(0, len(samples))
    ax.set_ylabel(labels[i])
    ax.yaxis.set_label_coords(-0.1, 0.5)

axes[-1].set_xlabel("step number");
../_images/examples_uncertainty_emcee_9_0.png

5. Plot some simulated time series to display uncertainty?#

[6]:
ax = ml.plot(figsize=(10, 3))

inds = np.random.randint(len(ml.fit.result.flatchain), size=100)
for ind in inds:
    params = ml.fit.result.flatchain.iloc[ind].values
    ml.simulate(params).plot(c="k", alpha=0.1, zorder=0)
../_images/examples_uncertainty_emcee_11_0.png