Testing MCMC on Pastas Models

R.A. Collenteur, University of Graz, November 2019

In this notebook it is shown how the MCMC-algorithm can be used to estimate the model parameters for a Pastas model. Besides Pastas the following Python Packages have to be installed to run this notebook:

[1]:
import numpy as np
import pandas as pd

import pastas as ps
import corner
import emcee as mc

import matplotlib.pyplot as plt

ps.set_log_level("ERROR")

1. Create a Pastas Model

The first step is to create a Pastas Model object, including the RechargeModel to simulate the effect of precipitation and evaporation on the groundwater heads.

[2]:
# read observations and create the time series model
obs = pd.read_csv('../data/head_nb1.csv', parse_dates=['date'], index_col='date', squeeze=True)
rain = pd.read_csv('../data/rain_nb1.csv', parse_dates=['date'], index_col='date', squeeze=True)
evap = pd.read_csv('../data/evap_nb1.csv', parse_dates=['date'], index_col='date', squeeze=True)

# Create the time series model
ml = ps.Model(obs, name="head")

sm = ps.RechargeModel(prec=rain, evap=evap, rfunc=ps.Exponential, name='recharge')
ml.add_stressmodel(sm)
ml.solve(noise=True, report="basic")

ml.plot(figsize=(10,3))
Model Results head                  Fit Statistics
==================================================
nfev     21                     EVP          92.91
nobs     644                    R2            0.93
noise    True                   RMSE          0.11
tmin     1985-11-14 00:00:00    AIC           5.74
tmax     2015-06-28 00:00:00    BIC          28.07
freq     D                      Obj           2.02
warmup   3650 days 00:00:00     ___
solver   LeastSquares           ___

Parameters (5 were optimized)
==================================================
                optimal   stderr     initial  vary
recharge_A   686.244582   ±5.30%  215.674528  True
recharge_a   159.386045   ±5.01%   10.000000  True
recharge_f    -1.305369   ±4.04%   -1.000000  True
constant_d    27.920146   ±0.21%   27.900078  True
noise_alpha   49.911834  ±11.86%   15.000000  True

Parameter correlations |rho| > 0.5
==================================================
recharge_A recharge_a  0.82
recharge_f constant_d -0.98
[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fec9b877f50>
../_images/examples_012_emcee_uncertainty.ipynb_3_2.png

2. Use the EMCEE Hammer

Apart from the default solver (ps.LeastSquares), Pastas also contains the option to use the LmFit package to estimate the parameters. This package wraps multiple optimization techniques, one of which is Emcee. The code bock below shows how to use this method to estimate the parameters of Pastas models.

Emcee takes a number of keyword arguments that determine how the optimization is done. The most important is the steps argument, that determines how many steps each of the walkers takes. The argument nwalkers can be used to set the number of walkers (default is 100). The burn argument determines how many samples from the start of the walkers are removed. The argument thin finally determines how many samples are accepted (1 in thin samples).

[8]:
ml.set_vary("noise_alpha", True)
#ml.set_parameter("constant_d", vary=False)
ml.solve(tmin="2002", noise=True, initial=False, fit_constant=False, solver=ps.LmfitSolve, method="emcee", steps=200, burn=10, thin=5, is_weighted=True, nan_policy="omit");
100%|██████████| 200/200 [03:48<00:00,  1.14s/it]
The chain is shorter than 50 times the integrated autocorrelation time for 4 parameter(s). Use this estimate with caution and run a longer chain!
N/50 = 4;
tau: [20.39566676 21.85900797 16.31593659 23.78300956]
Model Results head                     Fit Statistics
=====================================================
nfev     20000                  EVP             76.23
nobs     287                    R2               0.76
noise    True                   RMSE             0.18
tmin     2002-01-01 00:00:00    AIC              5.44
tmax     2015-06-28 00:00:00    BIC             23.74
freq     D                      Obj              2.38
warmup   3650 days 00:00:00     ___
solver   LmfitSolve             ___

Parameters (4 were optimized)
=====================================================
                 optimal    stderr     initial   vary
recharge_A   1507.575822  ±102.79%  857.766358   True
recharge_a    493.655951   ±63.28%  263.710900   True
recharge_f     -1.065318   ±55.18%   -1.300288   True
constant_d     27.386438     ±nan%    0.000000  False
noise_alpha   176.294705  ±221.96%   49.911834   True

Parameter correlations |rho| > 0.5
=====================================================
Empty DataFrame
Columns: [rho]
Index: []

3. Visualize the results

The results are stored in the result object, accessible through ml.fit.result. The object ml.fit.result.flatchain contains a Pandas DataFrame with \(n\) the parameter samples, whgere \(n\) is calculated as follows:

$n = \frac{(\text{steps}-\text{burn})\cdot\text{nwalkers}}{\text{thin}} $

Corner.py

Corner is a simple but great python package that makes creating corner graphs easy. One line of code suffices to create a plot of the parameter distributions and the covariances between the parameters.

[9]:
corner.corner(ml.fit.result.flatchain, truths=list(ml.parameters[ml.parameters.vary == True].optimal));
../_images/examples_012_emcee_uncertainty.ipynb_7_0.png

4. What happens to the walkers at each step?

The walkers take steps in different directions for each step. It is expected that after a number of steps, the direction of the step becomes random, as a sign that an optimum has been found. This can be checked by looking at the autocorrelation, which should be insignificant after a number of steps (NOT SURE HOW TO INTERPRET THIS YET). However it does not seem the case that the parameters converge to come optimum yet, even for the simple Linear model.

[5]:
mc.autocorr.integrated_time(ml.fit.result.chain)
---------------------------------------------------------------------------
AutocorrError                             Traceback (most recent call last)
<ipython-input-5-55ad95715032> in <module>
----> 1 mc.autocorr.integrated_time(ml.fit.result.chain)

/Applications/anaconda3/envs/py37_pastas/lib/python3.7/site-packages/emcee/autocorr.py in integrated_time(x, c, tol, quiet)
    108         msg += "N/{0} = {1:.0f};\ntau: {2}".format(tol, n_t / tol, tau_est)
    109         if not quiet:
--> 110             raise AutocorrError(tau_est, msg)
    111         logging.warning(msg)
    112

AutocorrError: The chain is shorter than 50 times the integrated autocorrelation time for 4 parameter(s). Use this estimate with caution and run a longer chain!
N/50 = 1;
tau: [3.87630749 4.09444635 3.11912338 3.33374754]
[10]:
labels = ml.fit.result.flatchain.columns

fig, axes = plt.subplots(labels.size, figsize=(10, 7), sharex=True)
samples = ml.fit.result.chain
for i in range(labels.size):
    ax = axes[i]
    ax.plot(samples[:, :, i], "k", alpha=0.3)
    ax.set_xlim(0, len(samples))
    ax.set_ylabel(labels[i])
    ax.yaxis.set_label_coords(-0.1, 0.5)

axes[-1].set_xlabel("step number");
../_images/examples_012_emcee_uncertainty.ipynb_10_0.png

5. Plot some simulated time series to display uncertainty?

[7]:
ax = ml.plot(figsize=(10,3))

inds = np.random.randint(len(ml.fit.result.flatchain), size=100)
for ind in inds:
    params = ml.fit.result.flatchain.iloc[ind].values
    ml.simulate(params).plot(c="k", alpha=0.1)
plt.ylim(27,30)
[7]:
(27, 30)
../_images/examples_012_emcee_uncertainty.ipynb_12_1.png
[ ]: