R.A. Collenteur, University of Graz, November 2019
In this notebook it is shown how the MCMC-algorithm can be used to estimate the model parameters for a Pastas model. Besides Pastas the following Python Packages have to be installed to run this notebook:
emcee
lmfit
corner
[1]:
import numpy as np import pandas as pd import pastas as ps import corner import emcee as mc import matplotlib.pyplot as plt ps.set_log_level("ERROR")
The first step is to create a Pastas Model object, including the RechargeModel to simulate the effect of precipitation and evaporation on the groundwater heads.
[2]:
# read observations and create the time series model obs = pd.read_csv('../data/head_nb1.csv', parse_dates=['date'], index_col='date', squeeze=True) rain = pd.read_csv('../data/rain_nb1.csv', parse_dates=['date'], index_col='date', squeeze=True) evap = pd.read_csv('../data/evap_nb1.csv', parse_dates=['date'], index_col='date', squeeze=True) # Create the time series model ml = ps.Model(obs, name="head") sm = ps.RechargeModel(prec=rain, evap=evap, rfunc=ps.Exponential, name='recharge') ml.add_stressmodel(sm) ml.solve(noise=True, report="basic") ml.plot(figsize=(10,3))
Model Results head Fit Statistics ================================================== nfev 21 EVP 92.91 nobs 644 R2 0.93 noise True RMSE 0.11 tmin 1985-11-14 00:00:00 AIC 5.74 tmax 2015-06-28 00:00:00 BIC 28.07 freq D Obj 2.02 warmup 3650 days 00:00:00 ___ solver LeastSquares ___ Parameters (5 were optimized) ================================================== optimal stderr initial vary recharge_A 686.244582 ±5.30% 215.674528 True recharge_a 159.386045 ±5.01% 10.000000 True recharge_f -1.305369 ±4.04% -1.000000 True constant_d 27.920146 ±0.21% 27.900078 True noise_alpha 49.911834 ±11.86% 15.000000 True Parameter correlations |rho| > 0.5 ================================================== recharge_A recharge_a 0.82 recharge_f constant_d -0.98
<matplotlib.axes._subplots.AxesSubplot at 0x7fec9b877f50>
Apart from the default solver (ps.LeastSquares), Pastas also contains the option to use the LmFit package to estimate the parameters. This package wraps multiple optimization techniques, one of which is Emcee. The code bock below shows how to use this method to estimate the parameters of Pastas models.
Emcee takes a number of keyword arguments that determine how the optimization is done. The most important is the steps argument, that determines how many steps each of the walkers takes. The argument nwalkers can be used to set the number of walkers (default is 100). The burn argument determines how many samples from the start of the walkers are removed. The argument thin finally determines how many samples are accepted (1 in thin samples).
steps
nwalkers
burn
thin
[8]:
ml.set_vary("noise_alpha", True) #ml.set_parameter("constant_d", vary=False) ml.solve(tmin="2002", noise=True, initial=False, fit_constant=False, solver=ps.LmfitSolve, method="emcee", steps=200, burn=10, thin=5, is_weighted=True, nan_policy="omit");
100%|██████████| 200/200 [03:48<00:00, 1.14s/it]
The chain is shorter than 50 times the integrated autocorrelation time for 4 parameter(s). Use this estimate with caution and run a longer chain! N/50 = 4; tau: [20.39566676 21.85900797 16.31593659 23.78300956] Model Results head Fit Statistics ===================================================== nfev 20000 EVP 76.23 nobs 287 R2 0.76 noise True RMSE 0.18 tmin 2002-01-01 00:00:00 AIC 5.44 tmax 2015-06-28 00:00:00 BIC 23.74 freq D Obj 2.38 warmup 3650 days 00:00:00 ___ solver LmfitSolve ___ Parameters (4 were optimized) ===================================================== optimal stderr initial vary recharge_A 1507.575822 ±102.79% 857.766358 True recharge_a 493.655951 ±63.28% 263.710900 True recharge_f -1.065318 ±55.18% -1.300288 True constant_d 27.386438 ±nan% 0.000000 False noise_alpha 176.294705 ±221.96% 49.911834 True Parameter correlations |rho| > 0.5 ===================================================== Empty DataFrame Columns: [rho] Index: []
The results are stored in the result object, accessible through ml.fit.result. The object ml.fit.result.flatchain contains a Pandas DataFrame with \(n\) the parameter samples, whgere \(n\) is calculated as follows:
result
ml.fit.result
ml.fit.result.flatchain
$n = \frac{(\text{steps}-\text{burn})\cdot\text{nwalkers}}{\text{thin}} $
Corner is a simple but great python package that makes creating corner graphs easy. One line of code suffices to create a plot of the parameter distributions and the covariances between the parameters.
[9]:
corner.corner(ml.fit.result.flatchain, truths=list(ml.parameters[ml.parameters.vary == True].optimal));
The walkers take steps in different directions for each step. It is expected that after a number of steps, the direction of the step becomes random, as a sign that an optimum has been found. This can be checked by looking at the autocorrelation, which should be insignificant after a number of steps (NOT SURE HOW TO INTERPRET THIS YET). However it does not seem the case that the parameters converge to come optimum yet, even for the simple Linear model.
[5]:
mc.autocorr.integrated_time(ml.fit.result.chain)
--------------------------------------------------------------------------- AutocorrError Traceback (most recent call last) <ipython-input-5-55ad95715032> in <module> ----> 1 mc.autocorr.integrated_time(ml.fit.result.chain) /Applications/anaconda3/envs/py37_pastas/lib/python3.7/site-packages/emcee/autocorr.py in integrated_time(x, c, tol, quiet) 108 msg += "N/{0} = {1:.0f};\ntau: {2}".format(tol, n_t / tol, tau_est) 109 if not quiet: --> 110 raise AutocorrError(tau_est, msg) 111 logging.warning(msg) 112 AutocorrError: The chain is shorter than 50 times the integrated autocorrelation time for 4 parameter(s). Use this estimate with caution and run a longer chain! N/50 = 1; tau: [3.87630749 4.09444635 3.11912338 3.33374754]
[10]:
labels = ml.fit.result.flatchain.columns fig, axes = plt.subplots(labels.size, figsize=(10, 7), sharex=True) samples = ml.fit.result.chain for i in range(labels.size): ax = axes[i] ax.plot(samples[:, :, i], "k", alpha=0.3) ax.set_xlim(0, len(samples)) ax.set_ylabel(labels[i]) ax.yaxis.set_label_coords(-0.1, 0.5) axes[-1].set_xlabel("step number");
[7]:
ax = ml.plot(figsize=(10,3)) inds = np.random.randint(len(ml.fit.result.flatchain), size=100) for ind in inds: params = ml.fit.result.flatchain.iloc[ind].values ml.simulate(params).plot(c="k", alpha=0.1) plt.ylim(27,30)
(27, 30)
[ ]: