{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayesian uncertainty analysis\n", "*R.A. Collenteur, Eawag, June, 2023*\n", "\n", "In this notebook it is shown how the MCMC-algorithm can be used to estimate the model parameters and quantify the (parameter) uncertainties for a Pastas model using a Bayesian approach. For this the EmceeSolver is introduced, based on the [emcee](https://emcee.readthedocs.io) Python package. \n", "\n", "Besides Pastas the following Python Packages have to be installed to run this notebook:\n", "\n", "- [emcee](https://emcee.readthedocs.io)\n", "- [corner](https://corner.readthedocs.io)\n", "\n", "
\n", "Note:\n", "The EmceeSolver is still an experimental feature and some of the arguments might be changed in the near future (2023/06/22). We welcome testing and feedback on this new feature!.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import corner\n", "import emcee\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import pastas as ps\n", "\n", "ps.set_log_level(\"ERROR\")\n", "ps.show_versions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Create a Pastas Model\n", "The first step is to create a Pastas Model, including the RechargeModel to simulate the effect of precipitation and evaporation on the heads. Here, we first estimate the model parameters using the standard least-squares approach." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head = pd.read_csv(\n", " \"data/B32C0639001.csv\", parse_dates=[\"date\"], index_col=\"date\"\n", ").squeeze()\n", "\n", "evap = pd.read_csv(\"data/evap_260.csv\", index_col=0, parse_dates=[0]).squeeze()\n", "rain = pd.read_csv(\"data/rain_260.csv\", index_col=0, parse_dates=[0]).squeeze()\n", "\n", "ml = ps.Model(head)\n", "ml.add_noisemodel(ps.ArNoiseModel())\n", "\n", "# Select a recharge model\n", "rch = ps.rch.FlexModel()\n", "\n", "rm = ps.RechargeModel(rain, evap, recharge=rch, rfunc=ps.Gamma(), name=\"rch\")\n", "ml.add_stressmodel(rm)\n", "\n", "ml.solve(tmin=\"1990\")\n", "\n", "ax = ml.plot(figsize=(10, 3))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Use the EmceeSolver\n", "\n", "We will now use the EmceeSolve solver to estimate the model parameters and their uncertainties. This solver wraps the [Emcee](https://lmfit.github.io/lmfit-py/fitting.html#lmfit.minimizer.Minimizer.emcee) package, which implements different versions of MCMC. A good understanding of Emcee helps when using this solver, so it comes recommended to check out their documentation as well.\n", "\n", "To set up the solver, a number of decisions need to be made:\n", "\n", "- Determine the priors of the parameters\n", "- Choose a (log) likelihood function\n", "- Choose the number of steps and thinning\n", "\n", "### 2a. Choose and set the priors\n", "\n", "The first step is to choose and set the priors of the parameters. This is done by using the `ml.set_parameter` method and the `dist` argument (from distribution). Any distribution from the `scipy.stats` can be chosen (https://docs.scipy.org/doc/scipy/tutorial/stats/continuous.html), for example `uniform`, `norm`, or `lognorm`. Here, for the sake of the example, we set all prior distributions to a normal distribution." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set the initial parameters to a normal distribution\n", "for name in ml.parameters.index:\n", " ml.set_parameter(name, dist=\"norm\")\n", "\n", "ml.parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pastas will use the `initial` value of the parameter for the `loc` argument of the distribution (e.g., the mean of a normal distribution), and the `stderr` as the `scale` argument (e.g., the standard deviation of a normal distribution). Only for the parameters with a `uniform` distribution, the `pmin` and `pmax` values are used to determine a uniform prior. By default, all parameters are assigned a `uniform` prior.\n", "\n", "
\n", "Note:\n", "This means that either the `pmin` and `pmax` should be set for uniform distributions, or the `stderr` for any other distribution. That is why in this example model was first solved using LeastSquares, in order to obtain estimates for the `stderr`. In practice, these could also be set based on expert judgement or information about the parameters.\n", "
\n", "\n", "### 2b. Create the solver instance\n", "\n", "The next step is to create an instance of the `EmceeSolve` solver class. At this stage all the settings need to be provided on how the Ensemble Sampler is created (https://emcee.readthedocs.io/en/stable/user/sampler/). Important settings are the `nwalkers`, the `moves`, the `objective_function`. More advanced options are to parallelize the MCMC algorithm (`parallel=True`), and to set a backend to store the results. Here's an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Choose the objective function\n", "ln_prob = ps.objfunc.GaussianLikelihoodAr1()\n", "\n", "# Create the EmceeSolver with some settings\n", "s = ps.EmceeSolve(\n", " nwalkers=20,\n", " moves=emcee.moves.DEMove(),\n", " objective_function=ln_prob,\n", " progress_bar=True,\n", " parallel=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above code we created an `EmceeSolve` instance with 20 walkers, which take steps according to the `DEMove` move algorithm (see Emcee docs), and a Gaussian likelihood function that assumes AR1 correlated errors. Different objective functions are available, see the Pastas documentation on the different options. \n", "\n", "Depending on the likelihood function, a number of additional parameters need to be inferred. These parameters are not added to the Pastas Model instance, but are available from the solver object. Using the `set_parameter` method of the solver, these parameters can be changed. In this example where we use the `GaussianLikelihoodAr1` function the sigma and theta are estimated; the unknown standard deviation of the errors and the autoregressive parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.set_parameter(\"ln_var\", initial=0.0028, vary=False, dist=\"norm\")\n", "s.parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2c. Run the solver and solve the model\n", "\n", "After setting the parameters and creating a EmceeSolve solver instance we are now ready to run the MCMC analysis. We can do this by running `ml.solve`. We can pass the same parameters that we normally provide to this method (e.g., `tmin` or `fit_constant`). Here we use the initial parameters from our least-square solve, and do not fit a noise model, because we take autocorrelated errors into account through the likelihood function. \n", "\n", "All the arguments that are not used by `ml.solve`, for example `steps` and `tune`, are passed on to the `run_mcmc` method from the sampler (see Emcee docs). The most important is the `steps` argument, that determines how many steps each of the walkers takes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the solver to run MCMC\n", "ml.del_noisemodel()\n", "ml.solve(\n", " solver=s,\n", " initial=False,\n", " fit_constant=False,\n", " tmin=\"1990\",\n", " steps=100,\n", " tune=True,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Posterior parameter distributions\n", "\n", "The results from the MCMC analysis are stored in the `sampler` object, accessible through `ml.solver.sampler` variable. The object `ml.solver.sampler.flatchain` contains a Pandas DataFrame with $n$ the parameter samples, where $n$ is calculated as follows:\n", "\n", "$n = \\frac{\\left(\\text{steps}-\\text{burn}\\right)\\cdot\\text{nwalkers}}{\\text{thin}} $\n", "\n", "## Corner.py\n", "Corner is a simple but great python package that makes creating corner graphs easy. A couple of lines of code suffice to create a plot of the parameter distributions and the covariances between the parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Corner plot of the results\n", "fig = plt.figure(figsize=(8, 8))\n", "\n", "labels = list(ml.parameters.index[ml.parameters.vary]) + list(\n", " ml.solver.parameters.index[ml.solver.parameters.vary]\n", ")\n", "labels = [label.split(\"_\")[1] for label in labels]\n", "\n", "best = list(ml.parameters[ml.parameters.vary].optimal) + list(\n", " ml.solver.parameters[ml.solver.parameters.vary].optimal\n", ")\n", "\n", "axes = corner.corner(\n", " ml.solver.sampler.get_chain(flat=True, discard=50),\n", " quantiles=[0.025, 0.5, 0.975],\n", " labelpad=0.1,\n", " show_titles=True,\n", " title_kwargs=dict(fontsize=10),\n", " label_kwargs=dict(fontsize=10),\n", " max_n_ticks=3,\n", " fig=fig,\n", " labels=labels,\n", " truths=best,\n", ")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. What happens to the walkers at each step?\n", "The walkers take steps in different directions for each step. It is expected that after a number of steps, the direction of the step becomes random, as a sign that an optimum has been found. This can be checked by looking at the autocorrelation, which should be insignificant after a number of steps. Below we just show how to obtain the different chains, the interpretation of which is outside the scope of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(len(labels), figsize=(10, 7), sharex=True)\n", "\n", "samples = ml.solver.sampler.get_chain(flat=True)\n", "for i in range(len(labels)):\n", " ax = axes[i]\n", " ax.plot(samples[:, i], \"k\", alpha=0.5)\n", " ax.set_xlim(0, len(samples))\n", " ax.set_ylabel(labels[i])\n", " ax.yaxis.set_label_coords(-0.1, 0.5)\n", "\n", "axes[-1].set_xlabel(\"step number\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Plot some simulated time series to display uncertainty?\n", "\n", "We can now draw parameter sets from the chain and simulate the uncertainty in the head simulation. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot results and uncertainty\n", "ax = ml.plot(figsize=(10, 3))\n", "plt.title(None)\n", "\n", "chain = ml.solver.sampler.get_chain(flat=True, discard=50)\n", "inds = np.random.randint(len(chain), size=100)\n", "for ind in inds:\n", " params = chain[ind]\n", " p = ml.parameters.optimal.copy().values\n", " p[ml.parameters.vary] = params[: ml.parameters.vary.sum()]\n", " _ = ml.simulate(p, tmin=\"1990\").plot(c=\"gray\", alpha=0.1, zorder=-1)\n", "\n", "plt.legend([\"Measurements\", \"Simulation\", \"Ensemble members\"], numpoints=3)" ] } ], "metadata": { "kernelspec": { "display_name": "pastas_dev", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 4 }