{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Frequentist uncertainty vs. Bayesian uncertainty analysis\n", "*Mark Bakker, TU Delft & Raoul Collenteur, Eawag, February, 2025*\n", "\n", "In this notebook, the fit and uncertainty are compared for `pastas` models solved with least squares (frequentist uncertainty) and with MCMC (Bayesian uncertainty). \n", "Besides Pastas the following Python Packages have to be installed to run this notebook:\n", "\n", "- [emcee](https://emcee.readthedocs.io)\n", "- [corner](https://corner.readthedocs.io)\n", "\n", "
\n", "Note:\n", "The EmceeSolver is still an experimental feature and some of the arguments may change in before official release. We welcome testing and feedback on this new feature!\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import corner\n", "import emcee\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "import pastas as ps\n", "\n", "ps.set_log_level(\"ERROR\")\n", "ps.show_versions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. A 'regular' Pastas Model\n", "The first step is to create a Pastas Model with a linear `RechargeModel` and a `Gamma` response function to simulate the effect of precipitation and evaporation on the heads. The AR1 noise model is used. We first estimate the model parameters using the standard least-squares approach." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head = pd.read_csv(\n", " \"data/B32C0639001.csv\", parse_dates=[\"date\"], index_col=\"date\"\n", ").squeeze()\n", "head = head[\"1990\":] # use data from 1990 on for this example\n", "\n", "evap = pd.read_csv(\"data/evap_260.csv\", index_col=0, parse_dates=[0]).squeeze()\n", "rain = pd.read_csv(\"data/rain_260.csv\", index_col=0, parse_dates=[0]).squeeze()\n", "\n", "ml1 = ps.Model(head)\n", "ml1.add_noisemodel(ps.ArNoiseModel())\n", "\n", "rm = ps.RechargeModel(\n", " rain, evap, recharge=ps.rch.Linear(), rfunc=ps.Gamma(), name=\"rch\"\n", ")\n", "ml1.add_stressmodel(rm)\n", "\n", "ml1.solve()\n", "\n", "ax = ml1.plots.results(figsize=(10, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The diagnostics show that the noise meets the statistical requirements for uncertainty analysis reasonably well. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml1.plots.diagnostics();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The estimated least squares parameters and standard errors are stored for later reference" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ls_params = ml1.parameters[[\"optimal\", \"stderr\"]].copy()\n", "ls_params.rename(columns={\"optimal\": \"ls_opt\", \"stderr\": \"ls_sig\"}, inplace=True)\n", "ls_params" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute prediction interval Pastas\n", "pi = ml1.solver.prediction_interval(n=1000)\n", "ax = ml1.plot(figsize=(10, 3))\n", "ax.fill_between(pi.index, pi.iloc[:, 0], pi.iloc[:, 1], color=\"lightgray\")\n", "ax.legend([\"Observations\", \"Simulation\", \"95% Prediction interval\"], ncol=3, loc=2)\n", "pi_pasta = np.mean(pi[0.975] - pi[0.025])\n", "print(f\"Mean prediction interval width: {pi_pasta:.3f} m\")\n", "print(f\"Prediction interval coverage probability: {ps.stats.picp(head, pi): .3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Use the EmceeSolver\n", "\n", "We will now use MCMC to estimate the model parameters and their uncertainties. The `EmceeSolve` solver wraps the [Emcee](https://lmfit.github.io/lmfit-py/fitting.html#lmfit.minimizer.Minimizer.emcee) package, which implements different versions of MCMC. A good understanding of Emcee helps when using this solver, so it comes recommended to check out their documentation as well.\n", "\n", "We start by making a `pastas` model with a linear recharge model and a Gamma response function. No noise model is added, as this is taken care of in the likelihood function. The model is solved using the regular solve (least squares) to have a good estimate of the starting values of the parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml2 = ps.Model(head)\n", "rm = ps.RechargeModel(\n", " rain, evap, recharge=ps.rch.Linear(), rfunc=ps.Gamma(), name=\"rch\"\n", ")\n", "ml2.add_stressmodel(rm)\n", "ml2.solve()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To set up the `EmceeSolve` solver, a number of decisions need to be made:\n", "\n", "- Select the priors of the parameters\n", "- Select a (log) likelihood function\n", "- Select the number of steps and thinning\n", " \n", "### 2a. Priors\n", "\n", "The first step is to select the priors of the parameters. This is done by using the `ml.set_parameter` method and the `dist` argument (from distribution). Any distribution from `scipy.stats` can be chosen [url](https://docs.scipy.org/doc/scipy/tutorial/stats/continuous.html), for example `uniform`, `norm`, or `lognorm`. Here, we select normal distributions for the priors. Currently, `pastas` will use the `initial` value of the parameter for the `loc` argument of the distribution (e.g., the mean of a normal distribution), and the `stderr` as the `scale` argument (e.g., the standard deviation of a normal distribution). Only for the parameters with a `uniform` distribution, the `pmin` and `pmax` values are used to determine a uniform prior. By default, all parameters are assigned a `uniform` prior.\n", "\n", "
\n", "Note:\n", "This means that either the `pmin` and `pmax` should be set for uniform distributions, or the `stderr` for any other distribution. That is why in this example model was first solved using LeastSquares, in order to obtain estimates for the `stderr`. In practice, these could also be set based on expert judgement or information about the parameters.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set the initial parameters to a normal distribution\n", "ml2.parameters[\"initial\"] = ml2.parameters[\n", " \"optimal\"\n", "] # set initial value to the optimal from least squares for good starting point\n", "ml2.parameters[\"stderr\"] = (\n", " 2 * ml2.parameters[\"stderr\"]\n", ") # this column is used (for now) to set the scale of the normal distribution\n", "\n", "for name in ml2.parameters.index:\n", " ml2.set_parameter(\n", " name,\n", " dist=\"norm\",\n", " )\n", "\n", "ml2.parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### 2b. Create the solver instance\n", "\n", "The next step is to create an instance of the `EmceeSolve` solver class. At this stage all the settings need to be provided on how the Ensemble Sampler is created (https://emcee.readthedocs.io/en/stable/user/sampler/). Important settings are the `nwalkers`, the `moves`, the `objective_function`. More advanced options are to parallelize the MCMC algorithm (`parallel=True`), and to set a backend to store the results. Here's an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Choose the objective function\n", "ln_prob = ps.objfunc.GaussianLikelihoodAr1()\n", "\n", "# Create the EmceeSolver with some settings\n", "s = ps.EmceeSolve(\n", " nwalkers=20,\n", " moves=emcee.moves.DEMove(),\n", " objective_function=ln_prob,\n", " progress_bar=True,\n", " parallel=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the above code we created an `EmceeSolve` instance with 20 walkers, which take steps according to the `DEMove` move algorithm (see Emcee docs), and a Gaussian likelihood function that assumes AR1 correlated errors. Different objective functions are available, see the Pastas documentation on the different options. \n", "\n", "Depending on the likelihood function, a number of additional parameters need to be inferred. These parameters are not added to the Pastas Model instance, but are available from the solver object. Using the `set_parameter` method of the solver, these parameters can be changed. In this example where we use the `GaussianLikelihoodAr1` function, the $\\sigma^2$ and $\\phi$ are estimated; the unknown standard deviation of the errors and the autoregressive parameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sigsq = ml1.noise().std() ** 2\n", "s.set_parameter(\"ln_var\", initial=sigsq, vary=True)\n", "s.parameters.loc[\"ln_var\", \"stderr\"] = stderr = sigsq / 8\n", "s.parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2c. Run the solver and solve the model\n", "\n", "After setting the parameters and creating a EmceeSolve solver instance we are now ready to run the MCMC analysis. We can do this by running `ml.solve`. We can pass the same parameters that we normally provide to this method (e.g., `tmin` or `fit_constant`). Here we use the initial parameters from our least-square solve, and do not fit a noise model, because we take autocorrelated errors into account through the likelihood function. \n", "\n", "All the arguments that are not used by `ml.solve`, for example `steps` and `tune`, are passed on to the `run_mcmc` method from the sampler (see Emcee docs). The most important is the `steps` argument, that determines how many steps each of the walkers takes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use the solver to run MCMC\n", "ml2.solve(\n", " solver=s,\n", " initial=False,\n", " tmin=\"1990\",\n", " steps=1000,\n", " tune=True,\n", " report=False,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Posterior parameter distributions\n", "\n", "The results from the MCMC analysis are stored in the `sampler` object, accessible through `ml.solver.sampler` variable. The object `ml.solver.sampler.flatchain` contains a Pandas DataFrame with $n$ the parameter samples, where $n$ is calculated as follows:\n", "\n", "$n = \\frac{\\left(\\text{steps}-\\text{burn}\\right)\\cdot\\text{nwalkers}}{\\text{thin}} $\n", "\n", "## Corner.py\n", "Corner is a simple but great python package that makes creating corner graphs easy. A couple of lines of code suffice to create a plot of the parameter distributions and the covariances between the parameters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Corner plot of the results\n", "fig = plt.figure(figsize=(8, 8))\n", "\n", "labels = list(ml2.parameters.index[ml2.parameters.vary]) + list(\n", " ml2.solver.parameters.index[ml2.solver.parameters.vary]\n", ")\n", "labels = [label.split(\"_\")[1] for label in labels]\n", "\n", "best = list(ml2.parameters[ml2.parameters.vary].optimal) + list(\n", " ml2.solver.parameters[ml2.solver.parameters.vary].optimal\n", ")\n", "\n", "axes = corner.corner(\n", " ml2.solver.sampler.get_chain(flat=True, discard=500),\n", " quantiles=[0.025, 0.5, 0.975],\n", " labelpad=0.1,\n", " show_titles=True,\n", " title_kwargs=dict(fontsize=10),\n", " label_kwargs=dict(fontsize=10),\n", " max_n_ticks=3,\n", " fig=fig,\n", " labels=labels,\n", " truths=best,\n", ")\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. The trace shows when MCMC converges\n", "The walkers take steps in different directions for each step. It is expected that after a number of steps, the direction of the step becomes random, as a sign that an optimum has been found. This can be checked by looking at the autocorrelation, which should be insignificant after a number of steps. Below we just show how to obtain the different chains, the interpretation of which is outside the scope of this notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(len(labels), figsize=(10, 7), sharex=True)\n", "\n", "samples = ml2.solver.sampler.get_chain(flat=True)\n", "for i in range(len(labels)):\n", " ax = axes[i]\n", " ax.plot(samples[:, i], \"k\", alpha=0.5)\n", " ax.set_xlim(0, len(samples))\n", " ax.set_ylabel(labels[i])\n", " ax.yaxis.set_label_coords(-0.1, 0.5)\n", "\n", "axes[-1].set_xlabel(\"step number\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mcn_params = pd.DataFrame(index=ls_params.index, columns=[\"mcn_opt\", \"mcn_sig\"])\n", "params = ml2.solver.sampler.get_chain(\n", " flat=True, discard=500\n", ") # discard first 500 of every chain\n", "for iparam in range(params.shape[1] - 1):\n", " mcn_params.iloc[iparam] = np.median(params[:, iparam]), np.std(params[:, iparam])\n", "mean_time_diff = head.index.to_series().diff().mean().total_seconds() / 86400\n", "\n", "# Translate phi into the value of alpha also used by the noisemodel\n", "mcn_params.loc[\"noise_alpha\", \"mcn_opt\"] = -mean_time_diff / np.log(\n", " np.median(params[:, -1])\n", ")\n", "mcn_params.loc[\"noise_alpha\", \"mcn_sig\"] = -mean_time_diff / np.log(\n", " np.std(params[:, -1])\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat((ls_params, mcn_params), axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Repeat with uniform priors\n", "Set more or less uninformative uniform priors. Now also include $\\sigma^2$. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ml3 = ps.Model(head)\n", "rm = ps.RechargeModel(\n", " rain, evap, recharge=ps.rch.Linear(), rfunc=ps.Gamma(), name=\"rch\"\n", ")\n", "ml3.add_stressmodel(rm)\n", "ml3.solve(report=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uniform prior selected from 0.25 till 4 times the optimal values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set the initial parameters to a normal distribution\n", "ml3.parameters[\"initial\"] = ml3.parameters[\n", " \"optimal\"\n", "] # set initial value to the optimal from least squares for good starting point\n", "for name in ml3.parameters.index:\n", " if ml3.parameters.loc[name, \"optimal\"] > 0:\n", " ml3.set_parameter(\n", " name,\n", " dist=\"uniform\",\n", " pmin=0.25 * ml3.parameters.loc[name, \"optimal\"],\n", " pmax=4 * ml3.parameters.loc[name, \"optimal\"],\n", " )\n", " else:\n", " ml3.set_parameter(\n", " name,\n", " dist=\"uniform\",\n", " pmin=4 * ml3.parameters.loc[name, \"optimal\"],\n", " pmax=0.25 * ml3.parameters.loc[name, \"optimal\"],\n", " )\n", "\n", "ml3.parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Choose the objective function\n", "ln_prob = ps.objfunc.GaussianLikelihoodAr1()\n", "\n", "# Create the EmceeSolver with some settings\n", "s = ps.EmceeSolve(\n", " nwalkers=20,\n", " moves=emcee.moves.DEMove(),\n", " objective_function=ln_prob,\n", " progress_bar=True,\n", " parallel=False,\n", ")\n", "\n", "s.parameters.loc[\"ln_var\", \"initial\"] = 0.05**2\n", "s.parameters.loc[\"ln_var\", \"pmin\"] = 0.05**2 / 4\n", "s.parameters.loc[\"ln_var\", \"pmax\"] = 4 * 0.05**2\n", "\n", "# Use the solver to run MCMC\n", "ml3.solve(\n", " solver=s,\n", " initial=False,\n", " tmin=\"1990\",\n", " steps=1000,\n", " tune=True,\n", " report=False,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s.parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Corner plot of the results\n", "fig = plt.figure(figsize=(8, 8))\n", "\n", "labels = list(ml3.parameters.index[ml3.parameters.vary]) + list(\n", " ml3.solver.parameters.index[ml3.solver.parameters.vary]\n", ")\n", "labels = [label.split(\"_\")[1] for label in labels]\n", "\n", "best = list(ml3.parameters[ml3.parameters.vary].optimal) + list(\n", " ml3.solver.parameters[ml3.solver.parameters.vary].optimal\n", ")\n", "\n", "axes = corner.corner(\n", " ml3.solver.sampler.get_chain(flat=True, discard=500),\n", " quantiles=[0.025, 0.5, 0.975],\n", " labelpad=0.1,\n", " show_titles=True,\n", " title_kwargs=dict(fontsize=10),\n", " label_kwargs=dict(fontsize=10),\n", " max_n_ticks=3,\n", " fig=fig,\n", " labels=labels,\n", " truths=best,\n", ")\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(len(labels), figsize=(10, 7), sharex=True)\n", "\n", "samples = ml3.solver.sampler.get_chain(flat=True)\n", "for i in range(len(labels)):\n", " ax = axes[i]\n", " ax.plot(samples[:, i], \"k\", alpha=0.5)\n", " ax.set_xlim(0, len(samples))\n", " ax.set_ylabel(labels[i])\n", " ax.yaxis.set_label_coords(-0.1, 0.5)\n", "\n", "axes[-1].set_xlabel(\"step number\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mcu_params = pd.DataFrame(index=ls_params.index, columns=[\"mcu_opt\", \"mcu_sig\"])\n", "params = ml3.solver.sampler.get_chain(\n", " flat=True, discard=500\n", ") # discard first 500 of every chain\n", "for iparam in range(params.shape[1] - 1):\n", " mcu_params.iloc[iparam] = np.median(params[:, iparam]), np.std(params[:, iparam])\n", "mean_time_diff = head.index.to_series().diff().mean().total_seconds() / 86400\n", "mcu_params.loc[\"noise_alpha\", \"mcu_opt\"] = -mean_time_diff / np.log(\n", " np.median(params[:, -1])\n", ")\n", "mcu_params.loc[\"noise_alpha\", \"mcu_sig\"] = -mean_time_diff / np.log(\n", " np.std(params[:, -1])\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat((ls_params, mcn_params, mcu_params), axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Compute prediction interval" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nobs = len(head)\n", "params = ml3.solver.sampler.get_chain(flat=True, discard=500)\n", "sim = {}\n", "# compute for 1000 random samples of chain\n", "np.random.seed(1)\n", "for i in np.random.choice(np.arange(10000), size=1000, replace=False):\n", " h = ml3.simulate(p=params[i, :-2])\n", " res = ml3.residuals(p=params[i, :-2])\n", " h += np.random.normal(loc=0, scale=np.std(res), size=len(h))\n", " sim[i] = h\n", "simdf = pd.DataFrame.from_dict(sim, orient=\"columns\", dtype=float)\n", "alpha = 0.05\n", "q = [alpha / 2, 1 - alpha / 2]\n", "pi = simdf.quantile(q, axis=1).transpose()\n", "pimean = np.mean(pi[0.975] - pi[0.025])\n", "print(f\"prediction interval emcee with uniform priors: {pimean:.3f} m\")\n", "print(f\"PICP: {ps.stats.picp(head, pi):.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, the prediction interval is dominated by the residuals not by the uncertainty of the parameters. In the code cell below, the parameter uncertainty is not included: the coverage only changes slightly and is mostly affected by the difference in randomly drawing residuals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logprob = ml3.solver.sampler.compute_log_prob(\n", " ml3.solver.sampler.get_chain(flat=True, discard=500)\n", ")[0]\n", "imax = np.argmax(logprob) # parameter set with larges likelihood\n", "#\n", "nobs = len(head)\n", "params = ml3.solver.sampler.get_chain(flat=True, discard=500)\n", "sim = {}\n", "# compute for 1000 random samples of residuals, but one parameter set\n", "h = ml3.simulate(p=params[imax, :-2])\n", "res = ml3.residuals(p=params[imax, :-2])\n", "np.random.seed(1)\n", "for i in range(1000):\n", " sim[i] = h + np.random.normal(loc=0, scale=np.std(res), size=len(h))\n", "simdf = pd.DataFrame.from_dict(sim, orient=\"columns\", dtype=float)\n", "alpha = 0.05\n", "q = [alpha / 2, 1 - alpha / 2]\n", "pi = simdf.quantile(q, axis=1).transpose()\n", "pimean = np.mean(pi[0.975] - pi[0.025])\n", "print(f\"prediction interval emcee with uniform priors: {pimean:.3f} m\")\n", "print(f\"PICP: {ps.stats.picp(head, pi):.3f}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" } }, "nbformat": 4, "nbformat_minor": 4 }