Uncertainty quantification#

R.A. Collenteur, University of Graz, WIP (May-2021)

In this notebook it is shown how to compute the uncertainty of the model simulation using the built-in uncertainty quantification options of Pastas.

Confidence interval of simulation
Prediction interval of simulation
Confidence interval of step response
Confidence interval of block response
Confidence interval of contribution
Custom confidence intervals

The compute the confidence intervals, parameters sets are drawn from a multivariate normal distribution based on the jacobian matrix obtained during parameter optimization. This method to quantify uncertainties has some underlying assumptions on the model residuals (or noise) that should be checked. This notebook only deals with parameter uncertainties and not with model structure uncertainties.

import pandas as pd
import pastas as ps

import matplotlib.pyplot as plt

ps.set_log_level("ERROR")
ps.show_versions()

/tmp/ipykernel_1338/685018981.py:1: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd

Python version: 3.11.6
NumPy version: 1.26.4
Pandas version: 2.2.0
SciPy version: 1.12.0
Matplotlib version: 3.8.3
Numba version: 0.59.0
LMfit version: 1.2.2
Latexify version: Not Installed
Pastas version: 1.4.0

Create a model#

We first create a toy model to simulate the groundwater levels in southeastern Austria. We will use this model to illustrate how the different methods for uncertainty quantification can be used.

gwl = (
    pd.read_csv("data_wagna/head_wagna.csv", index_col=0, parse_dates=True, skiprows=2)
    .squeeze()
    .loc["2006":]
    .iloc[0::10]
)
gwl = gwl.loc[~gwl.index.duplicated(keep='first')]

evap = pd.read_csv(
    "data_wagna/evap_wagna.csv", index_col=0, parse_dates=True, skiprows=2
).squeeze()
prec = pd.read_csv(
    "data_wagna/rain_wagna.csv", index_col=0, parse_dates=True, skiprows=2
).squeeze()

# Model settings
tmin = pd.Timestamp("2007-01-01")  # Needs warmup
tmax = pd.Timestamp("2016-12-31")

ml = ps.Model(gwl)
sm = ps.RechargeModel(
    prec, evap, recharge=ps.rch.FlexModel(), rfunc=ps.Exponential(), name="rch"
)
ml.add_stressmodel(sm)

# Add the ARMA(1,1) noise model and solve the Pastas model
ml.add_noisemodel(ps.ArmaModel())
ml.solve(tmin=tmin, tmax=tmax, noise=True)

/home/docs/checkouts/readthedocs.org/user_builds/pastas/envs/v1.4.0/lib/python3.11/site-packages/pastas/timeseries_utils.py:90: FutureWarning: Day.delta is deprecated and will be removed in a future version. Use pd.Timedelta(obj) instead
  if hasattr(offset, "delta"):

/home/docs/checkouts/readthedocs.org/user_builds/pastas/envs/v1.4.0/lib/python3.11/site-packages/pastas/stressmodels.py:1302: FutureWarning: 'A' is deprecated and will be removed in a future version, please use 'YE' instead.
  if self.prec.series.resample("A").sum().max() < 12:

Fit report GWL                          Fit Statistics
======================================================
nfev    39                     EVP               74.63
nobs    365                    R2                 0.75
noise   True                   RMSE               0.19
tmin    2007-01-01 00:00:00    AIC            -2051.60
tmax    2016-12-31 00:00:00    BIC            -2020.40
freq    D                      Obj                0.63
warmup  3650 days 00:00:00     ___                    
solver  LeastSquares           Interp.              No

Parameters (8 optimized)
======================================================
                optimal     initial   vary      stderr
rch_A          0.522818    0.529381   True      ±9.99%
rch_a         63.841311   10.000000   True     ±13.23%
rch_srmax    421.096959  250.000000   True     ±42.73%
rch_lp         0.250000    0.250000  False         nan
rch_ks       757.764872  100.000000   True    ±206.27%
rch_gamma      4.808958    2.000000   True     ±20.19%
rch_kv         1.000000    1.000000  False         nan
rch_simax      2.000000    2.000000  False         nan
constant_d   262.646988  263.166264   True  ±2.55e-02%
noise_alpha  123.010528   10.000000   True     ±24.84%
noise_beta     8.492118    1.000000   True     ±12.35%

Diagnostic Checks#

Before we perform the uncertainty quantification, we should check if the underlying statistical assumptions are met. We refer to the notebook on Diagnostic checking for more details on this.

ml.plots.diagnostics();

../_images/c2f9750071f8ba3acaf167f7cfbad6eb23b2229f4e69ebc951a46e5f30f6ece2.png

Confidence intervals#

After the model is calibrated, a solver attribute is added to the Pastas Model object (ml.solver). This object contains information about the optimizations (e.g., the jacobian matrix) and a number of methods that can be used to quantify uncertainties.

ci = ml.solver.ci_simulation(alpha=0.05, n=1000)
ax = ml.plot(figsize=(10, 3))
ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], color="lightgray")
ax.legend(["Observations", "Simulation", "95% Confidence interval"], ncol=3, loc=2)

<matplotlib.legend.Legend at 0x7f8f9e54e350>

../_images/8c96621bc0d947539669d7bcc639eca0554459b05338d1e32b406762a40ca358.png

Prediction interval#

ci = ml.solver.prediction_interval(n=1000)
ax = ml.plot(figsize=(10, 3))
ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], color="lightgray")
ax.legend(["Observations", "Simulation", "95% Prediction interval"], ncol=3, loc=2)

<matplotlib.legend.Legend at 0x7f8f9e467150>

../_images/8ac4e54c72dd68ae7f12a5a643b332d689d7129ec0d056110fd1ab1c40aab64e.png

Uncertainty of step response#

ci = ml.solver.ci_step_response("rch")
ax = ml.plots.step_response(figsize=(6, 2))
ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], color="lightgray")
ax.legend(["Simulation", "95% Prediction interval"], ncol=3, loc=4)

<matplotlib.legend.Legend at 0x7f8f9e0a8c50>

../_images/43d5db77a1c8a1fdcecd9575527eaf32e4fca1f78a1f60c03bd4a23d36be19dd.png

Uncertainty of block response#

ci = ml.solver.ci_block_response("rch")
ax = ml.plots.block_response(figsize=(6, 2))
ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], color="lightgray")
ax.legend(["Simulation", "95% Prediction interval"], ncol=3, loc=1)

<matplotlib.legend.Legend at 0x7f8f9de95cd0>

../_images/57ad156f56b05e78f1d4b5196b660d1ac2b3c9d45d6c348ffe916d4c78aa3818.png

Uncertainty of the contributions#

ci = ml.solver.ci_contribution("rch")
r = ml.get_contribution("rch")
ax = r.plot(figsize=(10, 3))
ax.fill_between(ci.index, ci.iloc[:, 0], ci.iloc[:, 1], color="lightgray")
ax.legend(["Simulation", "95% Prediction interval"], ncol=3, loc=1)
plt.tight_layout()

../_images/79e491cff2d585a2d4099386c5c76dee0b8c6de4f4bb1e95800152251bfb48d7.png

Custom Confidence intervals#

It is also possible to compute the confidence intervals manually, for example to estimate the uncertainty in the recharge or statistics (e.g., SGI, NSE). We can call ml.solver.get_parameter_sample to obtain random parameter samples from a multivariate normal distribution using the optimal parameters and the covariance matrix. Next, we use the parameter sets to obtain multiple simulations of ‘something’, here the recharge.

params = ml.solver.get_parameter_sample(n=1000, name="rch")
data = {}

# Here we run the model n times with different parameter samples
for i, param in enumerate(params):
    data[i] = ml.stressmodels["rch"].get_stress(p=param)

df = pd.DataFrame.from_dict(data, orient="columns").loc[tmin:tmax].resample("A").sum()
ci = df.quantile([0.025, 0.975], axis=1).transpose()

r = ml.get_stress("rch").resample("A").sum()
ax = r.plot.bar(figsize=(10, 2), width=0.5, yerr=[r - ci.iloc[:, 0], ci.iloc[:, 1] - r])
ax.set_xticklabels(labels=r.index.year, rotation=0, ha="center")
ax.set_ylabel("Recharge [mm a$^{-1}$]")
ax.legend(ncol=3);

/tmp/ipykernel_1338/1253942525.py:8: FutureWarning: 'A' is deprecated and will be removed in a future version, please use 'YE' instead.
  df = pd.DataFrame.from_dict(data, orient="columns").loc[tmin:tmax].resample("A").sum()
/tmp/ipykernel_1338/1253942525.py:11: FutureWarning: 'A' is deprecated and will be removed in a future version, please use 'YE' instead.
  r = ml.get_stress("rch").resample("A").sum()

../_images/7930a34f24e012977839d2de8e504a7c3f51e58fea21bb9571d8014b569bb733.png

Uncertainty of the NSE#

The code pattern shown above can be used for many types of uncertainty analyses. Another example is provided below, where we compute the uncertainty of the Nash-Sutcliffe efficacy.

params = ml.solver.get_parameter_sample(n=1000)
data = []

# Here we run the model n times with different parameter samples
for i, param in enumerate(params):
    sim = ml.simulate(p=param)
    data.append(ps.stats.nse(obs=ml.observations(), sim=sim))

fig, ax = plt.subplots(1, 1, figsize=(4, 3))
plt.hist(data, bins=50, density=True)
ax.axvline(ml.stats.nse(), linestyle="--", color="k")
ax.set_xlabel("NSE [-]")
ax.set_ylabel("frequency [-]")

from scipy.stats import norm
import numpy as np

mu, std = norm.fit(data)

# Plot the PDF.
xmin, xmax = ax.set_xlim()
x = np.linspace(xmin, xmax, 100)
p = norm.pdf(x, mu, std)
ax.plot(x, p, "k", linewidth=2)

[<matplotlib.lines.Line2D at 0x7f8f9e3cbb10>]

../_images/b3d1cc52c467692fed3946a017a82d2cba7a36c0d6b9aa7d26e562493c7d3bdb.png