Ensemble Predictions#

R.A. Collenteur, Eawag, 2025

This notebook shows how to use Pastas and meteorological ensemble forecasts to generate ensemble groundwater predictions (EGPs). The goal is to forecast the groundwater levels for a well in Switzerland (Gossau), one month ahead. Meteorological ensemble forecasts of the ECMWF with 51 ensemble members are used as data input. These members represent the uncertainty in the meteorological input data.

The Pastas model is calibrated on 10 years of head data prior to the start of the forecast, using meteorological data provided by MeteoSwiss. In this example, meteorological forecasts are used, but it is straightforward to extend this to other input data such as ensembles of pumping forecasts.

Note: Collenteur et al. (In submission) Ensemble groundwater predictions (EGP) in alluvial aquifers in Switzerland.

0. Import Python Packages#

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import pastas as ps

ps.set_log_level("INFO")
ps.show_versions()

Pastas     : 1.14.0
Python     : 3.11.12
Numpy      : 2.4.3
Pandas     : 2.3.3
Scipy      : 1.17.1
Matplotlib : 3.10.8
Numba      : 0.64.0

1. Load data#

Below the head and meteorological data is loaded for the well in Gossau, Switzerland.

head = pd.read_csv("data_forecast/heads.csv", index_col=0, parse_dates=True).squeeze()
prec = pd.read_csv("data_forecast/prec.csv", index_col=0, parse_dates=True).squeeze()
evap = pd.read_csv("data_forecast/evap.csv", index_col=0, parse_dates=True).squeeze()
temp = pd.read_csv("data_forecast/temp.csv", index_col=0, parse_dates=True).squeeze()

ps.plots.series(
    head,
    [prec, evap, temp],
    tmin="2004",
    tmax="2014",
    titles=False,
    labels=["Head\n[m]", "Prec.\n[mm/d]", "Evap.\n[mm/d]", "Temp.\n[C]"],
    table=True,
)
plt.tight_layout()

../_images/1ac5fe8681250068f92c2205b8083714ded6c22eba5143f9846e896bb2bf691c.png

2. Make Pastas Model#

We now make a Pastas model to simulate the heads for this monitoring well in Gossau. Only meterological data, which is also available as forecast data, is used to model the groundwater levels. A nonlinear recharge model including a snow module is applied to compute the recharge. The model is calibrated on weekly groundwater level data in the period 2004-2014.

ml = ps.Model(head)
ml.add_stressmodel(
    ps.RechargeModel(
        prec,
        evap,
        rfunc=ps.Gamma(),
        recharge=ps.rch.FlexModel(snow=True),
        temp=temp,
        name="rch",
    )
)

ml.set_parameter("rch_tt", vary=False)
ml.solve(
    tmin="2004", tmax="2014-02-01", report=True, fit_constant=False, freq_obs="10D"
)
ml.add_noisemodel(ps.ArNoiseModel())
ml.set_parameter("rch_srmax", vary=False)
ml.solve(
    tmin="2004", tmax="2014-02-01", initial=False, fit_constant=False, freq_obs="10D"
)

Fit report Gossau                   Fit Statistics
==================================================
nfev     41                     EVP          80.91
nobs     370                    R2            0.81
noise    False                  RMSE          0.21
tmin     2004-01-01 00:00:00    AICc      -1141.68
tmax     2014-02-01 00:00:00    BIC       -1110.77
freq     D                      Obj           8.09
freq_obs 10D                    ___               
warmup   3650 days 00:00:00     Interp.         No
solver   LeastSquares           weights         No

Parameters (8 optimized)
==================================================
               optimal    initial   vary
rch_A         0.460337    0.40514   True
rch_n         1.266660    1.00000   True
rch_a        19.727775   10.00000   True
rch_srmax    75.598819  250.00000   True
rch_lp        0.250000    0.25000  False
rch_ks       33.201024  100.00000   True
rch_gamma     8.255618    2.00000   True
rch_kv        1.087088    1.00000   True
rch_simax     2.000000    2.00000  False
rch_tt        0.000000    0.00000  False
rch_k         6.195118    4.00000   True
constant_d  637.512321    0.00000  False
Fit report Gossau                   Fit Statistics
==================================================
nfev     26                     EVP          79.81
nobs     370                    R2            0.80
noise    True                   RMSE          0.22
tmin     2004-01-01 00:00:00    AICc      -1406.78
tmax     2014-02-01 00:00:00    BIC       -1375.87
freq     D                      Obj           3.95
freq_obs 10D                    ___               
warmup   3650 days 00:00:00     Interp.         No
solver   LeastSquares           weights         No

Parameters (8 optimized)
==================================================
                optimal    initial   vary
rch_A          0.446321   0.460337   True
rch_n          1.097789   1.266660   True
rch_a         24.817030  19.727775   True
rch_srmax     75.598819  75.598819  False
rch_lp         0.250000   0.250000  False
rch_ks        33.091309  33.201024   True
rch_gamma      5.815464   8.255618   True
rch_kv         1.029982   1.087088   True
rch_simax      2.000000   2.000000  False
rch_tt         0.000000   0.000000  False
rch_k          5.725803   6.195118   True
constant_d   637.476813   0.000000  False
noise_alpha   31.999374   1.000000   True

The Time Series 'Gossau' has nan-values. Pastas will use the fill_nan settings to fill up the nan-values.
The Time Series 'Gossau' has nan-values. Pastas will use the fill_nan settings to fill up the nan-values.

Visualize the model results#

ml.plots.results();

../_images/d273bd2d8eeafe38934c2094e67bd79ce10d3e4d79eb460ecbb485e58b47389f.png

3. Prepare the forecast ensembles#

Now that we have a calibrated Pastas model, we can prepare the forecast data used to generate the groundwater ensemble predictions. The forecast data should be prepared carefully as a dictionary. For each stressmodel, one item in the dictionary should be provided where the key is the stressmodel name (i.e., same as in ml.stressmodels) and the value a list or dictionary of pandas.DataFrames.

Each DataFrame should have the same DateimeIndex with the dates of the forecasts of the input data. It should also have the same number of columns, where each column represents an ensemble member (i.e., the precipitation and evaporation data need the same number of members). All these properties are internally checked by the internal method ps.forecast._check_forecast_data and an error is raised if something is wrong.

fc = {
    "rch": {
        "rech": pd.read_csv(
            "data_forecast/ensemble_prec.csv", index_col=0, parse_dates=True
        ),
        "evap": pd.read_csv(
            "data_forecast/ensemble_evap.csv", index_col=0, parse_dates=True
        ),
        "temp": pd.read_csv(
            "data_forecast/ensemble_temp.csv", index_col=0, parse_dates=True
        ),
    }
}

fig, axes = plt.subplot_mosaic([list(fc["rch"].keys())], figsize=(9.5, 3.5))
for k, series in fc["rch"].items():
    series = series.cumsum() if k != "temp" else series
    title = "Cumulative " + k.capitalize() if k != "temp" else k.capitalize()
    series.plot(legend=False, ax=axes[k], color="k", alpha=0.7, title=title)

../_images/56b55147b12fe71d1393f06f71df3272f32197b597fb7e741b6b7f59aae8b4a5.png

4. Compute GWL forecasts#

We can now generate the forecasts of the groundwater levels using the calibrated model. We can include the parameter uncertainty, by drawing multiple parameter sets (i.e., \(N\) parameter sets). The model is run with each of the \(N\) parameter sets for all \(M\) ensemble member sets (i.e., set of precipitation, evaporation and temperature), resulting in \(N\) x \(M\) forecasts.

For each individual forecast the mean forecast and the variance of the error or noise is returned, depending on whether or not post_process is True or False. If True, the forecast is:

corrected using the last GWL measurement and the fitted noisemodel, and
the variance of the error is computed using the noise and the noisemodel.

If False, the forecast is not corrected and the variance of the residuals is returned. Thus, for each combination of a forecast of the ensemble members and the parameter sets a mean forecast and the variance of theb forecast is returned. These can be used to compute a prediction interval.

# Draw parameter sets
params = ml.solver.get_parameter_sample(n=10)
params[:2]

array([[ 4.13975930e-01,  1.03738627e+00,  2.57370638e+01,
         7.55988186e+01,  2.50000001e-01,  3.10410763e+01,
         5.83214546e+00,  1.10041079e+00,  2.00000000e+00,
        -1.86916632e-18,  5.34042850e+00,  6.37476813e+02,
         2.45330577e+01],
       [ 4.25533423e-01,  1.16418504e+00,  2.13516387e+01,
         7.55988186e+01,  2.49999996e-01,  3.73260742e+01,
         5.96156331e+00,  9.47760229e-01,  2.00000000e+00,
         1.36936071e-18,  5.95932827e+00,  6.37476813e+02,
         3.06580163e+01]])

# Generate the forecast
df_nopp = ps.forecast.forecast(ml, fc, p=params, post_process=False)
df_pp = ps.forecast.forecast(ml, fc, p=params, post_process=True)

df_pp.head()

The Time Series 'Gossau' has nan-values. Pastas will use the fill_nan settings to fill up the nan-values.
The Time Series 'Gossau' has nan-values. Pastas will use the fill_nan settings to fill up the nan-values.
The Time Series 'Gossau' has nan-values. Pastas will use the fill_nan settings to fill up the nan-values.
The Time Series 'Gossau' has nan-values. Pastas will use the fill_nan settings to fill up the nan-values.

ensemble_member	0										...	50
param_member	0		1		2		3		4		...	5		6		7		8		9
forecast	mean	var	mean	var	mean	var	mean	var	mean	var	...	mean	var	mean	var	mean	var	mean	var	mean	var
2014-01-02	638.027890	0.003146	638.030928	0.002911	638.029550	0.003188	638.032329	0.003021	638.034288	0.002965	...	638.030335	0.003176	638.026979	0.002944	638.033180	0.002857	638.035265	0.002808	638.029222	0.002922
2014-01-03	638.020257	0.006047	638.024596	0.005638	638.022550	0.006112	638.026456	0.005823	638.029854	0.005730	...	638.027009	0.006094	638.022425	0.005717	638.030917	0.005555	638.034908	0.005461	638.025755	0.005657
2014-01-04	638.011979	0.008720	638.017430	0.008193	638.014798	0.008794	638.019852	0.008421	638.024552	0.008308	...	638.021903	0.008776	638.015940	0.008329	638.027183	0.008103	638.032344	0.007967	638.020438	0.008216
2014-01-05	638.004279	0.011184	638.010538	0.010586	638.007313	0.011254	638.013769	0.010830	638.019715	0.010711	...	638.081625	0.011240	638.080162	0.010788	638.073769	0.010509	638.089195	0.010336	638.081304	0.010611
2014-01-06	638.019350	0.013455	638.024295	0.012829	638.018420	0.013510	638.030339	0.013064	638.039881	0.012952	...	638.134256	0.013505	638.130552	0.013104	638.117934	0.012781	638.139906	0.012573	638.128348	0.012852

5 rows × 1020 columns

The returned variable df is a DataFrame containing the ensemble groundwater predictions. The columns of df are a MultiIndex with the first row the \(M\) ensemble members (i.e., a single meteorological forecast), the second row the \(N\) parameter sets, and the third row three columns with the mean prediction and the variance of the error that can be used to construct the prediction interval.

5. Compute the overall mean and variance#

The forecast method returns a matrix with with \(N\) x \(M\) forecasts of the mean and variance. From these, we can compute the mean forecast over all ensembles and parameter sets easily. For the variance, which we use to compute the (for example) the 95% prediction interval, we apply the law of total variance. The forecast method provides a convenience method to do this: ps.forecast.get_overall_mean_and_variance. The method takes the output dataframe df and returns two pandas.Series with the overall mean and the variance.

fig, axes = plt.subplots(1, 2, figsize=(10.0, 4.0), sharey=True, layout="tight")

suptitles = ["Without post-processing", "With post-processing"]

for i, (df, label) in enumerate(zip([df_nopp, df_pp], ["No PP", "PP"])):
    mean, var = ps.forecast.get_overall_mean_and_variance(df)
    std = np.sqrt(var)

    ax = axes[i]
    ml.oseries.series.loc[df.index].plot(
        ax=ax, marker=".", color="k", linestyle="None", zorder=100
    )

    mean.plot(ax=ax, alpha=0.5, zorder=100)
    ax.plot(mean, color="k", label="Mean")
    ax.fill_between(
        mean.index,
        mean - std * 1.96,
        mean + std * 1.96,
        color="k",
        alpha=0.2,
        label="1 std",
    )

    df.loc[:, (slice(None), slice(None), "mean")].plot(
        color="gray", legend=False, ax=ax
    )
    ax.set_title(suptitles[i], x=0.01, y=0.93, ha="left", va="top")

axes[-1].legend(
    ["Observations", "Mean forecast", "Ensemble Members", "95% Prediction interval"],
    loc="lower right",
    ncol=2,
    bbox_to_anchor=(1.0, 1.0),
    frameon=False,
)

axes[0].set_ylabel("Head [m]")

/tmp/ipykernel_1004/3313499384.py:15: UserWarning: This axis already has a converter set and is updating to a potentially incompatible converter
  ax.plot(mean, color="k", label="Mean")
/home/docs/checkouts/readthedocs.org/user_builds/pastas/envs/v1.14.0/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py:981: UserWarning: This axis already has a converter set and is updating to a potentially incompatible converter
  return ax.plot(*args, **kwds)
/tmp/ipykernel_1004/3313499384.py:15: UserWarning: This axis already has a converter set and is updating to a potentially incompatible converter
  ax.plot(mean, color="k", label="Mean")
/home/docs/checkouts/readthedocs.org/user_builds/pastas/envs/v1.14.0/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py:981: UserWarning: This axis already has a converter set and is updating to a potentially incompatible converter
  return ax.plot(*args, **kwds)

Text(0, 0.5, 'Head [m]')

../_images/88b7df1d17700b57b9d7be0e970d0bc2dad1dbf5c965c1d76e36d8ef0252be68.png

In the figure above we see the two ensemble predictions, without (left) and with (right) post-processing. What we can see is that the prediction intervals widen over time when using the post/processing, whereas without it, these remain more stable and only widen due to the increased uncertainty in the ensemble members. In general, we can see that the largest part of the uncertainty is due to the uncertainty in the input data, as visible bz the mean forecasts of the ensemble members (the grey lines) covering a large part of the prediction intervals.