Parameter (uncertainty) estimates benchmarked with synthetic time series#
Mark Bakker - Delft University of Technology
The purpose of this notebook is to demonstrate that the parameters estimated by Pastas are unbiased and that the coverage of the estimated uncertainty of the parameters is correct, i.e., the 95% confidence intervals of the estimated parameters contains the true parameter values approximately 95% of the time.
All examples in this notebook use synthetic data. The true head is simulated with a known response function. The performance of Pastas is evaluated for head series with different kinds of both uncorrelated and correlated errors. The Notebook consists of the following sections:
from platform import system as os_system
from typing import Callable, List, Tuple, Union
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.special import gammainc, gammaincinv
import pastas as ps
plt.rcParams["figure.figsize"] = (5, 3) # default figure size
ps.show_versions()
Pastas version: 1.9.0
Python version: 3.11.10
NumPy version: 2.2.5
Pandas version: 2.2.3
SciPy version: 1.15.2
Matplotlib version: 3.10.1
Numba version: 0.61.2
DeprecationWarning: As of Pastas 1.5, no noisemodel is added to the pastas Model class by default anymore. To solve your model using a noisemodel, you have to explicitly add a noisemodel to your model before solving. For more information, and how to adapt your code, please see this issue on GitHub: https://github.com/pastas/pastas/issues/735
The following function generates correlated data with an underlying standard deviation of \(\sigma_x\) and a correlation \(\rho\).
def generate_correlated_data(
N: int, sigx: float, rho: float, seed: int = 1
) -> np.ndarray[float]:
"""
N: length of array of correlated data
sigx: standard deviation of the correlated data
rho: correlation coefficient
seed: seed to be used in random number generation
"""
sige = np.sqrt(1 - rho**2) * sigx
np.random.seed(seed)
e = np.random.normal(0, sige, N - 1)
x = np.zeros(N)
x[0] = np.random.normal(0, sigx) # first point variance sigx
for j in range(1, len(x)):
x[j] = rho * x[j - 1] + e[j - 1]
return x
Generation of synthetic head series (the truth)#
The synthetic head series are generated with a Gamma response function. The response functions are defined below for clarity (alternatively, the response function of Pastas can be used).
def gamma_tmax(A: float, n: float, a: float, cutoff: float = 0.999) -> float:
"""returns time for which step function equals cutoff * A"""
return gammaincinv(n, cutoff) * a
def gamma_step(
A: float, n: float, a: float, cutoff: float = 0.999
) -> np.ndarray[float]:
"""
returns gamma step function starting at t=0 with intervals of delt = 1
tmax is the time for which the step function reaches cutoff * A
"""
tmax = gamma_tmax(A, n, a, cutoff)
t = np.arange(0, tmax, 1)
s = A * gammainc(n, t / a)
return s
def gamma_block(
A: float, n: float, a: float, cutoff: float = 0.999
) -> np.ndarray[float]:
"""returns the gamma block response starting at t=0 with intervals of delt = 1"""
s = gamma_step(A, n, a, cutoff)
return np.append(s[0], s[1:] - s[:-1])
The Gamma response function requires three input arguments: A, n and a. The true values of the parameters used in this notebook are given below and the response function is drawn. The response function reflects the response to one unit of recharge during one day. For example, if the units of the recharge are mm/d, then the head response is in mm (in response to 1 mm of recharge). If the units of the recharge are m/d, then the head response is in m (in response to 1 m of recharge).
Atrue = 400 # true value of A
ntrue = 1.1 # true value of n
ntrue = 2
atrue = 100 # true vale of a
dtrue = 20 # base level of heads
ptrue = np.array([Atrue, ntrue, atrue, dtrue]) # array with true parameter values
pnames = ["A", "n", "a", "d"] # names of parameters
hblock = gamma_block(Atrue, ntrue, atrue)
tmax = gamma_tmax(Atrue, ntrue, atrue)
plt.plot(hblock)
plt.xlabel("Time (days)")
plt.ylabel("Head response")
plt.title(f"Gamma block response with tmax={tmax:.0f} d")
plt.grid()

Create synthetic observations#
Rainfall is used as input series to create a synthetic head series. The generated head series is purposely not generated with convolution so that it is clear how the head is computed. The computed head at day 1 is the head at the end of day 1 due to rainfall during day 1.
def generate_heads(
Atrue: float,
ntrue: float,
atrue: float,
dtrue: float,
rain: pd.Series,
yearstart: pd.Timestamp,
yearend: pd.Timestamp,
) -> pd.Series:
"""yearstart and yearend are strings"""
step = gamma_block(Atrue, ntrue, atrue)[
1:
] # start head computation at end of first day
lenstep = len(step)
h = dtrue * np.ones(len(rain) + lenstep)
for i in range(len(rain)):
h[i : i + lenstep] += rain[i] * step
head = pd.Series(index=rain.index, data=h[: len(rain)]).loc[yearstart:yearend]
return head
Synthetic heads are generated for the period 1990 - 2000. Computations start in 1980 as a warm-up period. The series with daily rainfall starts in 1980.
rain = (
pd.read_csv("../examples/data/rain_260.csv", index_col=0, parse_dates=[0]).squeeze()
/ 1000
)
head = generate_heads(Atrue, ntrue, atrue, dtrue, rain, "1990", "1999")
#
plt.figure(figsize=(10, 3))
plt.plot(head, "k.", label="head", markersize=1)
plt.legend(loc=0)
plt.ylabel("head (m)")
plt.xlabel("time (years)")
plt.grid()
Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`

Synthetic heads that contain no errors.#
In this first test, it is demonstrated that Pastas finds the correct parameters back
ml = ps.Model(oseries=head)
sm = ps.StressModel(rain, rfunc=ps.Gamma(cutoff=0.999), name="rain")
ml.add_stressmodel(sm)
ml.solve()
Fit report None Fit Statistics
==================================================
nfev 17 EVP 100.00
nobs 3652 R2 1.00
noise False RMSE 0.00
tmin 1990-01-01 00:00:00 AICc -228576.40
tmax 1999-12-31 00:00:00 BIC -228551.60
freq D Obj 0.00
warmup 3650 days 00:00:00 ___
solver LeastSquares Interp. No
Parameters (4 optimized)
==================================================
optimal initial vary
rain_A 400.0 217.313623 True
rain_n 2.0 1.000000 True
rain_a 100.0 10.000000 True
constant_d 20.0 20.901710 True
Warnings! (1)
==================================================
Response tmax for 'rain' > than warmup period.
Perform experiments to test whether the uncertainty estimate of the parameters is reasonable#
def model1(
headseries: pd.Series, rain: pd.Series = rain, returnmodel: bool = False
) -> Union[ps.Model, Tuple[np.ndarray[float], np.ndarray[float], float, float]]:
"""
Function that creates pastas model for given headseries, rain, a linear
recharge model and a Gamma response function
Returns:
Optimal parameters, their estimated standard error, r-squared
"""
ml = ps.Model(oseries=headseries)
sm = ps.StressModel(rain, rfunc=ps.Gamma(cutoff=0.999), name="rain")
ml.add_stressmodel(sm)
ml.solve(report=False)
if returnmodel:
return ml
return (
ml.parameters["optimal"].values,
ml.parameters["stderr"].values,
ml.stats.rsq(),
ml.stats.rmse(),
)
def model2(
headseries: pd.Series, rain: pd.Series = rain, returnmodel: bool = False
) -> Union[ps.Model, Tuple[np.ndarray[float], np.ndarray[float], float, float]]:
"""
Function that creates pastas model for given headseries, rain, a linear
recharge model, a Gamma response function, and an AR1 noise model
Returns:
Optimal parameters, their estimated standard error, r-squared
"""
ml = ps.Model(oseries=headseries)
sm = ps.StressModel(rain, rfunc=ps.Gamma(cutoff=0.999), name="rain")
ml.add_stressmodel(sm)
nm = ps.ArNoiseModel()
ml.add_noisemodel(nm)
ml.solve(report=False)
if returnmodel:
return ml
return (
ml.parameters["optimal"].values,
ml.parameters["stderr"].values,
ml.stats.rsq(),
ml.stats.rmse(),
)
A numerical experiment is conducted to test whether the Pastas estimates the uncertainty of the parameters correctly. In the experiment, the synthetic head consists of the true head plus an error. Different errors are generated every time. A pastas model is created of the synthetic head series with the introduced error and it is determined whether the true parameters are within the 95% confidence interval of the estimated parameters. The experiment consists of 1000 runs of heads with different random errors.
def experiment(
nexp: int,
nparam: int,
model: Callable,
ptrue: np.ndarray[float],
pnames: List[str],
sigh: float = 0.1,
rho: float = 0,
parallel: bool = True,
):
"""
nexp: number of times to run the experiment
nparam: number of parameters to estimate
model: model function to call (as defined above)
ptrue: array with true values of the parameters
sigh: standard deviation of the errors in the synthetic head
rho: correlation between the errors in the synthetic head
Returns:
p: DataFrame with estiamted parameters for each nexp runs
sig: DataFrame with estimated standard error for each nexp runs
coverage: DataFrame with 1 row and nparam columns with the coverage
of each parameter
"""
head = generate_heads(Atrue, ntrue, atrue, dtrue, rain, "1990", "1999")
head_errors = [
head + generate_correlated_data(len(head), sigx=sigh, rho=rho, seed=i)
for i in range(nexp)
]
if parallel and os_system() == "Linux": # Windows and MacOS do not support fork
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor() as executor:
p, sig, rsq, rmse = zip(*executor.map(model, head_errors))
p = np.array(p, dtype=float)
sig = np.array(sig, dtype=float)
rsq = np.array(rsq, dtype=float)
rmse = np.array(rmse, dtype=float)
else:
for i in range(nexp):
p = np.empty((nexp, nparam))
sig = np.empty((nexp, nparam))
rsq = np.empty(nexp)
rmse = np.empty(nexp)
if i % 100 == 0:
print(".", end="")
p[i], sig[i], rsq[i], rmse[i] = model(head_error[i])
print("\n")
coverage = (np.abs(p - ptrue) < 1.96 * sig).sum(0)
p_df = pd.DataFrame(p, columns=pnames)
sig_df = pd.DataFrame(sig, columns=pnames)
cov_df = pd.DataFrame(
coverage[np.newaxis, :] * 100 / nexp,
columns=pnames,
index=["coverage percentage"],
)
return p_df, sig_df, cov_df, rsq, rmse
Parameters are estimated correctly when no noise model is added#
In Experiments 1 and 2, the parameters were estimated for synthetic head data with correlated errors in two ways. First, a pastas
model was used without a noise model and second a pastas
model as used with a noise model. It was shown that the pastas
model without a noise model did a poor job in estimating the standard error of the parameters, as expected. But this does not mean that the estimated parameters themselves were biased. On the contrary, the following two graphs demonstrate that the estimated parameters with or without a noise model are essentially similar. The first graph shows a histogram of parameter \(A\) with and without the noise model. Note that the two histograms pretty much overlap. The second graph plots the estimated values of \(A\) with a noise model vs. the estimated values of \(A\) without a noise model. These values plot almost on a straight 45\(^\circ\) line, which indicates that the values of \(A\) only differ slightly with or without a noise model.
plt.figure(figsize=(8, 3))
for i in range(4):
plt.subplot(2, 2, i + 1)
plt.hist(p1.iloc[:, i], bins=20, density=True)
plt.hist(p2.iloc[:, i], bins=20, density=True, alpha=0.5)
if i == 1:
plt.legend(["w/o noise model", "w/ noise model"], bbox_to_anchor=(1.05, 1))
plt.title(pnames[i], fontsize="small")
plt.tight_layout()

plt.figure(figsize=(6, 6))
for i in range(4):
plt.subplot(2, 2, i + 1, aspect=1)
plt.plot(p1.iloc[:, i], p2.iloc[:, i], ".", markersize=1)
plt.xlabel("w/o noise model")
plt.ylabel("w/ noise model")
plt.title(pnames[i], fontsize="small")
plt.tight_layout()

The ultimate test to determine whether parameters estimated without a noise model are not significantly different from parameters estimated with a noise model is to check whether the parameters estimated without a noise model are within the 95% confidence interval of the parameters estimated with the noise model.
(np.abs(p1.values - p2.values[:, :-1]) < 1.96 * sig2.values[:, :-1]).sum(0)
array([1000, 1000, 998, 1000])
Parameter uncertainty is larger when the correlation factor is larger#
The experiments are repeated for the case that the standard deviation of the error is still \(\sigma=0.1\) but the correlation between errors from one day to the next is increased to \(\phi=0.99\) (\(\alpha\approx 99.5\) d). The coverage of the estimated parameter uncertainty is somewhat smaller than 95% when using the AR1 noise model.
sigh = 0.1
rho = 0.99
p3, sig3, coverage3, rsq3, rmse3 = experiment(
nexp=nexp, nparam=4, model=model1, ptrue=ptrue, pnames=pnames, sigh=sigh, rho=rho
)
print("coverage without noise model")
display(coverage3)
#
alphatrue = -1 / np.log(rho)
ptrue2 = np.hstack((ptrue, alphatrue))
p4, sig4, coverage4, rsq4, rmse4 = experiment(
nexp=nexp, nparam=5, model=model2, ptrue=ptrue2, pnames=pnames2, sigh=sigh, rho=rho
)
print("coverage with AR1 noise model")
display(coverage4)
coverage without noise model
coverage with AR1 noise model
A | n | a | d | |
---|---|---|---|---|
coverage percentage | 11.9 | 20.1 | 18.4 | 10.2 |
A | n | a | d | alpha | |
---|---|---|---|---|---|
coverage percentage | 89.8 | 93.0 | 90.5 | 89.7 | 77.7 |
It is sometimes difficult to estimate the parameters when the correlation \(\phi\) is very large.#
The difficulty lies in the model when no noise model is used. It seems the wrong optimal is sometimes found and it is more difficult to estimate \(n\). This requires further investigation.
plt.figure(figsize=(8, 3))
for i in range(4):
plt.subplot(2, 2, i + 1)
plt.hist(p3.iloc[:, i], bins=20, density=True)
plt.hist(p4.iloc[:, i], bins=20, density=True, alpha=0.5)
if i == 1:
plt.legend(["w/o noise model", "w noise model"], bbox_to_anchor=(1.05, 1))
plt.title(pnames[i], fontsize="small")
plt.tight_layout()

plt.figure(figsize=(6, 4))
for i in range(4):
plt.subplot(2, 2, i + 1, aspect=1)
plt.plot(p3.iloc[:, i], p4.iloc[:, i], ".", markersize=1)
plt.title(pnames[i], fontsize="small")
plt.xlabel("w/o noise model")
plt.ylabel("w/ noise model")
plt.tight_layout()

The parameters estimated without a noise model are sometimes significantly different from parameters estimated with a noise model. Below, it is computed how many times the parameters estimated without a noise model are within the 95% confidence interval of the parameters estimated with the noise model. The largest deviation is indeed for the parameter \(n\) as could also be seen from the graphs above.
(np.abs(p3.values - p4.values[:, :-1]) < 1.96 * sig4.values[:, :-1]).sum(0)
array([985, 716, 878, 985])