Autocorrelation for irregular time series#
R.A. Collenteur, Artesia Water & University of Graz, 2020
In this notebook the autocorrelation function for irregular time steps that is built-in in Pastas is tested and validated on synthetic data. The methodology for calculating the autocorrelation is based on Rehfeld et al. (2011) and Edelson and Krolik (1978). The methods are available through pastas.stats
package (e.g. pastas.stats.acf(series)
). The full report of the methods underlying this Notebook are available in Collenteur (2018, in Dutch).
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pastas as ps
DeprecationWarning: As of Pastas 1.5, no noisemodel is added to the pastas Model class by default anymore. To solve your model using a noisemodel, you have to explicitly add a noisemodel to your model before solving. For more information, and how to adapt your code, please see this issue on GitHub: https://github.com/pastas/pastas/issues/735
Create synthetic time series#
Two synthetic time series are created with a known autocorrelation and a regular time interval. The first is a sine wave with a wave period of one year. The second is a series of correlated noise generated through an AR(1) process. Both synthetic time series have a length of ten years and a daily observation time step (\(n=3650\)). From both time series three time series with irregular time steps are created:
Time Series 1: 3 years of monthly data, 3 years of bi-weekly data, and 4 years of daily data.
Time Series 2: 3 years of monthly data, 3 years of bi-weekly data, a one year gap, and 4 years of daily data.
Time Series 3: reindex time series using the indices from a real groundwater level time series.
index_test = (
pd.read_csv(
"../examples/data/test_index.csv", parse_dates=True, index_col=0, names=["Date"]
)
.squeeze()
.index.ceil("D")
.unique()
)
n_years = 10
index = pd.to_datetime(np.arange(0, n_years * 365, 1), unit="D", origin="2005")
index.name = "Time [Years]"
# 1. Sine timeseries 1: equal timesteps
np.random.seed(0)
data = np.sin(
np.linspace(0, n_years * 2 * np.pi, len(index))
) # +np.random.rand(index.size)
v = pd.Series(data=data, index=index, name="STS_SIN")
# 2. Sine timeseries with three frequencies
v1 = pd.concat(
[
v.iloc[: -7 * 365].asfreq("30D"),
v.iloc[-7 * 365 : -4 * 365].asfreq("14D"),
v.iloc[-4 * 365 :],
]
)
v1.name = "STS_SIN1"
# 3. Sine timeseries with three frequencies and a data gap
v2 = pd.concat(
[
v.iloc[: -7 * 365].asfreq("30D"),
v.iloc[-7 * 365 : -4 * 365].asfreq("14D"),
v.iloc[-3 * 365 :],
]
)
v2.name = "STS_SIN2"
# 4. Sine timeseries with indices based on a true groundwater level measurement indices
v3 = v.reindex(index_test).dropna()
v3.name = "STS_SIN3"
# Convoluting a random noise process with a exponential decay function to
# obtain a autocorrelation timeseries similar to an Auto Regressive model
# of order 1 (AR(1))
alpha = 10
np.random.seed(0)
n = np.random.rand(index.size + 365)
b = np.exp(-np.arange(366) / alpha)
n = np.convolve(n, b, mode="valid")
n = n - n.mean()
index = pd.to_datetime(np.arange(0, n.size, 1), unit="D", origin="2005")
index.name = "Time [Years]"
n = pd.Series(data=n, index=index, name="STS_EXP")
n1 = n.reindex(v1.index).dropna()
n1.name = "STS_EXP1"
n2 = n.reindex(v2.index).dropna()
n2.name = "STS_EXP2"
n3 = n.reindex(index_test).dropna()
n3.name = "STS_EXP3"
# Create a DataFrame with all series and plot them all
d = pd.concat([v, v1, v2, v3, n, n1, n2, n3], axis=1)
d.plot(subplots=True, figsize=(10, 6), marker=".", markersize=2, color="steelblue");
Theoretical autocorrelation#
In this codeblock a figure is created showing the time series with equidistant timesteps and their theoretical autocorrelation.
# Compute true autocorrelation functions
acf_v_true = pd.Series(np.cos(np.linspace(0, 2 * np.pi, 365)))
acf_n_true = pd.Series(b)
Computing the autocorrelation#
The autocorrelation function for all time series is calculated for a number of time lags. Two different methods are used:
binning in a rectangular bin
weighting through a Gaussian kernel
Warning
Calculation of all the autocorrelation functions can take several minutes!
lags = np.arange(1.0, 366)
acf_df = pd.DataFrame(index=lags)
for name, sts in d.items():
for method in ["rectangle", "gaussian"]:
acf = ps.stats.acf(sts.dropna(), bin_method=method, max_gap=30)
acf.index = acf.index.days
acf_df.loc[:, name + "_" + method] = acf
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/numba/core/serialize.py:30, in _numba_unpickle(address, bytedata, hashed)
27 _unpickled_memo = {}
---> 30 def _numba_unpickle(address, bytedata, hashed):
31 """Used by `numba_unpickle` from _helperlib.c
32
33 Parameters
(...)
42 unpickled object
43 """
KeyboardInterrupt:
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
SystemError: <function _numba_unpickle at 0x7f21eb9744a0> returned a result with an exception set
The above exception was the direct cause of the following exception:
SystemError Traceback (most recent call last)
Cell In[4], line 6
4 for name, sts in d.items():
5 for method in ["rectangle", "gaussian"]:
----> 6 acf = ps.stats.acf(sts.dropna(), bin_method=method, max_gap=30)
7 acf.index = acf.index.days
8 acf_df.loc[:, name + "_" + method] = acf
File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/stats/core.py:99, in acf(x, lags, bin_method, bin_width, max_gap, min_obs, full_output, alpha)
34 def acf(
35 x: Series,
36 lags: ArrayLike = 365,
(...)
42 alpha: float = 0.05,
43 ) -> Union[Series, DataFrame]:
44 """Calculate the autocorrelation function for irregular time steps.
45
46 Parameters
(...)
97 statsmodels.api.tsa.acf
98 """
---> 99 c = ccf(
100 x=x,
101 y=x,
102 lags=lags,
103 bin_method=bin_method,
104 bin_width=bin_width,
105 max_gap=max_gap,
106 min_obs=min_obs,
107 full_output=full_output,
108 alpha=alpha,
109 )
110 c.name = "ACF"
111 if full_output:
File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/stats/core.py:188, in ccf(x, y, lags, bin_method, bin_width, max_gap, min_obs, full_output, alpha)
186 c, b = _compute_ccf_rectangle(lags, t_x, x, t_y, y, bin_width)
187 elif bin_method == "gaussian":
--> 188 c, b = _compute_ccf_gaussian(lags, t_x, x, t_y, y, bin_width)
189 elif bin_method == "regular":
190 c, b = _compute_ccf_regular(arange(1.0, len(lags) + 1), x, y)
SystemError: CPUDispatcher(<function _compute_ccf_gaussian at 0x7f21eb52de40>) returned a result with an exception set
Autocorrelation for regular time series#
fig, axes = plt.subplots(3, 2, figsize=(10, 7), sharey="row")
for i, name in enumerate(["STS_SIN", "STS_EXP"]):
sts = d.loc[:, name]
sts.plot(ax=axes[0][i], style=".", legend=True, label=name)
if name == "STS_SIN":
acf_true = acf_v_true
else:
acf_true = acf_n_true
acf_true.plot(ax=axes[1][i])
axes[2][i].plot([0.0, 365], [0, 0], label="true ACF")
for bm in ["gaussian", "rectangle"]:
acf_name = name + "_" + bm
acf = acf_df.loc[:, acf_name]
if bm == "gaussian":
kwargs = dict(marker="x", markersize=10)
else:
kwargs = dict(marker="o")
acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle="", **kwargs)
dif = acf.subtract(acf_true).dropna()
rmse = " (RMSE={:.2f})".format(np.sqrt((dif.pow(2)).sum() / dif.size))
dif.plot(label=bm + rmse, ax=axes[2][i], logx=True, **kwargs)
axes[2][i].legend()
axes[1][i].set_xticks([])
axes[2][i].set_xlabel("Lags [Days]")
axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])
axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])
ax = axes[0][0].set_ylabel("Noise [L]")
axes[1][0].set_ylabel("ACF [-]")
axes[2][0].set_ylabel("Error [L]");
Sine wave with non-equidistant timesteps#
fig, axes = plt.subplots(3, 3, figsize=(16, 9), sharey="row")
#
for i, name in enumerate(["STS_SIN1", "STS_SIN2", "STS_SIN3"]):
sts = d.loc[:, name]
sts.plot(ax=axes[0][i], style=".", label=name)
axes[0][i].legend(loc=3)
acf_v_true.plot(ax=axes[1][i])
axes[2][i].plot([0.0, 365], [0, 0], label="true ACF")
for bm in ["gaussian", "rectangle"]:
acf_name = name + "_" + bm
acf = acf_df.loc[:, acf_name]
if bm == "gaussian":
kwargs = dict(marker="x", markersize=10)
else:
kwargs = dict(marker="o")
acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle="", **kwargs)
dif = acf.subtract(acf_v_true).dropna()
rmse = " (RMSE={:.2f})".format(np.sqrt((dif.pow(2)).sum() / dif.size))
dif.plot(label=bm + rmse, ax=axes[2][i], logx=True, **kwargs)
axes[1][i].set_xticks([])
axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])
axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])
axes[2][i].legend(loc=3)
axes[2][i].set_xlabel("Lags [Days]")
axes[2][i].relim()
axes[0][0].set_ylabel("Noise [L]")
axes[1][0].set_ylabel("ACF [-]")
axes[2][0].set_ylabel("Error [L]");
Exponential decay with non-equidistant timesteps#
fig, axes = plt.subplots(3, 3, figsize=(16, 9), sharey="row")
for i, name in enumerate(["STS_EXP1", "STS_EXP2", "STS_EXP3"]):
sts = d.loc[:, name]
sts.plot(ax=axes[0][i], style=".", label=name)
axes[0][i].legend(loc=3)
acf_n_true.plot(ax=axes[1][i])
axes[2][i].plot([0.0, 365], [0, 0], label="true ACF")
for j, bm in enumerate(["gaussian", "rectangle"]):
acf_name = name + "_" + bm
acf = acf_df.loc[:, acf_name]
if bm == "gaussian":
kwargs = dict(marker="x", markersize=10)
else:
kwargs = dict(marker="o")
acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle="", **kwargs)
dif = acf.subtract(acf_n_true).dropna()
rmse = " (RMSE={:.2f})".format(np.sqrt((dif.pow(2)).sum() / dif.size))
dif.plot(label=bm + rmse, ax=axes[2][i], logx=True, sharey=axes[2][0], **kwargs)
axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])
axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])
axes[1][i].set_xticks([])
axes[2][i].legend(loc=2)
axes[2][i].set_xlabel("Lags [Days]")
axes[0][0].set_ylabel("Noise [L]")
axes[1][0].set_ylabel("Autocorrelation [-]")
axes[2][0].set_ylabel("Difference from true value [-]");
References#
Rehfeld, K., Marwan, N., Heitzig, J., Kurths, J. (2011). Comparison of correlation analysis techniques for irregularly sampled time series. Nonlinear Processes in Geophysics. 18. 389-404.
Edelson, R. A., & Krolik, J. H. (1988). The discrete correlation function-A new method for analyzing unevenly sampled variability data. The Astrophysical Journal, 333, 646-659.
Collenteur, R.A. (2018) Over autocorrelatie van tijdreeksmoddellen met niet-equidistante tijdstappen, Artesia, Schoonhoven, Nederland. In Dutch.