{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Autocorrelation for irregular time series\n", "*R.A. Collenteur, Artesia Water & University of Graz, 2020*\n", "\n", "In this notebook the autocorrelation function for irregular time steps that is built-in in Pastas is tested and validated on synthetic data. The methodology for calculating the autocorrelation is based on [Rehfeld et al. (2011)](#References) and [Edelson and Krolik (1978)](#References). The methods are available through `pastas.stats` package (e.g. `pastas.stats.acf(series)`). The full report of the methods underlying this Notebook are available in [Collenteur (2018, in Dutch)](#References). \n", "\n", "1. [Create synthetic time series](#Create-synthetic-time-series)\n", "2. [Compute the autocorrelation](#Computing-the-autocorrelation)\n", "3. [Autocorrelation for regular time series](#Autocorrelation-for-regular-time-series)\n", "4. [Autocorrelation for irregular time series](#Sine-wave-with-non-equidistant-timesteps)\n", "5. [References](#References)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import pastas as ps\n", "\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create synthetic time series\n", "Two synthetic time series are created with a known autocorrelation and a regular time interval. The first is a sine wave with a wave period of one year. The second is a series of correlated noise generated through an AR(1) process. Both synthetic time series have a length of ten years and a daily observation time step ($n=3650$). From both time series three time series with irregular time steps are created:\n", "\n", "- **Time Series 1:** 3 years of monthly data, 3 years of bi-weekly data, and 4 years of daily data.\n", "- **Time Series 2:** 3 years of monthly data, 3 years of bi-weekly data, a one year gap, and 4 years of daily data.\n", "- **Time Series 3:** reindex time series using the indices from a real groundwater level time series." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "index_test = pd.read_csv(\"../examples/data/test_index.csv\", parse_dates=True, index_col=0, names=[\"Date\"]).squeeze().index.ceil(\"D\").unique()\n", "n_years=10\n", "index = pd.to_datetime(np.arange(0, n_years*365, 1), unit=\"D\", origin=\"2005\")\n", "index.name = \"Time [Years]\"\n", "\n", "# 1. Sine timeseries 1: equal timesteps\n", "np.random.seed(0)\n", "data = np.sin(np.linspace(0, n_years*2*np.pi, len(index)))#+np.random.rand(index.size)\n", "v = pd.Series(data=data, index=index, name=\"STS_SIN\")\n", "\n", "# 2. Sine timeseries with three frequencies\n", "v1 = v.iloc[:-7*365].asfreq(\"30D\").append(v.iloc[-7*365:-4*365].asfreq(\"14D\")).append(v.iloc[-4*365:])\n", "v1.name = \"STS_SIN1\"\n", "\n", "# 3. Sine timeseries with three frequencies and a data gap\n", "v2 = v.iloc[:-7*365].asfreq(\"30D\").append(v.iloc[-7*365:-4*365].asfreq(\"14D\")).append(v.iloc[-3*365:])\n", "v2.name = \"STS_SIN2\"\n", "\n", "# 4. Sine timeseries with indices based on a true groundwater level measurement indices\n", "v3 = v.reindex(index_test).dropna()\n", "v3.name = \"STS_SIN3\"\n", "\n", "# Convoluting a random noise process with a exponential decay function to obtain a autocorrelation \n", "# timeseries similar to an Auto Regressive model of order 1 (AR(1))\n", "alpha = 10\n", "np.random.seed(0)\n", "n = np.random.rand(index.size+365)\n", "b = np.exp(-np.arange(366) / alpha)\n", "n = np.convolve(n, b, mode=\"valid\")\n", "n = n-n.mean()\n", "\n", "index = pd.to_datetime(np.arange(0, n.size, 1), unit=\"D\", origin=\"2005\")\n", "index.name = \"Time [Years]\"\n", "n = pd.Series(data=n, index=index, name=\"STS_EXP\")\n", "\n", "n1 = n.reindex(v1.index).dropna()\n", "n1.name = \"STS_EXP1\"\n", "n2 = n.reindex(v2.index).dropna()\n", "n2.name = \"STS_EXP2\"\n", "n3 = n.reindex(index_test).dropna()\n", "n3.name = \"STS_EXP3\"\n", "\n", "# Create a DataFrame with all series and plot them all\n", "d = pd.concat([v,v1,v2,v3,n,n1,n2,n3], axis=1)\n", "d.plot(subplots=True, figsize=(10,6), marker=\".\",markersize=2, color=\"steelblue\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Theoretical autocorrelation\n", "In this codeblock a figure is created showing the time series with equidistant timesteps and their theoretical autocorrelation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute true autocorrelation functions\n", "acf_v_true = pd.Series(np.cos(np.linspace(0, 2*np.pi, 365)))\n", "acf_n_true = pd.Series(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing the autocorrelation\n", "\n", "The autocorrelation function for all time series is calculated for a number of time lags. Two different methods are used:\n", "\n", "- binning in a rectangular bin\n", "- weighting through a Gaussian kernel\n", "\n", "
\n", " \n", "Warning \n", " \n", "Calculation of all the autocorrelation functions can take several minutes!\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lags = np.arange(1.0, 366)\n", "acf_df = pd.DataFrame(index=lags)\n", "\n", "for name, sts in d.items():\n", " for method in [\"rectangle\", \"gaussian\"]:\n", " acf = ps.stats.acf(sts.dropna(), bin_method=method, max_gap=30)\n", " acf.index = acf.index.days\n", " acf_df.loc[:, name + \"_\" + method] = acf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Autocorrelation for regular time series" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3, 2, figsize=(10,7), sharey=\"row\")\n", "\n", "for i, name in enumerate([\"STS_SIN\", \"STS_EXP\"]):\n", " sts = d.loc[:, name]\n", " sts.plot(ax=axes[0][i], style=\".\", legend=True, label=name)\n", " \n", " if name is \"STS_SIN\":\n", " acf_true = acf_v_true\n", " else:\n", " acf_true = acf_n_true\n", " \n", " acf_true.plot(ax=axes[1][i])\n", "\n", " axes[2][i].plot([0.0, 365], [0,0], label=\"true ACF\")\n", " for bm in [\"gaussian\", \"rectangle\"]:\n", " acf_name = name + \"_\" + bm\n", " acf = acf_df.loc[:, acf_name]\n", " if bm == \"gaussian\":\n", " kwargs=dict(marker=\"x\", markersize=10)\n", " else:\n", " kwargs = dict(marker=\"o\")\n", " acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n", " dif = acf.subtract(acf_true).dropna()\n", " rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n", " dif.plot(label=bm+rmse, ax=axes[2][i], logx=True, **kwargs)\n", " axes[2][i].legend()\n", " axes[1][i].set_xticks([])\n", " axes[2][i].set_xlabel(\"Lags [Days]\")\n", " axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n", " axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n", "\n", "ax=axes[0][0].set_ylabel(\"Noise [L]\")\n", "axes[1][0].set_ylabel(\"ACF [-]\") \n", "axes[2][0].set_ylabel(\"Error [L]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sine wave with non-equidistant timesteps" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3, 3, figsize=(16,9), sharey=\"row\")\n", "# \n", "for i, name in enumerate([\"STS_SIN1\", \"STS_SIN2\", \"STS_SIN3\"]):\n", " sts = d.loc[:, name]\n", " sts.plot(ax=axes[0][i], style=\".\",label=name)\n", " axes[0][i].legend(loc=3)\n", " acf_v_true.plot(ax=axes[1][i])\n", " axes[2][i].plot([0.0, 365], [0,0], label=\"true ACF\")\n", " for bm in [\"gaussian\", \"rectangle\"]:\n", " acf_name = name + \"_\" + bm\n", " acf = acf_df.loc[:, acf_name] \n", " if bm == \"gaussian\":\n", " kwargs=dict(marker=\"x\", markersize=10)\n", " else:\n", " kwargs = dict(marker=\"o\")\n", " acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n", " dif = acf.subtract(acf_v_true).dropna()\n", " rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n", " dif.plot(label=bm+rmse, ax=axes[2][i], logx=True, **kwargs)\n", " axes[1][i].set_xticks([])\n", " axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n", " axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n", " axes[2][i].legend(loc=3)\n", " axes[2][i].set_xlabel(\"Lags [Days]\")\n", "\n", "axes[2][i].relim()\n", "axes[0][0].set_ylabel(\"Noise [L]\")\n", "axes[1][0].set_ylabel(\"ACF [-]\") \n", "axes[2][0].set_ylabel(\"Error [L]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exponential decay with non-equidistant timesteps" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3, 3, figsize=(16,9), sharey=\"row\")\n", "\n", "for i, name in enumerate([\"STS_EXP1\", \"STS_EXP2\", \"STS_EXP3\"]):\n", " sts = d.loc[:, name]\n", " sts.plot(ax=axes[0][i], style=\".\", label=name)\n", " axes[0][i].legend(loc=3)\n", " acf_n_true.plot(ax=axes[1][i])\n", " axes[2][i].plot([0.0, 365], [0,0], label=\"true ACF\")\n", " for j, bm in enumerate([\"gaussian\", \"rectangle\"]):\n", " acf_name = name + \"_\" + bm\n", " acf = acf_df.loc[:, acf_name]\n", " if bm == \"gaussian\":\n", " kwargs=dict(marker=\"x\", markersize=10)\n", " else:\n", " kwargs = dict(marker=\"o\")\n", " acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n", " dif = acf.subtract(acf_n_true).dropna()\n", " rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n", " dif.plot(label=bm+rmse, ax=axes[2][i], logx=True, sharey=axes[2][0], **kwargs)\n", " axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n", " axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n", " axes[1][i].set_xticks([])\n", " axes[2][i].legend(loc=2)\n", " axes[2][i].set_xlabel(\"Lags [Days]\")\n", "\n", "axes[0][0].set_ylabel(\"Noise [L]\")\n", "axes[1][0].set_ylabel(\"Autocorrelation [-]\") \n", "axes[2][0].set_ylabel(\"Difference from true value [-]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "- Rehfeld, K., Marwan, N., Heitzig, J., Kurths, J. (2011). [Comparison of correlation analysis techniques for irregularly sampled time series](https://doi.org/10.5194/npg-18-389-2011). Nonlinear Processes in Geophysics. 18. 389-404.\n", "- Edelson, R. A., & Krolik, J. H. (1988). The discrete correlation function-A new method for analyzing unevenly sampled variability data. The Astrophysical Journal, 333, 646-659.\n", "- Collenteur, R.A. (2018) [Over autocorrelatie van tijdreeksmoddellen met niet-equidistante tijdstappen](http://www.artesia-water.nl/wp-content/uploads/Autocorrelatie_niet_gelijke_tijdstappen.pdf), Artesia, Schoonhoven, Nederland. In Dutch." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 4 }