{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Autocorrelation for irregular time series\n", "*R.A. Collenteur, Artesia Water & University of Graz, 2020*\n", "\n", "In this notebook the autocorrelation function for irregular time steps that is built-in in Pastas is tested and validated on synthetic data. The methodology for calculating the autocorrelation is based on [Rehfeld et al. (2011)](#References) and [Edelson and Krolik (1978)](#References). The methods are available through `pastas.stats` package (e.g. `pastas.stats.acf(series)`). The full report of the methods underlying this Notebook are available in [Collenteur (2018, in Dutch)](#References). \n", "\n", "1. [Create synthetic time series](#Create-synthetic-time-series)\n", "2. [Compute the autocorrelation](#Computing-the-autocorrelation)\n", "3. [Autocorrelation for regular time series](#Autocorrelation-for-regular-time-series)\n", "4. [Autocorrelation for irregular time series](#Sine-wave-with-non-equidistant-timesteps)\n", "5. [References](#References)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import pastas as ps\n", "\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create synthetic time series\n", "Two synthetic time series are created with a known autocorrelation and a regular time interval. The first is a sine wave with a wave period of one year. The second is a series of correlated noise generated through an AR(1) process. Both synthetic time series have a length of ten years and a daily observation time step ($n=3650$). From both time series three time series with irregular time steps are created:\n", "\n", "- **Time Series 1:** 3 years of monthly data, 3 years of bi-weekly data, and 4 years of daily data.\n", "- **Time Series 2:** 3 years of monthly data, 3 years of bi-weekly data, a one year gap, and 4 years of daily data.\n", "- **Time Series 3:** reindex time series using the indices from a real groundwater level time series." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "index_test = (\n", " pd.read_csv(\n", " \"../examples/data/test_index.csv\", parse_dates=True, index_col=0, names=[\"Date\"]\n", " )\n", " .squeeze()\n", " .index.ceil(\"D\")\n", " .unique()\n", ")\n", "n_years = 10\n", "index = pd.to_datetime(np.arange(0, n_years * 365, 1), unit=\"D\", origin=\"2005\")\n", "index.name = \"Time [Years]\"\n", "\n", "# 1. Sine timeseries 1: equal timesteps\n", "np.random.seed(0)\n", "data = np.sin(\n", " np.linspace(0, n_years * 2 * np.pi, len(index))\n", ") # +np.random.rand(index.size)\n", "v = pd.Series(data=data, index=index, name=\"STS_SIN\")\n", "\n", "# 2. Sine timeseries with three frequencies\n", "v1 = (\n", " v.iloc[: -7 * 365]\n", " .asfreq(\"30D\")\n", " .append(v.iloc[-7 * 365 : -4 * 365].asfreq(\"14D\"))\n", " .append(v.iloc[-4 * 365 :])\n", ")\n", "v1.name = \"STS_SIN1\"\n", "\n", "# 3. Sine timeseries with three frequencies and a data gap\n", "v2 = (\n", " v.iloc[: -7 * 365]\n", " .asfreq(\"30D\")\n", " .append(v.iloc[-7 * 365 : -4 * 365].asfreq(\"14D\"))\n", " .append(v.iloc[-3 * 365 :])\n", ")\n", "v2.name = \"STS_SIN2\"\n", "\n", "# 4. Sine timeseries with indices based on a true groundwater level measurement indices\n", "v3 = v.reindex(index_test).dropna()\n", "v3.name = \"STS_SIN3\"\n", "\n", "# Convoluting a random noise process with a exponential decay function to obtain a autocorrelation\n", "# timeseries similar to an Auto Regressive model of order 1 (AR(1))\n", "alpha = 10\n", "np.random.seed(0)\n", "n = np.random.rand(index.size + 365)\n", "b = np.exp(-np.arange(366) / alpha)\n", "n = np.convolve(n, b, mode=\"valid\")\n", "n = n - n.mean()\n", "\n", "index = pd.to_datetime(np.arange(0, n.size, 1), unit=\"D\", origin=\"2005\")\n", "index.name = \"Time [Years]\"\n", "n = pd.Series(data=n, index=index, name=\"STS_EXP\")\n", "\n", "n1 = n.reindex(v1.index).dropna()\n", "n1.name = \"STS_EXP1\"\n", "n2 = n.reindex(v2.index).dropna()\n", "n2.name = \"STS_EXP2\"\n", "n3 = n.reindex(index_test).dropna()\n", "n3.name = \"STS_EXP3\"\n", "\n", "# Create a DataFrame with all series and plot them all\n", "d = pd.concat([v, v1, v2, v3, n, n1, n2, n3], axis=1)\n", "d.plot(subplots=True, figsize=(10, 6), marker=\".\", markersize=2, color=\"steelblue\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Theoretical autocorrelation\n", "In this codeblock a figure is created showing the time series with equidistant timesteps and their theoretical autocorrelation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Compute true autocorrelation functions\n", "acf_v_true = pd.Series(np.cos(np.linspace(0, 2 * np.pi, 365)))\n", "acf_n_true = pd.Series(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing the autocorrelation\n", "\n", "The autocorrelation function for all time series is calculated for a number of time lags. Two different methods are used:\n", "\n", "- binning in a rectangular bin\n", "- weighting through a Gaussian kernel\n", "\n", "
\n", " \n", "Warning \n", " \n", "Calculation of all the autocorrelation functions can take several minutes!\n", "\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lags = np.arange(1.0, 366)\n", "acf_df = pd.DataFrame(index=lags)\n", "\n", "for name, sts in d.items():\n", " for method in [\"rectangle\", \"gaussian\"]:\n", " acf = ps.stats.acf(sts.dropna(), bin_method=method, max_gap=30)\n", " acf.index = acf.index.days\n", " acf_df.loc[:, name + \"_\" + method] = acf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Autocorrelation for regular time series" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3, 2, figsize=(10, 7), sharey=\"row\")\n", "\n", "for i, name in enumerate([\"STS_SIN\", \"STS_EXP\"]):\n", " sts = d.loc[:, name]\n", " sts.plot(ax=axes[0][i], style=\".\", legend=True, label=name)\n", "\n", " if name == \"STS_SIN\":\n", " acf_true = acf_v_true\n", " else:\n", " acf_true = acf_n_true\n", "\n", " acf_true.plot(ax=axes[1][i])\n", "\n", " axes[2][i].plot([0.0, 365], [0, 0], label=\"true ACF\")\n", " for bm in [\"gaussian\", \"rectangle\"]:\n", " acf_name = name + \"_\" + bm\n", " acf = acf_df.loc[:, acf_name]\n", " if bm == \"gaussian\":\n", " kwargs = dict(marker=\"x\", markersize=10)\n", " else:\n", " kwargs = dict(marker=\"o\")\n", " acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n", " dif = acf.subtract(acf_true).dropna()\n", " rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n", " dif.plot(label=bm + rmse, ax=axes[2][i], logx=True, **kwargs)\n", " axes[2][i].legend()\n", " axes[1][i].set_xticks([])\n", " axes[2][i].set_xlabel(\"Lags [Days]\")\n", " axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n", " axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n", "\n", "ax = axes[0][0].set_ylabel(\"Noise [L]\")\n", "axes[1][0].set_ylabel(\"ACF [-]\")\n", "axes[2][0].set_ylabel(\"Error [L]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sine wave with non-equidistant timesteps" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3, 3, figsize=(16, 9), sharey=\"row\")\n", "#\n", "for i, name in enumerate([\"STS_SIN1\", \"STS_SIN2\", \"STS_SIN3\"]):\n", " sts = d.loc[:, name]\n", " sts.plot(ax=axes[0][i], style=\".\", label=name)\n", " axes[0][i].legend(loc=3)\n", " acf_v_true.plot(ax=axes[1][i])\n", " axes[2][i].plot([0.0, 365], [0, 0], label=\"true ACF\")\n", " for bm in [\"gaussian\", \"rectangle\"]:\n", " acf_name = name + \"_\" + bm\n", " acf = acf_df.loc[:, acf_name]\n", " if bm == \"gaussian\":\n", " kwargs = dict(marker=\"x\", markersize=10)\n", " else:\n", " kwargs = dict(marker=\"o\")\n", " acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n", " dif = acf.subtract(acf_v_true).dropna()\n", " rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n", " dif.plot(label=bm + rmse, ax=axes[2][i], logx=True, **kwargs)\n", " axes[1][i].set_xticks([])\n", " axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n", " axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n", " axes[2][i].legend(loc=3)\n", " axes[2][i].set_xlabel(\"Lags [Days]\")\n", "\n", "axes[2][i].relim()\n", "axes[0][0].set_ylabel(\"Noise [L]\")\n", "axes[1][0].set_ylabel(\"ACF [-]\")\n", "axes[2][0].set_ylabel(\"Error [L]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exponential decay with non-equidistant timesteps" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(3, 3, figsize=(16, 9), sharey=\"row\")\n", "\n", "for i, name in enumerate([\"STS_EXP1\", \"STS_EXP2\", \"STS_EXP3\"]):\n", " sts = d.loc[:, name]\n", " sts.plot(ax=axes[0][i], style=\".\", label=name)\n", " axes[0][i].legend(loc=3)\n", " acf_n_true.plot(ax=axes[1][i])\n", " axes[2][i].plot([0.0, 365], [0, 0], label=\"true ACF\")\n", " for j, bm in enumerate([\"gaussian\", \"rectangle\"]):\n", " acf_name = name + \"_\" + bm\n", " acf = acf_df.loc[:, acf_name]\n", " if bm == \"gaussian\":\n", " kwargs = dict(marker=\"x\", markersize=10)\n", " else:\n", " kwargs = dict(marker=\"o\")\n", " acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n", " dif = acf.subtract(acf_n_true).dropna()\n", " rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n", " dif.plot(label=bm + rmse, ax=axes[2][i], logx=True, sharey=axes[2][0], **kwargs)\n", " axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n", " axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n", " axes[1][i].set_xticks([])\n", " axes[2][i].legend(loc=2)\n", " axes[2][i].set_xlabel(\"Lags [Days]\")\n", "\n", "axes[0][0].set_ylabel(\"Noise [L]\")\n", "axes[1][0].set_ylabel(\"Autocorrelation [-]\")\n", "axes[2][0].set_ylabel(\"Difference from true value [-]\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "- Rehfeld, K., Marwan, N., Heitzig, J., Kurths, J. (2011). [Comparison of correlation analysis techniques for irregularly sampled time series](https://doi.org/10.5194/npg-18-389-2011). Nonlinear Processes in Geophysics. 18. 389-404.\n", "- Edelson, R. A., & Krolik, J. H. (1988). The discrete correlation function-A new method for analyzing unevenly sampled variability data. The Astrophysical Journal, 333, 646-659.\n", "- Collenteur, R.A. (2018) [Over autocorrelatie van tijdreeksmoddellen met niet-equidistante tijdstappen](http://www.artesia-water.nl/wp-content/uploads/Autocorrelatie_niet_gelijke_tijdstappen.pdf), Artesia, Schoonhoven, Nederland. In Dutch." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "artesia", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "vscode": { "interpreter": { "hash": "dace5e1b41a98a8e52d2a8eebc3b981caf2c12e7a76736ebfb89a489e3b62e79" } } }, "nbformat": 4, "nbformat_minor": 4 }