{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Autocorrelation for irregular time series\n",
    "*R.A. Collenteur, Artesia Water & University of Graz, 2020*\n",
    "\n",
    "In this notebook the autocorrelation function for irregular time steps that is built-in in Pastas is tested and validated on synthetic data. The methodology for calculating the autocorrelation is based on [Rehfeld et al. (2011)](#References) and [Edelson and Krolik (1978)](#References). The methods are available through `pastas.stats` package (e.g. `pastas.stats.acf(series)`). The full report of the methods underlying this Notebook are available in [Collenteur (2018, in Dutch)](#References). \n",
    "\n",
    "1. [Create synthetic time series](#Create-synthetic-time-series)\n",
    "2. [Compute the autocorrelation](#Computing-the-autocorrelation)\n",
    "3. [Autocorrelation for regular time series](#Autocorrelation-for-regular-time-series)\n",
    "4. [Autocorrelation for irregular time series](#Sine-wave-with-non-equidistant-timesteps)\n",
    "5. [References](#References)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import pastas as ps\n",
    "\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create synthetic time series\n",
    "Two synthetic time series are created with a known autocorrelation and a regular time interval. The first is a sine wave with a wave period of one year. The second is a series of correlated noise generated through an AR(1) process. Both synthetic time series have a length of ten years and a daily observation time step ($n=3650$). From both time series three time series with irregular time steps are created:\n",
    "\n",
    "- **Time Series 1:** 3 years of monthly data, 3 years of bi-weekly data, and 4 years of daily data.\n",
    "- **Time Series 2:** 3 years of monthly data, 3 years of bi-weekly data, a one year gap, and 4 years of daily data.\n",
    "- **Time Series 3:** reindex time series using the indices from a real groundwater level time series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "index_test = pd.read_csv(\"../examples/data/test_index.csv\", parse_dates=True, index_col=0, names=[\"Date\"]).squeeze().index.ceil(\"D\").unique()\n",
    "n_years=10\n",
    "index = pd.to_datetime(np.arange(0, n_years*365, 1), unit=\"D\", origin=\"2005\")\n",
    "index.name = \"Time [Years]\"\n",
    "\n",
    "# 1. Sine timeseries 1: equal timesteps\n",
    "np.random.seed(0)\n",
    "data = np.sin(np.linspace(0,  n_years*2*np.pi, len(index)))#+np.random.rand(index.size)\n",
    "v = pd.Series(data=data, index=index, name=\"STS_SIN\")\n",
    "\n",
    "# 2. Sine timeseries with three frequencies\n",
    "v1 = v.iloc[:-7*365].asfreq(\"30D\").append(v.iloc[-7*365:-4*365].asfreq(\"14D\")).append(v.iloc[-4*365:])\n",
    "v1.name = \"STS_SIN1\"\n",
    "\n",
    "# 3. Sine timeseries with three frequencies and a data gap\n",
    "v2 = v.iloc[:-7*365].asfreq(\"30D\").append(v.iloc[-7*365:-4*365].asfreq(\"14D\")).append(v.iloc[-3*365:])\n",
    "v2.name = \"STS_SIN2\"\n",
    "\n",
    "# 4. Sine timeseries with indices based on a true groundwater level measurement indices\n",
    "v3 = v.reindex(index_test).dropna()\n",
    "v3.name = \"STS_SIN3\"\n",
    "\n",
    "# Convoluting a random noise process with a exponential decay function to obtain a autocorrelation \n",
    "# timeseries similar to an Auto Regressive model of order 1 (AR(1))\n",
    "alpha = 10\n",
    "np.random.seed(0)\n",
    "n = np.random.rand(index.size+365)\n",
    "b = np.exp(-np.arange(366) / alpha)\n",
    "n = np.convolve(n, b, mode=\"valid\")\n",
    "n = n-n.mean()\n",
    "\n",
    "index = pd.to_datetime(np.arange(0, n.size, 1), unit=\"D\", origin=\"2005\")\n",
    "index.name = \"Time [Years]\"\n",
    "n = pd.Series(data=n, index=index, name=\"STS_EXP\")\n",
    "\n",
    "n1 = n.reindex(v1.index).dropna()\n",
    "n1.name = \"STS_EXP1\"\n",
    "n2 = n.reindex(v2.index).dropna()\n",
    "n2.name = \"STS_EXP2\"\n",
    "n3 = n.reindex(index_test).dropna()\n",
    "n3.name = \"STS_EXP3\"\n",
    "\n",
    "# Create a DataFrame with all series and plot them all\n",
    "d = pd.concat([v,v1,v2,v3,n,n1,n2,n3], axis=1)\n",
    "d.plot(subplots=True, figsize=(10,6), marker=\".\",markersize=2, color=\"steelblue\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Theoretical autocorrelation\n",
    "In this codeblock a figure is created showing the time series with equidistant timesteps and their theoretical autocorrelation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute true autocorrelation functions\n",
    "acf_v_true = pd.Series(np.cos(np.linspace(0,  2*np.pi, 365)))\n",
    "acf_n_true = pd.Series(b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing the autocorrelation\n",
    "\n",
    "The autocorrelation function for all time series is calculated for a number of time lags. Two different methods are used:\n",
    "\n",
    "- binning in a rectangular bin\n",
    "- weighting through a Gaussian kernel\n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "    \n",
    "<b>Warning</b> \n",
    "    \n",
    "Calculation of all the autocorrelation functions can take several minutes!\n",
    "\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lags = np.arange(1.0, 366)\n",
    "acf_df = pd.DataFrame(index=lags)\n",
    "\n",
    "for name, sts in d.items():\n",
    "    for method in [\"rectangle\", \"gaussian\"]:\n",
    "        acf = ps.stats.acf(sts.dropna(), bin_method=method, max_gap=30)\n",
    "        acf.index = acf.index.days\n",
    "        acf_df.loc[:, name + \"_\" + method] = acf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Autocorrelation for regular time series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(3, 2, figsize=(10,7), sharey=\"row\")\n",
    "\n",
    "for i, name in enumerate([\"STS_SIN\", \"STS_EXP\"]):\n",
    "    sts = d.loc[:, name]\n",
    "    sts.plot(ax=axes[0][i], style=\".\", legend=True, label=name)\n",
    "    \n",
    "    if name is \"STS_SIN\":\n",
    "        acf_true = acf_v_true\n",
    "    else:\n",
    "        acf_true = acf_n_true\n",
    "        \n",
    "    acf_true.plot(ax=axes[1][i])\n",
    "\n",
    "    axes[2][i].plot([0.0, 365], [0,0], label=\"true ACF\")\n",
    "    for bm in [\"gaussian\", \"rectangle\"]:\n",
    "        acf_name = name + \"_\" + bm\n",
    "        acf = acf_df.loc[:, acf_name]\n",
    "        if bm == \"gaussian\":\n",
    "            kwargs=dict(marker=\"x\", markersize=10)\n",
    "        else:\n",
    "            kwargs = dict(marker=\"o\")\n",
    "        acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n",
    "        dif = acf.subtract(acf_true).dropna()\n",
    "        rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n",
    "        dif.plot(label=bm+rmse, ax=axes[2][i], logx=True, **kwargs)\n",
    "        axes[2][i].legend()\n",
    "        axes[1][i].set_xticks([])\n",
    "        axes[2][i].set_xlabel(\"Lags [Days]\")\n",
    "        axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n",
    "        axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n",
    "\n",
    "ax=axes[0][0].set_ylabel(\"Noise [L]\")\n",
    "axes[1][0].set_ylabel(\"ACF [-]\") \n",
    "axes[2][0].set_ylabel(\"Error [L]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sine wave with non-equidistant timesteps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(3, 3, figsize=(16,9), sharey=\"row\")\n",
    "# \n",
    "for i, name in enumerate([\"STS_SIN1\", \"STS_SIN2\", \"STS_SIN3\"]):\n",
    "    sts = d.loc[:, name]\n",
    "    sts.plot(ax=axes[0][i], style=\".\",label=name)\n",
    "    axes[0][i].legend(loc=3)\n",
    "    acf_v_true.plot(ax=axes[1][i])\n",
    "    axes[2][i].plot([0.0, 365], [0,0], label=\"true ACF\")\n",
    "    for bm in [\"gaussian\", \"rectangle\"]:\n",
    "        acf_name = name + \"_\" + bm\n",
    "        acf = acf_df.loc[:, acf_name]        \n",
    "        if bm == \"gaussian\":\n",
    "            kwargs=dict(marker=\"x\", markersize=10)\n",
    "        else:\n",
    "            kwargs = dict(marker=\"o\")\n",
    "        acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n",
    "        dif = acf.subtract(acf_v_true).dropna()\n",
    "        rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n",
    "        dif.plot(label=bm+rmse, ax=axes[2][i], logx=True, **kwargs)\n",
    "        axes[1][i].set_xticks([])\n",
    "        axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n",
    "        axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n",
    "        axes[2][i].legend(loc=3)\n",
    "        axes[2][i].set_xlabel(\"Lags [Days]\")\n",
    "\n",
    "axes[2][i].relim()\n",
    "axes[0][0].set_ylabel(\"Noise [L]\")\n",
    "axes[1][0].set_ylabel(\"ACF [-]\") \n",
    "axes[2][0].set_ylabel(\"Error [L]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exponential decay with non-equidistant timesteps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(3, 3, figsize=(16,9), sharey=\"row\")\n",
    "\n",
    "for i, name in enumerate([\"STS_EXP1\", \"STS_EXP2\", \"STS_EXP3\"]):\n",
    "    sts = d.loc[:, name]\n",
    "    sts.plot(ax=axes[0][i], style=\".\", label=name)\n",
    "    axes[0][i].legend(loc=3)\n",
    "    acf_n_true.plot(ax=axes[1][i])\n",
    "    axes[2][i].plot([0.0, 365], [0,0], label=\"true ACF\")\n",
    "    for j, bm in enumerate([\"gaussian\", \"rectangle\"]):\n",
    "        acf_name = name + \"_\" + bm\n",
    "        acf = acf_df.loc[:, acf_name]\n",
    "        if bm == \"gaussian\":\n",
    "            kwargs=dict(marker=\"x\", markersize=10)\n",
    "        else:\n",
    "            kwargs = dict(marker=\"o\")\n",
    "        acf.plot(label=bm, ax=axes[1][i], logx=True, linestyle=\"\", **kwargs)\n",
    "        dif = acf.subtract(acf_n_true).dropna()\n",
    "        rmse = \" (RMSE={:.2f})\".format(np.sqrt((dif.pow(2)).sum() / dif.size))\n",
    "        dif.plot(label=bm+rmse, ax=axes[2][i], logx=True, sharey=axes[2][0], **kwargs)\n",
    "        axes[2][i].set_xticks([1, 10, 30, 60, 180, 365])\n",
    "        axes[2][i].set_xticklabels([1, 10, 30, 60, 180, 365])\n",
    "        axes[1][i].set_xticks([])\n",
    "        axes[2][i].legend(loc=2)\n",
    "        axes[2][i].set_xlabel(\"Lags [Days]\")\n",
    "\n",
    "axes[0][0].set_ylabel(\"Noise [L]\")\n",
    "axes[1][0].set_ylabel(\"Autocorrelation [-]\") \n",
    "axes[2][0].set_ylabel(\"Difference from true value [-]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "- Rehfeld, K., Marwan, N., Heitzig, J., Kurths, J. (2011). [Comparison of correlation analysis techniques for irregularly sampled time series](https://doi.org/10.5194/npg-18-389-2011). Nonlinear Processes in Geophysics. 18. 389-404.\n",
    "- Edelson, R. A., & Krolik, J. H. (1988). The discrete correlation function-A new method for analyzing unevenly sampled variability data. The Astrophysical Journal, 333, 646-659.\n",
    "- Collenteur, R.A. (2018) [Over autocorrelatie van tijdreeksmoddellen met niet-equidistante tijdstappen](http://www.artesia-water.nl/wp-content/uploads/Autocorrelatie_niet_gelijke_tijdstappen.pdf), Artesia, Schoonhoven, Nederland. In Dutch."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}