{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creating Equidistant Timeseries\n",
    "\n",
    "This notebooks shows functionality for creating equidistant timeseries in Pastas.\n",
    "This is sometimes useful or necessary, i.e. the Stoffer-Toloi test for autocorrelation requires an\n",
    "equidistant timeseries (that is allowed to have missing data).\n",
    "\n",
    "*Developed by D. Brakenhoff, Artesia, 2021*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pastas as ps\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "ps.show_versions()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We define 3 pandas methods for resampling to an equidistant timeseries. \n",
    "\n",
    "1. The first takes a sample at equidistant timesteps from the original series, at the user-specified frequency.\n",
    "2. The second creates a new equidistant index, rounded to the user-specified frequency. Then `Series.reindex()` is used with `method=\"nearest\"`.\n",
    "2. The third method rounds the series index down to the nearest user-specified frequency, then drops the duplicates before calling `Series.asfreq` with the user-specified frequency. This ensures no duplicates are in the resulting timeseries.\n",
    "\n",
    "Pastas contains the function `pastas.utils.get_equidistant_timeseries()` which does something similar, but attempts to minimize the number of dropped points and ensures that each observation from the original timeseries is used only once in the resulting equidistant timeseries.\n",
    "\n",
    "\n",
    "_**Note:** in terms of performance the pandas methods are undoubtedly faster._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def pandas_sample(series, freq):\n",
    "    series = series.copy()\n",
    "    t_offset = ps.utils._get_time_offset(series.index, freq).value_counts().idxmax()\n",
    "    new_idx = pd.date_range(\n",
    "        series.index[0].floor(freq) + t_offset,\n",
    "        series.index[-1].floor(freq) + t_offset, \n",
    "        freq=freq\n",
    "        )\n",
    "    return series.reindex(new_idx)\n",
    "\n",
    "def pandas_nearest(series, freq, tolerance=None):\n",
    "    series = series.copy()\n",
    "    # Create equidistant timeseries with Pandas\n",
    "    idx = pd.date_range(series.index[0].floor(freq),\n",
    "                        series.index[-1].ceil(freq),\n",
    "                        freq=freq)\n",
    "    spandas = series.reindex(idx, method=\"nearest\", tolerance=tolerance)\n",
    "    return spandas\n",
    "\n",
    "\n",
    "def pandas_asfreq(series, freq):\n",
    "    # Create equidistant timeseries with most frequent samples\n",
    "    series = series.copy()\n",
    "    series.index = series.index.floor(freq)\n",
    "    spandas = (series\n",
    "               .reset_index()\n",
    "               .drop_duplicates(subset=\"index\", keep=\"first\", inplace=False)\n",
    "               .set_index(\"index\")\n",
    "               .asfreq(freq)\n",
    "               .squeeze()\n",
    "               )\n",
    "    return spandas\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example 1\n",
    "\n",
    "Lets create a timeseries spaced which is normally spaced with a frequency of 6 hours. The first and last measurement are shifted a bit later and earlier respectively. The two method compared here are the new function in Pastas and the Pandas reindex function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create timeseries\n",
    "freq = \"6H\"\n",
    "idx0 = pd.date_range(\"2000-01-01\", freq=freq, periods=7).tolist()\n",
    "idx0[0] = pd.Timestamp(\"2000-01-01 04:00:00\")\n",
    "idx0[-1] = pd.Timestamp(\"2000-01-02 11:00:00\")\n",
    "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n",
    "\n",
    "# Create equidistant timeseries with Pastas\n",
    "s_pd1 = pandas_sample(series, freq)\n",
    "s_pd2 = pandas_nearest(series, freq)\n",
    "s_pd3 = pandas_asfreq(series, freq)\n",
    "s_pastas = ps.utils.get_equidistant_series(series, freq)\n",
    "\n",
    "# Create figure\n",
    "plt.figure(figsize=(10, 4))\n",
    "ax = series.plot(marker=\"o\", label=\"original timeseries\", ms=10,)\n",
    "s_pd2.plot(ax=ax, marker=\"x\", ms=8, label=\"pandas_nearest\")\n",
    "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas_asfreq\")\n",
    "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas_sample\")\n",
    "s_pastas.plot(ax=ax, marker=\".\", label=\"pastas equidistant\")\n",
    "ax.grid(b=True)\n",
    "ax.legend(loc=\"best\")\n",
    "ax.set_xlabel(\"\");\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, both the `pandas_nearest` and `pandas_asfreq` methods and `get_equidistant_series` show the expected behavior. The data at the beginning and at the end is shifted to the nearest equidistant timestamp. The `pandas_sample` method drops 2 datapoints because they're measured at different time offsets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas], axis=1)\n",
    "dfall.columns = [\n",
    "    \"original\",\n",
    "    \"pandas_sample\",\n",
    "    \"pandas_nearest\",\n",
    "    \"pandas_asfreq\",\n",
    "    \"pastas\"\n",
    "]\n",
    "dfall"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create timeseries\n",
    "freq = \"D\"\n",
    "idx0 = pd.date_range(\"2000-01-01\", freq=freq, periods=7).tolist()\n",
    "idx0[0] = pd.Timestamp(\"2000-01-01 09:00:00\")\n",
    "del idx0[2]\n",
    "del idx0[2]\n",
    "idx0[-2] = pd.Timestamp(\"2000-01-06 13:00:00\")\n",
    "idx0[-1] = pd.Timestamp(\"2000-01-06 23:00:00\")\n",
    "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n",
    "\n",
    "# Create equidistant timeseries\n",
    "s_pd1 = pandas_sample(series, freq)\n",
    "s_pd2 = pandas_nearest(series, freq)\n",
    "s_pd3 = pandas_asfreq(series, freq)\n",
    "s_pastas = ps.utils.get_equidistant_series(series, freq)\n",
    "\n",
    "# Create figure\n",
    "plt.figure(figsize=(10, 4))\n",
    "ax = series.plot(marker=\"o\", label=\"original\", ms=10)\n",
    "s_pd2.plot(ax=ax, marker=\"x\", ms=10, label=\"pandas nearest\")\n",
    "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas asfreq\")\n",
    "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas sample\")\n",
    "s_pastas.plot(ax=ax, marker=\".\", label=\"equidistant\")\n",
    "ax.grid(b=True)\n",
    "ax.legend(loc=\"best\")\n",
    "ax.set_xlabel(\"\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, the shortcomings of `pandas_nearest` are clearly visible. It duplicates observations from the original timeseries to fill the gaps. This can be solved by passing e.g. `tolerance=\"0.99{freq}\"` to `series.reindex()` in which case the gaps will not be filled. However, with very irregular timesteps this is not guaranteed to work and duplicates may still occur. The `pandas_asfreq` and pastas methods work as expected and use the available data to create a reasonable equidistant timeseries from the original data. The `pandas_sample` method is only able to keep two observations from the original series in this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas], axis=1)\n",
    "dfall.columns = [\n",
    "    \"original\",\n",
    "    \"pandas_sample\",\n",
    "    \"pandas_nearest\",\n",
    "    \"pandas_asfreq\",\n",
    "    \"pastas\",\n",
    "]\n",
    "dfall"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create timeseries\n",
    "freq = \"2H\"\n",
    "freq2 = \"1H\"\n",
    "idx0 = pd.date_range(\"2000-01-01 18:00:00\", freq=freq, periods=3).tolist()\n",
    "idx1 = pd.date_range(\"2000-01-02 01:30:00\", freq=freq2, periods=10).tolist()\n",
    "idx0 = idx0 + idx1\n",
    "idx0[3] = pd.Timestamp(\"2000-01-02 01:31:00\")\n",
    "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n",
    "series.iloc[8:10] = np.nan\n",
    "\n",
    "\n",
    "# Create equidistant timeseries\n",
    "s_pd1 = pandas_sample(series, freq)\n",
    "s_pd2 = pandas_nearest(series, freq)\n",
    "s_pd3 = pandas_asfreq(series, freq)\n",
    "s_pastas1 = ps.utils.get_equidistant_series(\n",
    "    series, freq, minimize_data_loss=True)\n",
    "s_pastas2 = ps.utils.get_equidistant_series(\n",
    "    series, freq, minimize_data_loss=False)\n",
    "\n",
    "\n",
    "# Create figure\n",
    "plt.figure(figsize=(10, 6))\n",
    "ax = series.plot(marker=\"o\", label=\"original\", ms=10)\n",
    "s_pd2.plot(ax=ax, marker=\"x\", ms=10, label=\"pandas nearest\")\n",
    "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas asfreq\")\n",
    "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas sample\")\n",
    "s_pastas1.plot(ax=ax, marker=\".\", ms=6, label=\"equidistant (minimize data loss)\")\n",
    "s_pastas2.plot(ax=ax, marker=\"+\", ms=10, label=\"equidistant (default)\")\n",
    "ax.grid(b=True)\n",
    "ax.legend(loc=\"best\")\n",
    "ax.set_xlabel(\"\");\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we can observe the following behavior in each method:\n",
    "- `pandas_sample` retains 4 values.\n",
    "- `pandas_nearest` duplicates some observations in the equidistant timeseries.\n",
    "- `pandas_asfreq` does quite well, but drops some observations near the gap in the original timeseries.\n",
    "- the pastas method with the default option misses an observation right after the gap in the original timeseries.\n",
    "- the pastas method with `minimize_data_loss=True` fills this gap, using as much data as possible from the original timeseries.\n",
    "\n",
    "The results from the `pandas_asfreq` and pastas method are both good, but the pastas methods retains more of the original data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas2, s_pastas1], axis=1)\n",
    "dfall.columns = [\n",
    "    \"original\",\n",
    "    \"pandas_sample\",\n",
    "    \"pandas_nearest\",\n",
    "    \"pandas_asfreq\",\n",
    "    \"pastas (default)\",\n",
    "    \"pastas (minimize data loss)\",\n",
    "]\n",
    "dfall"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example 4"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create timeseries\n",
    "freq = \"2H\"\n",
    "freq2 = \"1H\"\n",
    "idx0 = pd.date_range(\"2000-01-01 18:00:00\", freq=freq, periods=3).tolist()\n",
    "idx1 = pd.date_range(\"2000-01-02 00:00:00\", freq=freq2, periods=10).tolist()\n",
    "idx0 = idx0 + idx1\n",
    "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n",
    "series.iloc[8:10] = np.nan\n",
    "\n",
    "# Create equidistant timeseries\n",
    "s_pd1 = pandas_sample(series, freq)\n",
    "s_pd2 = pandas_nearest(series, freq)\n",
    "s_pd3 = pandas_asfreq(series, freq)\n",
    "s_pastas1 = ps.utils.get_equidistant_series(\n",
    "    series, freq, minimize_data_loss=True)\n",
    "s_pastas2 = ps.utils.get_equidistant_series(\n",
    "    series, freq, minimize_data_loss=False)\n",
    "\n",
    "# Create figure\n",
    "plt.figure(figsize=(10, 6))\n",
    "ax = series.plot(marker=\"o\", label=\"original\", ms=10)\n",
    "s_pd2.plot(ax=ax, marker=\"x\", ms=10, label=\"pandas nearest\")\n",
    "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas asfreq\")\n",
    "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas sample\")\n",
    "s_pastas1.plot(ax=ax, marker=\".\", ms=6,\n",
    "               label=\"equidistant (minimize data loss)\")\n",
    "s_pastas2.plot(ax=ax, marker=\"+\", ms=10, label=\"equidistant (default)\")\n",
    "ax.grid(b=True)\n",
    "ax.legend(loc=\"best\")\n",
    "ax.set_xlabel(\"\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similar to the previous example, the pastas method retains the most data from the original timeseries. In this case both pandas methods perform well, but do omit some of the original data at the end of the timeseries or near the gap in the original timeseries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas2, s_pastas1], axis=1)\n",
    "dfall.columns = [\n",
    "    \"original\",\n",
    "    \"pandas_sample\",\n",
    "    \"pandas_nearest\",\n",
    "    \"pandas_asfreq\",\n",
    "    \"pastas (default)\",\n",
    "    \"pastas (minimize data loss)\",\n",
    "]\n",
    "dfall\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}