{ "cells": [ { "cell_type": "markdown", "id": "e901f921", "metadata": {}, "source": [ "# Preprocessing user-provided time series\n", "*Developed by D. Brakenhoff, Artesia, R Caljé, Artesia and R.A. Collenteur, Eawag, January (2021-2023)*\n", "\n", "This notebooks shows how to solve the most common errors that arise during the validation of the user provided time series. After showing how to deal with some of the easier errors, we will dive into the topic of making time series equidistant. For this purpose Pastas contains a lot of helper functions. \n", "\n", "
\n", "\n", "Note \n", "To see the Error messages returned by `validate_series` method, you have to uncomment the code-line. This is done to make sure the notebook still runs automatically for our Documentation website.\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "id": "9c439993", "metadata": {}, "outputs": [], "source": [ "import pastas as ps\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "ps.show_versions()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "7318278d", "metadata": {}, "source": [ "## 1. Validating the time series, what is checked?\n", "\n", "Let us first look at the docstring of the `ps.validate_stress` method, which can be used to automatically check user-provided input time series. This method is also used internally in Pastas to check all user provided time series. For the stresses `ps.validate_stress` is used and for the oseries the `ps.validate_oseries` is used. The only difference between these methods is that the oseries are not checked for equidistant time series." ] }, { "cell_type": "code", "execution_count": null, "id": "61a4d76f", "metadata": {}, "outputs": [], "source": [ "?ps.validate_stress" ] }, { "cell_type": "markdown", "id": "626071f4", "metadata": {}, "source": [ "### a. If the time series is a DataFrame" ] }, { "cell_type": "code", "execution_count": null, "id": "7810ad02", "metadata": {}, "outputs": [], "source": [ "index = pd.date_range(\"2000-01-01\", \"2000-01-10\")\n", "series = pd.DataFrame(data=[np.arange(10.0)], index=index)\n", "\n", "# Here's error message returned by Pastas\n", "# ps.validate_stress(series)" ] }, { "cell_type": "code", "execution_count": null, "id": "badd62d9", "metadata": {}, "outputs": [], "source": [ "series.iloc[:, 0] # Simpy select the first column" ] }, { "cell_type": "markdown", "id": "ff78c299", "metadata": {}, "source": [ "### b. If values are not floats" ] }, { "cell_type": "code", "execution_count": null, "id": "60acf137", "metadata": {}, "outputs": [], "source": [ "index = pd.date_range(\"2000-01-01\", \"2000-01-10\")\n", "series = pd.Series(data=range(10), index=index, name=\"Stress\")\n", "\n", "# Here's error message returned by Pastas\n", "# ps.validate_stress(series)" ] }, { "cell_type": "code", "execution_count": null, "id": "992f3634", "metadata": {}, "outputs": [], "source": [ "# Here is a possible fix to this issue\n", "series = series.astype(float)" ] }, { "cell_type": "markdown", "id": "c173f0ac", "metadata": {}, "source": [ "### c. If the index is not a datetimeindex" ] }, { "cell_type": "code", "execution_count": null, "id": "eff81fdf", "metadata": {}, "outputs": [], "source": [ "series = pd.Series(data=np.arange(10.0), index=range(10), name=\"Stress\")\n", "\n", "# Here's error message returned by Pastas\n", "# ps.validate_stress(series)" ] }, { "cell_type": "code", "execution_count": null, "id": "081f25fc", "metadata": {}, "outputs": [], "source": [ "# Here is a possible fix to this issue\n", "series.index = pd.to_datetime(series.index)\n", "ps.validate_stress(series)" ] }, { "cell_type": "markdown", "id": "e513fe6b", "metadata": {}, "source": [ "### d. If index is not monotonically increasing" ] }, { "cell_type": "code", "execution_count": null, "id": "0069b7a6", "metadata": {}, "outputs": [], "source": [ "index = pd.to_datetime([\"2000-01-01\", \"2000-01-03\", \"2000-01-02\", \"2000-01-4\"])\n", "series = pd.Series(data=np.arange(4.0), index=index, name=\"Stress\")\n", "\n", "# Here's error message returned by Pastas\n", "ps.validate_stress(series)" ] }, { "cell_type": "code", "execution_count": null, "id": "662a029d", "metadata": {}, "outputs": [], "source": [ "# Here is a possible fix to this issue\n", "series = series.sort_index()\n", "ps.validate_stress(series)" ] }, { "cell_type": "markdown", "id": "d6cb3c37", "metadata": {}, "source": [ "### e. If the index has duplicate indices" ] }, { "cell_type": "code", "execution_count": null, "id": "bf21d496", "metadata": {}, "outputs": [], "source": [ "index = pd.to_datetime([\"2000-01-01\", \"2000-01-02\", \"2000-01-02\", \"2000-01-3\"])\n", "series = pd.Series(data=np.arange(4.0), index=index, name=\"Stress\")\n", "\n", "# Here's error message returned by Pastas\n", "ps.validate_stress(series)" ] }, { "cell_type": "code", "execution_count": null, "id": "4fe9b9d0", "metadata": {}, "outputs": [], "source": [ "# Here is a possible fix to this issue\n", "grouped = series.groupby(level=0)\n", "series = grouped.mean()\n", "ps.validate_stress(series)" ] }, { "cell_type": "markdown", "id": "b52c1724", "metadata": {}, "source": [ "### f. If the time series has nan-values" ] }, { "cell_type": "code", "execution_count": null, "id": "1a8244e1", "metadata": {}, "outputs": [], "source": [ "index = pd.date_range(\"2000-01-01\", \"2000-01-10\")\n", "series = pd.Series(data=np.arange(10.0), index=index, name=\"Stress\")\n", "series.loc[\"2000-01-05\"] = np.nan\n", "\n", "# Here's warning message returned by Pastas\n", "ps.validate_stress(series)" ] }, { "cell_type": "code", "execution_count": null, "id": "305f1760", "metadata": {}, "outputs": [], "source": [ "# Here is a possible fix to this issue for oseries\n", "series.dropna() # simply drop the nan-values\n", "\n", "# Here is a possible fix to this issue for stresses\n", "series = series.fillna(series.mean()) # For example for water levels\n", "series = series.fillna(0.0) # For example for precipitation\n", "series = series.interpolate(method=\"time\") # For example for evaporation" ] }, { "cell_type": "markdown", "id": "4a362629", "metadata": {}, "source": [ "## 2. If a stress time series has non-equidistant time steps\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a2e4b1f4", "metadata": {}, "outputs": [], "source": [ "# Create timeseries\n", "freq = \"6H\"\n", "idx0 = pd.date_range(\"2000-01-01\", freq=freq, periods=7).tolist()\n", "idx0[3] = idx0[3] + pd.to_timedelta(1, \"H\")\n", "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float), name=\"Stress\")\n", "\n", "# Here's error message return by Pastas\n", "ps.validate_stress(series)" ] }, { "cell_type": "markdown", "id": "9a2f2373", "metadata": {}, "source": [ "Pastas contains some convenience functions for creating equidistant time series. The method for creating an equidistant time series depends on the type of stress and different methods are used for for fluxes (e.g. precipitation, evaporation, pumping discharge) and levels (e.g. head, water levels). These methods are presented in the next sections." ] }, { "attachments": {}, "cell_type": "markdown", "id": "86d5aa86", "metadata": {}, "source": [ "### Creating equidistant time series for fluxes\n", "There are several methods in Pandas to create equidistant series. A flux describes a measured or logged quantity over a period of time, ressulting in a pandas Series. In this series, each flux is asssigned a timestamp in the index. In Pastas, we assume the timestamp is at _the end of the period_ that belongs to each measurement. This means that the precipitation of march 5 2022 gets the timestamp '2022-03-06 00:00:00' (which can be counter-intuitive, as the index is now a day later). Therefore, when using Pandas resample methods, we add two parameters: closed='right' and label='right'. So given this series of precipitation in mm:" ] }, { "cell_type": "code", "execution_count": null, "id": "19f3be77", "metadata": {}, "outputs": [], "source": [ "# plot the original series\n", "series" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9277882e", "metadata": {}, "source": [ "using `\"right\"` would yield:" ] }, { "cell_type": "code", "execution_count": null, "id": "90fffac6", "metadata": {}, "outputs": [], "source": [ "series.resample(\"12H\", closed=\"right\", label=\"right\").sum()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4e8f4e47", "metadata": {}, "source": [ "which is logical because over the first 12 hours (between 01-01 00:00:00 and 01-01 12:00:00) 3mm of precipitation fell. However, using `\"left\"` would yield:" ] }, { "cell_type": "code", "execution_count": null, "id": "47536431", "metadata": {}, "outputs": [], "source": [ "series.resample(\"12H\", closed=\"left\", label=\"left\").sum()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "583f24ad", "metadata": {}, "source": [ "Pastas helps users with a simple wrapper around the pandas resample function with setting \"right\" keyword arguments for `closed` and `label`:\n", "\n", "`resampler = pastas.timeseries_utils.resample(series, freq)`. \n", "\n", "When this resampler is return series can easily be interpolated, summed or averaged using all [resample methods](https://pandas.pydata.org/docs/reference/resampling.html) available in the Pandas library." ] }, { "cell_type": "code", "execution_count": null, "id": "53fb5987", "metadata": {}, "outputs": [], "source": [ "ps.timeseries_utils.resample(\n", " series, \"12H\"\n", ").sum() # gives the same as series.resample(\"12H\", closed=\"right\", label=\"right\").sum()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e12fb4a6", "metadata": {}, "source": [ "The resample-method of pandas basically is a groupby-method. This creates problems when there is not a single measurement in each group / bin." ] }, { "cell_type": "code", "execution_count": null, "id": "9f911d9b", "metadata": {}, "outputs": [], "source": [ "series.resample(\"6H\", closed=\"right\", label=\"right\").mean()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "871598bf", "metadata": {}, "source": [ "This is demponstrated by the NaN at 2000-01-01 18:00:00. Also when there are two values in a group, these are averaged, even though one of the values only counts for one hour, and the other for 6 hours. This is demonstrated by the value of 3.5 at 2000-01-02 00:00:00.\n", "\n", "Because of these problems there is a method called 'timestep_weighted_resample' in Pastas. This method assumes the index is at the end of the period that belongs to each measurement, just like the rest of Pastas. Using this assumption, the method can calculate a new series, using an overlapping period weighted average:" ] }, { "cell_type": "code", "execution_count": null, "id": "0bd45b5a", "metadata": {}, "outputs": [], "source": [ "new_index = pd.date_range(series.index[0], series.index[-1], freq=\"12H\")\n", "ps.ts.timestep_weighted_resample(series, new_index)" ] }, { "cell_type": "markdown", "id": "1d232423", "metadata": {}, "source": [ "We see the value at 2000-01-02 00:00:00 is now 3.833333, which is 1/6th of 3.0 (the original value at 2000-01-01 19:00:00) and 5/6th of 4.0 (the original value at 2000-01-02 00:00:00). This methods sets a NaN for the value at 2000-01-01 00:00:00, as the length of the period cannot be determined." ] }, { "attachments": {}, "cell_type": "markdown", "id": "ef4a8f1b", "metadata": {}, "source": [ "The following examples showcase some potentially useful operations for which\n", "`timestep_weighted_resample` can be used. " ] }, { "attachments": {}, "cell_type": "markdown", "id": "83bc9e13", "metadata": {}, "source": [ "### Example 1: Resample monthly pumping volumes to daily values\n", "\n", "Monthly aggregated data is common, and in this synthetic example we'll assume\n", "that we received data for monthly pumping rate from a well. We want to\n", "convert these to daily data so that we can create a time series model with a\n", "daily time step.\n", "\n", "First let's invent some data." ] }, { "cell_type": "code", "execution_count": null, "id": "80b04dd3", "metadata": {}, "outputs": [], "source": [ "# monthly discharge volumes\n", "# volume on 1 february 2022 at 00:00 is the volume for month of january\n", "index = pd.date_range(\"2022-02-01\", \"2023-01-01\", freq=\"MS\")\n", "discharge = np.arange(12) * 100\n", "\n", "Q0 = pd.Series(index=index, data=discharge)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5a1c1a8e", "metadata": {}, "source": [ "Next, use `timestep_weighted_resample` to calculate a daily time series for\n", "well pumping.\n", "\n", "**Note**: _the unit of the resulting daily time series has not changed! This\n", "means the daily value is assumed to be the same as the monthly value. So take\n", "note of your units when modeling resampled time series with Pastas!_" ] }, { "cell_type": "code", "execution_count": null, "id": "c99e071e", "metadata": {}, "outputs": [], "source": [ "# define a new index\n", "new_index = pd.date_range(\"2022-01-01\", \"2022-12-31\", freq=\"D\")\n", "\n", "# resample to daily data\n", "Q_daily = ps.ts.timestep_weighted_resample(Q0, new_index)\n", "\n", "# plot the comparison\n", "ax = Q0.plot(marker=\"o\", label=\"original monthly series\")\n", "Q_daily.plot(ax=ax, label=\"resampled daily series\")\n", "ax.legend()\n", "ax.grid(True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2d1af6e1", "metadata": {}, "source": [ "### Example 2: Resample precipitation between 9AM-9AM to 12AM-12AM\n", "\n", "In the Netherlands, the rainfall gauges that are measured daily measure between\n", "9 AM one day and 9 AM the next day. For time series models with a daily\n", "timestep it is often simpler to use time series represent one full day (from 12\n", "AM - 12 AM). We can do this ourselved by applying the formula:\n", "$data_{24}[t] = \\frac{9}{24}data_{9}[t] + \\frac{15}{24}data_{9}[t+1]$ but once again,\n", "`timestep_weighted_resample` can help us calculate this resampled time series.\n", "First we invent some new random data." ] }, { "cell_type": "code", "execution_count": null, "id": "d3026993", "metadata": {}, "outputs": [], "source": [ "index = pd.date_range(\"2022-01-01 09:00:00\", \"2022-01-10 09:00:00\")\n", "data = np.random.rand(len(index))\n", "\n", "p0 = pd.Series(index=index, data=data)\n", "p0" ] }, { "attachments": {}, "cell_type": "markdown", "id": "35f6e4c7", "metadata": {}, "source": [ "The result is shown below. Note how each resampled point represents the\n", "weighted average of the two surrounding observations, which is what we want." ] }, { "cell_type": "code", "execution_count": null, "id": "24cc9ad9", "metadata": {}, "outputs": [], "source": [ "# create a new daily index, running from 12AM-12AM\n", "new_index = pd.date_range(p0.index[0].normalize(), p0.index[-1].normalize())\n", "\n", "# resample time series\n", "p_resample = ps.ts.timestep_weighted_resample(p0, new_index)\n", "\n", "# try it ourselves:\n", "p_self = [\n", " data[i] * (h / 24) + data[i + 1] * ((24 - h) / 24)\n", " for i, h in enumerate(index.hour.values[1:])\n", "]\n", "p_resample_self = pd.Series(p_self, new_index[1:])\n", "\n", "# plot comparison\n", "ax = p0.plot(marker=\"o\", label=\"original 9AM series\")\n", "p_resample.plot(marker=\"^\", ax=ax, label=\"resampled series\")\n", "p_resample_self.plot(\n", " marker=\"d\", markersize=3, linestyle=\"--\", ax=ax, label=\"resampled series self\"\n", ")\n", "ax.legend()\n", "ax.grid(True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8c2db168", "metadata": {}, "source": [ "### Example 3: Resample hourly data to daily data\n", "\n", "Another frequent pre-processing step is converting hourly data to daily values.\n", "This can be done with simple pandas methods, but `timestep_weighted_resample`\n", "can also handle this calculation, with the added advantage of supporting\n", "irregular time steps in the new time series.\n", "\n", "In this example we'll simply convert an hourly time series into a daily time series." ] }, { "cell_type": "code", "execution_count": null, "id": "bbfde7aa", "metadata": {}, "outputs": [], "source": [ "# create some hourly data\n", "index = pd.date_range(\"2022-01-01 00:00:00\", \"2022-01-03 00:00:00\", freq=\"H\")\n", "data = np.hstack([np.arange(len(index) // 2), 24 * np.ones(len(index) // 2 + 1)])\n", "\n", "# convert to series\n", "p0 = pd.Series(index=index, data=data)\n", "p0.head()" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d2b5d1a7", "metadata": {}, "source": [ "The result shows that the resampled value at the end of each day represents the\n", "average value of that day, which is what we would expect." ] }, { "cell_type": "code", "execution_count": null, "id": "28097131", "metadata": {}, "outputs": [], "source": [ "# create a new daily index\n", "new_index = pd.date_range(p0.index[0].normalize(), p0.index[-1].normalize(), freq=\"D\")\n", "\n", "# resample measurements\n", "p_resample = ps.ts.timestep_weighted_resample(p0, new_index)\n", "\n", "# plot the comparison\n", "ax = p0.plot(marker=\"o\", label=\"original hourly series\")\n", "p_resample.plot(marker=\"^\", ax=ax, label=\"resampled daily series\")\n", "ax.legend()\n", "ax.grid(True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0395b960", "metadata": {}, "source": [ "### Creating equidistant time series for (ground)waterlevels\n", "\n", "The following methods are used for time series that do not need to be resampled. For example, the measurements represent a state (e.g. head, water level) and either an equidistant sample is taken from the original time series, or observations are shifted slightly, to create an equidistant time series. This can be done in a number of different ways:\n", "\n", "- `pandas_equidistant_sample` takes a sample at equidistant timesteps from the original series, at the user-specified frequency. For very irregular time series lots of observations will be lost. The advantage is that observations are not shifted in time, unlike in the other methods.\n", "- `pandas_equidistant_nearest` creates a new equidistant index with the user-specified frequency, then `Series.reindex()` is used with `method=\"nearest\"` which will shift certain observations in time to fill the equidistant time series. This method can introduce duplicates (i.e. an observation that is used more than once) in the final result.\n", "- `pandas_equidistant_asfreq` rounds the series index to the user-specified frequency, then drops any duplicates before calling `Series.asfreq` with the user-specified frequency. This ensures no duplicates are contained in the resulting time series.\n", "- `get_equidistant_timeseries_nearest` creates a equidistant time series minimizing the number of dropped points and ensuring that each observation from the original time series is used only once in the resulting equidistant time series. This method \n", "\n", "\n", "The different methods are compared in the following four examples.\n", "\n", "_**Note:** in terms of performance the pandas methods are much faster._" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2cb59811", "metadata": {}, "source": [ "#### Example 1\n", "\n", "Lets create a timeseries spaced which is normally spaced with a frequency of 6\n", "hours. The first and last measurement are shifted a bit later and earlier\n", "respectively. \n", "\n", "The different methods for creating equidistant time series for levels are compared." ] }, { "cell_type": "code", "execution_count": null, "id": "141d0369", "metadata": {}, "outputs": [], "source": [ "# Create time series\n", "freq = \"6H\"\n", "idx0 = pd.date_range(\"2000-01-01\", freq=freq, periods=7).tolist()\n", "idx0[0] = pd.Timestamp(\"2000-01-01 04:00:00\")\n", "idx0[-1] = pd.Timestamp(\"2000-01-02 11:00:00\")\n", "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n", "\n", "# Create equidistant time series with Pastas\n", "s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)\n", "s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)\n", "s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)\n", "s_pastas = ps.ts.get_equidistant_series_nearest(series, freq)\n", "\n", "# Create figure\n", "plt.figure(figsize=(10, 4))\n", "ax = series.plot(\n", " marker=\"o\",\n", " label=\"original time series\",\n", " ms=10,\n", ")\n", "s_pd2.plot(ax=ax, marker=\"x\", ms=8, label=\"pandas_equidistant_nearest\")\n", "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas_equidistant_asfreq\")\n", "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas_equidistant_sample\")\n", "s_pastas.plot(ax=ax, marker=\".\", label=\"get_equidistant_series_nearest\")\n", "\n", "ax.grid(True)\n", "ax.legend(loc=\"best\")\n", "ax.set_xlabel(\"\");" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8682dad5", "metadata": {}, "source": [ "Both the `pandas_equidistant_nearest` and `pandas_equidistant_asfreq` methods and `get_equidistant_series_nearest` show the observations at the beginning and the end of the time series are shifted to the nearest equidistant timestamp. The `pandas_equidistant_sample` method drops 2 datapoints because they're measured at different time offsets." ] }, { "cell_type": "code", "execution_count": null, "id": "834e18a2", "metadata": {}, "outputs": [], "source": [ "# some helper functions to show differences in performance\n", "def values_kept(s, original):\n", " diff = set(original.dropna().values) & set(s.dropna().values)\n", " return len(diff)\n", "\n", "\n", "def n_duplicates(s):\n", " return (s.value_counts() >= 2).sum()" ] }, { "cell_type": "code", "execution_count": null, "id": "cf30f93b", "metadata": {}, "outputs": [], "source": [ "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas], axis=1)\n", "dfall.columns = [\n", " \"original\",\n", " \"pandas_equidistant_sample\",\n", " \"pandas_equidistant_nearest\",\n", " \"pandas_equidistant_asfreq\",\n", " \"get_equidistant_series_nearest\",\n", "]\n", "dfall" ] }, { "cell_type": "markdown", "id": "f273ac47", "metadata": {}, "source": [ "The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result." ] }, { "cell_type": "code", "execution_count": null, "id": "431be517", "metadata": {}, "outputs": [], "source": [ "valueskept = dfall.apply(values_kept, args=(dfall[\"original\"],))\n", "valueskept.name = \"values kept\"\n", "duplicates = dfall.apply(n_duplicates)\n", "duplicates.name = \"duplicates\"\n", "\n", "pd.concat([valueskept, duplicates], axis=1)" ] }, { "cell_type": "markdown", "id": "479436fc", "metadata": {}, "source": [ "#### Example 2" ] }, { "cell_type": "code", "execution_count": null, "id": "485a1069", "metadata": {}, "outputs": [], "source": [ "# Create timeseries\n", "freq = \"D\"\n", "idx0 = pd.date_range(\"2000-01-01\", freq=freq, periods=7).tolist()\n", "idx0[0] = pd.Timestamp(\"2000-01-01 09:00:00\")\n", "del idx0[2]\n", "del idx0[2]\n", "idx0[-2] = pd.Timestamp(\"2000-01-06 13:00:00\")\n", "idx0[-1] = pd.Timestamp(\"2000-01-06 23:00:00\")\n", "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n", "\n", "# Create equidistant timeseries\n", "s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)\n", "s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)\n", "s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)\n", "s_pastas = ps.ts.get_equidistant_series_nearest(series, freq)\n", "\n", "# Create figure\n", "plt.figure(figsize=(10, 4))\n", "ax = series.plot(marker=\"o\", label=\"original\", ms=10)\n", "s_pd2.plot(ax=ax, marker=\"x\", ms=10, label=\"pandas_equidistant_nearest\")\n", "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas_equidistant_asfreq\")\n", "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas_equidistant_sample\")\n", "s_pastas.plot(ax=ax, marker=\".\", label=\"get_equidistant_series_nearest\")\n", "ax.grid(True)\n", "ax.legend(loc=\"best\")\n", "ax.set_xlabel(\"\");" ] }, { "cell_type": "markdown", "id": "4a4fdd60", "metadata": {}, "source": [ "In this example, the shortcomings of `pandas_equidistant_nearest` are clearly visible. It duplicates observations from the original timeseries to fill the gaps. This can be solved by passing e.g. `tolerance=\"0.99{freq}\"` to `series.reindex()` in which case the gaps will not be filled. However, with very irregular timesteps this is not guaranteed to work and duplicates may still occur. The `pandas_equidistant_asfreq` and pastas methods work as expected and uses the available data to create a reasonable equidistant timeseries from the original data. The `pandas_equidistant_sample` method is only able to keep two observations from the original series in this example." ] }, { "cell_type": "code", "execution_count": null, "id": "33ff7b9c", "metadata": {}, "outputs": [], "source": [ "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas], axis=1)\n", "dfall.columns = [\n", " \"original\",\n", " \"pandas_equidistant_sample\",\n", " \"pandas_equidistant_nearest\",\n", " \"pandas_equidistant_asfreq\",\n", " \"get_equidistant_series_nearest\",\n", "]\n", "dfall" ] }, { "cell_type": "markdown", "id": "1ada72be", "metadata": {}, "source": [ "The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result." ] }, { "cell_type": "code", "execution_count": null, "id": "6992eb07", "metadata": {}, "outputs": [], "source": [ "valueskept = dfall.apply(values_kept, args=(dfall[\"original\"],))\n", "valueskept.name = \"values kept\"\n", "duplicates = dfall.apply(n_duplicates)\n", "duplicates.name = \"duplicates\"\n", "\n", "pd.concat([valueskept, duplicates], axis=1)" ] }, { "cell_type": "markdown", "id": "c5f3ad30", "metadata": {}, "source": [ "#### Example 3" ] }, { "cell_type": "code", "execution_count": null, "id": "5cba7522", "metadata": {}, "outputs": [], "source": [ "# Create timeseries\n", "freq = \"2H\"\n", "freq2 = \"1H\"\n", "idx0 = pd.date_range(\"2000-01-01 18:00:00\", freq=freq, periods=3).tolist()\n", "idx1 = pd.date_range(\"2000-01-02 01:30:00\", freq=freq2, periods=10).tolist()\n", "idx0 = idx0 + idx1\n", "idx0[3] = pd.Timestamp(\"2000-01-02 01:31:00\")\n", "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n", "series.iloc[8:10] = np.nan\n", "\n", "\n", "# Create equidistant timeseries\n", "s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)\n", "s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)\n", "s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)\n", "s_pastas1 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=True)\n", "s_pastas2 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=False)\n", "\n", "\n", "# Create figure\n", "plt.figure(figsize=(10, 6))\n", "ax = series.plot(marker=\"o\", label=\"original\", ms=10)\n", "s_pd2.plot(ax=ax, marker=\"x\", ms=10, label=\"pandas_equidistant_nearest\")\n", "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas_equidistant_asfreq\")\n", "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas_equidistant_sample\")\n", "s_pastas1.plot(\n", " ax=ax, marker=\".\", ms=6, label=\"get_equidistant_series_nearest (minimize data loss)\"\n", ")\n", "s_pastas2.plot(\n", " ax=ax, marker=\"+\", ms=10, label=\"get_equidistant_series_nearest (default)\"\n", ")\n", "ax.grid(True)\n", "ax.legend(loc=\"best\")\n", "ax.set_xlabel(\"\");" ] }, { "attachments": {}, "cell_type": "markdown", "id": "34b2daa3", "metadata": {}, "source": [ "In this example we can observe the following behavior in each method:\n", "- `pandas_equidistant_sample` retains 4 values.\n", "- `pandas_equidistant_nearest` duplicates some observations in the equidistant timeseries.\n", "- `pandas_equidistant_asfreq` does quite well, but drops some observations near the gap in the original timeseries.\n", "- `get_equidistant_series_nearest` method misses an observation right after the gap in the original timeseries.\n", "- `get_equidistant_series_nearest` with `minimize_data_loss=True` fills this gap, using as much data as possible from the original timeseries.\n", "\n", "The results from the `pandas_equidistant_asfreq` and `get_equidistant_series_nearest` methods both work well, but the latter method retains more of the original data." ] }, { "cell_type": "code", "execution_count": null, "id": "a795fc12", "metadata": {}, "outputs": [], "source": [ "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas2, s_pastas1], axis=1)\n", "dfall.columns = [\n", " \"original\",\n", " \"pandas_equidistant_sample\",\n", " \"pandas_equidistant_nearest\",\n", " \"pandas_equidistant_asfreq\",\n", " \"get_equidistant_series_nearest (default)\",\n", " \"get_equidistant_series_nearest (minimize data loss)\",\n", "]\n", "dfall" ] }, { "cell_type": "markdown", "id": "f304f555", "metadata": {}, "source": [ "The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result." ] }, { "cell_type": "code", "execution_count": null, "id": "795e5e5b", "metadata": {}, "outputs": [], "source": [ "valueskept = dfall.apply(values_kept, args=(dfall[\"original\"],))\n", "valueskept.name = \"values kept\"\n", "duplicates = dfall.apply(n_duplicates)\n", "duplicates.name = \"duplicates\"\n", "\n", "pd.concat([valueskept, duplicates], axis=1)" ] }, { "cell_type": "markdown", "id": "1211dc96", "metadata": {}, "source": [ "#### Example 4" ] }, { "cell_type": "code", "execution_count": null, "id": "8696911b", "metadata": {}, "outputs": [], "source": [ "# Create timeseries\n", "freq = \"2H\"\n", "freq2 = \"1H\"\n", "idx0 = pd.date_range(\"2000-01-01 18:00:00\", freq=freq, periods=3).tolist()\n", "idx1 = pd.date_range(\"2000-01-02 00:00:00\", freq=freq2, periods=10).tolist()\n", "idx0 = idx0 + idx1\n", "series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))\n", "series.iloc[8:10] = np.nan\n", "\n", "# Create equidistant timeseries\n", "s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)\n", "s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)\n", "s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)\n", "s_pastas1 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=True)\n", "s_pastas2 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=False)\n", "\n", "# Create figure\n", "plt.figure(figsize=(10, 6))\n", "ax = series.plot(marker=\"o\", label=\"original\", ms=10)\n", "s_pd2.plot(ax=ax, marker=\"x\", ms=10, label=\"pandas_equidistant_nearest\")\n", "s_pd3.plot(ax=ax, marker=\"^\", ms=8, label=\"pandas_equidistant_asfreq\")\n", "s_pd1.plot(ax=ax, marker=\"+\", ms=16, label=\"pandas_equidistant_sample\")\n", "s_pastas1.plot(\n", " ax=ax, marker=\".\", ms=6, label=\"get_equidistant_series_nearest (minimize data loss)\"\n", ")\n", "s_pastas2.plot(\n", " ax=ax, marker=\"+\", ms=10, label=\"get_equidistant_series_nearest (default)\"\n", ")\n", "ax.grid(True)\n", "ax.legend(loc=\"best\")\n", "ax.set_xlabel(\"\");" ] }, { "cell_type": "markdown", "id": "33b56249", "metadata": {}, "source": [ "Similar to the previous example, `get_equidistant_timeseries` retains the most data from the original timeseries. In this case both the `pandas_equidistant_asfreq` and `pandas_equidistant_nearest` methods perform well, but do omit some of the original data at the end of the timeseries or near the gap in the original timeseries." ] }, { "cell_type": "code", "execution_count": null, "id": "1b157111", "metadata": {}, "outputs": [], "source": [ "dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas2, s_pastas1], axis=1)\n", "dfall.columns = [\n", " \"original\",\n", " \"pandas_equidistant_sample\",\n", " \"pandas_equidistant_nearest\",\n", " \"pandas_equidistant_asfreq\",\n", " \"get_equidistant_series_nearest (default)\",\n", " \"get_equidistant_series_nearest (minimize data loss)\",\n", "]\n", "dfall" ] }, { "cell_type": "markdown", "id": "b3f34c90", "metadata": {}, "source": [ "The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result." ] }, { "cell_type": "code", "execution_count": null, "id": "950db6ec", "metadata": { "scrolled": true }, "outputs": [], "source": [ "valueskept = dfall.apply(values_kept, args=(dfall[\"original\"],))\n", "valueskept.name = \"values kept\"\n", "duplicates = dfall.apply(n_duplicates)\n", "duplicates.name = \"duplicates\"\n", "\n", "pd.concat([valueskept, duplicates], axis=1)" ] } ], "metadata": { "kernelspec": { "display_name": "artesia", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" }, "vscode": { "interpreter": { "hash": "dace5e1b41a98a8e52d2a8eebc3b981caf2c12e7a76736ebfb89a489e3b62e79" } } }, "nbformat": 4, "nbformat_minor": 5 }