The default behavior for adding and solving with noisemodels has changed from Pastas 1.5. Find more information here

Preprocessing user-provided time series#

Developed by D. Brakenhoff, Artesia, R Caljé, Artesia and R.A. Collenteur, Eawag, January (2021-2023)

This notebooks shows how to solve the most common errors that arise during the validation of the user provided time series. After showing how to deal with some of the easier errors, we will dive into the topic of making time series equidistant. For this purpose Pastas contains a lot of helper functions.

import pastas as ps
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

ps.show_versions()
Python version: 3.11.6
NumPy version: 1.26.4
Pandas version: 2.2.2
SciPy version: 1.13.0
Matplotlib version: 3.8.4
Numba version: 0.59.1
LMfit version: 1.3.1
Latexify version: Not Installed
Pastas version: 1.5.0

1. Validating the time series, what is checked?#

Let us first look at the docstring of the ps.validate_stress method, which can be used to automatically check user-provided input time series. This method is also used internally in Pastas to check all user provided time series. For the stresses ps.validate_stress is used and for the oseries the ps.validate_oseries is used. The only difference between these methods is that the oseries are not checked for equidistant time series.

?ps.validate_stress

a. If the time series is a DataFrame#

index = pd.date_range("2000-01-01", "2000-01-10")
series = pd.DataFrame(data=[np.arange(10.0)], index=index)

# Here's error message returned by Pastas
# ps.validate_stress(series)
series.iloc[:, 0]  # Simpy select the first column
2000-01-01    0.0
2000-01-02    0.0
2000-01-03    0.0
2000-01-04    0.0
2000-01-05    0.0
2000-01-06    0.0
2000-01-07    0.0
2000-01-08    0.0
2000-01-09    0.0
2000-01-10    0.0
Freq: D, Name: 0, dtype: float64

b. If values are not floats#

index = pd.date_range("2000-01-01", "2000-01-10")
series = pd.Series(data=range(10), index=index, name="Stress")

# Here's error message returned by Pastas
# ps.validate_stress(series)
# Here is a possible fix to this issue
series = series.astype(float)

c. If the index is not a datetimeindex#

series = pd.Series(data=np.arange(10.0), index=range(10), name="Stress")

# Here's error message returned by Pastas
# ps.validate_stress(series)
# Here is a possible fix to this issue
series.index = pd.to_datetime(series.index)
ps.validate_stress(series)
True

d. If index is not monotonically increasing#

index = pd.to_datetime(["2000-01-01", "2000-01-03", "2000-01-02", "2000-01-4"])
series = pd.Series(data=np.arange(4.0), index=index, name="Stress")

# Here's error message returned by Pastas
ps.validate_stress(series)
ERROR: The time-indices of series Stress are not monotonically increasing. Try to use `series.sort_index()` to fix it.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 5
      2 series = pd.Series(data=np.arange(4.0), index=index, name="Stress")
      4 # Here's error message returned by Pastas
----> 5 ps.validate_stress(series)

File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/timeseries.py:644, in validate_stress(series)
    610 def validate_stress(series: Series):
    611     """Method to validate user-provided stress input time series.
    612 
    613     Parameters
   (...)
    642     >>> ps.validate_stress(series)
    643     """
--> 644     return _validate_series(series, equidistant=True)

File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/timeseries.py:756, in _validate_series(series, equidistant)
    751     msg = (
    752         "The time-indices of series %s are not monotonically increasing. Try "
    753         "to use `series.sort_index()` to fix it."
    754     )
    755     logger.error(msg, name)
--> 756     raise ValueError(msg % name)
    758 # 6. Make sure there are no duplicate indices
    759 if not series.index.is_unique:

ValueError: The time-indices of series Stress are not monotonically increasing. Try to use `series.sort_index()` to fix it.
# Here is a possible fix to this issue
series = series.sort_index()
ps.validate_stress(series)
True

e. If the index has duplicate indices#

index = pd.to_datetime(["2000-01-01", "2000-01-02", "2000-01-02", "2000-01-3"])
series = pd.Series(data=np.arange(4.0), index=index, name="Stress")

# Here's error message returned by Pastas
ps.validate_stress(series)
ERROR: duplicate time-indexes were found in the time series Stress. Make sure there are no duplicate indices. For example by `grouped = series.groupby(level=0); series = grouped.mean()`or `series = series.loc[~series.index.duplicated(keep='first/last')]`
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 5
      2 series = pd.Series(data=np.arange(4.0), index=index, name="Stress")
      4 # Here's error message returned by Pastas
----> 5 ps.validate_stress(series)

File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/timeseries.py:644, in validate_stress(series)
    610 def validate_stress(series: Series):
    611     """Method to validate user-provided stress input time series.
    612 
    613     Parameters
   (...)
    642     >>> ps.validate_stress(series)
    643     """
--> 644     return _validate_series(series, equidistant=True)

File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/timeseries.py:767, in _validate_series(series, equidistant)
    760     msg = (
    761         "duplicate time-indexes were found in the time series %s. Make sure "
    762         "there are no duplicate indices. For example by "
    763         "`grouped = series.groupby(level=0); series = grouped.mean()`"
    764         "or `series = series.loc[~series.index.duplicated(keep='first/last')]`"
    765     )
    766     logger.error(msg, name)
--> 767     raise ValueError(msg % name)
    769 # 7. Make sure the time series has no nan-values
    770 if series.hasnans:

ValueError: duplicate time-indexes were found in the time series Stress. Make sure there are no duplicate indices. For example by `grouped = series.groupby(level=0); series = grouped.mean()`or `series = series.loc[~series.index.duplicated(keep='first/last')]`
# Here is a possible fix to this issue
grouped = series.groupby(level=0)
series = grouped.mean()
ps.validate_stress(series)
True

f. If the time series has nan-values#

index = pd.date_range("2000-01-01", "2000-01-10")
series = pd.Series(data=np.arange(10.0), index=index, name="Stress")
series.loc["2000-01-05"] = np.nan

# Here's warning message returned by Pastas
ps.validate_stress(series)
WARNING: The Time Series 'Stress' has nan-values. Pastas will use the fill_nan settings to fill up the nan-values.
True
# Here is a possible fix to this issue for oseries
series.dropna()  # simply drop the nan-values

# Here is a possible fix to this issue for stresses
series = series.fillna(series.mean())  # For example for water levels
series = series.fillna(0.0)  # For example for precipitation
series = series.interpolate(method="time")  # For example for evaporation

2. If a stress time series has non-equidistant time steps#

# Create timeseries
freq = "6H"
idx0 = pd.date_range("2000-01-01", freq=freq, periods=7).tolist()
idx0[3] = idx0[3] + pd.to_timedelta(1, "H")
series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float), name="Stress")

# Here's error message return by Pastas
ps.validate_stress(series)
ERROR: The frequency of the index of time series Stress could not be inferred. Please provide a time series with a regular time step.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 8
      5 series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float), name="Stress")
      7 # Here's error message return by Pastas
----> 8 ps.validate_stress(series)

File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/timeseries.py:644, in validate_stress(series)
    610 def validate_stress(series: Series):
    611     """Method to validate user-provided stress input time series.
    612 
    613     Parameters
   (...)
    642     >>> ps.validate_stress(series)
    643     """
--> 644     return _validate_series(series, equidistant=True)

File ~/checkouts/readthedocs.org/user_builds/pastas/envs/latest/lib/python3.11/site-packages/pastas/timeseries.py:785, in _validate_series(series, equidistant)
    780         msg = (
    781             "The frequency of the index of time series %s could not be "
    782             "inferred. Please provide a time series with a regular time step."
    783         )
    784         logger.error(msg, name)
--> 785         raise ValueError(msg % name)
    787 # If all checks are passed, return True
    788 return True

ValueError: The frequency of the index of time series Stress could not be inferred. Please provide a time series with a regular time step.

Pastas contains some convenience functions for creating equidistant time series. The method for creating an equidistant time series depends on the type of stress and different methods are used for for fluxes (e.g. precipitation, evaporation, pumping discharge) and levels (e.g. head, water levels). These methods are presented in the next sections.

Creating equidistant time series for fluxes#

There are several methods in Pandas to create equidistant series. A flux describes a measured or logged quantity over a period of time, ressulting in a pandas Series. In this series, each flux is asssigned a timestamp in the index. In Pastas, we assume the timestamp is at the end of the period that belongs to each measurement. This means that the precipitation of march 5 2022 gets the timestamp ‘2022-03-06 00:00:00’ (which can be counter-intuitive, as the index is now a day later). Therefore, when using Pandas resample methods, we add two parameters: closed=’right’ and label=’right’. So given this series of precipitation in mm:

# plot the original series
series
2000-01-01 00:00:00    0.0
2000-01-01 06:00:00    1.0
2000-01-01 12:00:00    2.0
2000-01-01 19:00:00    3.0
2000-01-02 00:00:00    4.0
2000-01-02 06:00:00    5.0
2000-01-02 12:00:00    6.0
Name: Stress, dtype: float64

using "right" would yield:

series.resample("12H", closed="right", label="right").sum()
2000-01-01 00:00:00     0.0
2000-01-01 12:00:00     3.0
2000-01-02 00:00:00     7.0
2000-01-02 12:00:00    11.0
Freq: 12h, Name: Stress, dtype: float64

which is logical because over the first 12 hours (between 01-01 00:00:00 and 01-01 12:00:00) 3mm of precipitation fell. However, using "left" would yield:

series.resample("12H", closed="left", label="left").sum()
2000-01-01 00:00:00    1.0
2000-01-01 12:00:00    5.0
2000-01-02 00:00:00    9.0
2000-01-02 12:00:00    6.0
Freq: 12h, Name: Stress, dtype: float64

Pastas helps users with a simple wrapper around the pandas resample function with setting “right” keyword arguments for closed and label:

resampler = pastas.timeseries_utils.resample(series, freq).

When this resampler is return series can easily be interpolated, summed or averaged using all resample methods available in the Pandas library.

ps.timeseries_utils.resample(
    series, "12H"
).sum()  # gives the same as series.resample("12H", closed="right", label="right").sum()
2000-01-01 00:00:00     0.0
2000-01-01 12:00:00     3.0
2000-01-02 00:00:00     7.0
2000-01-02 12:00:00    11.0
Freq: 12h, Name: Stress, dtype: float64

The resample-method of pandas basically is a groupby-method. This creates problems when there is not a single measurement in each group / bin.

series.resample("6H", closed="right", label="right").mean()
2000-01-01 00:00:00    0.0
2000-01-01 06:00:00    1.0
2000-01-01 12:00:00    2.0
2000-01-01 18:00:00    NaN
2000-01-02 00:00:00    3.5
2000-01-02 06:00:00    5.0
2000-01-02 12:00:00    6.0
Freq: 6h, Name: Stress, dtype: float64

This is demponstrated by the NaN at 2000-01-01 18:00:00. Also when there are two values in a group, these are averaged, even though one of the values only counts for one hour, and the other for 6 hours. This is demonstrated by the value of 3.5 at 2000-01-02 00:00:00.

Because of these problems there is a method called ‘timestep_weighted_resample’ in Pastas. This method assumes the index is at the end of the period that belongs to each measurement, just like the rest of Pastas. Using this assumption, the method can calculate a new series, using an overlapping period weighted average:

new_index = pd.date_range(series.index[0], series.index[-1], freq="12H")
ps.ts.timestep_weighted_resample(series, new_index)
2000-01-01 00:00:00         NaN
2000-01-01 12:00:00    1.500000
2000-01-02 00:00:00    3.416667
2000-01-02 12:00:00    5.500000
Freq: 12h, dtype: float64

We see the value at 2000-01-02 00:00:00 is now 3.833333, which is 1/6th of 3.0 (the original value at 2000-01-01 19:00:00) and 5/6th of 4.0 (the original value at 2000-01-02 00:00:00). This methods sets a NaN for the value at 2000-01-01 00:00:00, as the length of the period cannot be determined.

The following examples showcase some potentially useful operations for which timestep_weighted_resample can be used.

Example 1: Resample monthly pumping volumes to daily values#

Monthly aggregated data is common, and in this synthetic example we’ll assume that we received data for monthly pumping rate from a well. We want to convert these to daily data so that we can create a time series model with a daily time step.

First let’s invent some data.

# monthly discharge volumes
# volume on 1 february 2022 at 00:00 is the volume for month of january
index = pd.date_range("2022-02-01", "2023-01-01", freq="MS")
discharge = np.arange(12) * 100

Q0 = pd.Series(index=index, data=discharge)

Next, use timestep_weighted_resample to calculate a daily time series for well pumping.

Note: the unit of the resulting daily time series has not changed! This means the daily value is assumed to be the same as the monthly value. So take note of your units when modeling resampled time series with Pastas!

# define a new index
new_index = pd.date_range("2022-01-01", "2022-12-31", freq="D")

# resample to daily data
Q_daily = ps.ts.timestep_weighted_resample(Q0, new_index)

# plot the comparison
ax = Q0.plot(marker="o", label="original monthly series")
Q_daily.plot(ax=ax, label="resampled daily series")
ax.legend()
ax.grid(True)
../_images/da2bd13babd1863cf006329d705c9dd04a73b4ceafe9bbefa0d2a248258248fe.png

Example 2: Resample precipitation between 9AM-9AM to 12AM-12AM#

In the Netherlands, the rainfall gauges that are measured daily measure between 9 AM one day and 9 AM the next day. For time series models with a daily timestep it is often simpler to use time series represent one full day (from 12 AM - 12 AM). We can do this ourselved by applying the formula: \(data_{24}[t] = \frac{9}{24}data_{9}[t] + \frac{15}{24}data_{9}[t+1]\) but once again, timestep_weighted_resample can help us calculate this resampled time series. First we invent some new random data.

index = pd.date_range("2022-01-01 09:00:00", "2022-01-10 09:00:00")
data = np.random.rand(len(index))

p0 = pd.Series(index=index, data=data)
p0
2022-01-01 09:00:00    0.205328
2022-01-02 09:00:00    0.229926
2022-01-03 09:00:00    0.760973
2022-01-04 09:00:00    0.220669
2022-01-05 09:00:00    0.282238
2022-01-06 09:00:00    0.772791
2022-01-07 09:00:00    0.756235
2022-01-08 09:00:00    0.578174
2022-01-09 09:00:00    0.378007
2022-01-10 09:00:00    0.275680
Freq: D, dtype: float64

The result is shown below. Note how each resampled point represents the weighted average of the two surrounding observations, which is what we want.

# create a new daily index, running from 12AM-12AM
new_index = pd.date_range(p0.index[0].normalize(), p0.index[-1].normalize())

# resample time series
p_resample = ps.ts.timestep_weighted_resample(p0, new_index)

# try it ourselves:
p_self = [
    data[i] * (h / 24) + data[i + 1] * ((24 - h) / 24)
    for i, h in enumerate(index.hour.values[1:])
]
p_resample_self = pd.Series(p_self, new_index[1:])

# plot comparison
ax = p0.plot(marker="o", label="original 9AM series")
p_resample.plot(marker="^", ax=ax, label="resampled series")
p_resample_self.plot(
    marker="d", markersize=3, linestyle="--", ax=ax, label="resampled series self"
)
ax.legend()
ax.grid(True)
../_images/8e1c6190dcf061718ac177049c7b0a5f419c42715deafbfffffc73ebe27edcd4.png

Example 3: Resample hourly data to daily data#

Another frequent pre-processing step is converting hourly data to daily values. This can be done with simple pandas methods, but timestep_weighted_resample can also handle this calculation, with the added advantage of supporting irregular time steps in the new time series.

In this example we’ll simply convert an hourly time series into a daily time series.

# create some hourly data
index = pd.date_range("2022-01-01 00:00:00", "2022-01-03 00:00:00", freq="H")
data = np.hstack([np.arange(len(index) // 2), 24 * np.ones(len(index) // 2 + 1)])

# convert to series
p0 = pd.Series(index=index, data=data)
p0.head()
2022-01-01 00:00:00    0.0
2022-01-01 01:00:00    1.0
2022-01-01 02:00:00    2.0
2022-01-01 03:00:00    3.0
2022-01-01 04:00:00    4.0
Freq: h, dtype: float64

The result shows that the resampled value at the end of each day represents the average value of that day, which is what we would expect.

# create a new daily index
new_index = pd.date_range(p0.index[0].normalize(), p0.index[-1].normalize(), freq="D")

# resample measurements
p_resample = ps.ts.timestep_weighted_resample(p0, new_index)

# plot the comparison
ax = p0.plot(marker="o", label="original hourly series")
p_resample.plot(marker="^", ax=ax, label="resampled daily series")
ax.legend()
ax.grid(True)
../_images/800c0855c574a00c10204aaf6b7dd479e926558026590ae73ef48992d78f5801.png

Creating equidistant time series for (ground)waterlevels#

The following methods are used for time series that do not need to be resampled. For example, the measurements represent a state (e.g. head, water level) and either an equidistant sample is taken from the original time series, or observations are shifted slightly, to create an equidistant time series. This can be done in a number of different ways:

  • pandas_equidistant_sample takes a sample at equidistant timesteps from the original series, at the user-specified frequency. For very irregular time series lots of observations will be lost. The advantage is that observations are not shifted in time, unlike in the other methods.

  • pandas_equidistant_nearest creates a new equidistant index with the user-specified frequency, then Series.reindex() is used with method="nearest" which will shift certain observations in time to fill the equidistant time series. This method can introduce duplicates (i.e. an observation that is used more than once) in the final result.

  • pandas_equidistant_asfreq rounds the series index to the user-specified frequency, then drops any duplicates before calling Series.asfreq with the user-specified frequency. This ensures no duplicates are contained in the resulting time series.

  • get_equidistant_timeseries_nearest creates a equidistant time series minimizing the number of dropped points and ensuring that each observation from the original time series is used only once in the resulting equidistant time series. This method

The different methods are compared in the following four examples.

Note: in terms of performance the pandas methods are much faster.

Example 1#

Lets create a timeseries spaced which is normally spaced with a frequency of 6 hours. The first and last measurement are shifted a bit later and earlier respectively.

The different methods for creating equidistant time series for levels are compared.

# Create time series
freq = "6H"
idx0 = pd.date_range("2000-01-01", freq=freq, periods=7).tolist()
idx0[0] = pd.Timestamp("2000-01-01 04:00:00")
idx0[-1] = pd.Timestamp("2000-01-02 11:00:00")
series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))

# Create equidistant time series with Pastas
s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)
s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)
s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)
s_pastas = ps.ts.get_equidistant_series_nearest(series, freq)

# Create figure
plt.figure(figsize=(10, 4))
ax = series.plot(
    marker="o",
    label="original time series",
    ms=10,
)
s_pd2.plot(ax=ax, marker="x", ms=8, label="pandas_equidistant_nearest")
s_pd3.plot(ax=ax, marker="^", ms=8, label="pandas_equidistant_asfreq")
s_pd1.plot(ax=ax, marker="+", ms=16, label="pandas_equidistant_sample")
s_pastas.plot(ax=ax, marker=".", label="get_equidistant_series_nearest")

ax.grid(True)
ax.legend(loc="best")
ax.set_xlabel("")
Text(0.5, 0, '')
../_images/e43cdcdd17266d0214943e5a434b08ea917edeb8af9304c99bee59026e3bdd37.png

Both the pandas_equidistant_nearest and pandas_equidistant_asfreq methods and get_equidistant_series_nearest show the observations at the beginning and the end of the time series are shifted to the nearest equidistant timestamp. The pandas_equidistant_sample method drops 2 datapoints because they’re measured at different time offsets.

# some helper functions to show differences in performance
def values_kept(s, original):
    diff = set(original.dropna().values) & set(s.dropna().values)
    return len(diff)


def n_duplicates(s):
    return (s.value_counts() >= 2).sum()
dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas], axis=1, sort=True)
dfall.columns = [
    "original",
    "pandas_equidistant_sample",
    "pandas_equidistant_nearest",
    "pandas_equidistant_asfreq",
    "get_equidistant_series_nearest",
]
dfall
original pandas_equidistant_sample pandas_equidistant_nearest pandas_equidistant_asfreq get_equidistant_series_nearest
2000-01-01 00:00:00 NaN NaN 0.0 0.0 0.0
2000-01-01 04:00:00 0.0 NaN NaN NaN NaN
2000-01-01 06:00:00 1.0 1.0 1.0 1.0 1.0
2000-01-01 12:00:00 2.0 2.0 2.0 2.0 2.0
2000-01-01 18:00:00 3.0 3.0 3.0 3.0 3.0
2000-01-02 00:00:00 4.0 4.0 4.0 4.0 4.0
2000-01-02 06:00:00 5.0 5.0 5.0 5.0 5.0
2000-01-02 11:00:00 6.0 NaN NaN NaN NaN
2000-01-02 12:00:00 NaN NaN 6.0 NaN 6.0

The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result.

valueskept = dfall.apply(values_kept, args=(dfall["original"],))
valueskept.name = "values kept"
duplicates = dfall.apply(n_duplicates)
duplicates.name = "duplicates"

pd.concat([valueskept, duplicates], axis=1)
values kept duplicates
original 7 0
pandas_equidistant_sample 5 0
pandas_equidistant_nearest 7 0
pandas_equidistant_asfreq 6 0
get_equidistant_series_nearest 7 0

Example 2#

# Create timeseries
freq = "D"
idx0 = pd.date_range("2000-01-01", freq=freq, periods=7).tolist()
idx0[0] = pd.Timestamp("2000-01-01 09:00:00")
del idx0[2]
del idx0[2]
idx0[-2] = pd.Timestamp("2000-01-06 13:00:00")
idx0[-1] = pd.Timestamp("2000-01-06 23:00:00")
series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))

# Create equidistant timeseries
s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)
s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)
s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)
s_pastas = ps.ts.get_equidistant_series_nearest(series, freq)

# Create figure
plt.figure(figsize=(10, 4))
ax = series.plot(marker="o", label="original", ms=10)
s_pd2.plot(ax=ax, marker="x", ms=10, label="pandas_equidistant_nearest")
s_pd3.plot(ax=ax, marker="^", ms=8, label="pandas_equidistant_asfreq")
s_pd1.plot(ax=ax, marker="+", ms=16, label="pandas_equidistant_sample")
s_pastas.plot(ax=ax, marker=".", label="get_equidistant_series_nearest")
ax.grid(True)
ax.legend(loc="best")
ax.set_xlabel("")
Text(0.5, 0, '')
../_images/2c65c2d71e6a0a39f841df51f497acbd783187da72904c2aac2273b189095e62.png

In this example, the shortcomings of pandas_equidistant_nearest are clearly visible. It duplicates observations from the original timeseries to fill the gaps. This can be solved by passing e.g. tolerance="0.99{freq}" to series.reindex() in which case the gaps will not be filled. However, with very irregular timesteps this is not guaranteed to work and duplicates may still occur. The pandas_equidistant_asfreq and pastas methods work as expected and uses the available data to create a reasonable equidistant timeseries from the original data. The pandas_equidistant_sample method is only able to keep two observations from the original series in this example.

dfall = pd.concat([series, s_pd1, s_pd2, s_pd3, s_pastas], axis=1, sort=True)
dfall.columns = [
    "original",
    "pandas_equidistant_sample",
    "pandas_equidistant_nearest",
    "pandas_equidistant_asfreq",
    "get_equidistant_series_nearest",
]
dfall
original pandas_equidistant_sample pandas_equidistant_nearest pandas_equidistant_asfreq get_equidistant_series_nearest
2000-01-01 00:00:00 NaN NaN 0.0 0.0 0.0
2000-01-01 09:00:00 0.0 NaN NaN NaN NaN
2000-01-02 00:00:00 1.0 1.0 1.0 1.0 1.0
2000-01-03 00:00:00 NaN NaN 1.0 NaN NaN
2000-01-04 00:00:00 NaN NaN 2.0 NaN NaN
2000-01-05 00:00:00 2.0 2.0 2.0 2.0 2.0
2000-01-06 00:00:00 NaN NaN 3.0 3.0 3.0
2000-01-06 13:00:00 3.0 NaN NaN NaN NaN
2000-01-06 23:00:00 4.0 NaN NaN NaN NaN
2000-01-07 00:00:00 NaN NaN 4.0 NaN 4.0

The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result.

valueskept = dfall.apply(values_kept, args=(dfall["original"],))
valueskept.name = "values kept"
duplicates = dfall.apply(n_duplicates)
duplicates.name = "duplicates"

pd.concat([valueskept, duplicates], axis=1)
values kept duplicates
original 5 0
pandas_equidistant_sample 2 0
pandas_equidistant_nearest 5 2
pandas_equidistant_asfreq 4 0
get_equidistant_series_nearest 5 0

Example 3#

# Create timeseries
freq = "2H"
freq2 = "1H"
idx0 = pd.date_range("2000-01-01 18:00:00", freq=freq, periods=3).tolist()
idx1 = pd.date_range("2000-01-02 01:30:00", freq=freq2, periods=10).tolist()
idx0 = idx0 + idx1
idx0[3] = pd.Timestamp("2000-01-02 01:31:00")
series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))
series.iloc[8:10] = np.nan


# Create equidistant timeseries
s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)
s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)
s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)
s_pastas1 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=True)
s_pastas2 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=False)


# Create figure
plt.figure(figsize=(10, 6))
ax = series.plot(marker="o", label="original", ms=10)
s_pd2.plot(ax=ax, marker="x", ms=10, label="pandas_equidistant_nearest")
s_pd3.plot(ax=ax, marker="^", ms=8, label="pandas_equidistant_asfreq")
s_pd1.plot(ax=ax, marker="+", ms=16, label="pandas_equidistant_sample")
s_pastas1.plot(
    ax=ax, marker=".", ms=6, label="get_equidistant_series_nearest (minimize data loss)"
)
s_pastas2.plot(
    ax=ax, marker="+", ms=10, label="get_equidistant_series_nearest (default)"
)
ax.grid(True)
ax.legend(loc="best")
ax.set_xlabel("")
Text(0.5, 0, '')
../_images/6b2de4570242c452ca216e6cf1a01642dfe108ad7326bdf7a64852c9805b93b4.png

In this example we can observe the following behavior in each method:

  • pandas_equidistant_sample retains 4 values.

  • pandas_equidistant_nearest duplicates some observations in the equidistant timeseries.

  • pandas_equidistant_asfreq does quite well, but drops some observations near the gap in the original timeseries.

  • get_equidistant_series_nearest method misses an observation right after the gap in the original timeseries.

  • get_equidistant_series_nearest with minimize_data_loss=True fills this gap, using as much data as possible from the original timeseries.

The results from the pandas_equidistant_asfreq and get_equidistant_series_nearest methods both work well, but the latter method retains more of the original data.

dfall = pd.concat(
    [series, s_pd1, s_pd2, s_pd3, s_pastas2, s_pastas1], axis=1, sort=True
)
dfall.columns = [
    "original",
    "pandas_equidistant_sample",
    "pandas_equidistant_nearest",
    "pandas_equidistant_asfreq",
    "get_equidistant_series_nearest (default)",
    "get_equidistant_series_nearest (minimize data loss)",
]
dfall
original pandas_equidistant_sample pandas_equidistant_nearest pandas_equidistant_asfreq get_equidistant_series_nearest (default) get_equidistant_series_nearest (minimize data loss)
2000-01-01 18:00:00 0.0 NaN 0.0 0.0 NaN NaN
2000-01-01 18:30:00 NaN NaN NaN NaN 0.0 0.0
2000-01-01 20:00:00 1.0 NaN 1.0 1.0 NaN NaN
2000-01-01 20:30:00 NaN NaN NaN NaN 1.0 1.0
2000-01-01 22:00:00 2.0 NaN 2.0 2.0 NaN NaN
2000-01-01 22:30:00 NaN NaN NaN NaN 2.0 2.0
2000-01-02 00:00:00 NaN NaN 3.0 3.0 NaN NaN
2000-01-02 00:30:00 NaN NaN NaN NaN 3.0 3.0
2000-01-02 01:31:00 3.0 NaN NaN NaN NaN NaN
2000-01-02 02:00:00 NaN NaN 3.0 4.0 NaN NaN
2000-01-02 02:30:00 4.0 4.0 NaN NaN 4.0 4.0
2000-01-02 03:30:00 5.0 NaN NaN NaN NaN NaN
2000-01-02 04:00:00 NaN NaN 6.0 6.0 NaN NaN
2000-01-02 04:30:00 6.0 6.0 NaN NaN 6.0 6.0
2000-01-02 05:30:00 7.0 NaN NaN NaN NaN NaN
2000-01-02 06:00:00 NaN NaN NaN NaN NaN NaN
2000-01-02 06:30:00 NaN NaN NaN NaN NaN 7.0
2000-01-02 07:30:00 NaN NaN NaN NaN NaN NaN
2000-01-02 08:00:00 NaN NaN 10.0 10.0 NaN NaN
2000-01-02 08:30:00 10.0 10.0 NaN NaN 10.0 10.0
2000-01-02 09:30:00 11.0 NaN NaN NaN NaN NaN
2000-01-02 10:00:00 NaN NaN 12.0 12.0 NaN NaN
2000-01-02 10:30:00 12.0 12.0 NaN NaN 12.0 12.0
2000-01-02 12:00:00 NaN NaN 12.0 NaN NaN NaN
2000-01-02 12:30:00 NaN NaN NaN NaN NaN NaN

The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result.

valueskept = dfall.apply(values_kept, args=(dfall["original"],))
valueskept.name = "values kept"
duplicates = dfall.apply(n_duplicates)
duplicates.name = "duplicates"

pd.concat([valueskept, duplicates], axis=1)
values kept duplicates
original 11 0
pandas_equidistant_sample 4 0
pandas_equidistant_nearest 7 2
pandas_equidistant_asfreq 8 0
get_equidistant_series_nearest (default) 8 0
get_equidistant_series_nearest (minimize data loss) 9 0

Example 4#

# Create timeseries
freq = "2H"
freq2 = "1H"
idx0 = pd.date_range("2000-01-01 18:00:00", freq=freq, periods=3).tolist()
idx1 = pd.date_range("2000-01-02 00:00:00", freq=freq2, periods=10).tolist()
idx0 = idx0 + idx1
series = pd.Series(index=idx0, data=np.arange(len(idx0), dtype=float))
series.iloc[8:10] = np.nan

# Create equidistant timeseries
s_pd1 = ps.ts.pandas_equidistant_sample(series, freq)
s_pd2 = ps.ts.pandas_equidistant_nearest(series, freq)
s_pd3 = ps.ts.pandas_equidistant_asfreq(series, freq)
s_pastas1 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=True)
s_pastas2 = ps.ts.get_equidistant_series_nearest(series, freq, minimize_data_loss=False)

# Create figure
plt.figure(figsize=(10, 6))
ax = series.plot(marker="o", label="original", ms=10)
s_pd2.plot(ax=ax, marker="x", ms=10, label="pandas_equidistant_nearest")
s_pd3.plot(ax=ax, marker="^", ms=8, label="pandas_equidistant_asfreq")
s_pd1.plot(ax=ax, marker="+", ms=16, label="pandas_equidistant_sample")
s_pastas1.plot(
    ax=ax, marker=".", ms=6, label="get_equidistant_series_nearest (minimize data loss)"
)
s_pastas2.plot(
    ax=ax, marker="+", ms=10, label="get_equidistant_series_nearest (default)"
)
ax.grid(True)
ax.legend(loc="best")
ax.set_xlabel("")
Text(0.5, 0, '')
../_images/6c58110e7b5d9d440f16f6f11d01e5a7be9e431abfa7e11f904b478fac9417e8.png

Similar to the previous example, get_equidistant_timeseries retains the most data from the original timeseries. In this case both the pandas_equidistant_asfreq and pandas_equidistant_nearest methods perform well, but do omit some of the original data at the end of the timeseries or near the gap in the original timeseries.

dfall = pd.concat(
    [series, s_pd1, s_pd2, s_pd3, s_pastas2, s_pastas1], axis=1, sort=True
)
dfall.columns = [
    "original",
    "pandas_equidistant_sample",
    "pandas_equidistant_nearest",
    "pandas_equidistant_asfreq",
    "get_equidistant_series_nearest (default)",
    "get_equidistant_series_nearest (minimize data loss)",
]
dfall
original pandas_equidistant_sample pandas_equidistant_nearest pandas_equidistant_asfreq get_equidistant_series_nearest (default) get_equidistant_series_nearest (minimize data loss)
2000-01-01 18:00:00 0.0 0.0 0.0 0.0 0.0 0.0
2000-01-01 20:00:00 1.0 1.0 1.0 1.0 1.0 1.0
2000-01-01 22:00:00 2.0 2.0 2.0 2.0 2.0 2.0
2000-01-02 00:00:00 3.0 3.0 3.0 3.0 3.0 3.0
2000-01-02 01:00:00 4.0 NaN NaN NaN NaN NaN
2000-01-02 02:00:00 5.0 5.0 5.0 5.0 5.0 5.0
2000-01-02 03:00:00 6.0 NaN NaN NaN NaN NaN
2000-01-02 04:00:00 7.0 7.0 7.0 7.0 7.0 7.0
2000-01-02 05:00:00 NaN NaN NaN NaN NaN NaN
2000-01-02 06:00:00 NaN NaN NaN NaN NaN 10.0
2000-01-02 07:00:00 10.0 NaN NaN NaN NaN NaN
2000-01-02 08:00:00 11.0 11.0 11.0 11.0 11.0 11.0
2000-01-02 09:00:00 12.0 NaN NaN NaN NaN NaN
2000-01-02 10:00:00 NaN NaN 12.0 NaN 12.0 12.0

The following table summarizes the results, showing how many values from the original time series are kept and how many duplicates are contained in the final result.

valueskept = dfall.apply(values_kept, args=(dfall["original"],))
valueskept.name = "values kept"
duplicates = dfall.apply(n_duplicates)
duplicates.name = "duplicates"

pd.concat([valueskept, duplicates], axis=1)
values kept duplicates
original 11 0
pandas_equidistant_sample 7 0
pandas_equidistant_nearest 8 0
pandas_equidistant_asfreq 7 0
get_equidistant_series_nearest (default) 8 0
get_equidistant_series_nearest (minimize data loss) 9 0