示例:
import pandas as pd
import numpy as np
rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00 0.0
2000-01-01 00:01:00 1.0
2000-01-01 00:02:00 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:00 4.0
2000-01-01 00:05:00 5.0
2000-01-01 00:06:00 6.0
2000-01-01 00:07:00 7.0
2000-01-01 00:08:00 8.0
2000-01-01 00:09:00 9.0
2000-01-01 00:10:00 10.0
2000-01-01 00:11:00 11.0
Freq: T, dtype: float64
ts.resample("5min").sum()
2000-01-01 00:00:00 5.0
2000-01-01 00:05:00 30.0
2000-01-01 00:10:00 30.0
Freq: 5T, dtype: float64
在上面的示例中,它提取间隔 00:00-00:05 的总和,就好像缺失值为零一样。我想要的是它在 00:00 产生结果 NaN。
或者,如果区间中有一个缺失值,我可能希望它是 OK,但如果区间中有两个缺失值,我希望它是 NaN。
我怎样才能做到这些?
对于一个或多个
NaN
值:
ts.resample('5min').agg(pd.Series.sum, skipna=False)
对于 2 个非 NaN
值的
最小值:
ts.resample('5min').agg(pd.Series.sum, min_count=2)
对于 2
NaN
的 maximum
值似乎很棘手:
ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)
您可能期望
ts.resample('5min').sum(skipna=False)
与
ts.sum(skipna=False)
的工作方式相同,但实现并不一致。
.resample().agg(pd.Series.sum, skipna=False)
的性能比
.resample().sum()
慢得多,特别是在具有许多列的数据帧上。显然,这些是产生不同结果的不同方法,但核心意图是相同的,所以我会提出以下提高速度的函数。
test = pd.DataFrame(index = pd.date_range('2023-01-01 00:00', periods=16, freq='15T'), data={'A':10})
test.iloc[13] = pd.NA
test.iloc[:6] = pd.NA
test
date A
2023-01-01 00:00:00 NaN
2023-01-01 00:15:00 NaN
2023-01-01 00:30:00 NaN
2023-01-01 00:45:00 NaN
2023-01-01 01:00:00 NaN
2023-01-01 01:15:00 NaN
2023-01-01 01:30:00 10.0
2023-01-01 01:45:00 10.0
2023-01-01 02:00:00 10.0
2023-01-01 02:15:00 10.0
2023-01-01 02:30:00 10.0
2023-01-01 02:45:00 10.0
2023-01-01 03:00:00 10.0
2023-01-01 03:15:00 NaN
2023-01-01 03:30:00 10.0
2023-01-01 03:45:00 10.0
当测试复制到 100 列的上述数据帧时,可以看出“skipna=False”版本慢了 17 倍。
%%timeit pd.concat([test]*100, axis=1).resample('H').sum()
2.07 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit pd.concat([test]*100, axis=1).resample('H').agg(pd.Series.sum, skipna=False)
34.1 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
.resample().agg(pd.Series.sum, skipna=False)
方法大约快 8 倍。请注意,该函数会稍微增加一些总和,但这可以通过最小化 nan_number 输入来控制。
import pandas as pd
def resample_sum_keep_nans(df, target_freq='H', nan_number = 0.01):
"""Function returns a downsampled dataframe that returns NaN for downsampled
intervals when all values in source intervals are NaN.
Input df should have a time-series index that 'is_monotonic_increasing'
and has a defined frequency.
Select a 'nan_number' value that is very small relative to the input
df values to avoid significant alteration of the output downsampled totals
for source intervals that have some but not all NaN values. Refer to Pandas
resample method for orientation on defining 'target_freq' of output df"""
assert df.index.is_monotonic_increasing
assert df.index.freq is not None
# confirm that the suggested nan_number is not present in the df already
# this is important because the temporary fill-in nan_number must be easily distinguished
# from the values already present in df.
if df.sample(frac=0.5).dropna(how='all').eq(nan_number).any().any():
print(f'nan_number {nan_number} exists in input df. '
'Input a nan_number that is not present in df')
else:
# determine what the temporary nan_number will be after being summed
# across intervals in downsampling
old_delta = df.iloc[:,0].index.freq.delta
new_delta = df.iloc[:,0].resample(target_freq).sum().index.freq.delta
freq_multiplier = new_delta / old_delta
# define the number to search for and replace with NaN in the resampled df
nan_number_resampled = freq_multiplier * nan_number
# print(nan_number_resampled)
# fill NaNs with nan_number & resample
df = df.fillna(nan_number).resample(target_freq)\
.sum()
# fill any values equal to nan_number_resampled with NaN and return df
return df.mask(df.eq(nan_number_resampled), pd.NA)
%%timeit resample_sum_keep_nans(pd.concat([test]*100, axis=1))
3.94 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
test.resample('H').agg(pd.Series.sum, skipna=False)
会丢弃源区间中包含 NaN (2023-01-01 01:00:00) 的值。
A
date
2023-01-01 00:00:00 NaN
2023-01-01 01:00:00 NaN
2023-01-01 02:00:00 40.0
2023-01-01 03:00:00 NaN
虽然 resample_sum_keep_nans(test, nan_number=0.00001).round(0)
可以识别源区间中与 NaN 混合的值并对其求和。
A
date
2023-01-01 00:00:00 NaN
2023-01-01 01:00:00 20.0
2023-01-01 02:00:00 40.0
2023-01-01 03:00:00 30.0