重采样时,如果源区间中有一些 NaN 值,则将 NaN 放入结果值中

问题描述 投票:0回答:2

示例:

import pandas as pd
import numpy as np

rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts
2000-01-01 00:00:00     0.0
2000-01-01 00:01:00     1.0
2000-01-01 00:02:00     NaN
2000-01-01 00:03:00     3.0
2000-01-01 00:04:00     4.0
2000-01-01 00:05:00     5.0
2000-01-01 00:06:00     6.0
2000-01-01 00:07:00     7.0
2000-01-01 00:08:00     8.0
2000-01-01 00:09:00     9.0
2000-01-01 00:10:00    10.0
2000-01-01 00:11:00    11.0
Freq: T, dtype: float64
ts.resample("5min").sum()
2000-01-01 00:00:00     5.0
2000-01-01 00:05:00    30.0
2000-01-01 00:10:00    30.0
Freq: 5T, dtype: float64

在上面的示例中,它提取间隔 00:00-00:05 的总和,就好像缺失值为零一样。我想要的是它在 00:00 产生结果 NaN。

或者,如果区间中有一个缺失值,我可能希望它是 OK,但如果区间中有两个缺失值,我希望它是 NaN。

我怎样才能做到这些?

python pandas series
2个回答
13
投票

对于一个或多个

NaN
值:

ts.resample('5min').agg(pd.Series.sum, skipna=False)

对于 2 个非 NaN 值的

最小值:

ts.resample('5min').agg(pd.Series.sum, min_count=2)

对于 2

NaNmaximum

 值似乎很棘手:

ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)

您可能期望

ts.resample('5min').sum(skipna=False)

ts.sum(skipna=False)
 的工作方式相同,但实现并不一致。


0
投票

.resample().agg(pd.Series.sum, skipna=False)

的性能比
.resample().sum()
慢得多,特别是在具有许多列的数据帧上。显然,这些是产生不同结果的不同方法,但核心意图是相同的,所以我会提出以下提高速度的函数。

test = pd.DataFrame(index = pd.date_range('2023-01-01 00:00', periods=16, freq='15T'), data={'A':10}) test.iloc[13] = pd.NA test.iloc[:6] = pd.NA test


date A 2023-01-01 00:00:00 NaN 2023-01-01 00:15:00 NaN 2023-01-01 00:30:00 NaN 2023-01-01 00:45:00 NaN 2023-01-01 01:00:00 NaN 2023-01-01 01:15:00 NaN 2023-01-01 01:30:00 10.0 2023-01-01 01:45:00 10.0 2023-01-01 02:00:00 10.0 2023-01-01 02:15:00 10.0 2023-01-01 02:30:00 10.0 2023-01-01 02:45:00 10.0 2023-01-01 03:00:00 10.0 2023-01-01 03:15:00 NaN 2023-01-01 03:30:00 10.0 2023-01-01 03:45:00 10.0
当测试复制到 100 列的上述数据帧时,可以看出“skipna=False”版本慢了 17 倍。

%%timeit pd.concat([test]*100, axis=1).resample('H').sum()

2.07 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit pd.concat([test]*100, axis=1).resample('H').agg(pd.Series.sum, skipna=False)

34.1 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


此函数比

.resample().agg(pd.Series.sum, skipna=False)

 方法大约快 8 倍。请注意,该函数会稍微增加一些总和,但这可以通过最小化 nan_number 输入来控制。

import pandas as pd def resample_sum_keep_nans(df, target_freq='H', nan_number = 0.01): """Function returns a downsampled dataframe that returns NaN for downsampled intervals when all values in source intervals are NaN. Input df should have a time-series index that 'is_monotonic_increasing' and has a defined frequency. Select a 'nan_number' value that is very small relative to the input df values to avoid significant alteration of the output downsampled totals for source intervals that have some but not all NaN values. Refer to Pandas resample method for orientation on defining 'target_freq' of output df""" assert df.index.is_monotonic_increasing assert df.index.freq is not None # confirm that the suggested nan_number is not present in the df already # this is important because the temporary fill-in nan_number must be easily distinguished # from the values already present in df. if df.sample(frac=0.5).dropna(how='all').eq(nan_number).any().any(): print(f'nan_number {nan_number} exists in input df. ' 'Input a nan_number that is not present in df') else: # determine what the temporary nan_number will be after being summed # across intervals in downsampling old_delta = df.iloc[:,0].index.freq.delta new_delta = df.iloc[:,0].resample(target_freq).sum().index.freq.delta freq_multiplier = new_delta / old_delta # define the number to search for and replace with NaN in the resampled df nan_number_resampled = freq_multiplier * nan_number # print(nan_number_resampled) # fill NaNs with nan_number & resample df = df.fillna(nan_number).resample(target_freq)\ .sum() # fill any values equal to nan_number_resampled with NaN and return df return df.mask(df.eq(nan_number_resampled), pd.NA)

%%timeit resample_sum_keep_nans(pd.concat([test]*100, axis=1))

3.94 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


请注意,

test.resample('H').agg(pd.Series.sum, skipna=False)

会丢弃源区间中包含 NaN (2023-01-01 01:00:00) 的值。

A date 2023-01-01 00:00:00 NaN 2023-01-01 01:00:00 NaN 2023-01-01 02:00:00 40.0 2023-01-01 03:00:00 NaN
虽然 

resample_sum_keep_nans(test, nan_number=0.00001).round(0)

 可以识别源区间中与 NaN 混合的值并对其求和。

A date 2023-01-01 00:00:00 NaN 2023-01-01 01:00:00 20.0 2023-01-01 02:00:00 40.0 2023-01-01 03:00:00 30.0
    
© www.soinside.com 2019 - 2024. All rights reserved.