重采样时，如果源区间中有一些 NaN 值，则将 NaN 放入结果值中

Question

示例：

import pandas as pd
import numpy as np

rng = pd.date_range("2000-01-01", periods=12, freq="T")
ts = pd.Series(np.arange(12), index=rng)
ts["2000-01-01 00:02"] = np.nan
ts

2000-01-01 00:00:00     0.0
2000-01-01 00:01:00     1.0
2000-01-01 00:02:00     NaN
2000-01-01 00:03:00     3.0
2000-01-01 00:04:00     4.0
2000-01-01 00:05:00     5.0
2000-01-01 00:06:00     6.0
2000-01-01 00:07:00     7.0
2000-01-01 00:08:00     8.0
2000-01-01 00:09:00     9.0
2000-01-01 00:10:00    10.0
2000-01-01 00:11:00    11.0
Freq: T, dtype: float64

ts.resample("5min").sum()

2000-01-01 00:00:00     5.0
2000-01-01 00:05:00    30.0
2000-01-01 00:10:00    30.0
Freq: 5T, dtype: float64

在上面的示例中，它提取间隔 00:00-00:05 的总和，就好像缺失值为零一样。我想要的是它在 00:00 产生结果 NaN。

或者，如果区间中有一个缺失值，我可能希望它是 OK，但如果区间中有两个缺失值，我希望它是 NaN。

我怎样才能做到这些？

Answer 1

对于一个或多个

NaN

值：

ts.resample('5min').agg(pd.Series.sum, skipna=False)

对于 2 个非 NaN 值的

最小值：

ts.resample('5min').agg(pd.Series.sum, min_count=2)

对于 2

NaN 的 maximum

 值似乎很棘手：

ts.resample('5min').apply(lambda x: x.sum() if x.isnull().sum() <= 2 else np.nan)

您可能期望

ts.resample('5min').sum(skipna=False)

与

ts.sum(skipna=False)

 的工作方式相同，但实现并不一致。

Answer 2

.resample().agg(pd.Series.sum, skipna=False)

的性能比

.resample().sum()

慢得多，特别是在具有许多列的数据帧上。显然，这些是产生不同结果的不同方法，但核心意图是相同的，所以我会提出以下提高速度的函数。

test = pd.DataFrame(index = pd.date_range('2023-01-01 00:00', periods=16, freq='15T'), data={'A':10})
test.iloc[13] = pd.NA
test.iloc[:6] = pd.NA
test

               date    A
2023-01-01 00:00:00  NaN
2023-01-01 00:15:00  NaN
2023-01-01 00:30:00  NaN
2023-01-01 00:45:00  NaN
2023-01-01 01:00:00  NaN
2023-01-01 01:15:00  NaN
2023-01-01 01:30:00 10.0
2023-01-01 01:45:00 10.0
2023-01-01 02:00:00 10.0
2023-01-01 02:15:00 10.0
2023-01-01 02:30:00 10.0
2023-01-01 02:45:00 10.0
2023-01-01 03:00:00 10.0
2023-01-01 03:15:00  NaN
2023-01-01 03:30:00 10.0
2023-01-01 03:45:00 10.0

当测试复制到 100 列的上述数据帧时，可以看出“skipna=False”版本慢了 17 倍。

%%timeit pd.concat([test]*100, axis=1).resample('H').sum()

2.07 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit pd.concat([test]*100, axis=1).resample('H').agg(pd.Series.sum, skipna=False)

34.1 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

此函数比

.resample().agg(pd.Series.sum, skipna=False)

 方法大约快 8 倍。请注意，该函数会稍微增加一些总和，但这可以通过最小化 nan_number 输入来控制。

import pandas as pd

def resample_sum_keep_nans(df, target_freq='H', nan_number = 0.01):
    """Function returns a downsampled dataframe that returns NaN for downsampled
    intervals when all values in source intervals are NaN.
    Input df should have a time-series index that 'is_monotonic_increasing'
    and has a defined frequency.
    Select a 'nan_number' value that is very small relative to the input 
    df values to avoid significant alteration of the output downsampled totals
    for source intervals that have some but not all NaN values.  Refer to Pandas
    resample method for orientation on defining 'target_freq' of output df"""
    assert df.index.is_monotonic_increasing
    assert df.index.freq is not None
#     confirm that the suggested nan_number is not present in the df already
# this is important because the temporary fill-in nan_number must be easily distinguished
# from the values already present in df.
    if df.sample(frac=0.5).dropna(how='all').eq(nan_number).any().any():
        print(f'nan_number {nan_number} exists in input df.  '
              'Input a nan_number that is not present in df')
    else:
#         determine what the temporary nan_number will be after being summed 
        # across intervals in downsampling
        old_delta = df.iloc[:,0].index.freq.delta
        new_delta = df.iloc[:,0].resample(target_freq).sum().index.freq.delta
        freq_multiplier = new_delta / old_delta
# define the number to search for and replace with NaN in the resampled df
        nan_number_resampled = freq_multiplier * nan_number
#         print(nan_number_resampled)
# fill NaNs with nan_number & resample
        df = df.fillna(nan_number).resample(target_freq)\
            .sum()
# fill any values equal to nan_number_resampled with NaN and return df
        return df.mask(df.eq(nan_number_resampled), pd.NA)

%%timeit resample_sum_keep_nans(pd.concat([test]*100, axis=1))

3.94 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

请注意，

test.resample('H').agg(pd.Series.sum, skipna=False)

会丢弃源区间中包含 NaN (2023-01-01 01:00:00) 的值。

                        A
date                     
2023-01-01 00:00:00   NaN
2023-01-01 01:00:00   NaN
2023-01-01 02:00:00  40.0
2023-01-01 03:00:00   NaN

虽然

resample_sum_keep_nans(test, nan_number=0.00001).round(0)

 可以识别源区间中与 NaN 混合的值并对其求和。

                        A
date                     
2023-01-01 00:00:00   NaN
2023-01-01 01:00:00  20.0
2023-01-01 02:00:00  40.0
2023-01-01 03:00:00  30.0

重采样时，如果源区间中有一些 NaN 值，则将 NaN 放入结果值中

问题描述投票：0回答：2

2个回答

最新问题

重采样时，如果源区间中有一些 NaN 值，则将 NaN 放入结果值中

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2