数据框的异常值上限

问题描述 投票:0回答:1

我有一个带有“tot_dl_vol”列的数据框。我想限制年同比百分比高于 80% 或低于 10% 的该列的值。我如何实现这一目标?到目前为止,我已经编写了这段代码。

df['YoY_dl'] = (df['tot_dl_vol'].pct_change(12)) * 100
upper = 80
lower = 10
df.loc[df['YoY_dl'] > upper, 'tot_dl_vol'] = df['tot_dl_vol'].shift(12) * (1 + upper/100)
df.loc[df['YoY_dl'] < lower, 'tot_dl_vol'] = df['tot_dl_vol'].shift(12) * (1 - lower/100)

这是一个示例数据框:

这里有一个例子:

import pandas as pd
from pandas import Timestamp

data = {'key': ['A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1', 'A1'],
        'volume': [1714.11, 1907.1, 2927.58, 2656.2, 2364.18, 2372.41, 2363.76, 1956.16, 4146.98, 1971.72, 2588.72, 1853.93, 2050.91, 2267.84, 2634.94, 2750.46, 3072.91, 3363.62, 2717.2, 2273.96, 2228.8, 1886.77, 1864.19],
        'ds': [Timestamp('2021-04-01 00:00:00'), Timestamp('2021-05-01 00:00:00'), Timestamp('2021-06-01 00:00:00'), Timestamp('2021-07-01 00:00:00'), Timestamp('2021-08-01 00:00:00'), Timestamp('2021-09-01 00:00:00'), Timestamp('2021-10-01 00:00:00'), Timestamp('2021-11-01 00:00:00'), Timestamp('2021-12-01 00:00:00'), Timestamp('2022-01-01 00:00:00'), Timestamp('2022-02-01 00:00:00'), Timestamp('2022-03-01 00:00:00'), Timestamp('2022-04-01 00:00:00'), Timestamp('2022-05-01 00:00:00'), Timestamp('2022-06-01 00:00:00'), Timestamp('2022-07-01 00:00:00'), Timestamp('2022-08-01 00:00:00'), Timestamp('2022-09-01 00:00:00'), Timestamp('2022-10-01 00:00:00'), Timestamp('2022-11-01 00:00:00'), Timestamp('2022-12-01 00:00:00'), Timestamp('2023-01-01 00:00:00'), Timestamp('2023-02-01 00:00:00')]}
df = pd.DataFrame(data)

key  volume         ds
 A1 1714.11 2021-04-01
 A1 1907.10 2021-05-01
 A1 2927.58 2021-06-01
 A1 2656.20 2021-07-01
 A1 2364.18 2021-08-01
 A1 2372.41 2021-09-01
 A1 2363.76 2021-10-01
 A1 1956.16 2021-11-01
 A1 4146.98 2021-12-01
 A1 1971.72 2022-01-01
 A1 2588.72 2022-02-01
 A1 1853.93 2022-03-01
 A1 2050.91 2022-04-01
 A1 2267.84 2022-05-01
 A1 2634.94 2022-06-01
 A1 2750.46 2022-07-01
 A1 3072.91 2022-08-01
 A1 3363.62 2022-09-01
 A1 2717.20 2022-10-01
 A1 2273.96 2022-11-01
 A1 2228.80 2022-12-01
 A1 1886.77 2023-01-01
 A1 1864.19 2023-02-01
python pandas dataframe outliers
1个回答
0
投票

capping的时候还需要在等式右边贴上面膜

df['YoY_dl'] = (df['volume'].pct_change(12)) * 100
upper = 80
lower = 10
mask = (df['YoY_dl'] > upper)
df.loc[mask, 'volume'] = df.loc[mask.shift(-12,fill_value=False),'volume'].values * (1 + upper/100)
mask = (df['YoY_dl'] < - lower)
df.loc[mask, 'volume'] = df.loc[mask.shift(-12,fill_value=False),'volume'].values * (1 - lower/100)

我会建议做一些不同的事情:

df['YoY_dl_clipped']=df.YoY_dl.clip(60,-10)
df['volume_clipped']=df.volume.shift(12)*(1+df.YoY_dl_clipped/100)

导致相同的裁剪体积

© www.soinside.com 2019 - 2024. All rights reserved.