我有一个由分类人口统计特征和日期组合而成的人口计数数据框,每个日期的一些缺失值(在所有组合中一致)构成了数据中的空白。
我正在尝试:
(2) 中的函数作用于具有缺失值的现有总体列,在间隙上向后重新分配间隙后计数。我相信该函数按预期工作,但我正在努力将其缝合到 group-by 的上下文中并将其转换为 DataFrame 中的新列。
这是示例数据:
age race gender date population
0 15-24 AAPI Male 2020-01-01 1.0
1 15-24 AAPI Male 2020-01-02 2.0
2 15-24 AAPI Male 2020-01-03 2.0
...
7 15-24 Black Female 2020-01-01 0.0
8 15-24 Black Female 2020-01-02 NaN
9 15-24 Black Female 2020-01-03 3.0
对于上面的简单示例,所需的输出将是:
age race gender date population interpolated
0 15-24 AAPI Male 2020-01-01 1.0 1.0
1 15-24 AAPI Male 2020-01-02 2.0 2.0
2 15-24 AAPI Male 2020-01-03 2.0 2.0
...
7 15-24 Black Female 2020-01-01 0.0 0.0
8 15-24 Black Female 2020-01-02 NaN 1.5
9 15-24 Black Female 2020-01-03 3.0 1.5
我创建了以下函数,它接受日期间隙的输入列表:
gaps = [
{
"gap": [2020-01-02],
"day_after": 2020-01-03,
}
]
def bfill_pop(gaps, group):
for el in gaps:
fill_val = group.loc[group["date"] == el["day_after"], "population"] / (
len(el["gap"]) + 1
)
group.loc[group["date"].isin(el["gap"]), "population"] = fill_val
group.loc[group["date"] == el["day_after"], "population"] = fill_val
return group.rename(columns={"population": "interpolated"})["interpolated"]
当我尝试使用
apply()
或 transform()
函数将其应用到 DataFrame 时,我收到错误,例如:
df["interpolated"] = df.groupby(["age", "race", "gender"]).apply(
lambda g: bfill_pop(gaps, g)
)
> ValueError: cannot handle a non-unique multi-index!
有没有办法通过应用或转换函数来做到这一点?
你就快到了。这只是函数中索引的错误处理:
import pandas as pd
def bfill_pop(gaps, group):
for el in gaps:
day_after_population = group.loc[group['date'] == pd.to_datetime(el['day_after']), 'population']
if not day_after_population.empty:
fill_val = day_after_population.iloc[0] / (len(el['gap']) + 1)
group.loc[group['date'].isin([pd.to_datetime(date) for date in el['gap']]), 'population'] = fill_val
group.loc[group['date'] == pd.to_datetime(el['day_after']), 'population'] = fill_val
return group
data = {
'age': ['15-24'] * 6,
'race': ['AAPI', 'AAPI', 'Black', 'Black', 'Black', 'Black'],
'gender': ['Male', 'Male', 'Female', 'Female', 'Female', 'Female'],
'date': pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03']),
'population': [1.0, 2.0, 0.0, np.nan, np.nan, 3.0]
}
df = pd.DataFrame(data)
gaps = [
{
"gap": ['2020-01-02'],
"day_after": '2020-01-03',
}
]
df['interpolated'] = df.groupby(['age', 'race', 'gender']).apply(
lambda g: bfill_pop(gaps, g)
).reset_index(drop=True)['population']
print(df)
这给了你
age race gender date population interpolated
0 15-24 AAPI Male 2020-01-01 1.0 1.0
1 15-24 AAPI Male 2020-01-02 2.0 2.0
2 15-24 Black Female 2020-01-01 0.0 0.0
3 15-24 Black Female 2020-01-02 NaN 1.5
4 15-24 Black Female 2020-01-02 NaN 1.5
5 15-24 Black Female 2020-01-03 3.0 1.5