我有一个 pandas 数据框,其中有 4 列注释,ID,日期,PRIMARY_INDICATOR,PHONE
每个 ID 在表中可以有多行。保证在每个 ID 子组中按日期降序对行进行排序。例如:
>>> df.head(10)
ID DATE PRIMARY_INDICATOR PHONE
0 123 20230125 1 8071234
1 123 20230124 0 8079999
2 999 20230125 1 8074312
3 999 20230120 1 9087654
4 999 20230119 0 1235678
5 765 20230125 0 9990000
6 765 20230125 0 9999999
我的目标是:
df.loc[df['ID'] == 123]
就是该组的示例)df.loc[df['ID'] == 765]
就是该组的示例)df.loc[df['ID'] == 999]
组成的组,结果将如下所示>>> df.head(10)
ID DATE PRIMARY_INDICATOR PHONE
2 999 20230125 1 8074312
3 999 20230120 0 9087654
4 999 20230119 0 1235678
我有大约 280000 个唯一 ID。
我尝试将 ID 添加到集合中,然后从集合中弹出,通过 loc 和 ID 创建数据帧的子集,然后使用 itterrows 和 bool 标志进行迭代。
这种方法有效,但速度非常慢。查询中唯一有意义的缓慢部分是为每个弹出记录创建子集数据帧,
idDataframe = df.loc[df['ID'] == currentId]
。每个 id 大约需要 0.016475 秒,整个脚本花了 80 分钟。
>>> import pandas as pd
>>>
>>> pd.set_option('display.max_columns', None)
>>> data = {'ID': [123, 123, 999, 999, 999, 765, 765],
... 'DATE': ['20230125', '20230124', '20230125', '20230120', '20230125', '20230125', '20230125'],
... 'PRIMARY_INDICATOR': [1, 0, 1, 1, 0, 0, 0],
... 'PHONE' : [8071234, 8079999, 8074312, 9087654, 1235678, 9990000, 9999999]}
>>> df = pd.DataFrame.from_dict(data)
>>> df.head(10)
ID DATE PRIMARY_INDICATOR PHONE
0 123 20230125 1 8071234
1 123 20230124 0 8079999
2 999 20230125 1 8074312
3 999 20230120 1 9087654
4 999 20230125 0 1235678
5 765 20230125 0 9990000
6 765 20230125 0 9999999
import pandas as pd
data = {'ID': [123, 123, 999, 999, 999, 765, 765],
'DATE': ['20230125', '20230124', '20230125', '20230120', '20230125', '20230125', '20230125'],
'PRIMARY_INDICATOR': [1, 0, 1, 1, 0, 0, 0],
'PHONE' : [8071234, 8079999, 8074312, 9087654, 1235678, 9990000, 9999999]}
df = pd.DataFrame.from_dict(data)
idSet = set(df.ID.unique())
while idSet :
# pop one id from the set
currentId = idSet .pop()
# get a subset of the original dataframe which only shows the pop'd ids records
idDataframe = df.loc[df['ID'] == currentId]
idDataframe.drop_duplicates()
# create the output row from each row in the subframe
primaryPhoneIndicated = False
for index, row in idDataframe.iterrows():
if not primaryPhoneIndicated and row['PRIMARY_INDICATOR'] == 1:
PRIMARY_INDICATOR = 1
primaryPhoneIndicated = True
else:
PRIMARY_INDICATOR = 0
print([row['ID'], row['DATE'], PRIMARY_INDICATOR, row['PHONE']])
有没有一种 pandas-y 方法可以做到这一点,而不需要为每个 ID 创建一个数据框来应用逻辑?
你可以尝试:
def change_indicator(group):
if group["PRIMARY_INDICATOR"].sum() > 1:
group["PRIMARY_INDICATOR"] = (group["DATE"] == group["DATE"].max()).astype(int)
return group
out = df.groupby("ID", group_keys=False).apply(change_indicator)
print(out)
打印:
ID DATE PRIMARY_INDICATOR PHONE
0 123 20230125 1 8071234
1 123 20230124 0 8079999
2 999 20230125 1 8074312
3 999 20230120 0 9087654
4 999 20230119 0 1235678
5 765 20230125 0 9990000
6 765 20230125 0 9999999