所以我有以下数据框:
Period group ID
20130101 A 10
20130101 A 20
20130301 A 20
20140101 A 20
20140301 A 30
20140401 A 40
20130101 B 11
20130201 B 21
20130401 B 31
20140401 B 41
20140501 B 51
我需要用ID
来计算去年有多少个group
。所以我想要的输出看起来像这样:
Period group num_ids_last_year
20130101 A 2 # ID 10 and 20 in the last year
20130301 A 2
20140101 A 2
20140301 A 2 # ID 30 enters, ID 10 leaves
20140401 A 3 # ID 40 enters
20130101 B 1
20130201 B 2
20130401 B 3
20140401 B 2 # ID 11 and 21 leave
20140501 B 2 # ID 31 leaves, ID 51 enters
期间为日期时间格式。我尝试了很多方法:
df.groupby(['group','Period'])['ID'].nunique() # Get number of IDs by group in a given period.
df.groupby(['group'])['ID'].nunique() # Get total number of IDs by group.
df.set_index('Period').groupby('group')['ID'].rolling(window=1, freq='Y').nunique()
但是最后一个甚至不可能。有没有简单的方法可以做到这一点?我在想cumcount()
和pd.DateOffset
或ge(df.Period - dt.timedelta(365)
的某种组合,但我找不到答案。
谢谢。
编辑:添加了一个事实,即我可以在给定的ID
中找到多个Period
from dateutil.relativedelta import relativedelta
df.sort_values(by=['Period'], inplace=True) # if not already sorted
# create new output df
df1 = (df.groupby(['Period','group'])['ID']
.apply(lambda x: list(x))
.reset_index())
df1['num_ids_last_year'] = df1.apply(lambda x: len(set(df1.loc[(df1['Period'] >= x['Period']-relativedelta(years=1)) & (df1['Period'] <= x['Period']) & (df1['group'] == x['group'])].ID.apply(pd.Series).stack())), axis=1)
df1.sort_values(by=['group'], inplace=True)
df1.drop('ID', axis=1, inplace=True)
df1 = df1.reset_index(drop=True)