假设我有一个如下所示的数据框:
Contact Amount Last updated
0 011000111 2 2023-01-01
1 011000111 2 2023-01-02
2 011000112 2 2023-01-03
3 011000112 2 2023-01-04
4 011000112 2 2023-01-06
5 011000111 2 2023-01-07
6 011000111 2 2023-01-09
7 011000111 3 2023-01-11
我想在
Last updated
列中保留最新日期,每次更改Contact
和Amount
的组合。预期的数据框应该看起来像:
Contact Amount Last updated
1 011000111 2 2023-01-02
4 011000112 2 2023-01-06
6 011000111 2 2023-01-09
7 011000111 3 2023-01-11
这是我目前拥有的:
import pandas as pd
# create the dataframe
data = {'Contact': ['011000111', '011000111', '011000112', '011000112', '011000112', '011000111', '011000111', '011000111'],
'Amount': [2, 2, 2, 2, 2, 2, 2, 3],
'Last updated': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-06', '2023-01-07', '2023-01-09', '2023-01-11']}
df = pd.DataFrame(data)
# convert the "Last updated" column to a datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])
# sort the dataframe by the "Contact" and "Last updated" columns in descending order
df = df.sort_values(['Last updated'], ascending=True)
# drop the duplicates based on the "Contact" column and keep the last occurrence
result = df.drop_duplicates(subset=['Contact','Amount'], keep='last')
print(result)
输出:
Contact Amount Last updated
4 011000112 2 2023-01-06
6 011000111 2 2023-01-09
7 011000111 3 2023-01-11
应该有四行,因为有两行有联系人
011000111
和最后一次更新2023-01-02
这包括所有带有最近
Last updated
的行Contact
import pandas as pd
# create the dataframe
data = {'Contact': ['011000111', '011000111', '011000112', '011000112', '011000112', '011000111', '011000111', '011000111'],
'Amount': [2, 2, 2, 2, 2, 2, 2, 3],
'Last updated': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-06', '2023-01-07', '2023-01-09', '2023-01-11']}
df = pd.DataFrame(data)
# convert the "Last updated" column to a datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])
# sort the dataframe by the "Contact" and "Last updated" columns in descending order
df = df.sort_values(['Contact', 'Last updated'], ascending=[True, False])
# drop the duplicates based on the "Contact" column and keep the last occurrence
result = df.drop_duplicates(subset='Contact', keep='last')
# keep all rows with the most recent 'Last updated' date for each unique 'Contact'
result = result.loc[result.groupby('Contact')['Last updated'].idxmax()].sort_values(['Last updated'])
print(result)
现在我所做的是删除所有依赖于
contact
的重复项并保留最后一次出现,这些将使我们有多个相同的 contact
但每个日期都有不同的日期,以解决我们需要分组的问题contact
的数据框结果,然后使用 maximum
函数为每个组获得 last updated
idxmax()
。
# compare current and prev row
c = ['Contact', 'Amount']
mask = df[c] != df[c].shift()
# Are the rows different?
mask = mask.any(axis=1)
# ensure the column is datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])
# use cumsum to idenify different blocks of same rows
# then group the dataframe and find the index of max value in last updated
result = df.loc[df.groupby(mask.cumsum())['Last updated'].idxmax()]
Contact Amount Last updated
1 011000111 2 2023-01-02
4 011000112 2 2023-01-06
6 011000111 2 2023-01-09
7 011000111 3 2023-01-11