Pandas 保持最新行更新

Question

假设我有一个如下所示的数据框：

     Contact  Amount Last updated
0  011000111       2   2023-01-01
1  011000111       2   2023-01-02
2  011000112       2   2023-01-03
3  011000112       2   2023-01-04
4  011000112       2   2023-01-06
5  011000111       2   2023-01-07
6  011000111       2   2023-01-09
7  011000111       3   2023-01-11

我想在

Last updated

列中保留最新日期，每次更改

Contact

和

Amount

的组合。预期的数据框应该看起来像：

     Contact  Amount Last updated
1  011000111       2   2023-01-02
4  011000112       2   2023-01-06
6  011000111       2   2023-01-09
7  011000111       3   2023-01-11

这是我目前拥有的：

import pandas as pd

# create the dataframe
data = {'Contact': ['011000111', '011000111', '011000112', '011000112', '011000112', '011000111', '011000111', '011000111'],
        'Amount': [2, 2, 2, 2, 2, 2, 2, 3],
        'Last updated': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-06', '2023-01-07', '2023-01-09', '2023-01-11']}
df = pd.DataFrame(data)

# convert the "Last updated" column to a datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])

# sort the dataframe by the "Contact" and "Last updated" columns in descending order
df = df.sort_values(['Last updated'], ascending=True)

# drop the duplicates based on the "Contact" column and keep the last occurrence
result = df.drop_duplicates(subset=['Contact','Amount'], keep='last')

print(result)

输出：

     Contact  Amount Last updated
4  011000112       2   2023-01-06
6  011000111       2   2023-01-09
7  011000111       3   2023-01-11

Answer 1

应该有四行，因为有两行有联系人

011000111

和最后一次更新

2023-01-02

这包括所有带有最近

Last updated

的行

Contact

import pandas as pd

# create the dataframe
data = {'Contact': ['011000111', '011000111', '011000112', '011000112', '011000112', '011000111', '011000111', '011000111'],
        'Amount': [2, 2, 2, 2, 2, 2, 2, 3],
        'Last updated': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-06', '2023-01-07', '2023-01-09', '2023-01-11']}
df = pd.DataFrame(data)

# convert the "Last updated" column to a datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])

# sort the dataframe by the "Contact" and "Last updated" columns in descending order
df = df.sort_values(['Contact', 'Last updated'], ascending=[True, False])

# drop the duplicates based on the "Contact" column and keep the last occurrence
result = df.drop_duplicates(subset='Contact', keep='last')

# keep all rows with the most recent 'Last updated' date for each unique 'Contact'
result = result.loc[result.groupby('Contact')['Last updated'].idxmax()].sort_values(['Last updated'])

print(result)

现在我所做的是删除所有依赖于

contact

的重复项并保留最后一次出现，这些将使我们有多个相同的

contact

但每个日期都有不同的日期，以解决我们需要分组的问题

contact

的数据框结果，然后使用

maximum

函数为每个组获得

last updated

idxmax()

。

Answer 2

注释代码

# compare current and prev row
c = ['Contact', 'Amount']
mask = df[c] != df[c].shift()

# Are the rows different?
mask = mask.any(axis=1)

# ensure the column is datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])

# use cumsum to idenify different blocks of same rows
# then group the dataframe and find the index of max value in last updated
result = df.loc[df.groupby(mask.cumsum())['Last updated'].idxmax()]

结果

     Contact Amount Last updated
1  011000111      2   2023-01-02
4  011000112      2   2023-01-06
6  011000111      2   2023-01-09
7  011000111      3   2023-01-11

Pandas 保持最新行更新

问题描述投票：0回答：2

2个回答

注释代码

结果

最新问题

Pandas 保持最新行更新

问题描述 投票：0回答：2

2个回答

注释代码

结果

最新问题

问题描述投票：0回答：2