Pandas 保持最新行更新

问题描述 投票:0回答:2

假设我有一个如下所示的数据框:

     Contact  Amount Last updated
0  011000111       2   2023-01-01
1  011000111       2   2023-01-02
2  011000112       2   2023-01-03
3  011000112       2   2023-01-04
4  011000112       2   2023-01-06
5  011000111       2   2023-01-07
6  011000111       2   2023-01-09
7  011000111       3   2023-01-11

我想在

Last updated
列中保留最新日期,每次更改
Contact
Amount
的组合。预期的数据框应该看起来像:

     Contact  Amount Last updated
1  011000111       2   2023-01-02
4  011000112       2   2023-01-06
6  011000111       2   2023-01-09
7  011000111       3   2023-01-11

这是我目前拥有的:

import pandas as pd

# create the dataframe
data = {'Contact': ['011000111', '011000111', '011000112', '011000112', '011000112', '011000111', '011000111', '011000111'],
        'Amount': [2, 2, 2, 2, 2, 2, 2, 3],
        'Last updated': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-06', '2023-01-07', '2023-01-09', '2023-01-11']}
df = pd.DataFrame(data)

# convert the "Last updated" column to a datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])

# sort the dataframe by the "Contact" and "Last updated" columns in descending order
df = df.sort_values(['Last updated'], ascending=True)

# drop the duplicates based on the "Contact" column and keep the last occurrence
result = df.drop_duplicates(subset=['Contact','Amount'], keep='last')

print(result)

输出:

     Contact  Amount Last updated
4  011000112       2   2023-01-06
6  011000111       2   2023-01-09
7  011000111       3   2023-01-11
python pandas group-by row data-cleaning
2个回答
0
投票

应该有四行,因为有两行有联系人

011000111
和最后一次更新
2023-01-02

这包括所有带有最近

Last updated
的行
Contact

import pandas as pd

# create the dataframe
data = {'Contact': ['011000111', '011000111', '011000112', '011000112', '011000112', '011000111', '011000111', '011000111'],
        'Amount': [2, 2, 2, 2, 2, 2, 2, 3],
        'Last updated': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-06', '2023-01-07', '2023-01-09', '2023-01-11']}
df = pd.DataFrame(data)

# convert the "Last updated" column to a datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])

# sort the dataframe by the "Contact" and "Last updated" columns in descending order
df = df.sort_values(['Contact', 'Last updated'], ascending=[True, False])

# drop the duplicates based on the "Contact" column and keep the last occurrence
result = df.drop_duplicates(subset='Contact', keep='last')

# keep all rows with the most recent 'Last updated' date for each unique 'Contact'
result = result.loc[result.groupby('Contact')['Last updated'].idxmax()].sort_values(['Last updated'])

print(result)

现在我所做的是删除所有依赖于

contact
的重复项并保留最后一次出现,这些将使我们有多个相同的
contact
但每个日期都有不同的日期,以解决我们需要分组的问题
contact
的数据框结果,然后使用
maximum
函数为每个组获得
last updated
idxmax()


0
投票

注释代码

# compare current and prev row
c = ['Contact', 'Amount']
mask = df[c] != df[c].shift()

# Are the rows different?
mask = mask.any(axis=1)

# ensure the column is datetime type
df['Last updated'] = pd.to_datetime(df['Last updated'])

# use cumsum to idenify different blocks of same rows
# then group the dataframe and find the index of max value in last updated
result = df.loc[df.groupby(mask.cumsum())['Last updated'].idxmax()]

结果

     Contact Amount Last updated
1  011000111      2   2023-01-02
4  011000112      2   2023-01-06
6  011000111      2   2023-01-09
7  011000111      3   2023-01-11
© www.soinside.com 2019 - 2024. All rights reserved.