python pandas - 删除列中的重复项并根据复杂的标准保留行

Question

假设我有这个 DF：

s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
   id  qual  nm
0   1    10   0
1   1    20   0
2   2    10   0
3   2     5   0
4   2    10   1
5   3     7   1
6   3     7   0
7   3     3   2
8   4    10   0

我想要一个新的 DF，其中没有重复的 id，因此应该有 4 行，id 为 1,2,3,4。应根据以下标准选择应保留的行：取 nm 最小的行，如果相等，则取 qual 最大的行，如果仍然相等，则仅选择一个。我认为我的代码应该类似于：

df.groupby('id').apply(lambda x: ???)

它应该返回：

   id  qual  nm
0   1    20   0
1   2    10   0
2   3     7   0
3   4    10   0

但不确定我的函数应该接受和返回什么。
或者可能有更简单的方法吗？
谢谢！

Answer 1

使用

boolean indexing

和

GroupBy.transform

获取每组的最小行数，然后获取最大值，最后如果仍然重复，请通过

DataFrame.drop_duplicates

将其删除：

#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual    
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
   id  qual  nm
1   1    20   0
2   2    10   0
6   3     7   0
8   4    10   0

Answer 2

使用 -

grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])

输出

   id  qual  nm
1   1    20   0
2   2    10   0
6   3     7   0
8   4    10   0

Answer 3

duckdb：

df1.sql.row_number("over(partition by id order by nm,qual desc) col1","*").filter("col1=1").order("index").select("id,qual,nm")


┌───────┬───────┬───────┐
│  id   │ qual  │  nm   │
│ int64 │ int64 │ int64 │
├───────┼───────┼───────┤
│     1 │    20 │     0 │
│     2 │    10 │     0 │
│     3 │     7 │     0 │
│     4 │    10 │     0 │
└───────┴───────┴───────┘

python pandas - 删除列中的重复项并根据复杂的标准保留行

问题描述投票：0回答：3

3个回答

最新问题

python pandas - 删除列中的重复项并根据复杂的标准保留行

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3