如何使用pandas删除基于特定列的重复值？

Question

目前，我将以下数据框从Excel导入到pandas中，我想删除基于两列值的重复值。

# Python 3.5.2
# Pandas library version 0.22

import pandas as pd 

# Save the Excel workbook in a variable
current_workbook  = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')

# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num') 

# current output
print(current_worksheet)


| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    CHARLS      Charlie & Associates      $5,700.00   South
    CHARLS      Charlie & Associates      $5,700.00   North
    CHARLS      Charlie & Associates      $5,700.00   West
    HUGHES      Hughinos                  $3,800.00   Central
    HUGHES      Hughinos                  $3,800.00   South
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,500.00   North
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

我想要的是删除基于数量和来源列的重复值：

查看数量和来源列值： 1.1。如果供应商的数量在同一供应商的另一行中相等且来源不等于Central，则从该供应商处删除重复的行，但行Central除外。 1.2。否则，如果供应商的数量在同一供应商的另一行中相等，并且没有源中心，则删除重复的行。

期望的结果

| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    HUGHES      Hughinos                  $3,800.00   Central
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

到目前为止，我已经尝试了以下代码，但是pandas甚至没有检测到任何重复的行。

print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())

我试图弄清楚解决方案，但我在这个问题上遇到了很多困难，所以非常感谢这个问题的任何帮助。随意改善这个问题。

Answer 1

这是一种方式。

df['CentralFlag'] = (df['source'] == 'Central')

df = df.sort_values('CentralFlag', ascending=False)\
       .drop_duplicates(['vend_name', 'quantity'])\
       .drop('CentralFlag', 1)

#   vend_number           vend_name   quantity   source
# 0      CHARLS  Charlie&Associates  $5,700.00  Central
# 4      HUGHES            Hughinos  $3,800.00  Central
# 6      FERNAS  FernandaIndustries  $3,500.00    South
# 8      FERNAS  FernandaIndustries  $3,000.00     West

说明

创建一个标志列，按此降序排序，因此Central优先。
按vend_name和quantity排序，然后放下标志栏。

Answer 2

你可以做两个步骤

s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]

pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])

如何使用pandas删除基于特定列的重复值？

问题描述投票：1回答：2

2个回答

最新问题

如何使用pandas删除基于特定列的重复值？

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2