如何使用pandas删除基于特定列的重复值?

问题描述 投票:1回答:2

目前,我将以下数据框从Excel导入到pandas中,我想删除基于两列值的重复值。

# Python 3.5.2
# Pandas library version 0.22

import pandas as pd 

# Save the Excel workbook in a variable
current_workbook  = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')

# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num') 

# current output
print(current_worksheet)


| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    CHARLS      Charlie & Associates      $5,700.00   South
    CHARLS      Charlie & Associates      $5,700.00   North
    CHARLS      Charlie & Associates      $5,700.00   West
    HUGHES      Hughinos                  $3,800.00   Central
    HUGHES      Hughinos                  $3,800.00   South
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,500.00   North
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

我想要的是删除基于数量和来源列的重复值:

  1. 查看数量和来源列值: 1.1。如果供应商的数量在同一供应商的另一行中相等且来源不等于Central,则从该供应商处删除重复的行,但行Central除外。 1.2。否则,如果供应商的数量在同一供应商的另一行中相等,并且没有源中心,则删除重复的行。

期望的结果

| vend_number |      vend_name         | quantity |  source  |
| ----------- |----------------------- | -------- | -------- | 
    CHARLS      Charlie & Associates      $5,700.00   Central
    HUGHES      Hughinos                  $3,800.00   Central
    FERNAS      Fernanda Industries       $3,500.00   South
    FERNAS      Fernanda Industries       $3,000.00   West
    ....

到目前为止,我已经尝试了以下代码,但是pandas甚至没有检测到任何重复的行。

print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())

我试图弄清楚解决方案,但我在这个问题上遇到了很多困难,所以非常感谢这个问题的任何帮助。随意改善这个问题。

python python-3.x pandas
2个回答
1
投票

这是一种方式。

df['CentralFlag'] = (df['source'] == 'Central')

df = df.sort_values('CentralFlag', ascending=False)\
       .drop_duplicates(['vend_name', 'quantity'])\
       .drop('CentralFlag', 1)

#   vend_number           vend_name   quantity   source
# 0      CHARLS  Charlie&Associates  $5,700.00  Central
# 4      HUGHES            Hughinos  $3,800.00  Central
# 6      FERNAS  FernandaIndustries  $3,500.00    South
# 8      FERNAS  FernandaIndustries  $3,000.00     West

说明

  • 创建一个标志列,按此降序排序,因此Central优先。
  • vend_namequantity排序,然后放下标志栏。

1
投票

你可以做两个步骤

s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]

pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])
© www.soinside.com 2019 - 2024. All rights reserved.