目前,我将以下数据框从Excel导入到pandas中,我想删除基于两列值的重复值。
# Python 3.5.2
# Pandas library version 0.22
import pandas as pd
# Save the Excel workbook in a variable
current_workbook = pd.ExcelFile('C:\\Users\\userX\\Desktop\\cost_values.xlsx')
# convert the workbook to a data frame
current_worksheet = pd.read_excel(current_workbook, index_col = 'vend_num')
# current output
print(current_worksheet)
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
CHARLS Charlie & Associates $5,700.00 South
CHARLS Charlie & Associates $5,700.00 North
CHARLS Charlie & Associates $5,700.00 West
HUGHES Hughinos $3,800.00 Central
HUGHES Hughinos $3,800.00 South
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,500.00 North
FERNAS Fernanda Industries $3,000.00 West
....
我想要的是删除基于数量和来源列的重复值:
期望的结果
| vend_number | vend_name | quantity | source |
| ----------- |----------------------- | -------- | -------- |
CHARLS Charlie & Associates $5,700.00 Central
HUGHES Hughinos $3,800.00 Central
FERNAS Fernanda Industries $3,500.00 South
FERNAS Fernanda Industries $3,000.00 West
....
到目前为止,我已经尝试了以下代码,但是pandas甚至没有检测到任何重复的行。
print(current_worksheet.loc[current_worksheet.duplicated()])
print(current_worksheet.duplicated())
我试图弄清楚解决方案,但我在这个问题上遇到了很多困难,所以非常感谢这个问题的任何帮助。随意改善这个问题。
这是一种方式。
df['CentralFlag'] = (df['source'] == 'Central')
df = df.sort_values('CentralFlag', ascending=False)\
.drop_duplicates(['vend_name', 'quantity'])\
.drop('CentralFlag', 1)
# vend_number vend_name quantity source
# 0 CHARLS Charlie&Associates $5,700.00 Central
# 4 HUGHES Hughinos $3,800.00 Central
# 6 FERNAS FernandaIndustries $3,500.00 South
# 8 FERNAS FernandaIndustries $3,000.00 West
说明
vend_name
和quantity
排序,然后放下标志栏。你可以做两个步骤
s=df.loc[df['source']=='Central',:]
t=df.loc[~df['vend_number'].isin(s['vend_number']),:]
pd.concat([s,t.drop_duplicates(['vend_number','quantity'],keep='first')])