我包含多个重复的vendor_name的csv文件,但2015-2017年的amt不同。
这里是我的密码。
df = pd.read_csv('government-procurement-via-gebiz.csv', parse_dates=['award_date'],
infer_datetime_format=True, usecols=['supplier_name', 'award_date', 'awarded_amt'],)
df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == 2015)].reset_index(drop=True)
d1 = df.set_index('supplier_name').to_dict()['awarded_amt']
top5D1 = dict(sorted(d1.iteritems(), key=operator.itemgetter(1), reverse=True)[:5])
print top5D1
输出为
{'KAJIMA OVERSEAS ASIA PTE LTD': 595800000.0, 'SAMSUNG C&T CORPORATION': 555322063.0, 'GS Engineering & Construction Corp.': 428301000.0, 'HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD': 601726000.0, 'THE GO-AHEAD GROUP PLC': 497738104.0}
我检查了csv文件,正确的结果应该是这个。
supplier_name award_date awarded_amt
1 SANTARLI CONSTRUCTION PTE. LTD. 2015-01-07 1.030000e+09
2 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 2015-08-04 6.017260e+08
3 KAJIMA OVERSEAS ASIA PTE LTD 2015-02-03 5.958000e+08
4 SAMSUNG C&T CORPORATION 2015-11-20 5.553221e+08
5 THE GO-AHEAD GROUP PLC 2015-11-23 4.977381e+08
从csv文件中,我发现“ SANTARLI CONSTRUCTION PTE。LTD。” Supplier_name在csv文件上出现了两次,一个是最低的,而另一个是最高的。
我应该如何输出最高的“ SANTARLI CONSTRUCTION PTE。LTD。”?
csv数据就是这样。
1/7/2015 SANTARLI CONSTRUCTION PTE. LTD. 1030000000
8/4/2015 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 601726000
2/3/2015 KAJIMA OVERSEAS ASIA PTE LTD 595800000
11/20/2015 SAMSUNG C&T CORPORATION 555322063
11/23/2015 THE GO-AHEAD GROUP PLC 497738104
6/19/2015 GS Engineering & Construction Corp. 428301000
6/25/2015 TIONG SENG CONTRACTORS (PRIVATE) LIMITED 277265946
2/27/2015 CHIP ENG SENG CONTRACTORS (1988) PTE LTD 258000000
11/18/2015 TEAMBUILD ENGINEERING & CONSTRUCTION PTE. LTD. 236800000
2/23/2015 NCS PTE. LTD. 223028240
11/11/2015 HSL Constructor Pte Ltd 217354000
7/31/2015 HI-TEK CONSTRUCTION PTE LTD 215000000
6/22/2015 HWA SENG BUILDER PTE LTD 189339600
3/19/2015 EXPAND CONSTRUCTION PTE LTD 189000000
11/30/2015 CNQC ENGINEERING & CONSTRUCTION PTE. LTD. 163980000
9/7/2015 Master Contract Services Pte Ltd 163000000
3/5/2015 Yongnam Engineering & Construction Pte Ltd 159000000
5/19/2015 SANTARLI CONSTRUCTION PTE. LTD. 148800000
所以我对drop_duplicate进行了评论,并删除了to.dict()
这是新代码
df = pd.read_csv('government-procurement-via-gebiz.csv', parse_dates=['award_date'],
infer_datetime_format=True, usecols=['supplier_name', 'award_date', 'awarded_amt'],)
df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year ==
2016)].reset_index(drop=True)
# df = df.drop_duplicates(subset=['supplier_name'])
df = df.sort_values('awarded_amt', ascending=False).nlargest(5,'awarded_amt')
d1 = df.set_index('supplier_name')['awarded_amt']
print d1
输出为
supplier_name
GS ENGINEERING & CONSTRUCTION CORP. 1.988000e+09
SAMSUNG C&T CORPORATION 8.336120e+08
PENTA-OCEAN CONSTRUCTION CO LTD 6.744177e+08
SAMSUNG C&T CORPORATION 4.509105e+08
KTC CIVIL ENGINEERING & CONSTRUCTION PTE LTD 4.175000e+08
但是我希望输出为字典格式。
我该怎么办?
问题是;当您使用to_dict
创建字典时,它会创建所需的“ SANTARLI”的第一个实例作为键,然后继续解析,它会找到“ SANTARLI”的第二个实例,并将其用作键,从而替换了第一个实例的密钥(覆盖密钥和数据。)
字典键必须是唯一的。您需要首先清除冗余实例的数据。见下文...
import pandas as pd
import re
import operator
#df = pd.read_csv('government-procurement-via-gebiz.csv', parse_dates=['award_date'], infer_datetime_format=True, usecols=['supplier_name', 'award_date', 'awarded_amt'],)
# I creatd the df from the data supplied in the questions
df = pd.DataFrame(data, columns =['award_date', 'supplier_name', 'awarded_amt'])
df['award_date'] = pd.to_datetime(df['award_date'])
print(df)
# Select by date (your original code)
df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == 2015)].reset_index(drop=True)
# Sort by column 'awarded_amt'.
# This will leave the duplicates like 'SANTARLI', but put the one with the highest
# value in 'awarded_amt' first
df = df.sort_values('awarded_amt', ascending=True)
# Drop the duplicates. This has a parameter "keep" which defaults to "first"
# Thus, it will keep the first instance of 'SANTARLI',
# which will also be the greatest 'awarded_amt'
df = df.drop_duplicates(subset=['supplier_name'])
# Now create your dict
d1 = df.set_index('supplier_name').to_dict()['awarded_amt']
print(d1)
输出:
award_date supplier_name awarded_amt
0 2015-01-07 SANTARLI CONSTRUCTION PTE. LTD. 1030000000
1 2014-08-04 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 601726000
2 2014-02-03 KAJIMA OVERSEAS ASIA PTE LTD 595800000
3 2015-11-20 SAMSUNG C&T CORPORATION 555322063
4 2015-11-23 THE GO-AHEAD GROUP PLC 497738104
5 2015-06-19 GS Engineering & Construction Corp. 428301000
6 2015-09-07 Master Contract Services Pte Ltd 163000000
7 2015-03-05 Yongnam Engineering & Construction Pte Ltd 159000000
8 2015-12-30 NANJING DADI CONSTRUCTION (GROUP) CO., LTD. SI... 152600000
9 2015-05-19 SANTARLI CONSTRUCTION PTE. LTD. 148800000
{'SANTARLI CONSTRUCTION PTE. LTD.': '1030000000', 'NANJING DADI CONSTRUCTION (GROUP) CO., LTD. SINGAPORE BRANCH': '152600000', 'Yongnam Engineering & Construction Pte Ltd': '159000000', 'Master Contract Services Pte Ltd': '163000000', 'GS Engineering & Construction Corp.': '428301000', 'THE GO-AHEAD GROUP PLC': '497738104', 'SAMSUNG C&T CORPORATION': '555322063'}
编辑:如果您只希望每年基于“ awarded_amt”排在前5位(即,不管是5家不同的公司还是同一家公司,则排在前5名“ awarded_amt”),那么根本就不需要重复删除。
只需按“ awarded_amt”对整个DataFrame进行排序,排在前5位(也许使用df.head(5)),但是不要使用to_dict()(使用公司名称作为键),因为它不允许两个(或多个)相同的公司名称。
import pandas as pd
import sys
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
data = [["1/7/2015", "SANTARLI CONSTRUCTION PTE. LTD.", 1030000000],
["8/4/2015", "HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD", 601726000],
["2/3/2015", "KAJIMA OVERSEAS ASIA PTE LTD", 595800000],
["11/20/2015","SAMSUNG C&T CORPORATION", 555322063],
["11/23/2015" ,"THE GO-AHEAD GROUP PLC", 497738104],
["6/19/2015" ,"GS Engineering & Construction Corp.", 428301000],
["6/25/2015" ,"TIONG SENG CONTRACTORS (PRIVATE) LIMITED", 277265946],
["5/19/2015" ,"SANTARLI CONSTRUCTION PTE. LTD." , 649800000],
["5/19/2016" ,"SANTARLI CONSTRUCTION PTE. LTD." , 650800000],
["5/19/2016" ,"SANTARLI CONSTRUCTION PTE. LTD." , 651800000],
["11/20/2016","SAMSUNG C&T CORPORATION", 555322063],
["11/23/2016" ,"THE GO-AHEAD GROUP PLC", 497738104],
["6/19/2016" ,"GS Engineering & Construction Corp.", 428301000]
]
df = pd.DataFrame(data, columns = ['award_date', 'supplier_name', 'awarded_amt'])
df['award_date'] = pd.to_datetime(df['award_date'])
# Separate df by years
finaldf = pd.DataFrame()
years = [2015, 2016]
for year in years:
temp_df = df[(df['supplier_name'] != 'na') & (df['award_date'].dt.year == year)].reset_index(drop=True)
# Sort by column 'awarded_amt'.
# This will leave the duplicates like 'SANTARLI', but put the one with the highest
# value in 'awarded_amt' first
temp_df = temp_df.sort_values('awarded_amt', ascending=False)
print("-----------------------____")
finaldf = pd.concat([finaldf, temp_df.iloc[:5]])
print(finaldf)
输出:
award_date supplier_name awarded_amt
0 2015-01-07 SANTARLI CONSTRUCTION PTE. LTD. 1030000000
7 2015-05-19 SANTARLI CONSTRUCTION PTE. LTD. 649800000
1 2015-08-04 HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD 601726000
2 2015-02-03 KAJIMA OVERSEAS ASIA PTE LTD 595800000
3 2015-11-20 SAMSUNG C&T CORPORATION 555322063
1 2016-05-19 SANTARLI CONSTRUCTION PTE. LTD. 651800000
0 2016-05-19 SANTARLI CONSTRUCTION PTE. LTD. 650800000
2 2016-11-20 SAMSUNG C&T CORPORATION 555322063
3 2016-11-23 THE GO-AHEAD GROUP PLC 497738104
4 2016-06-19 GS Engineering & Construction Corp. 428301000
编辑:
[将finaldf
转换为字典,我建议这样做。它将创建类似于JSON的嵌套字典。您也可以使用Python模块JSON
。
final_dict = {}
for row in finaldf.iterrows():
award_date = row[1][0]
supplier_name = row[1][1]
awarded_amt = row[1][2]
if supplier_name not in final_dict.keys():
final_dict[supplier_name] = {}
final_dict[supplier_name][award_date] = awarded_amt
print(final_dict)
输出:
{
'SANTARLI CONSTRUCTION PTE. LTD.': {
Timestamp('2015-01-07 00:00:00'): 1030000000,
Timestamp('2015-05-19 00:00:00'): 649800000,
Timestamp('2016-05-19 00:00:00'): 650800000
},
'HYUNDAI ENGINEERING & CONSTRUCTION CO. LTD': {
Timestamp('2015-08-04 00:00:00'): 601726000
},
'KAJIMA OVERSEAS ASIA PTE LTD': {
Timestamp('2015-02-03 00:00:00'): 595800000
},
'SAMSUNG C&T CORPORATION': {
Timestamp('2015-11-20 00:00:00'): 555322063,
Timestamp('2016-11-20 00:00:00'): 555322063
},
'THE GO-AHEAD GROUP PLC': {
Timestamp('2016-11-23 00:00:00'): 497738104
},
'GS Engineering & Construction Corp.': {
Timestamp('2016-06-19 00:00:00'): 428301000
}
}