使用 pandas 按地区和日期分组

问题描述 投票:0回答:2

你好,

我有一个 csv 文件,请检查示例输入 csv 的图像,我需要获取一个数据框,其中包含基于日期在特定“可用区”组上运行的“Amazon 弹性计算云”服务的总和.

类似这样的事情

|UsageStartDate| AvaliabilityZone  | Sum of products used |  Total cost for each

[6/1/16,  ap-northeast-1a, Amazon Elastic compute cloud = 6, 15$]
[6/2/16,  ap-southeast-2 , Amazon Elastic compute cloud = 3,   12$]

这就是我尝试使用 pandas 的方式:

funk = pd.read_csv('/tmp/temp.csv')
funk.sort_values('UsageStartDate') 
k = funk['AvailabilityZone'][funk['ProductName'] == 'Amazon Elastic Compute Cloud'].sum()
print  k 

对此有什么帮助吗?我正在学习熊猫

这是数据:

    ProductName               AvailabilityZone  UsageStartDate  BlendedCost
0   Amazon Simple Queue Service                   6/1/16 0:00       0
1   Alexa Web Information Service                 6/1/16 0:00       0.00347032
2   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
3   Amazon DynamoDB        ap-southeast-2          6/1/16 0:00      0
4   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.1
5   Amazon Elastic Compute Cloud    ap-northeast-1a 6/1/16 0:00     0.02
6   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
7   Amazon Elastic Compute Cloud                     6/1/16 0:00    0
8   Amazon Elastic Compute Cloud                     6/1/16 0:00    4.70E-06
9   Amazon Elastic Compute Cloud                     6/1/16 0:00    8.00E-08
10  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.00133333
11  Amazon Elastic Compute Cloud                     6/1/16 0:00    0.005
12  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
13  Amazon Elastic Compute Cloud    ap-southeast-1a 6/1/16 0:00     0.02
14  Amazon Elastic Compute Cloud    ap-southeast-1b 6/1/16 0:00     0.02
15  Amazon Elastic Compute Cloud                    6/1/16 0:00     0
python csv pandas dataframe grouping
2个回答
2
投票

我认为您需要

groupby
aggregate
- 大小按第
len
列的
AvailabilityZone
sum
od 列
BlendedCost
:

print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,
               'BlendedCost':sum}))

样品:

import pandas as pd

raw_data = {
    'ProductName': ['ASQS', 'AWIS', 'AWIS', 'AECC', 'AECC'], 
    'UsageStartDate': ['6/1/16','6/1/16','6/1/16','6/1/16','6/1/16'],
    'AvailabilityZone':['ap-northeast-1a','ap-northeast-1a','ap-northeast-1a','ap-southeast-2','ap-southeast-2'],
    'BlendedCost':[1,2,3,4,5]}
df = pd.DataFrame(raw_data)
print (df)
  AvailabilityZone  BlendedCost ProductName UsageStartDate
0  ap-northeast-1a            1        ASQS         6/1/16
1  ap-northeast-1a            2        AWIS         6/1/16
2  ap-northeast-1a            3        AWIS         6/1/16
3   ap-southeast-2            4        AECC         6/1/16
4   ap-southeast-2            5        AECC         6/1/16

print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,'BlendedCost':sum})
         .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
         .reset_index())

  UsageStartDate AvailabilityZone ProductName  Sum of products used  Total
0         6/1/16  ap-northeast-1a        ASQS                     1      1
1         6/1/16  ap-northeast-1a        AWIS                     2      5
2         6/1/16   ap-southeast-2        AECC                     2      9

带有示例数据的解决方案:

import pandas as pd
import io

temp=u"""ProductName;AvailabilityZone;UsageStartDate;BlendedCost
Amazon Simple Queue Service;;6/1/16 0:00;0
Alexa Web Information Service;;6/1/16 0:00;0.00347032
Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
Amazon DynamoDB;ap-southeast-2;6/1/16 0:00;0
Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.1
Amazon Elastic Compute Cloud;ap-northeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;;6/1/16 0:00;0
Amazon Elastic Compute Cloud;;6/1/16 0:00;0
Amazon Elastic Compute Cloud;;6/1/16 0:00;4.70E-06
Amazon Elastic Compute Cloud;;6/1/16 0:00;8.00E-08
Amazon Elastic Compute Cloud;;6/1/16 0:00;0.00133333
Amazon Elastic Compute Cloud;;6/1/16 0:00;0.005
Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;ap-southeast-1a;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;ap-southeast-1b;6/1/16 0:00;0.02
Amazon Elastic Compute Cloud;;6/1/16 0:00;0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None

#print (df)
print (df.groupby(['UsageStartDate', 'AvailabilityZone', 'ProductName'])
         .agg({'AvailabilityZone':len,'BlendedCost':sum})
         .rename(columns={'AvailabilityZone':'Sum of products used', 'BlendedCost':'Total'})
         .reset_index())

  UsageStartDate AvailabilityZone                   ProductName  \
0    6/1/16 0:00  ap-northeast-1a  Amazon Elastic Compute Cloud   
1    6/1/16 0:00  ap-southeast-1a  Amazon Elastic Compute Cloud   
2    6/1/16 0:00  ap-southeast-1b  Amazon Elastic Compute Cloud   
3    6/1/16 0:00   ap-southeast-2               Amazon DynamoDB   

   Sum of products used  Total  
0                     2   0.12  
1                     2   0.04  
2                     1   0.02  
3                     2   0.00  

-2
投票

这里是关于 pandas 通用聚合框架pandas.groupby 函数的文档

funk.groupby(['AvailabilityZone','Date','ProductName'])['BlendedCost'].sum()
© www.soinside.com 2019 - 2024. All rights reserved.