如何使用 pandas 过滤数据框

问题描述 投票:0回答:1

我今天有一个挑战。使用 pandas 过滤 CSV 文件以获得下一个查询的相同结果:

SELECT *
     , CASE COALESCE(first_msg_courier , 'infinity')
          < COALESCE(first_msg_customer, 'infinity')
          WHEN true  THEN 'Courier'
          WHEN false THEN 'Customer'  -- ELSE null
       END AS first_msg_by
     , LEAST(first_msg_courier, first_msg_customer) AS conversation_start
FROM  (
   SELECT order_id
        , city_code
        , min(message_sent_time) FILTER (WHERE app_type = 'Co') AS first_msg_courier
        , min(message_sent_time) FILTER (WHERE app_type = 'Cu') AS first_msg_customer
        , max(message_sent_time) FILTER (WHERE app_type = 'Co') AS last_msg_courier
        , max(message_sent_time) FILTER (WHERE app_type = 'Cu') AS last_msg_customer
        , max(message_sent_time) - min(message_sent_time) FILTER (WHERE app_type = 'Co') AS responsive_delay_courier
        , max(message_sent_time) - min(message_sent_time) FILTER (WHERE app_type = 'Cu') AS responsive_delay_customer
        , count(*) FILTER (WHERE app_type = 'Co') AS num_msg_courier
        , count(*) FILTER (WHERE app_type = 'Cu') AS num_msg_customer        
   FROM  (
    SELECT ccc.order_id, 
           ord.city_code,
           left(ccc.sender_app_type, 2) AS app_type, 
           ccc.message_sent_time 
    FROM customer_courier_chat_messages ccc
    INNER JOIN "Orders" ord
        ON ccc.order_id = ord.order_id) c
   GROUP  BY order_id, city_code
   ) sub;

enter image description here

以我们为例的 CSV 文件是:

客户_快递_聊天_消息

sender_app_type,customer_id,from_id,to_id,chat_started_by_message,order_id,order_stage,courier_id,message_sent_time
Customer IOS,99,99,21,FALSE,555,PICKING_UP,21,9/8/22 8:02
Courier IOS,99,21,99,FALSE,555,ARRIVING,21,9/8/22 8:01
Customer IOS,99,99,21,FALSE,555,PICKING_UP,21,9/8/22 8:00
Courier Android,122,87,122,TRUE,38,ADDRESS_DELIVERY,87,9/8/22 7:55
Customer Android,43,43,75,FALSE,875,PICKING_UP,75,7/8/22 14:55
Courier Android,43,75,43,FALSE,875,ARRIVING,75,7/8/22 14:53
Customer Android,43,43,75,FALSE,875,PICKING_UP,75,7/8/22 14:51
Courier Android,43,75,43,TRUE,875,ADDRESS_DELIVERY,75,7/8/22 14:50
Customer IOS,23,23,21,FALSE,134,PICKING_UP,21,7/8/22 10:02
Courier IOS,23,21,23,FALSE,134,ARRIVING,21,7/8/22 10:01
Customer IOS,23,23,21,FALSE,134,PICKING_UP,21,7/8/22 10:00

订单.csv

order_id,city_code
38,BCN
134,OPO
555,BCN
875,VAL

我开始编码这样做:

import pandas as pd

ccc = pd.read_csv('customer_courier_chat_messages.csv')
ord = pd.read_csv('orders.csv')


inner_tables = pd.merge(ccc, ord, on=['order_id'], how='inner')
inner_tables['app_type'] = inner_tables.sender_app_type.astype(str).str[:2]

filter_1 = inner_tables[['order_id', 'city_code', 'app_type', 'message_sent_time']]

print(filter_1)

cols = ['order_id', 'city_code']
grouped_df1 = filter_1.groupby(cols).size()

# filter_2 = filter_1.groupby('order_id', 'city_code')
print(grouped_df1)
print('\n')

我不明白如何按两列进行分组,同时添加其他新列(first_msg_courier、first_msg_customer、last_msg_courier等

有人可以帮我吗?

问候

python-3.x pandas dataframe numpy
1个回答
0
投票
cols = ['order_id', 'city_code', 'app_type']  # we also group the app_type to get the min and max of message_sent_time
grouped_df1 = filter_1.groupby(cols).agg({'message_sent_time': ['min', 'max']}).unstack('app_type')  # after aggregation, we unstack the app_type
grouped_df1.columns = grouped_df1.columns.map('_'.join)  # we join the multi-level columns
grouped_df1.rename(columns={
    'message_sent_time_min_Co': 'first_msg_courier', 
    'message_sent_time_max_Co': 'last_msg_courier',
    'message_sent_time_min_Cu': 'first_msg_customer', 
    'message_sent_time_max_Cu': 'last_msg_customer'}).reset_index()  # we rename the columns and reset the index
© www.soinside.com 2019 - 2024. All rights reserved.