我今天有一个挑战。使用 pandas 过滤 CSV 文件以获得下一个查询的相同结果:
SELECT *
, CASE COALESCE(first_msg_courier , 'infinity')
< COALESCE(first_msg_customer, 'infinity')
WHEN true THEN 'Courier'
WHEN false THEN 'Customer' -- ELSE null
END AS first_msg_by
, LEAST(first_msg_courier, first_msg_customer) AS conversation_start
FROM (
SELECT order_id
, city_code
, min(message_sent_time) FILTER (WHERE app_type = 'Co') AS first_msg_courier
, min(message_sent_time) FILTER (WHERE app_type = 'Cu') AS first_msg_customer
, max(message_sent_time) FILTER (WHERE app_type = 'Co') AS last_msg_courier
, max(message_sent_time) FILTER (WHERE app_type = 'Cu') AS last_msg_customer
, max(message_sent_time) - min(message_sent_time) FILTER (WHERE app_type = 'Co') AS responsive_delay_courier
, max(message_sent_time) - min(message_sent_time) FILTER (WHERE app_type = 'Cu') AS responsive_delay_customer
, count(*) FILTER (WHERE app_type = 'Co') AS num_msg_courier
, count(*) FILTER (WHERE app_type = 'Cu') AS num_msg_customer
FROM (
SELECT ccc.order_id,
ord.city_code,
left(ccc.sender_app_type, 2) AS app_type,
ccc.message_sent_time
FROM customer_courier_chat_messages ccc
INNER JOIN "Orders" ord
ON ccc.order_id = ord.order_id) c
GROUP BY order_id, city_code
) sub;
以我们为例的 CSV 文件是:
客户_快递_聊天_消息
sender_app_type,customer_id,from_id,to_id,chat_started_by_message,order_id,order_stage,courier_id,message_sent_time
Customer IOS,99,99,21,FALSE,555,PICKING_UP,21,9/8/22 8:02
Courier IOS,99,21,99,FALSE,555,ARRIVING,21,9/8/22 8:01
Customer IOS,99,99,21,FALSE,555,PICKING_UP,21,9/8/22 8:00
Courier Android,122,87,122,TRUE,38,ADDRESS_DELIVERY,87,9/8/22 7:55
Customer Android,43,43,75,FALSE,875,PICKING_UP,75,7/8/22 14:55
Courier Android,43,75,43,FALSE,875,ARRIVING,75,7/8/22 14:53
Customer Android,43,43,75,FALSE,875,PICKING_UP,75,7/8/22 14:51
Courier Android,43,75,43,TRUE,875,ADDRESS_DELIVERY,75,7/8/22 14:50
Customer IOS,23,23,21,FALSE,134,PICKING_UP,21,7/8/22 10:02
Courier IOS,23,21,23,FALSE,134,ARRIVING,21,7/8/22 10:01
Customer IOS,23,23,21,FALSE,134,PICKING_UP,21,7/8/22 10:00
订单.csv
order_id,city_code
38,BCN
134,OPO
555,BCN
875,VAL
我开始编码这样做:
import pandas as pd
ccc = pd.read_csv('customer_courier_chat_messages.csv')
ord = pd.read_csv('orders.csv')
inner_tables = pd.merge(ccc, ord, on=['order_id'], how='inner')
inner_tables['app_type'] = inner_tables.sender_app_type.astype(str).str[:2]
filter_1 = inner_tables[['order_id', 'city_code', 'app_type', 'message_sent_time']]
print(filter_1)
cols = ['order_id', 'city_code']
grouped_df1 = filter_1.groupby(cols).size()
# filter_2 = filter_1.groupby('order_id', 'city_code')
print(grouped_df1)
print('\n')
我不明白如何按两列进行分组,同时添加其他新列(first_msg_courier、first_msg_customer、last_msg_courier等)
有人可以帮我吗?
问候
cols = ['order_id', 'city_code', 'app_type'] # we also group the app_type to get the min and max of message_sent_time
grouped_df1 = filter_1.groupby(cols).agg({'message_sent_time': ['min', 'max']}).unstack('app_type') # after aggregation, we unstack the app_type
grouped_df1.columns = grouped_df1.columns.map('_'.join) # we join the multi-level columns
grouped_df1.rename(columns={
'message_sent_time_min_Co': 'first_msg_courier',
'message_sent_time_max_Co': 'last_msg_courier',
'message_sent_time_min_Cu': 'first_msg_customer',
'message_sent_time_max_Cu': 'last_msg_customer'}).reset_index() # we rename the columns and reset the index