我有两个数据集。一个是供应品,另一个是销售。他们有不同的日期和时间。
耗材
Year Month Day Hour Item
0 2023 05 17 10 8
1 2023 06 01 12 8
2 2023 06 10 16 3
3 2023 06 17 10 8
4 2023 07 01 10 8
5 2023 08 17 10 8
销售
Year Month Day Hour Sale
0 2023 05 17 16 3
1 2023 05 18 12 3
2 2023 05 24 16 3
3 2023 05 27 10 1
4 2023 06 02 10 2
5 2023 06 03 10 3
我需要这两个信息,所以我合并了它们
Year Month Day Hour Item Year Month Day Hour Item
0 2023 05 17 10 8 2023 05 17 16 3
1 2023 06 01 12 8 2023 05 18 12 3
2 2023 06 10 16 3 2023 05 24 16 3
3 2023 06 17 10 8 2023 05 27 10 1
4 2023 07 01 10 8 2023 06 02 10 2
5 2023 08 17 10 8 2023 06 03 10 3
我想如果当天没有发货,那么右边的日期是重复的,数量是0。直到有新的发货
我渴望得到
Year Month Day Hour Item Year Month Day Hour Item
0 2023 05 17 10 8 2023 05 17 16 3
1 2023 NaN NaN NaN 0 2023 05 18 12 3
2 2023 NaN NaN NaN 0 2023 05 20 16 3
3 2023 NaN NaN NaN 0 2023 05 27 10 1
4 2023 06 01 12 8 2023 06 02 10 2
5 2023 NaN NaN NaN 0 2023 06 03 10 3
我想得到这个结果,如果左边的日期值小于右边的日期值,并且如果列中的值为0,那么它将被替换为nan。
有两种方法可以以所需的方式组合两个数据集:
SQL
的条件连接。特别是,当销售日期等于/晚于当前供应记录的日期时,特定销售记录将与当前供应记录连接,其中销售记录的日期不得晚于后续供应记录的日期。据我所知,条件连接(因为它们存在于
SQL
中)无法在 pandas
中完成 请参阅此处的 SO 帖子
下面您可以找到显示如何完成这两种方法的代码,其中第一种方法还需要
sqlite3
模块。我个人会推荐您使用第一种方法,因为执行交叉连接的计算成本可能非常昂贵。
第一种方法的代码:
# loading data using solution from https://stackoverflow.com/a/53692642/8718701
from io import StringIO
import numpy as np
import pandas as pd
import sqlite3
d = '''
Year Month Day Hour Item
0 2023 05 17 10 8
1 2023 06 01 12 8
2 2023 06 10 16 3
3 2023 06 17 10 8
4 2023 07 01 10 8
5 2023 08 17 10 8
'''
supplies_df = pd.read_csv(StringIO(d), sep='\s+')
d = '''
Year Month Day Hour Sale
0 2023 05 17 16 3
1 2023 05 18 12 3
2 2023 05 24 16 3
3 2023 05 27 10 1
4 2023 06 02 10 2
5 2023 06 03 10 3
'''
sales_df = pd.read_csv(StringIO(d), sep='\s+')
# first approach based on https://stackoverflow.com/a/42796283/8718701
supplies_df['supply_datetime'] = pd.to_datetime(supplies_df[['Year', 'Month', 'Day', 'Hour']])
supplies_df['next_supply_datetime'] = supplies_df['supply_datetime'].shift(-1)
sales_df['sale_datetime'] = pd.to_datetime(sales_df[['Year', 'Month', 'Day', 'Hour']])
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
supplies_df.to_sql('supplies', conn, index=False)
sales_df.to_sql('sales', conn, index=False)
# sql query joining tables on conditional join:
qry = '''
SELECT
A.supply_datetime,
A.Year AS supply_year,
A.Month AS supply_month,
A.Day AS supply_day,
A.Hour AS supply_hour,
A.Item,
B.sale_datetime,
B.Year AS sale_year,
B.Month AS sale_month,
B.Day AS sale_day,
B.Hour AS sale_hour,
B.Sale
FROM supplies AS A
JOIN sales AS B
ON A.supply_datetime <= B.sale_datetime AND A.next_supply_datetime > B.sale_datetime
'''
df = pd.read_sql_query(qry, conn)
# remove duplicated info
record_is_dupe = df.duplicated('supply_datetime')
df.loc[record_is_dupe, ['supply_datetime', 'supply_year', 'supply_month', 'supply_day', 'supply_hour']] = np.NaN
df.loc[record_is_dupe, ['Item']] = 0
# remove datetime columns
df.drop(columns=['supply_datetime', 'sale_datetime'], inplace=True)
# matches expected output
print(df.to_markdown(index=False))
# | supply_year | supply_month | supply_day | supply_hour | Item | sale_year | sale_month | sale_day | sale_hour | Sale |
# |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
# | 2023 | 5 | 17 | 10 | 8 | 2023 | 5 | 17 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 18 | 12 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 24 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 27 | 10 | 1 |
# | 2023 | 6 | 1 | 12 | 8 | 2023 | 6 | 2 | 10 | 2 |
# | nan | nan | nan | nan | 0 | 2023 | 6 | 3 | 10 | 3 |
第二种方法的代码:
# loading data using solution from https://stackoverflow.com/a/53692642/8718701
from io import StringIO
import numpy as np
import pandas as pd
d = '''
Year Month Day Hour Item
0 2023 05 17 10 8
1 2023 06 01 12 8
2 2023 06 10 16 3
3 2023 06 17 10 8
4 2023 07 01 10 8
5 2023 08 17 10 8
'''
supplies_df = pd.read_csv(StringIO(d), sep='\s+')
d = '''
Year Month Day Hour Sale
0 2023 05 17 16 3
1 2023 05 18 12 3
2 2023 05 24 16 3
3 2023 05 27 10 1
4 2023 06 02 10 2
5 2023 06 03 10 3
'''
sales_df = pd.read_csv(StringIO(d), sep='\s+')
# second approach based on https://stackoverflow.com/a/53699198/8718701
supplies_df['datetime'] = pd.to_datetime(supplies_df[['Year', 'Month', 'Day', 'Hour']])
supplies_df['datetime_shift'] = supplies_df['datetime'].shift(-1)
supplies_df.columns = ['supply_' + col.lower() for col in supplies_df.columns]
# renaming columns as duplicate names cause problems
sales_df['datetime'] = pd.to_datetime(sales_df[['Year', 'Month', 'Day', 'Hour']])
sales_df.columns = ['sale_' + col.lower() for col in sales_df.columns]
# Cartesian product
df = pd.merge(left=supplies_df, right=sales_df, how='cross')
# Filtering rows based on condition
winnowing_condition = (
((df['supply_datetime'] <= df['sale_datetime'])
& (df['supply_datetime_shift'] > df['sale_datetime']))
)
df = df.loc[winnowing_condition, :]
# remove duplicated info
record_is_dupe = df.duplicated('supply_datetime')
df.loc[record_is_dupe, ['supply_datetime', 'supply_year', 'supply_month', 'supply_day', 'supply_hour']] = np.NaN
df.loc[record_is_dupe, ['supply_item']] = 0
# remove datetime columns
df.drop(columns=['supply_datetime', 'sale_datetime', 'supply_datetime_shift'], inplace=True)
df.rename(columns={'supply_item': 'Item', 'sale_sale': 'Sale'}, inplace=True)
# matches expected output
print(df.to_markdown(index=False))
# | supply_year | supply_month | supply_day | supply_hour | Item | sale_year | sale_month | sale_day | sale_hour | Sale |
# |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
# | 2023 | 5 | 17 | 10 | 8 | 2023 | 5 | 17 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 18 | 12 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 24 | 16 | 3 |
# | nan | nan | nan | nan | 0 | 2023 | 5 | 27 | 10 | 1 |
# | 2023 | 6 | 1 | 12 | 8 | 2023 | 6 | 2 | 10 | 2 |
# | nan | nan | nan | nan | 0 | 2023 | 6 | 3 | 10 | 3 |