如何对齐合并的数据集?

问题描述 投票:0回答:1

我有两个数据集。一个是供应品,另一个是销售。他们有不同的日期和时间。

耗材

         Year      Month   Day  Hour       Item
    0    2023       05     17    10         8 
    1    2023       06     01    12         8 
    2    2023       06     10    16         3
    3    2023       06     17    10         8 
    4    2023       07     01    10         8 
    5    2023       08     17    10         8 

销售

         Year      Month   Day  Hour       Sale
    0    2023       05     17    16         3 
    1    2023       05     18    12         3 
    2    2023       05     24    16         3 
    3    2023       05     27    10         1 
    4    2023       06     02    10         2 
    5    2023       06     03    10         3 

我需要这两个信息,所以我合并了它们

         Year      Month   Day  Hour       Item    Year      Month   Day  Hour       Item  
    0    2023       05     17    10         8      2023       05     17    16         3
    1    2023       06     01    12         8      2023       05     18    12         3
    2    2023       06     10    16         3      2023       05     24    16         3
    3    2023       06     17    10         8      2023       05     27    10         1
    4    2023       07     01    10         8      2023       06     02    10         2
    5    2023       08     17    10         8      2023       06     03    10    3 

我想如果当天没有发货,那么右边的日期是重复的,数量是0。直到有新的发货

我渴望得到

         Year      Month   Day  Hour       Item    Year      Month   Day  Hour       Item  
    0    2023       05     17    10         8      2023       05     17    16         3
    1    2023      NaN    NaN    NaN        0      2023       05     18    12         3
    2    2023       NaN    NaN    NaN       0      2023       05     20    16         3
    3    2023      NaN    NaN    NaN        0      2023       05     27    10         1
    4    2023       06     01    12         8      2023       06     02    10         2
    5    2023      NaN    NaN    NaN        0      2023       06     03    10         3 

我想得到这个结果,如果左边的日期值小于右边的日期值,并且如果列中的值为0,那么它将被替换为nan。

python pandas dataframe merge cross-join
1个回答
0
投票

有两种方法可以以所需的方式组合两个数据集:

  1. 对供应和销售日期进行类似
    SQL
    的条件连接。
    特别是,当销售日期等于/晚于当前供应记录的日期时,特定销售记录将与当前供应记录连接,其中销售记录的日期不得晚于后续供应记录的日期。
  2. 两个表的笛卡尔积与后续过滤器。我们交叉连接两个表的所有行,然后按照 1 中所述过滤数据。

据我所知,条件连接(因为它们存在于

SQL
中)无法在
pandas
中完成 请参阅此处的 SO 帖子

下面您可以找到显示如何完成这两种方法的代码,其中第一种方法还需要

sqlite3
模块。我个人会推荐您使用第一种方法,因为执行交叉连接的计算成本可能非常昂贵。

第一种方法的代码:

# loading data using solution from https://stackoverflow.com/a/53692642/8718701
from io import StringIO

import numpy as np
import pandas as pd
import sqlite3

d = '''
         Year      Month   Day  Hour       Item
    0    2023       05     17    10         8
    1    2023       06     01    12         8
    2    2023       06     10    16         3
    3    2023       06     17    10         8
    4    2023       07     01    10         8
    5    2023       08     17    10         8
'''

supplies_df = pd.read_csv(StringIO(d), sep='\s+')

d = '''
         Year      Month   Day  Hour       Sale
    0    2023       05     17    16         3
    1    2023       05     18    12         3
    2    2023       05     24    16         3
    3    2023       05     27    10         1
    4    2023       06     02    10         2
    5    2023       06     03    10         3
'''

sales_df = pd.read_csv(StringIO(d), sep='\s+')

# first approach based on https://stackoverflow.com/a/42796283/8718701

supplies_df['supply_datetime'] = pd.to_datetime(supplies_df[['Year', 'Month', 'Day', 'Hour']])
supplies_df['next_supply_datetime'] = supplies_df['supply_datetime'].shift(-1)

sales_df['sale_datetime'] = pd.to_datetime(sales_df[['Year', 'Month', 'Day', 'Hour']])

#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
supplies_df.to_sql('supplies', conn, index=False)
sales_df.to_sql('sales', conn, index=False)

# sql query joining tables on conditional join:
qry = '''
    SELECT 
        A.supply_datetime,
        A.Year AS supply_year,
        A.Month AS supply_month,
        A.Day AS supply_day,
        A.Hour AS supply_hour,
        A.Item,
        B.sale_datetime,
        B.Year AS sale_year,
        B.Month AS sale_month,
        B.Day AS sale_day,
        B.Hour AS sale_hour,
        B.Sale
    FROM supplies AS A
    JOIN sales AS B
    ON A.supply_datetime <= B.sale_datetime AND A.next_supply_datetime > B.sale_datetime
    '''
df = pd.read_sql_query(qry, conn)

# remove duplicated info
record_is_dupe = df.duplicated('supply_datetime')
df.loc[record_is_dupe, ['supply_datetime', 'supply_year', 'supply_month', 'supply_day', 'supply_hour']] = np.NaN
df.loc[record_is_dupe, ['Item']] = 0

# remove datetime columns
df.drop(columns=['supply_datetime', 'sale_datetime'], inplace=True)

# matches expected output
print(df.to_markdown(index=False))

# |   supply_year |   supply_month |   supply_day |   supply_hour |   Item |   sale_year |   sale_month |   sale_day |   sale_hour |   Sale |
# |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
# |          2023 |              5 |           17 |            10 |      8 |        2023 |            5 |         17 |          16 |      3 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         18 |          12 |      3 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         24 |          16 |      3 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         27 |          10 |      1 |
# |          2023 |              6 |            1 |            12 |      8 |        2023 |            6 |          2 |          10 |      2 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            6 |          3 |          10 |      3 |

第二种方法的代码:

# loading data using solution from https://stackoverflow.com/a/53692642/8718701
from io import StringIO

import numpy as np
import pandas as pd

d = '''
         Year      Month   Day  Hour       Item
    0    2023       05     17    10         8
    1    2023       06     01    12         8
    2    2023       06     10    16         3
    3    2023       06     17    10         8
    4    2023       07     01    10         8
    5    2023       08     17    10         8
'''

supplies_df = pd.read_csv(StringIO(d), sep='\s+')

d = '''
         Year      Month   Day  Hour       Sale
    0    2023       05     17    16         3
    1    2023       05     18    12         3
    2    2023       05     24    16         3
    3    2023       05     27    10         1
    4    2023       06     02    10         2
    5    2023       06     03    10         3
'''

sales_df = pd.read_csv(StringIO(d), sep='\s+')

# second approach based on https://stackoverflow.com/a/53699198/8718701

supplies_df['datetime'] = pd.to_datetime(supplies_df[['Year', 'Month', 'Day', 'Hour']])
supplies_df['datetime_shift'] = supplies_df['datetime'].shift(-1)
supplies_df.columns = ['supply_' + col.lower() for col in supplies_df.columns]

# renaming columns as duplicate names cause problems
sales_df['datetime'] = pd.to_datetime(sales_df[['Year', 'Month', 'Day', 'Hour']])
sales_df.columns = ['sale_' + col.lower() for col in sales_df.columns]

# Cartesian product
df = pd.merge(left=supplies_df, right=sales_df, how='cross')

# Filtering rows based on condition
winnowing_condition = (
    ((df['supply_datetime'] <= df['sale_datetime'])
    & (df['supply_datetime_shift'] > df['sale_datetime']))
)
df = df.loc[winnowing_condition, :]

# remove duplicated info
record_is_dupe = df.duplicated('supply_datetime')
df.loc[record_is_dupe, ['supply_datetime', 'supply_year', 'supply_month', 'supply_day', 'supply_hour']] = np.NaN
df.loc[record_is_dupe, ['supply_item']] = 0

# remove datetime columns
df.drop(columns=['supply_datetime', 'sale_datetime', 'supply_datetime_shift'], inplace=True)
df.rename(columns={'supply_item': 'Item', 'sale_sale': 'Sale'}, inplace=True)

# matches expected output
print(df.to_markdown(index=False))


# |   supply_year |   supply_month |   supply_day |   supply_hour |   Item |   sale_year |   sale_month |   sale_day |   sale_hour |   Sale |
# |--------------:|---------------:|-------------:|--------------:|-------:|------------:|-------------:|-----------:|------------:|-------:|
# |          2023 |              5 |           17 |            10 |      8 |        2023 |            5 |         17 |          16 |      3 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         18 |          12 |      3 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         24 |          16 |      3 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            5 |         27 |          10 |      1 |
# |          2023 |              6 |            1 |            12 |      8 |        2023 |            6 |          2 |          10 |      2 |
# |           nan |            nan |          nan |           nan |      0 |        2023 |            6 |          3 |          10 |      3 |
© www.soinside.com 2019 - 2024. All rights reserved.