(Featuretools)如何计算聚合要素基元?

问题描述 投票:0回答:1

即使我使用非常简单的数据进行测试,我也不知道如何计算聚合要素基元。我也查看了featuretools代码,但找不到聚合操作发生的位置。

这里是示例代码:

from sklearn.utils import shuffle

periods = 5
end_date = "2012-04-13"
train_df = pd.DataFrame(
    {
        "store_id": [0]*periods + [1]*periods + [2]*periods + [3]*periods,
        "region": ["A"]*periods+["B"]*periods*3,
        "amount": shuffle(range(periods*4)),
        "transacted_date": [
            "2012-02-05", "2012-02-10", "2012-03-01", "2012-03-18", "2012-04-23",
        ]*4
    }
)
train_df["transacted_date"] = pd.to_datetime(train_df["transacted_date"])
train_df.sort_values(["store_id", "transacted_date"], inplace=True)


def make_retail_cutoffs_amounts(data_df, amount_start_date, amount_end_date):
    store_pool = data_df[data_df['transacted_date'] < amount_start_date]['store_id'].unique()
    tmp = pd.DataFrame({'store_id': store_pool})

    amounts = data_df[
        (data_df['store_id'].isin(store_pool)) &
        (amount_start_date <= data_df['transacted_date']) &
        (data_df['transacted_date'] < amount_end_date)
    ].groupby('store_id')['amount'].sum().reset_index()

    amounts = amounts.merge(tmp, on = 'store_id', how = 'right')
    amounts['amount'] = amounts['amount'].fillna(0)  # 0으로 채워지는 애는 3개월 다 수익이 없는 녀석!

    amounts['cutoff_time'] = pd.to_datetime(amount_start_date)

    amounts = amounts[['store_id', 'cutoff_time', 'amount']]
    amounts = amounts.rename(columns={"amount":"1month_amount_from_cutoff_time"})
    return amounts


amount_start_date = "2012-02-01"
amount_end_date = end_date
agg_month = 1

data_df_list = []
date_list = pd.date_range(amount_start_date, datetime.strptime(end_date, "%Y-%m-%d") + pd.DateOffset(months=1), freq="MS")

for amount_start_date, amount_end_date in zip(date_list[:-agg_month], date_list[agg_month:]):
    data_df_list.append(
        make_retail_cutoffs_amounts(
            train_df, amount_start_date, amount_end_date
        )
    )
data_df = pd.concat(data_df_list)
data_df.sort_values(["store_id", "cutoff_time", ], inplace=True)

import featuretools as ft

es = ft.EntitySet(id="sale_set")
es = es.entity_from_dataframe(
    "sales",
    dataframe=train_df,
    index="sale_id", make_index=True,
    time_index='transacted_date',
)
es.normalize_entity(
    new_entity_id="stores",
    base_entity_id="sales",
    index="store_id",
    additional_variables=['region']
)

# When using a training window, 
# it is necessary to calculate the last time indexes for the entity set. Adding
es.add_last_time_indexes()

features  = ft.dfs(
    entityset=es,
    target_entity='stores',
    cutoff_time=data_df,
    verbose=1,
    cutoff_time_in_index=True,
    n_jobs=1,
    max_depth=2,

    agg_primitives=["sum",],
    trans_primitives=["cum_max"], 
    training_window="1 month",
)

[dfs可以正常工作,但无法解释结果特征。

这是特征的示例数据:

enter image description here

如您在此处看到的,SUM(sales.amount)SUM(sales.CUM_MAX(amount))的第一行分别为19、37。我想知道它们是如何计算的。

这是我对结果的解释:

enter image description here

  1. 如您所见,store_0在2012年2月有2条销售数据记录。因此,截止2012年3月1日的store_id = 0的SUM(sales.amount)应为0 + 8 = 8,而不是19 。

  2. 同样,2012年1月1日截止时间store_id = 0的SUM(sales.CUM_MAX(amount))也应为SUM(sales.CUM_MAX(amount))= SUM([0,8])= 8,而不是37。] >

  3. 我错过了什么吗?如何计算?

即使我使用非常简单的数据进行测试,我也不知道如何计算聚合要素基元。我也查看了featuretools代码,但找不到聚合操作的位置...

python featuretools
1个回答
0
投票

这些概念将帮助您了解特征的计算方式:

© www.soinside.com 2019 - 2024. All rights reserved.