SO的第一个问题,还在学习python & pandas。
EDIT: 我已经成功地将DF的值从长到宽,以便有独特的id+日期索引(例如,没有独特的ID每天有超过1行)。然而,我还是没能达到我想要的结果。
我有几个DF,我想根据A)uniqueID和B)uniqueID是否在不同和多个日期范围之间进行合并。我发现 这个问题 然而,在解决方法不可行后,挖了一下,似乎我的尝试是不可能的。误期 (?)
其要点是:如果uniqueID在df_dates_range中,且其对应的day列在dates_ranges的start:end范围内,则将df_values上的所有值相加。
在每个df中还有很多列,但这些都是相关的列。意味着任何地方都是重复的,没有特别的顺序。所有的df系列都有适当的格式。
所以,这里是df1,dates_range。
import pandas as pd
import numpy as np
dates_range = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
"start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
"end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
"df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"]}
df_dates_range = pd.DataFrame(dates_range,
columns = ["uniqueID",
"start",
"end",
"df1_tag1"])
df_dates_range[["start","end"]] = df_dates_range[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)
df2, values:
values = {"uniqueID": [1, 2, 7, 3, 4, 4, 10, 1, 8, 7, 10, 9, 10, 8, 3, 10, 11, 3, 7, 4, 10, 14],
"df2_tag1": ["abc", "abc", "abc", "abc", "abc", "def", "abc", "abc", "abc", "abc", "abc", "abc", "def", "def", "abc", "abc", "abc", "def", "abc", "abc", "def", "abc"],
"df2_tag2": ["type 1", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 2", "type 2", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 1", "type 2", "type 1", "type 2", "type 1", "type 1", "type 1"],
"day": ["01/01/2020", "01/02/2020", "01/03/2020", "01/03/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/01/2020", "02/02/2020", "02/03/2020", "02/03/2020", "02/04/2020", "02/05/2020", "02/05/2020", "03/03/2020", "03/04/2020", "03/04/2020", "03/06/2020", "03/06/2020", "03/07/2020", "03/06/2020", "04/08/2020"],
"df2_value1": [2, 10, 6, 5, 7, 9, 3, 10, 9, 7, 4, 9, 1, 8, 7, 5, 4, 4, 2, 8, 8, 4],
"df2_value2": [1, 5, 10, 13, 15, 10, 12, 50, 3, 10, 2, 1, 4, 6, 80, 45, 3, 30, 20, 7.5, 15, 3],
"df2_value3": [0.547, 2.160, 0.004, 9.202, 7.518, 1.076, 1.139, 25.375, 0.537, 7.996, 1.475, 0.319, 1.118, 2.927, 7.820, 19.755, 2.529, 2.680, 17.762, 0.814, 1.201, 2.712]}
values["day"] = pd.to_datetime(values["day"], format = "%m/%d/%Y")
df_values = pd.DataFrame(values,
columns = ["uniqueID",
"df2_tag1",
"df2_tag2",
"day",
"df2_value1",
"df2_value2",
"df2_value1"])
从第一个链接开始,我试着运行以下内容。
df_dates_range.index = pd.IntervalIndex.from_arrays(df_dates_range["start"],
df_dates_range["end"],
closed = "both")
df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
然而,我得到了这个错误。 n00b检查,摆脱了第二到最后一天的指数,问题仍然存在。
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-58-54ea384e06f7> in <module>
14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
15
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
C:\anaconda\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-58-54ea384e06f7> in <lambda>(x)
14 df_values_date_index = df_values.set_index(pd.DatetimeIndex(df_values["day"]))
15
---> 16 df_values = df_values_date_index["day"].apply( lambda x : df_values_date_index.iloc[df_values_date_index.index.get_indexer_non_unique(x)])
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in get_indexer_non_unique(self, target)
4471 @Appender(_index_shared_docs["get_indexer_non_unique"] % _index_doc_kwargs)
4472 def get_indexer_non_unique(self, target):
-> 4473 target = ensure_index(target)
4474 pself, ptarget = self._maybe_promote(target)
4475 if pself is not self or ptarget is not target:
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in ensure_index(index_like, copy)
5355 index_like = copy(index_like)
5356
-> 5357 return Index(index_like)
5358
5359
C:\anaconda\lib\site-packages\pandas\core\indexes\base.py in __new__(cls, data, dtype, copy, name, tupleize_cols, **kwargs)
420 return Index(np.asarray(data), dtype=dtype, copy=copy, name=name, **kwargs)
421 elif data is None or is_scalar(data):
--> 422 raise cls._scalar_data_error(data)
423 else:
424 if tupleize_cols and is_list_like(data):
TypeError: Index(...) must be called with a collection of some kind, Timestamp('2020-01-01 00:00:00') was passed
预期的结果应该是:
desired = {"uniqueID": [1, 2, 3, 4, 1, 7, 10, 11, 3, 4, 7, 10],
"start": ["12/31/2019", "12/31/2019", "12/31/2019", "12/31/2019", "02/01/2020", "02/01/2020", "02/01/2020", "02/01/2020", "03/03/2020", "03/03/2020", "03/03/2020", "03/03/2020"],
"end": ["01/04/2020", "01/04/2020", "01/04/2020", "01/04/2020", "02/05/2020", "02/05/2020", "02/05/2020", "02/05/2020", "03/08/2020", "03/08/2020", "03/08/2020", "03/08/2020"],
"df1_tag1": ["v1", "v1", "v1", "v1", "v2", "v2", "v2", "v2", "v3", "v3", "v3", "v3"],
"df2_tag1": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"],
"df2_value1": [2, 10, 5, 16, 10, 7, 5, np.nan, 11, 8, 2, 8],
"df2_value2+df2_value3": [1.547, 7.160, 22.202, 33.595, 75.375, 17.996, 8.594, np.nan, 120.501, 8.314, 37.762, 16.201],
"df2_tag3": ["abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc"]}
df_desired = pd.DataFrame(desired,
columns = ["uniqueID",
"start",
"end",
"df1_tag1",
"df2_tag1",
"df2_value1",
"df2_value2+df2_value3",
"df2_tag3"])
df_desired[["start","end"]] = df_desired[["start","end"]].apply(pd.to_datetime, infer_datetime_format = True)
或在图形可视化:
注意列S & T @第10行是NaN,因为uniqueID 11在v2期间没有 "活动";然而,如果可能的话,我很想能够从df2中提取标签;它们是100%存在的,只是也许不是那个时期的,也许是第二个脚本的任务?另外,请注意,col T是col J+K的集合。
EDIT: 忘了说了,我之前曾尝试用@firelynx 's的解决方案在 这个问题但尽管我有32GB的内存,我的机器还是无法应付。SQL解决方案没有为我工作,因为一些,有sqlite3库的问题。
在这些情况下,最简单的事情(如果你能在硬件方面负担得起的话)是创建一个临时的DataFrame,然后进行聚合。这有一个很大的好处,就是把合并和聚合分开,大大降低了复杂性。
In [22]: df = pd.merge(df_dates_range, df_values)
Out[22]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1 Online
1 1 2019-12-31 2020-01-04 2020-02-01 50 Online
2 1 2020-02-01 2020-02-05 2020-01-01 1 Online
3 1 2020-02-01 2020-02-05 2020-02-01 50 Online
4 2 2019-12-31 2020-01-04 2020-01-02 5 Shop
.. ... ... ... ... ... ...
23 10 2020-02-01 2020-02-05 2020-03-04 45 Shop
24 10 2020-03-03 2020-03-08 2020-01-03 13 Shop
25 10 2020-03-03 2020-03-08 2020-02-03 2 Online
26 10 2020-03-03 2020-03-08 2020-03-04 45 Shop
27 11 2020-02-01 2020-02-05 2020-02-05 4 Shop
In [24]: df = df[(df['day'] > df['start']) & (df['day'] <= df['end'])]
Out[24]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1 Online
4 2 2019-12-31 2020-01-04 2020-01-02 5 Shop
5 3 2019-12-31 2020-01-04 2020-01-04 12 Shop
10 3 2020-03-03 2020-03-08 2020-03-06 30 Online
11 4 2019-12-31 2020-01-04 2020-01-04 15 Online
12 4 2019-12-31 2020-01-04 2020-01-04 10 Shop
16 7 2020-02-01 2020-02-05 2020-02-03 10 Shop
20 7 2020-03-03 2020-03-08 2020-03-06 20 Shop
22 10 2020-02-01 2020-02-05 2020-02-03 2 Online
26 10 2020-03-03 2020-03-08 2020-03-04 45 Shop
27 11 2020-02-01 2020-02-05 2020-02-05 4 Shop
然后你可以做一些类似于
In [30]: df.groupby(['start', 'end', 'uniqueID', 'medium'])['value1'].agg(['count', 'sum']).reset_index()
Out[30]:
start end uniqueID medium count sum
0 2019-12-31 2020-01-04 1 Online 1 1
1 2019-12-31 2020-01-04 2 Shop 1 5
2 2019-12-31 2020-01-04 3 Shop 1 12
3 2019-12-31 2020-01-04 4 Online 1 15
4 2019-12-31 2020-01-04 4 Shop 1 10
5 2020-02-01 2020-02-05 7 Shop 1 10
6 2020-02-01 2020-02-05 10 Online 1 2
7 2020-02-01 2020-02-05 11 Shop 1 4
8 2020-03-03 2020-03-08 3 Online 1 30
9 2020-03-03 2020-03-08 7 Shop 1 20
10 2020-03-03 2020-03-08 10 Shop 1 45
将数据汇总成所需的形式。然而,我没有得到你所期望的结果。在值中,有行与 Shop
和一些日期是一个小的。我责怪初始值;) 希望这能把你推向正确的方向。
小编注:如果你只对区间的首值或尾值感兴趣。pd.merge_asof
是一个有趣的选择
In [17]: pd.merge_asof(df_dates_range, df_values, left_on='start', right_on='day', by='uniqueID', direction='forward')
Out[17]:
uniqueID start end day value1 medium
0 1 2019-12-31 2020-01-04 2020-01-01 1.0 Online
1 2 2019-12-31 2020-01-04 2020-01-02 5.0 Shop
2 3 2019-12-31 2020-01-04 2020-01-04 12.0 Shop
3 4 2019-12-31 2020-01-04 2020-01-04 15.0 Online
4 1 2020-02-01 2020-02-05 2020-02-01 50.0 Online
5 7 2020-02-01 2020-02-05 2020-02-03 10.0 Shop
6 10 2020-02-01 2020-02-05 2020-02-03 2.0 Online
7 11 2020-02-01 2020-02-05 2020-02-05 4.0 Shop
8 3 2020-03-03 2020-03-08 2020-03-03 80.0 Online
9 4 2020-03-03 2020-03-08 NaT NaN NaN
10 7 2020-03-03 2020-03-08 2020-03-06 20.0 Shop
11 10 2020-03-03 2020-03-08 2020-03-04 45.0 Shop
然而,要想把一个聚合挤到这个里面,几乎是不可能的。
终于破解了这个问题。
由于IntervalIndex只能处理唯一的日期,所以我所做的是将这些唯一的开始:结束区间以及它们的唯一标签映射到df_values上。我所犯的错误是将整个 df_dates_ranges 作为 intervalindex 数组的值,所以这只是提取唯一性的问题。有一点我不清楚,就是当任何一个区间范围有多个适用的df1_tag1值时,会发生什么情况,希望它只是创建了一个标签列表,然后就可以正常工作了。
请记住,在做下面的操作之前,我需要把我的df_values从长格式转换成宽格式,为此我使用了 group_by
产生重复的uniqueID+value行。由于某些原因,我无法使用这里的样本数据,但无论如何,如果你的数据是你需要的格式(宽长),下面的操作应该可以避免df_values有重复的uniqueID+day行。
之后,我做了如下操作。
# In order to bypass the non "uniqueness" of the desired tag to apply, we create a list of the unique df1_tag1's with their respective start:end dates
df1_tag1_list = df_dates_range.groupby(["start",
"end",
"df1_tag1"]).size().reset_index().rename(columns={0:'records'})
然后..,
# Create a new pandas IntervalIndex series variable to then map ("paste onto") the copied df_values using
applicable_df1_tag1 = pd.Series(df1_tag1_list["df1_tag1"].values,
pd.IntervalIndex.from_arrays(df1_tag1_list['start'],
df1_tag1_list['end']))
# map the df1_tag1 to the applicable rows in the copied df_values
df_values_with_df1_tag1["applicable_df1_tag1"] = df_values_with_df1_tag1["day"].map(applicable_df1_tag1)
这样做的结果应该是聚合的df_values--或者是你在groupby过程中做的任何其他数学函数--带有非重复的uniqueID+day行,现在有一个映射的df1_tag1,我们可以用它和uniqueID一起合并到df_dates_range中。
希望是一个对某些人有效的答案:)
EDIT:还有一点可能很重要,我在做左面合并的时候,为了避免不必要的重复,我用了如下方法
df_date_ranges_all = df_date_ranges.merge(df_values_wide_with_df1_tag1.drop_duplicates(subset = ['uniqueID'],
keep = "last"),
how = "left",
left_on = ["uniqueID", "df1_tag1"],
right_on = ["uniqueID", "applicable_df1_tag1"],
indicator = True)