Pandas pd.cut on Timestamps - “ValueError:bin必须单调增加”

问题描述 投票:2回答:1

我正在尝试将时间序列数据拆分为标记的段,如下所示:

import pandas as pd
import numpy as np

# Create example DataFrame of stock values
df = pd.DataFrame({
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

# Cut the date into sections 
today = df['date'].max()
bin_edges = [pd.Timestamp.min, today - pd.Timedelta('14 days'), today - pd.Timedelta('7 days'), pd.Timestamp.max]
df['Time Group'] = pd.cut(df['date'], bins=bin_edges, labels=['history', 'previous week', 'this week'])

但即使bin_edges似乎单调增加,我也会收到错误。

Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3267, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-42-00524c0a883b>", line 13, in <module>
    df['Time Group'] = pd.cut(df['date'], bins=bin_edges, labels=['history', 'previous week', 'this week'])
  File "C:\Anaconda3\lib\site-packages\pandas\core\reshape\tile.py", line 228, in cut
    raise ValueError('bins must increase monotonically.')
ValueError: bins must increase monotonically.


In[43]: bin_edges
Out[43]: 
[Timestamp('1677-09-21 00:12:43.145225'),
 Timestamp('2011-01-11 00:00:00'),
 Timestamp('2011-01-18 00:00:00'),
 Timestamp('2262-04-11 23:47:16.854775807')]

为什么会这样?

python pandas datetime binning
1个回答
2
投票

这是熊猫的一个错误。您的边需要转换为数值才能执行cut,并且通过使用pd.Timestamp.minpd.Timestamp.max,您实际上将边设置为可由64位整数表示的边界的下边界/上边界。当试图比较边缘的单调性时,这会导致溢出,这使得它看起来不是单调增加。

溢出的演示:

In [2]: bin_edges_numeric = [t.value for t in bin_edges]

In [3]: bin_edges_numeric
Out[3]:
[-9223372036854775000,
 1294704000000000000,
 1295308800000000000,
 9223372036854775807]

In [4]: np.diff(bin_edges_numeric)
Out[4]:
array([-7928668036854776616,      604800000000000,  7928063236854775807],
      dtype=int64)

在修复此问题之前,我的建议是使用更接近实际日期的下限/上限,但仍会达到相同的最终结果:

first = df['date'].min()
today = df['date'].max()
bin_edges = [first - pd.Timedelta('1000 days'), today - pd.Timedelta('14 days'),
             today - pd.Timedelta('7 days'), today + pd.Timedelta('1000 days')]

我随意挑选了1000天,您可以根据需要选择不同的值。通过这些修改,cut不应该引发错误。

© www.soinside.com 2019 - 2024. All rights reserved.