评估时间序列的频率，持续时间和值

Question

我是python的新手，有一个简单的问题我尚未找到答案。假设我有一个c（t）的时间序列：

我现在想要评估这个系列关于值c在某个范围内连续多长时间以及这些时间段出现的频率。

因此，结果将包括三列：c（分箱），持续时间（分箱），频率。转换为简单示例，结果可能如下所示：

c_      Dt_  Freq_ 
0-50    8    1 
50-100  2    1
0-50    5    1

你能给我一个建议吗？

提前致谢，

乌尔里克

//编辑：谢谢你的回复！我的示例数据有些缺陷，因此我无法展示我的一部分问题。那么，这是一个新的数据系列：

如果我应用下面Christoph提出的代码：

bins = pd.cut(series['c'], [-1, 5, 100])
same_as_prev = (bins != bins.shift())
run_ids = same_as_prev.cumsum()
result = bins.groupby(run_ids).aggregate(["first", "count"])

我收到这样的结果：

first   count
(-1, 5]   2
(5, 100]  3
(-1, 5]   2
(5, 100]  3
(-1, 5]   3

但是我对这样的事情更感兴趣：

c        length  freq
(-1, 5]    2      2
(-1, 5]    3      1
(5, 100]   3      2

我该如何实现这一目标？我怎么能在KDE情节中绘制它？

最好，

乌尔里克

Answer 1

很好地问了一个例子:)这是一种方法，很可能是不完整的，但它应该对你有所帮助。

由于您的数据按时间间隔固定增量，因此我不实现时间序列并将索引用作时间。因此，我将c转换为数组并使用np.where()来查找bin中的值。

import numpy as np

c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])

bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]

对于bin1，输出是array([ 0, 1, 2, 3, 4, 5, 6, 7, 10, 11, 12, 13, 14], dtype=int64)，它对应于来自c的值在bin中的idx。

下一步是找到连续的idx。根据这个SO post ::

from itertools import groupby
from operator import itemgetter

data = bin1
for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
    print(list(map(itemgetter(1), g)))

# Output is:
#[0, 1, 2, 3, 4, 5, 6, 7]
#[10, 11, 12, 13, 14]

最后一步：按正确的顺序放置新的子仓，并跟踪哪个仓对应于哪个子仓。因此，完整的代码看起来像：

import numpy as np
from itertools import groupby
from operator import itemgetter

c = np.array([40, 41, 4, 5, 7, 20, 20, 8, 90, 99, 10, 5, 8, 8, 19])

bin1 = np.where((0 <= c) & (c <= 50))[0]
bin2 = np.where((50 < c) & (c <= 100))[0]

# 1 and 2 for the range names.
bins = [(bin1, 1), (bin2, 2)]
subbins = list()

for b in bins:
    data = b[0]
    name = b[1] # 1 or 2
    for k, g in groupby(enumerate(data), lambda ix : ix[0] - ix[1]):
        subbins.append((list(map(itemgetter(1), g)), name))

subbins = sorted(subbins, key=lambda x: x[0][0])

输出：[([0, 1, 2, 3, 4, 5, 6, 7], 1), ([8, 9], 2), ([10, 11, 12, 13, 14], 1)]

然后，你只需要做你想要的统计:)

Answer 2

import pandas as pd

def bin_run_lengths(series, bins):

    binned = pd.cut(pd.Series(series), bins)
    return binned.groupby(
        (1 - (binned == binned.shift())).cumsum()
    ).aggregate(
        ["first", "count"]
    )

（我不确定你的频率列在哪里 - 在你描述的问题中，它似乎总是被设置为1。）

Binning

使用pandas.cut()很容易分类系列：

https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html

import pandas as pd

pd.cut(pd.Series(range(100)), bins=[-1,0,10,20,50,100])

这里的箱子是（右包含，左包含）边界;论证可以用不同的形式给出。

0       (-1.0, 0.0]
1       (0.0, 10.0]
2       (0.0, 10.0]
3       (0.0, 10.0]
4       (0.0, 10.0]
5       (0.0, 10.0]
6       (0.0, 10.0]
          ...
19     (10.0, 20.0]
20     (10.0, 20.0]
21     (20.0, 50.0]
22     (20.0, 50.0]
23     (20.0, 50.0]
          ...
29     (20.0, 50.0]
          ...      
99    (50.0, 100.0]
Length: 100, dtype: category
Categories (4, interval[int64]): [(0, 10] < (10, 20] < (20, 50] < (50, 100]]

这会将其从一系列值转换为一系列间隔。

Count consecutive values

这在熊猫中没有本地成语，但是通过一些常用功能相当容易。最高投票的StackOverflow答案非常好：Counting consecutive positive value in Python array

same_as_prev = (series != series.shift())

这会产生一个布尔系列，用于确定该值是否与之前的值不同。

run_ids = same_as_prev.cumsum()

这使得int系列在每次值更改为新运行时从0开始递增，从而将系列中的每个位置分配给“运行ID”

result = series.groupby(run_ids).aggregate(["first", "count"])

这会生成一个数据框，显示每次运行中的值以及该运行的长度：

      first   count
0   (-1, 0]      1
1   (0, 10]     10
2   (10, 20]    10
3   (20, 50]    30
4   (50, 100]   49

评估时间序列的频率，持续时间和值

问题描述投票：3回答：2

2个回答

Binning

Count consecutive values

最新问题

评估时间序列的频率，持续时间和值

问题描述 投票：3回答：2

2个回答

Binning

Count consecutive values

最新问题

问题描述投票：3回答：2