找到数据帧中每个用户的最长连续零串

问题描述 投票:0回答:5

我正在寻找数据帧中连续零的最大运行次数,结果按用户分组。我对运行 RLE 的使用感兴趣。

输入示例:

用户--天--使用情况
A-----1-----0
A-----2------0
A-----3------1
B-----1-----0
B-----2------1
B-----3------0

所需输出

用户---longest_run
一个 - - - - 2
b - - - - 1

mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
  if (0 %in% mydata$usage[mydata$user == i]) {
    run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
    d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
  } else {
    d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
  }
}
d2 <- d2[order(-d2$longest_no_usage),]

这在 R 中有效,但我想在 python 中做同样的事情,我完全被难住了

python binary counting run-length-encoding
5个回答
4
投票

首先将

groupby
size
按列
user
usage
和辅助函数
Series
用于连续值:

print (df)
  user  day  usage
0    A    1      0
1    A    2      0
2    A    3      1
3    B    1      0
4    B    2      1
5    B    3      0
6    C    1      1


df1 = (df.groupby([df['user'], 
                   df['usage'].rename('val'), 
                   df['usage'].ne(df['usage'].shift()).cumsum()])
        .size()
        .to_frame(name='longest_run'))

print (df1)
                longest_run
user val usage             
A    0   1                2
     1   2                1
B    0   3                1
         5                1
     1   4                1
C    1   6                1

然后仅过滤

zero
行,获取
max
并添加
reindex
以追加非
0
组:

df2 = (df1.query('val == 0')
          .max(level=0)
          .reindex(df['user'].unique(), fill_value=0)
          .reset_index())
print (df2)
  user  longest_run
0    A            2
1    B            1
2    C            0

详情

print (df['usage'].ne(df['usage'].shift()).cumsum())
0    1
1    1
2    2
3    3
4    4
5    5
6    6
Name: usage, dtype: int32

2
投票

获取系列中连续零的最大数量:

def max0(sr):
     return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)


max0(pd.Series([1,0,0,0,0,2,3]))

4


0
投票

我认为以下内容可以满足您的需求,其中

consecutive_zero
函数是对最佳答案here的改编。

希望这有帮助!

import pandas as pd
from itertools import groupby

df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]], 
                  columns=["user", "usage"])

def len_iter(items):
    return sum(1 for _ in items)

def consecutive_zero(data):
    x = list((len_iter(run) for val, run in groupby(data) if val==0))
    if len(x)==0: return 0 
    else: return max(x)

df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))

输出:

user
A    2
B    1
C    0
dtype: int64

0
投票

如果您有大量数据集并且速度至关重要,您可能想尝试高性能 pyrle 库。

设置:

# pip install pyrle
# or 
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
#          User  Number
# 0           0       0
# 1           0       1
# 2           0       1
# 3           0       0
# 4           0       1
# ...       ...     ...
# 9999995     4       1
# 9999996     4       1
# 9999997     4       0
# 9999998     4       0
# 9999999     4       1
# 
# [10000000 rows x 2 columns]

执行:

for u, udf in df.groupby("User"):
    r = Rle(udf.Number)
    is_0 = r.values == 0
    print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)


# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23

0
投票

这是另一种以 NumPy 为中心的方法:

def max0(x):
    z = np.argwhere(x == 0).flatten()
    if not z.size:
        return 0
    z -= np.arange(len(z))
    return np.unique(z, return_counts=True)[-1].max()
© www.soinside.com 2019 - 2024. All rights reserved.