基于两列的 Bin pandas 数据框

Question

我有大的文本文件，我需要基于两列进行分箱，然后为每个分箱中的行总和和每个分箱中的行索引添加两个新列。我之前问过这个问题并在 here 得到了合适的回复，但我现在意识到我需要根据额外的列进行分类。我的数据集的简化版本如下所示：

  time  A  RN NOR
0  100  1  0  0
1  101  1  0  0
2  104  2  0  0
3  105  3  0  0
4  107  3  0  0
5  110  3  0  0
6  114  4  0  0
7  115  5  0  0
8  116  5  0  0
9  118  6  0  0

其中 RN 是每个 bin 中行的索引，NR 是每个 bin 中行的总和。我想先按时间列对数据进行装箱，例如间隔 5，然后按 A 列按 1 的间隔装箱。

我希望结果看起来像这样：

  time  A  RN NR bin
0  100  1  1  2  1
1  101  1  2  2  1

2  104  2  1  1  2

3  105  3  1  2  3
4  107  3  2  2  3

5  110  3  1  1  4

6  114  4  1  1  5

7  115  5  1  2  6
8  116  5  2  2  6

9  118  6  1  1  7

这是仅基于时间列进行装箱的代码：

df = df.sort_values(by=['time'], ascending=True)

ranges1 = np.arange(df.time.min()-5, df.time.max()+5, 5)

ranges2 = np.arange(df.A.min()-1, df.A.max()+1, 1)

# bin the data based on two specific columns
bins1 = pd.cut(df.time, ranges1)
bins2 = pd.cut(df.A, ranges2)

# add a column for the row count in each bin
df['NR'] = df.groupby(bins1)['time'].transform('count')

# add a column for the index of each row in each bin
df['RN'] = df.groupby(bins1).cumcount()+1

但我无法弄清楚如何根据时间列和 A 列进行分箱。

Answer 1

我认为你需要改变你的垃圾箱的右边界，然后使用

ngroup

而不是

cumcount

。

bins3 = pd.cut(df.time, ranges1, right=False)
df['bin'] = df.groupby([bins3, 'A']).ngroup()+1

输出：

   time  A  RN  NR  bin
0   100  1   1   1    1
1   101  1   1   3    1
2   104  2   2   3    2
3   105  3   3   3    3
4   107  3   1   2    3
5   110  3   2   2    4
6   114  4   1   2    5
7   115  5   2   2    6
8   116  5   1   2    6
9   118  6   2   2    7

Answer 2

根据您提到的重新创建数据框。还没有删除“NOR”列，其余输出应该相同。

import numpy as np
import pandas as pd

data = {'time': [100, 101, 104, 105, 107, 110, 114, 115, 116, 118],
        'A': [1, 1, 2, 3, 3, 3, 4, 5, 5, 6],
        'RN': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        'NOR': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

df = pd.DataFrame(data)

# set bin sizes
time_bin_size = 5
A_bin_size = 1

# calculate bins
df['time_bin'] = (df['time'] // time_bin_size).astype(int)
df['A_bin'] = (df['A'] // A_bin_size).astype(int)

# create bin groups
groups = df.groupby(['time_bin', 'A_bin'])

# calculate bin statistics
df['RN'] = groups.cumcount() + 1
df['NR'] = groups['time'].transform('count')
df['bin'] = groups.ngroup() + 1

# drop the temporary columns
df = df.drop(columns=['time_bin', 'A_bin'])

# print the result
print(df)

# Output

> time  A  RN  NOR  NR  bin
>     0   100  1   1    0   2    1
>     1   101  1   2    0   2    1
>     2   104  2   1    0   1    2
>     3   105  3   1    0   2    3
>     4   107  3   2    0   2    3
>     5   110  3   1    0   1    4
>     6   114  4   1    0   1    5
>     7   115  5   1    0   2    6
>     8   116  5   2    0   2    6
>     9   118  6   1    0   1    7

Answer 3

你可以改变previous answer with

GroupBy.ngroup

by both columns:

df = df.sort_values(by=['time'], ascending=True)

# bin the data based on two specific columns
bins1 = df['time'].sub(df['time'].min()).floordiv(5).add(1)

df['bin'] = df.groupby([bins1, 'A']).ngroup() + 1

# add a column for the row count in each bin
df['NR'] = df.groupby('bin')['time'].transform('count')

# add a column for the index of each row in each bin
df['RN'] = df.groupby('bin').cumcount()+1

print (df)
   time  A  RN  NOR  bin  NR
0   100  1   1    0    1   2
1   101  1   2    0    1   2
2   104  2   1    0    2   1
3   105  3   1    0    3   2
4   107  3   2    0    3   2
5   110  3   1    0    4   1
6   114  4   1    0    5   1
7   115  5   1    0    6   2
8   116  5   2    0    6   2
9   118  6   1    0    7   1

如果不需要 bin 列，将

bins1

和

传递给

groupby

用于

NR

和

RN

新列：

df = df.sort_values(by=['time'], ascending=True)

# bin the data based on two specific columns
bins1 = df['time'].sub(df['time'].min()).floordiv(5).add(1)

# add a column for the row count in each bin
df['NR'] = df.groupby([bins1, 'A'])['time'].transform('count')

# add a column for the index of each row in each bin
df['RN'] = df.groupby([bins1, 'A']).cumcount()+1

print (df)
   time  A  RN  NOR  NR
0   100  1   1    0   2
1   101  1   2    0   2
2   104  2   1    0   1
3   105  3   1    0   2
4   107  3   2    0   2
5   110  3   1    0   1
6   114  4   1    0   1
7   115  5   1    0   2
8   116  5   2    0   2
9   118  6   1    0   1

基于两列的 Bin pandas 数据框

问题描述投票：0回答：3

3个回答

最新问题

基于两列的 Bin pandas 数据框

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3