我有大的文本文件,我需要基于两列进行分箱,然后为每个分箱中的行总和和每个分箱中的行索引添加两个新列。我之前问过这个问题并在 here 得到了合适的回复,但我现在意识到我需要根据额外的列进行分类。我的数据集的简化版本如下所示:
time A RN NOR
0 100 1 0 0
1 101 1 0 0
2 104 2 0 0
3 105 3 0 0
4 107 3 0 0
5 110 3 0 0
6 114 4 0 0
7 115 5 0 0
8 116 5 0 0
9 118 6 0 0
其中 RN 是每个 bin 中行的索引,NR 是每个 bin 中行的总和。我想先按时间列对数据进行装箱,例如间隔 5,然后按 A 列按 1 的间隔装箱。
我希望结果看起来像这样:
time A RN NR bin
0 100 1 1 2 1
1 101 1 2 2 1
2 104 2 1 1 2
3 105 3 1 2 3
4 107 3 2 2 3
5 110 3 1 1 4
6 114 4 1 1 5
7 115 5 1 2 6
8 116 5 2 2 6
9 118 6 1 1 7
这是仅基于时间列进行装箱的代码:
df = df.sort_values(by=['time'], ascending=True)
ranges1 = np.arange(df.time.min()-5, df.time.max()+5, 5)
ranges2 = np.arange(df.A.min()-1, df.A.max()+1, 1)
# bin the data based on two specific columns
bins1 = pd.cut(df.time, ranges1)
bins2 = pd.cut(df.A, ranges2)
# add a column for the row count in each bin
df['NR'] = df.groupby(bins1)['time'].transform('count')
# add a column for the index of each row in each bin
df['RN'] = df.groupby(bins1).cumcount()+1
但我无法弄清楚如何根据时间列和 A 列进行分箱。
ngroup
而不是cumcount
。
bins3 = pd.cut(df.time, ranges1, right=False)
df['bin'] = df.groupby([bins3, 'A']).ngroup()+1
输出:
time A RN NR bin
0 100 1 1 1 1
1 101 1 1 3 1
2 104 2 2 3 2
3 105 3 3 3 3
4 107 3 1 2 3
5 110 3 2 2 4
6 114 4 1 2 5
7 115 5 2 2 6
8 116 5 1 2 6
9 118 6 2 2 7
根据您提到的重新创建数据框。还没有删除“NOR”列,其余输出应该相同。
import numpy as np
import pandas as pd
data = {'time': [100, 101, 104, 105, 107, 110, 114, 115, 116, 118],
'A': [1, 1, 2, 3, 3, 3, 4, 5, 5, 6],
'RN': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
'NOR': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
# set bin sizes
time_bin_size = 5
A_bin_size = 1
# calculate bins
df['time_bin'] = (df['time'] // time_bin_size).astype(int)
df['A_bin'] = (df['A'] // A_bin_size).astype(int)
# create bin groups
groups = df.groupby(['time_bin', 'A_bin'])
# calculate bin statistics
df['RN'] = groups.cumcount() + 1
df['NR'] = groups['time'].transform('count')
df['bin'] = groups.ngroup() + 1
# drop the temporary columns
df = df.drop(columns=['time_bin', 'A_bin'])
# print the result
print(df)
# Output
> time A RN NOR NR bin
> 0 100 1 1 0 2 1
> 1 101 1 2 0 2 1
> 2 104 2 1 0 1 2
> 3 105 3 1 0 2 3
> 4 107 3 2 0 2 3
> 5 110 3 1 0 1 4
> 6 114 4 1 0 1 5
> 7 115 5 1 0 2 6
> 8 116 5 2 0 2 6
> 9 118 6 1 0 1 7
你可以改变previous answer with
GroupBy.ngroup
by both columns:
df = df.sort_values(by=['time'], ascending=True)
# bin the data based on two specific columns
bins1 = df['time'].sub(df['time'].min()).floordiv(5).add(1)
df['bin'] = df.groupby([bins1, 'A']).ngroup() + 1
# add a column for the row count in each bin
df['NR'] = df.groupby('bin')['time'].transform('count')
# add a column for the index of each row in each bin
df['RN'] = df.groupby('bin').cumcount()+1
print (df)
time A RN NOR bin NR
0 100 1 1 0 1 2
1 101 1 2 0 1 2
2 104 2 1 0 2 1
3 105 3 1 0 3 2
4 107 3 2 0 3 2
5 110 3 1 0 4 1
6 114 4 1 0 5 1
7 115 5 1 0 6 2
8 116 5 2 0 6 2
9 118 6 1 0 7 1
如果不需要 bin 列,将
bins1
和 A
传递给 groupby
用于 NR
和 RN
新列:
df = df.sort_values(by=['time'], ascending=True)
# bin the data based on two specific columns
bins1 = df['time'].sub(df['time'].min()).floordiv(5).add(1)
# add a column for the row count in each bin
df['NR'] = df.groupby([bins1, 'A'])['time'].transform('count')
# add a column for the index of each row in each bin
df['RN'] = df.groupby([bins1, 'A']).cumcount()+1
print (df)
time A RN NOR NR
0 100 1 1 0 2
1 101 1 2 0 2
2 104 2 1 0 1
3 105 3 1 0 2
4 107 3 2 0 2
5 110 3 1 0 1
6 114 4 1 0 1
7 115 5 1 0 2
8 116 5 2 0 2
9 118 6 1 0 1