我有一个 Pandas 数据框,例如:
| word_number | change_type | doc1_paragraph_number | doc2_paragraph_number |
|-------------|-------------|-----------------------|-----------------------|
| 1 | -1 | 0 | -1 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 1 |
| 4 | 0 | 1 | 1 |
| 5 | 1 | 1 | 1 |
| 6 | 1 | 1 | 1 |
| 7 | 1 | -1 | 1 |
| 8 | 0 | 1 | 1 |
| 9 | 1 | -1 | 1 |
| 10 | 0 | 1 | 1 |
| 11 | 0 | 2 | 2 |
| 12 | 1 | 2 | 2 |
| 13 | -1 | 2 | -1 |
| 14 | 1 | 2 | 2 |
| 15 | 0 | 3 | 3 |
| 16 | 1 | -1 | 3 |
| 17 | 1 | -1 | 3 |
| 18 | 0 | 3 | 3 |
| 19 | 0 | 3 | 3 |
| 20 | 0 | 3 | 3 |
| 21 | 0 | 3 | 3 |
| 22 | 1 | -1 | 3 |
| 23 | 1 | -1 | 3 |
| 24 | 0 | 4 | 4 |
| 25 | 0 | 4 | 4 |
| 26 | 1 | -1 | 5 |
| 27 | 1 | -1 | 5 |
| 28 | 0 | 4 | 5 |
| 29 | 0 | 4 | 5 |
| 30 | 1 | -1 | 5 |
| 31 | -1 | 4 | -1 |
| 32 | 0 | 5 | 6 |
| 33 | 0 | 5 | 6 |
| 34 | 0 | 5 | 6 |
| 35 | -1 | 5 | -1 |
| 36 | -1 | 5 | -1 |
| 37 | 1 | -1 | 7 |
| 38 | 1 | -1 | 7 |
| 39 | 0 | 6 | 7 |
| 40 | 0 | 6 | 7 |
我正在尝试开发一个函数,将 word_number 合并到一个列表中,并将该列表添加到单独的列 doc1 和 doc2 中,change_type 是第三列,由对文档所做的更改类型组成。
预期数据框:
| doc1 | doc2 | change_type |
|---------------------|-----------------------|-------------|
| [1] | [] | -1 |
| [2] | [2] | 0 |
| [3, 4] | [3, 4] | 0 |
| [5, 6, 8] | [5, 6, 7, 8, 9] | 2 |
| [10] | [10] | 0 |
| [11] | [11] | 0 |
| [12, 13, 14] | [12, 14] | 2 |
| [15] | [15] | 0 |
| [] | [16, 17] | 1 |
| [18, 19, 20, 21] | [18, 19, 20, 21] | 0 |
| [] | [22, 23] | 1 |
因此,如果有连续的word_numbers,则change_type为0,这意味着这里没有进行任何更改,这意味着我们可以将其添加到单个列表中。 例如:word_numbers 5,6,8,9 是 doc1 列表的一部分,5,6,7,8,9 是 doc2 列表的一部分。
有人可以为我提供一些方法,让我朝着正确的方向前进吗?
传奇:
1. -1 in paragraph number shows that the word isn't present in the paragraph.
2. change type (no change=0, insert=1, delete=-1)
word_number (each word represented by an integer)
逻辑并不完全清晰,但您可以形成石斑鱼以进行自定义聚合:
# group consecutive words
g1 = df['word_number'].diff().ne(1).cumsum()
# group consecutive change type
# ignore intermediate 0/-1
c = (df['change_type']
.replace({0: None, -1: None})
.ffill().fillna(0)
)
g2 = c.diff().ne(0).cumsum()
# select document/paragraph columns
# group by identical type
# ignore -1
tmp = df.filter(like='paragraph_number')
g3 = (tmp
.replace(-1, None)
.ffill().diff()
.ne(0).any(axis=1)
.cumsum()
)
# form groups
g = df.groupby([g1,g2,g3], sort=False)
# aggregate
out = (g[list(tmp)]
.agg(lambda x: df.loc[x.index, 'word_number'][x.ne(-1)].tolist())
.join(g['change_type'].max())
.reset_index(drop=True)
)
print(out)
输出:
doc1_paragraph_number doc2_paragraph_number change_type
0 [1] [] -1
1 [2] [2] 0
2 [3, 4] [3, 4] 0
3 [5, 6, 8, 10] [5, 6, 7, 8, 9, 10] 1
4 [11, 12, 13, 14] [11, 12, 14] 1
5 [15, 18, 19, 20, 21] [15, 16, 17, 18, 19, 20, 21, 22, 23] 1
6 [24, 25] [24, 25] 0
7 [28, 29, 31] [26, 27, 28, 29, 30] 1
8 [32, 33, 34, 35, 36] [32, 33, 34] 0
9 [] [37, 38] 1
10 [39, 40] [39, 40] 0