识别两个给定文档中相应的单词更改

问题描述 投票:0回答:1

我有一个 Pandas 数据框,例如:

| word_number | change_type | doc1_paragraph_number | doc2_paragraph_number |
|-------------|-------------|-----------------------|-----------------------|
| 1           | -1          | 0                     | -1                    |
| 2           | 0           | 0                     | 0                     |
| 3           | 0           | 1                     | 1                     |
| 4           | 0           | 1                     | 1                     |
| 5           | 1           | 1                     | 1                     |
| 6           | 1           | 1                     | 1                     |
| 7           | 1           | -1                    | 1                     |
| 8           | 0           | 1                     | 1                     |
| 9           | 1           | -1                    | 1                     |
| 10          | 0           | 1                     | 1                     |
| 11          | 0           | 2                     | 2                     |
| 12          | 1           | 2                     | 2                     |
| 13          | -1          | 2                     | -1                    |
| 14          | 1           | 2                     | 2                     |
| 15          | 0           | 3                     | 3                     |
| 16          | 1           | -1                    | 3                     |
| 17          | 1           | -1                    | 3                     |
| 18          | 0           | 3                     | 3                     |
| 19          | 0           | 3                     | 3                     |
| 20          | 0           | 3                     | 3                     |
| 21          | 0           | 3                     | 3                     |
| 22          | 1           | -1                    | 3                     |
| 23          | 1           | -1                    | 3                     |
| 24          | 0           | 4                     | 4                     |
| 25          | 0           | 4                     | 4                     |
| 26          | 1           | -1                    | 5                     |
| 27          | 1           | -1                    | 5                     |
| 28          | 0           | 4                     | 5                     |
| 29          | 0           | 4                     | 5                     |
| 30          | 1           | -1                    | 5                     |
| 31          | -1          | 4                     | -1                    |
| 32          | 0           | 5                     | 6                     |
| 33          | 0           | 5                     | 6                     |
| 34          | 0           | 5                     | 6                     |
| 35          | -1          | 5                     | -1                    |
| 36          | -1          | 5                     | -1                    |
| 37          | 1           | -1                    | 7                     |
| 38          | 1           | -1                    | 7                     |
| 39          | 0           | 6                     | 7                     |
| 40          | 0           | 6                     | 7                     |

我正在尝试开发一个函数,将 word_number 合并到一个列表中,并将该列表添加到单独的列 doc1 和 doc2 中,change_type 是第三列,由对文档所做的更改类型组成。

预期数据框:

| doc1                | doc2                  | change_type |
|---------------------|-----------------------|-------------|
| [1]                 | []                    | -1          |
| [2]                 | [2]                   | 0           |
| [3, 4]              | [3, 4]                | 0           |
| [5, 6, 8]           | [5, 6, 7, 8, 9]       | 2           |
| [10]                | [10]                  | 0           |
| [11]                | [11]                  | 0           |
| [12, 13, 14]        | [12, 14]              | 2           |
| [15]                | [15]                  | 0           |
| []                  | [16, 17]              | 1           |
| [18, 19, 20, 21]    | [18, 19, 20, 21]      | 0           |
| []                  | [22, 23]              | 1           |

因此,如果有连续的word_numbers,则change_type为0,这意味着这里没有进行任何更改,这意味着我们可以将其添加到单个列表中。 例如:word_numbers 5,6,8,9 是 doc1 列表的一部分,5,6,7,8,9 是 doc2 列表的一部分。

有人可以为我提供一些方法,让我朝着正确的方向前进吗?

传奇

 1. -1 in paragraph number shows that the word isn't present in the paragraph. 
 2. change type (no change=0, insert=1, delete=-1)
    word_number (each word represented by an integer)
python pandas dataframe
1个回答
0
投票

逻辑并不完全清晰,但您可以形成石斑鱼以进行自定义聚合:

# group consecutive words
g1 = df['word_number'].diff().ne(1).cumsum()

# group consecutive change type
# ignore intermediate 0/-1
c = (df['change_type']
     .replace({0: None, -1: None})
     .ffill().fillna(0)
     )
g2 = c.diff().ne(0).cumsum()

# select document/paragraph columns
# group by identical type
# ignore -1
tmp = df.filter(like='paragraph_number')
g3 = (tmp
      .replace(-1, None)
      .ffill().diff()
      .ne(0).any(axis=1)
      .cumsum()
      )

# form groups
g = df.groupby([g1,g2,g3], sort=False)

# aggregate
out = (g[list(tmp)]
       .agg(lambda x: df.loc[x.index, 'word_number'][x.ne(-1)].tolist())
       .join(g['change_type'].max())
       .reset_index(drop=True)
       )

print(out)

输出:

   doc1_paragraph_number                 doc2_paragraph_number  change_type
0                    [1]                                    []           -1
1                    [2]                                   [2]            0
2                 [3, 4]                                [3, 4]            0
3          [5, 6, 8, 10]                   [5, 6, 7, 8, 9, 10]            1
4       [11, 12, 13, 14]                          [11, 12, 14]            1
5   [15, 18, 19, 20, 21]  [15, 16, 17, 18, 19, 20, 21, 22, 23]            1
6               [24, 25]                              [24, 25]            0
7           [28, 29, 31]                  [26, 27, 28, 29, 30]            1
8   [32, 33, 34, 35, 36]                          [32, 33, 34]            0
9                     []                              [37, 38]            1
10              [39, 40]                              [39, 40]            0
© www.soinside.com 2019 - 2024. All rights reserved.