如何用熊猫将句子分为句子ID,单词和标签?

问题描述 投票:0回答:1

我想将我的熊猫数据框转换成可以在NER模型中使用的格式。

我有一个像这样的熊猫数据框:

```
Sentence_id    Sentence                                                       labels
1              Did not  enjoy the new Windows 8 and touchscreen functions.    Windows 8
1              Did not  enjoy the new Windows 8 and touchscreen functions.    touchscreen functions
```

是否可以将其转换为以下格式?

```
Sentence_id    words          labels                                                       
1              Did            O
1              not            O
1              enjoy          O
1              the            O
1              new            O
1              Windows        B
1              8              I
1              and            O
1              touchscreen    B
1              functions      I
1              .              O
```

标签中的第一个单词应标记为“ B”(开头),标签中的以下单词应标记为“ I”(内部)。其他单词和标点符号应标记为O(Outside)。

python pandas named-entity-recognition
1个回答
0
投票

解决方案有点长。但是您可以使用df.iterrows()

import string

ids = df.Sentence_id.unique().tolist()     ## Assuming name of your dataframe is df
sentences = df.Sentence.unique().tolist()
labels = df.labels.unique().tolist()

def get_label(word, labels):
  if word == labels[0]:
    return 'B'
  elif word in labels and word!= labels[0]:
    return 'I'
  else:
    return 'O'

data = {}
exclude = set(string.punctuation)
for _, row in df.iterrows():
  words = ''.join(ch for ch in row['Sentence'] if ch not in exclude).split()
  puncts = ''.join(ch for ch in row['Sentence'] if ch in exclude).split()
  labels = row['labels'].split()
  for word in words: 
    if word in data:
      if word in labels:
        data[word][1] =  get_label(word, labels)
    else:
      data[word] = [row['Sentence_id'], get_label(word, labels)]
    for punct in puncts:
      data[punct] = [row['Sentence_id'],'O']

## Processing the dictionary to input into dataframe
ids = []
words = []
labels = []
for key, val in data.items():
  words.append(key)
  ids.append(data[key][0])
  labels.append(data[key][1])
new_df = pd.DataFrame({'Sentence_id':ids, 'words':words, 'labels':labels})
new_df

    Sentence_id words   labels
0   1           Did     O
1   1           .       O
2   1           not     O
3   1           enjoy   O
4   1           the     O
5   1           new     O
6   1           Windows B
7   1           8       I
8   1           and     O
9   1       touchscreen B
10  1         functions I
© www.soinside.com 2019 - 2024. All rights reserved.