将 BERT 代币索引映射到 Spacy 代币索引

Question

我正在尝试将 Bert 的 (

bert-base-uncased

) 标记化标记索引（不是 ids，标记索引）映射到 Spacy 的标记化标记索引。在下面的示例中，我的方法不起作用，因为 Spacy 的标记化行为比我预期的要复杂一些。有想法解决这个问题吗？

import spacy
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
nlp = spacy.load("en_core_web_sm")

sent = nlp("BRITAIN'S railways cost £20.7bn during the 2020-21 financial year, with £2.5bn generated through fares and other income, £1.3bn through other sources and £16.9bn from government, figures released by the regulator the Office of Rail and Road (ORR) on November 30 revealed.")
# Get spacy word index to BERT token indice mapping
wd_to_tok_map = [wd.i for wd in sent for el in tokenizer.encode(wd.text, add_special_tokens=False)]
len(sent) # 55
len(wd_to_tok_map) # 67     <- Should be 65

input_ids = tokenizer.encode(sent.text, add_special_tokens=False)
len(input_ids) # 65

我可以打印两个标记化并寻找完美的文本匹配，但我遇到的问题是如果一个单词在标记化中重复两次怎么办？寻找单词匹配将返回句子不同部分的两个索引。

[el.text for el in sent]
['BRITAIN', ''S', 'railways', 'cost', '£', '20.7bn', 'during', 'the', '2020', '-', '21', 'financial', 'year', ',', 'with', '£','2.5bn','generated','through', 'fares', 'and','other', 'income', ',', '£', '1.3bn', 'through', 'other', 'sources', 'and', '£', '16.9bn', 'from', 'government', ',', 'figures', 'released', 'by', 'the', 'regulator', 'the', 'Office', 'of', 'Rail', 'and', 'Road', '(', 'ORR', ')', 'on', 'November', '30', 'revealed', '.']

[tokenizer.ids_to_tokens[el] for el in input_ids]
['britain',''', 's', 'railways', 'cost', '£2', '##0', '.', '7', '##bn', 'during', 'the', '2020', '-', '21', 'financial', 'year', ',', 'with', '£2', '.', '5', '##bn', 'generated', 'through', 'fares', 'and', 'other', 'income', ',', '£1', '.', '3', '##bn', 'through', 'other', 'sources', 'and', '£1', '##6', '.', '9', '##bn', 'from', 'government', ',', 'figures', 'released', 'by', 'the', 'regulator', 'the', 'office', 'of', 'rail', 'and', 'road', '(', 'orr', ')', 'on', 'november', '30', 'revealed', '.']

decode() 似乎没有给我想要的东西，因为我在寻找索引。

Answer 1

使用快速分词器通过

return_offsets_mapping=True

直接从转换器分词器获取字符偏移量，然后将它们映射到您想要的稀疏标记：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "BRITAIN'S railways cost £20.7bn"
output = tokenizer([text], return_offsets_mapping=True)

print(output["input_ids"])
# [[101, 3725, 1005, 1055, 7111, 3465, 21853, 2692, 1012, 1021, 24700, 102]]

print(tokenizer.convert_ids_to_tokens(output["input_ids"][0]))
# ['[CLS]', 'britain', "'", 's', 'railways', 'cost', '£2', '##0', '.', '7', '##bn', '[SEP]']

print(output["offset_mapping"])
# [[(0, 0), (0, 7), (7, 8), (8, 9), (10, 18), (19, 23), (24, 26), (26, 27), (27, 28), (28, 29), (29, 31), (0, 0)]]

将 BERT 代币索引映射到 Spacy 代币索引

问题描述投票：0回答：1

1个回答

最新问题

将 BERT 代币索引映射到 Spacy 代币索引

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1