将BIO令牌组合成复合词的任何方法。我实现了从BIO模式中形成单词的方法,但是这种方法不适用于标点符号的单词。例如:使用以下功能的S.E.C将其作为S加入。 E. C
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))
return collapsed_result
另一种方法:-
我曾尝试使用TreebankWordDetokenizer进行通证删除,但它仍然没有形成原始句子。例如:来源:句子->parties. \n \n IN WITNESS WHEREOF, the parties hereto
标记化和解密的句子-> parties . IN WITNESS WHEREOF, the parties hereto
另一个示例:原件:句子->Group’s employment, Group shall be
标记化和解密的句子-> Group ’ s employment, Group shall be
请注意,使用TreebankWordDetokenizer删除句点和换行符。
是否有形成复合词的解决方法?
非常小的修补程序应该可以完成:
def join_tokens(tokens):
res = ''
if tokens:
res = tokens[0]
for token in tokens[1:]:
if not (token.isalpha() and res[-1].isalpha()):
res += token # punctuation
else:
res += ' ' + token # regular word
return res
def collapse(ner_result):
# List with the result
collapsed_result = []
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
return collapsed_result
这将解决大多数情况,但是正如下面的评论所示,总是存在离群值。因此,完整的解决方案是跟踪创建某些令牌的单词的身份。因此
text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]
# lut = [("U":0), (".":0), ("S":0), (".":0), ("Securities":1), ("and":2), ("Exchange":3), ("Commision":4)]
现在,在给定标记索引的情况下,您可以知道其确切的单词,并将与其属于同一单词的标记连接起来,并在标记属于不同单词时添加空格。