NER将BIO令牌组合成原始复合词

Question

将BIO令牌组合成复合词的任何方法。我实现了从BIO模式中形成单词的方法，但是这种方法不适用于标点符号的单词。例如：使用以下功能的S.E.C将其作为S加入。 E． C

def collapse(ner_result):
    # List with the result
    collapsed_result = []


    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:

        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append([" ".join(current_entity_tokens), current_entity])

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]

        # If the entity continues ...
        elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            collapsed_result.append([" ".join(current_entity_tokens), current_entity])
            collapsed_result.append([token,tag[2:]])

            current_entity_tokens = []
            current_entity = None

            pass

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append([" ".join(current_entity_tokens), current_entity])
        collapsed_result = sorted(collapsed_result)
        collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))


    return collapsed_result

另一种方法：-

我曾尝试使用TreebankWordDetokenizer进行通证删除，但它仍然没有形成原始句子。例如：来源：句子->parties. \n \n IN WITNESS WHEREOF, the parties hereto标记化和解密的句子-> parties . IN WITNESS WHEREOF, the parties hereto

另一个示例：原件：句子->Group’s employment, Group shall be标记化和解密的句子-> Group ’ s employment, Group shall be

请注意，使用TreebankWordDetokenizer删除句点和换行符。

是否有形成复合词的解决方法？

Answer 1

非常小的修补程序应该可以完成：

def join_tokens(tokens):
    res = ''
    if tokens:
        res = tokens[0]
        for token in tokens[1:]:
            if not (token.isalpha() and res[-1].isalpha()):
                res += token  # punctuation
            else:
                res += ' ' + token  # regular word
    return res

def collapse(ner_result):
    # List with the result
    collapsed_result = []


    current_entity_tokens = []
    current_entity = None

    # Iterate over the tagged tokens
    for token, tag in ner_result:

        if tag.startswith("B-"):
            # ... if we have a previous entity in the buffer, store it in the result list
            if current_entity is not None:
                collapsed_result.append([join_tokens(current_entity_tokens), current_entity])

            current_entity = tag[2:]
            # The new entity has so far only one token
            current_entity_tokens = [token]

        # If the entity continues ...
        elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
            # Just add the token buffer
            current_entity_tokens.append(token)
        else:
            collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
            collapsed_result.append([token,tag[2:]])

            current_entity_tokens = []
            current_entity = None

            pass

    # The last entity is still in the buffer, so add it to the result
    # ... but only if there were some entity at all
    if current_entity is not None:
        collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
        collapsed_result = sorted(collapsed_result)
        collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))

    return collapsed_result

更新

这将解决大多数情况，但是正如下面的评论所示，总是存在离群值。因此，完整的解决方案是跟踪创建某些令牌的单词的身份。因此

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]  

# lut = [("U":0), (".":0), ("S":0), (".":0), ("Securities":1), ("and":2), ("Exchange":3), ("Commision":4)]

现在，在给定标记索引的情况下，您可以知道其确切的单词，并将与其属于同一单词的标记连接起来，并在标记属于不同单词时添加空格。

NER将BIO令牌组合成原始复合词

问题描述投票：1回答：1

1个回答

更新

最新问题

NER将BIO令牌组合成原始复合词

问题描述 投票：1回答：1

1个回答

更新

最新问题

问题描述投票：1回答：1