使用 bert-base-ner 提取公司名称:了解哪些词与哪些词相关的简单方法?

问题描述 投票:0回答:1

您好,我正在尝试使用 bert-base-ner 从有关公司的字符串描述中提取完整的公司名称。我也愿意尝试其他方法,但我真的找不到。问题是,虽然它正确地标记了组织,但它是按单词/标记标记的,所以我不能轻易地提取完整的公司名称,而不必自己连接和构建它。

有没有更简单的方法或模型来做到这一点?

这是我的代码:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

ner_results = nlp(text1)
print(ner_results)

这是我对一个文本字符串的输出:

[{'entity': 'B-ORG', 'score': 0.99965024, 'index': 1, 'word': 'Orion', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99945647, 'index': 2, 'word': 'Metal', 'start': 6, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99943095, 'index': 3, 'word': '##s', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.99939036, 'index': 4, 'word': 'Limited', 'start': 13, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9997398, 'index': 14, 'word': 'Australia', 'start': 78, 'end': 87}]
bert-language-model named-entity-recognition tagging pos-tagger
1个回答
1
投票

我遇到了类似的问题,并通过使用一种名为“xlm-roberta-large-finetuned-conll03-English”的更好模型解决了它,它比您现在使用的模型好得多,并且会呈现完整的组织名称,而不是比破碎的碎片。请随意测试下面提到的代码,该代码将从文档中提取完整的组织列表。如果觉得有用,请点击打勾按钮接受我的回答。

from transformers import pipeline
from subprocess import list2cmdline
from pdfminer.high_level import extract_text
import docx2txt
import spacy
from spacy.matcher import Matcher
import time
start = time.time()
nlp = spacy.load('en_core_web_sm')
model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)



def text_extraction(file):
    """"
    To extract texts from both pdf and word
    """
    if file.endswith(".pdf"):
        return extract_text(file)
    else:
        resume_text = docx2txt.process(file)
    if resume_text:
        return resume_text.replace('\t', ' ')
    return None



# Organisation names extraction
def org_name(file):
    # Extract the complete text in the resume
    extracted_text = text_extraction(file)
    classifier = token_classifier(extracted_text)
    # Get the list of dictionary with key value pair "entity":'ORG'
    values = [item for item in classifier if item["entity_group"] == "ORG"]
    # Get the list of dictionary with key value pair "entity":'ORG'
    res = [sub['word'] for sub in values]
    final1 = list(set(res))  # Remove duplicates
    final = list(filter(None, final1)) # Remove empty strings
    print(final)

       
org_name("your file name")

end = time.time()

print("The time of execution of above program is :", round((end - start), 2))
© www.soinside.com 2019 - 2024. All rights reserved.