您好,我正在尝试使用 bert-base-ner 从有关公司的字符串描述中提取完整的公司名称。我也愿意尝试其他方法,但我真的找不到。问题是,虽然它正确地标记了组织,但它是按单词/标记标记的,所以我不能轻易地提取完整的公司名称,而不必自己连接和构建它。
有没有更简单的方法或模型来做到这一点?
这是我的代码:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp(text1)
print(ner_results)
这是我对一个文本字符串的输出:
[{'entity': 'B-ORG', 'score': 0.99965024, 'index': 1, 'word': 'Orion', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99945647, 'index': 2, 'word': 'Metal', 'start': 6, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99943095, 'index': 3, 'word': '##s', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.99939036, 'index': 4, 'word': 'Limited', 'start': 13, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9997398, 'index': 14, 'word': 'Australia', 'start': 78, 'end': 87}]
我遇到了类似的问题,并通过使用一种名为“xlm-roberta-large-finetuned-conll03-English”的更好模型解决了它,它比您现在使用的模型好得多,并且会呈现完整的组织名称,而不是比破碎的碎片。请随意测试下面提到的代码,该代码将从文档中提取完整的组织列表。如果觉得有用,请点击打勾按钮接受我的回答。
from transformers import pipeline
from subprocess import list2cmdline
from pdfminer.high_level import extract_text
import docx2txt
import spacy
from spacy.matcher import Matcher
import time
start = time.time()
nlp = spacy.load('en_core_web_sm')
model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
def text_extraction(file):
""""
To extract texts from both pdf and word
"""
if file.endswith(".pdf"):
return extract_text(file)
else:
resume_text = docx2txt.process(file)
if resume_text:
return resume_text.replace('\t', ' ')
return None
# Organisation names extraction
def org_name(file):
# Extract the complete text in the resume
extracted_text = text_extraction(file)
classifier = token_classifier(extracted_text)
# Get the list of dictionary with key value pair "entity":'ORG'
values = [item for item in classifier if item["entity_group"] == "ORG"]
# Get the list of dictionary with key value pair "entity":'ORG'
res = [sub['word'] for sub in values]
final1 = list(set(res)) # Remove duplicates
final = list(filter(None, final1)) # Remove empty strings
print(final)
org_name("your file name")
end = time.time()
print("The time of execution of above program is :", round((end - start), 2))