字典中的SpaCy实体(NLP)(NER)

问题描述 投票:0回答:1

早上好,我正在开发OCR项目,并从基于ocr的文本文件中找到的实体中创建一个dic。

第一步运行良好。 “读取”图像后,我从每个图像中都有一个包含txt文件的文件夹。现在我要使用SpaCy(NLP)处理此文本文件。我对标签“ PER”和“ LOC”的所有实体都感兴趣我希望它们存储在json数据中。我的想法很久了:该代码为每个txt文件提供了一个人员实体和位置实体的列表。这样正确吗?

现在是我的问题专家,如何将这些列表存储在一个字典中?像这样:

    "1.txt": [
        {
            "name": "Abbas Hilmi Pasha",
            "type": "PER",
            "frequency": 2
        },
        {
            "name": "Turkey",
            "type": "LOC",
            "frequency":    

....

        }
    ],
    "2.txt": [
        {
            "name": "Englishmen",
            "type": "PER",
            "frequency": 1
        },

--------------------------------------- FULLCODE -------- -------------------

nlp = spacy.load('xx_ent_wiki_sm')

docs ='Corpus'

def get_filename(path):

        return [i.path for i in os.scandir(path) if i.is_file()]

files=get_filename(docs)

for filepath in files:

    with open(filepath, 'r', encoding='UTF8') as file_to_read:
        some_text = file_to_read.read()
        print(os.path.basename(filepath))
        doc = nlp(some_text)
        perlist=[]
        loclist=[]
        for ent in doc.ents:
             if ent.label_=="PER":
                perlist.append(str(ent))
             elif ent.label_=="LOC":
                loclist.append(str(ent))

        print(perlist)
        print(loclist)

也许有人可以帮助我。我将很高兴了解更多有关它。这是我第一次与(ocr)nlp和实体合作。

问候!

dictionary nlp spacy entities ner
1个回答
0
投票

如果我对您的理解正确,则可以通过在代码中添加一些内容来实现这一点

您要做的全部使用Counterperlist上的loclist,并将结果存储在字典中。

...

final_dict = {}  # stores the desired final output in a singe dict
for filepath in files:

    with open(filepath, 'r', encoding='UTF8') as file_to_read:
        some_text = file_to_read.read()
        base_name = os.path.basename(filepath)
        print(base_name)
        doc = nlp(some_text)
        perlist=[]
        loclist=[]
        for ent in doc.ents:
             if ent.label_ == "PER":
                perlist.append(str(ent))
             elif ent.label_ == "LOC":
                loclist.append(str(ent))

        # Count the number of PER/LOC entities and store in final_dict
        final_list = []  # {"1.txt": final_list}

        # Count PER entities
        c = Counter(perlist)
        for p, count in c.most_common():
            final_list.append({
                'name': p,
                'type': 'PER',
                'frequency': count
            })

        # Count LOC entities
        c = Counter(loclist)
        for l, count in c.most_common():
            final_list.append({
                'name': l,
                'type': 'LOC',
                'frequency': count
            })

        # store list of results in final_dict.
        # eg. final_dict['1.txt'] = [{'name': 'Englishmen', 'type': 'PER', 'frequency': 1}, ...]
        final_dict[base_name] = final_list

print('Final result', final_dict)
© www.soinside.com 2019 - 2024. All rights reserved.