早上好,我正在开发OCR项目,并从基于ocr的文本文件中找到的实体中创建一个dic。
第一步运行良好。 “读取”图像后,我从每个图像中都有一个包含txt文件的文件夹。现在我要使用SpaCy(NLP)处理此文本文件。我对标签“ PER”和“ LOC”的所有实体都感兴趣我希望它们存储在json数据中。我的想法很久了:该代码为每个txt文件提供了一个人员实体和位置实体的列表。这样正确吗?
现在是我的问题专家,如何将这些列表存储在一个字典中?像这样:
"1.txt": [
{
"name": "Abbas Hilmi Pasha",
"type": "PER",
"frequency": 2
},
{
"name": "Turkey",
"type": "LOC",
"frequency":
....
}
],
"2.txt": [
{
"name": "Englishmen",
"type": "PER",
"frequency": 1
},
--------------------------------------- FULLCODE -------- -------------------
nlp = spacy.load('xx_ent_wiki_sm')
docs ='Corpus'
def get_filename(path):
return [i.path for i in os.scandir(path) if i.is_file()]
files=get_filename(docs)
for filepath in files:
with open(filepath, 'r', encoding='UTF8') as file_to_read:
some_text = file_to_read.read()
print(os.path.basename(filepath))
doc = nlp(some_text)
perlist=[]
loclist=[]
for ent in doc.ents:
if ent.label_=="PER":
perlist.append(str(ent))
elif ent.label_=="LOC":
loclist.append(str(ent))
print(perlist)
print(loclist)
也许有人可以帮助我。我将很高兴了解更多有关它。这是我第一次与(ocr)nlp和实体合作。
问候!
如果我对您的理解正确,则可以通过在代码中添加一些内容来实现这一点
您要做的全部使用Counter
和perlist
上的loclist
,并将结果存储在字典中。
...
final_dict = {} # stores the desired final output in a singe dict
for filepath in files:
with open(filepath, 'r', encoding='UTF8') as file_to_read:
some_text = file_to_read.read()
base_name = os.path.basename(filepath)
print(base_name)
doc = nlp(some_text)
perlist=[]
loclist=[]
for ent in doc.ents:
if ent.label_ == "PER":
perlist.append(str(ent))
elif ent.label_ == "LOC":
loclist.append(str(ent))
# Count the number of PER/LOC entities and store in final_dict
final_list = [] # {"1.txt": final_list}
# Count PER entities
c = Counter(perlist)
for p, count in c.most_common():
final_list.append({
'name': p,
'type': 'PER',
'frequency': count
})
# Count LOC entities
c = Counter(loclist)
for l, count in c.most_common():
final_list.append({
'name': l,
'type': 'LOC',
'frequency': count
})
# store list of results in final_dict.
# eg. final_dict['1.txt'] = [{'name': 'Englishmen', 'type': 'PER', 'frequency': 1}, ...]
final_dict[base_name] = final_list
print('Final result', final_dict)