我怎么能把实体(list)转换成字典呢? 我试过的代码被注释了,但不能用,NLP问题。

问题描述 投票:0回答:1

如何将实体(list)转换为字典?我试过的代码被注释了,但没有工作,或者说,与其转换,不如将实体改写成像字典一样的东西,我想在字典中转换,以便能够在前500句中找到5个最常见的名字。

! pip install wget
import wget
url = 'https://raw.githubusercontent.com/dirkhovy/NLPclass/master/data/moby_dick.txt'
wget.download(url, 'moby_dick.txt')
documents = [line.strip() for line in open('moby_dick.txt', encoding='utf8').readlines()]

import spacy

nlp = spacy.load('en')
entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:50]]
entities


#I TRIED THIS BUT IS WRONG
#def Convert(lst): 
#    res_dct = {lst[i]: lst[i + 1] for i in range(0, len(lst), 2)} 
#    return res_dct
#print(Convert(ent)) 

python nlp nltk spacy
1个回答
1
投票

存储在变量的列表 entities 是具有类型 list[list[tuple[str, str]]]其中,元组中的第一个条目是实体的字符串,第二个条目是实体的类型,例如。

>>> from pprint import pprint
>>> pprint(entities)
[[],
 [('Ishmael', 'GPE')],
 [('Some years ago', 'DATE')],
 [],
 [('November', 'DATE')],
 [],
 [('Cato', 'ORG')],
 [],
 [],
 [('Manhattoes', 'ORG'), ('Indian', 'NORP')],
 [],
 [('a few hours', 'TIME')],
...

然后你可以创建一个反向的 dict 以下列方式:

>>> sum(filter(None, entities), [])
[('Ishmael', 'GPE'), ('Some years ago', 'DATE'), ('November', 'DATE'), ('Cato', 'ORG'), ('Manhattoes', 'ORG'), ('Indian', 'NORP'), ('a few hours', 'TIME'), ('Sabbath afternoon', 'TIME'), ('Corlears Hook to Coenties Slip', 'WORK_OF_ART'), ('Whitehall', 'PERSON'), ('thousands upon thousands', 'CARDINAL'), ('China', 'GPE'), ('week days', 'DATE'), ('ten', 'CARDINAL'), ('American', 'NORP'), ('June', 'DATE'), ('one', 'CARDINAL'), ('Niagara', 'ORG'), ('thousand miles', 'QUANTITY'), ('Tennessee', 'GPE'), ('two', 'CARDINAL'), ('Rockaway Beach', 'GPE'), ('first', 'ORDINAL'), ('first', 'ORDINAL'), ('Persians', 'NORP')]
>>> from collections import defaultdict
>>> type2entities = defaultdict(list)
>>> for entity, entity_type in sum(filter(None, entities), []):
...   type2entities[entity_type].append(entity)
...
>>> from pprint import pprint
>>> pprint(type2entities)
defaultdict(<class 'list'>,
            {'CARDINAL': ['thousands upon thousands', 'ten', 'one', 'two'],
             'DATE': ['Some years ago', 'November', 'week days', 'June'],
             'GPE': ['Ishmael', 'China', 'Tennessee', 'Rockaway Beach'],
             'NORP': ['Indian', 'American', 'Persians'],
             'ORDINAL': ['first', 'first'],
             'ORG': ['Cato', 'Manhattoes', 'Niagara'],
             'PERSON': ['Whitehall'],
             'QUANTITY': ['thousand miles'],
             'TIME': ['a few hours', 'Sabbath afternoon'],
             'WORK_OF_ART': ['Corlears Hook to Coenties Slip']})

The dict 存于 type2entities 是您想要的。要得到前500行中出现频率最高的人名(及其对应的提及次数)。

>>> from collections import Counter
>>> entities = [[(entity.text, entity.label_) for entity in nlp(sentence).ents]for sentence in documents[:500]]
>>> person_cnt = Counter()
>>> for entity, entity_type in sum(filter(None, entities), []):
...   if entity_type == 'PERSON':
...     person_cnt[entity] += 1
...
>>> person_cnt.most_common(5)
[('Queequeg', 17), ('don', 4), ('Nantucket', 2), ('Jonah', 2), ('Sal', 2)]
© www.soinside.com 2019 - 2024. All rights reserved.