如何仅从文件中检索带有名词标签的单词?

问题描述 投票:-1回答:3

我只需要从pos标签为'NN'或'NN'或'NNS'或'NNPS'的文件中检索那些单词。我的示例输入是:

  [['For,IN', ',,,', 'We,PRP', 'the,DT', 'divine,NN', 'caused,VBD', 'apostle,NN', 'We,PRP', 'vouchsafed,VBD', 'unto,JJ', 'Jesus,NNP', 'the,DT', 'son,NN', 'of,IN', 'Mary,NNP', 'all,DT', 'evidence,NN', 'of,IN', 'the,DT', 'truth,NN', ',,,', 'and,CC', 'strengthened,VBD', 'him,PRP', 'with,IN', 'holy,JJ'], [ 'be,VB', 'nor,CC', 'ransom,NN', 'taken,VBN', 'from,IN', 'them,PRP', 'and,CC', 'none,NN', '\n']]

我的预期输出是:

 [ 'divine', 'apostle','Jesus', 'son','Mary',  'evidence',  'truth',  'ransom', 'none']
python arrays list pos-tagger
3个回答
1
投票

既然你的输入是listlist,你可以使用nested list comprehension

a_list = [['For,IN', ',,,', 'indeed,RB', ',,,', 'We,PRP', 'vouchsafed,VBD', 'unto,JJ', 'Moses,NNPS', 'the,DT', 'divine,NN', 'writ,NN', 'and,CC', 'caused,VBD', 'apostle,NN', 'after,IN', 'apostle,NN', 'to,TO', 'follow,VB', 'him,PRP', ';,:', 'and,CC', 'We,PRP', 'vouchsafed,VBD', 'unto,JJ', 'Jesus,NNP', ',,,', 'the,DT', 'son,NN', 'of,IN', 'Mary,NNP', ',,,', 'all,DT', 'evidence,NN', 'of,IN', 'the,DT', 'truth,NN', ',,,', 'and,CC', 'strengthened,VBD', 'him,PRP', 'with,IN', 'holy,JJ']]

pos_tags = (',NN', ',NNP', ',NNS', ',NNPS')

nouns = [s.split(',')[0] for sub in a_list for s in sub if s.endswith(pos_tags)]

print(nouns)

['Moses', 'divine', 'writ', 'apostle', 'apostle', 'Jesus', 'son', 'Mary', 'evidence', 'truth']
>>> 

编辑:

a_list = [['For,IN', ',,,', 'We,PRP', 'the,DT', 'divine,NN', 'caused,VBD', 'apostle,NN', 'We,PRP', 'vouchsafed,VBD', 'unto,JJ', 'Jesus,NNP', 'the,DT', 'son,NN', 'of,IN', 'Mary,NNP', 'all,DT', 'evidence,NN', 'of,IN', 'the,DT', 'truth,NN', ',,,', 'and,CC', 'strengthened,VBD', 'him,PRP', 'with,IN', 'holy,JJ'], ['be,VB', 'nor,CC', 'ransom,NN', 'taken,VBN', 'from,IN', 'them,PRP', 'and,CC', 'none,NN', '\n']]
pos_tags = (',NN', ',NNP', ',NNS', ',NNPS')

nouns = [s.split(',')[0] for sub in a_list for s in sub if s.endswith(pos_tags)]

print(nouns)

['divine', 'apostle', 'Jesus', 'son', 'Mary', 'evidence', 'truth', 'ransom', 'none']
>>> 

2
投票

这是一个基于列表推导的简单方法:

x = ['For,IN', ....]
y = [w.split(',')[0] for w in x if ',NN' in w]

它基本上遍历所有单词,仅包含那些包含",NN"并在逗号之前删除部分的单词。


0
投票

您可以在一行中使用正则表达式尝试这样的事情:

import re
pattern=r'\w+(?=,NN)'
data=[['For,IN', ',,,', 'We,PRP', 'the,DT', 'divine,NN', 'caused,VBD', 'apostle,NN', 'We,PRP', 'vouchsafed,VBD', 'unto,JJ',
  'Jesus,NNP', 'the,DT', 'son,NN', 'of,IN', 'Mary,NNP', 'all,DT', 'evidence,NN', 'of,IN', 'the,DT', 'truth,NN', ',,,',
  'and,CC', 'strengthened,VBD', 'him,PRP', 'with,IN', 'holy,JJ'],
 ['be,VB', 'nor,CC', 'ransom,NN', 'taken,VBN', 'from,IN', 'them,PRP', 'and,CC', 'none,NN', '\n']]

print(list(map(lambda x:list(filter(lambda y:re.search(pattern,y)!=None,x)),data)))

输出:

[['divine,NN', 'apostle,NN', 'Jesus,NNP', 'son,NN', 'Mary,NNP', 'evidence,NN', 'truth,NN'], ['ransom,NN', 'none,NN']]

现在如果你想要没有'NN'那么:

print([re.search(pattern,j).group() for i in data for j in i if isinstance(i,list) if re.search(pattern,j)!=None])

输出:

['divine', 'apostle', 'Jesus', 'son', 'Mary', 'evidence', 'truth', 'ransom', 'none']
© www.soinside.com 2019 - 2024. All rights reserved.