我有以下数据:
[['The',
'Fulton',
'County',
'Grand',
'Jury',
'said',
'Friday',
'an',
'investigation',
'of',
"Atlanta's",
'recent',
'primary',
'election',
'produced',
'``',
'no',
'evidence',
"''",
'that',
'any',
'irregularities',
'took',
'place',
'.'],
['The',
'jury',
'further',
'said',
'in',
'term-end',
'presentments',
'that',
'the',
'City',
'Executive',
'Committee',
',',
'which',
'had',
'over-all',
'charge',
'of',
'the',
'election',
',',
'``',
'deserves',
'the',
'praise',
'and',
'thanks',
'of',
'the',
'City',
'of',
'Atlanta',
"''",
'for',
'the',
'manner',
'in',
'which',
'the',
'election',
'was',
'conducted',
'.']]
因此,我有一个包含2个其他列表的列表(在我的情况下,一个大列表中有50000个列表)。我想删除所有标点和停用词,例如“ the”,“ a”,“ of”等。
这是我编写的代码:
import string
from nltk.corpus import stopwords
nltk.download('stopwords')
punct = list(string.punctuation)
punct.append("``")
punct.append("''")
stops = set(stopwords.words("english"))
res = [[word.lower() for word in sentence if word not in punct or word.lower() in not stops] for sentence in dataset]
但是它会返回我最初拥有的相同列表列表。我的代码有什么问题?
您应使用and
代替or
:
由于punct
和stops
不重叠,每个