修改停用词删除代码以删除数字

问题描述投票：0回答：1

我在 df 列中有一个标记化文本。从中删除停用词的代码有效，但我也喜欢删除标点符号、数字和特殊字符，而不将它们拼写出来。就像我想确保它也会删除更大/标记为一个标记的数字。

我当前的代码是：

eng_stopwords = stopwords.words('english')
punctuation = ['.', ',', ';', ':', '!' #and so on] 
complete_stopwords = punctuation + eng_stopwords
df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])

python python-3.x pandas stop-words

1个回答

1
投票

您可以从字符串模块中获取标点符号：

import string
print(string.punctuation)

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

eng_stopwords = stopwords.words('english')

punctuation = list(string.punctuation) 

complete_stopwords = punctuation + eng_stopwords

df['removed'] = df['tokenized_text'].apply(lambda words: [word for word in words if word not in complete_stopwords])

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.