NLP数据保持单词频率

问题描述 投票:0回答:1

我正在使用以下代码清理语料库:-

token=['hi','hi','account','is','follow' ,'follow','account','delhi']
to_remove=set(words union of stopwrold, city,country,firstname, lastname, otherword)
set(token)-to_remove
{'account','follow',}

由于取一组token丢失重复世界的频率,导致tf-idf性能低下。 我想保持输出世界的频率。 我有很大的语料库,使用 for 循环进行手动删除需要一周的时间来清理,上面的代码在 1:30 小时内完成工作。

我想要以最快的方式输出:

{'account','follow' ,'follow','account'}
python nlp set stop-words
1个回答
0
投票

尝试一下希望这对你有帮助

from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

to_remove = {'stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'}

filtered_token = [word for word in token if word not in to_remove]

# To maintain frequency, you can use Counter to count occurrences and reconstruct the list
counts = Counter(token)
output = [word for word in filtered_token for _ in range(counts[word])]

print(output)
© www.soinside.com 2019 - 2024. All rights reserved.