我正在使用以下代码清理语料库:-
token=['hi','hi','account','is','follow' ,'follow','account','delhi']
to_remove=set(words union of stopwrold, city,country,firstname, lastname, otherword)
set(token)-to_remove
{'account','follow',}
由于取一组token丢失重复世界的频率,导致tf-idf性能低下。 我想保持输出世界的频率。 我有很大的语料库,使用 for 循环进行手动删除需要一周的时间来清理,上面的代码在 1:30 小时内完成工作。
我想要以最快的方式输出:
{'account','follow' ,'follow','account'}
尝试一下希望这对你有帮助
from collections import Counter
token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']
to_remove = {'stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'}
filtered_token = [word for word in token if word not in to_remove]
# To maintain frequency, you can use Counter to count occurrences and reconstruct the list
counts = Counter(token)
output = [word for word in filtered_token for _ in range(counts[word])]
print(output)