NLP数据保持单词频率

Question

我正在使用以下代码清理语料库：-

token=['hi','hi','account','is','follow' ,'follow','account','delhi']
to_remove=set(words union of stopwrold, city,country,firstname, lastname, otherword)
set(token)-to_remove
{'account','follow',}

由于取一组token丢失重复世界的频率，导致tf-idf性能低下。我想保持输出世界的频率。我有很大的语料库，使用 for 循环进行手动删除需要一周的时间来清理，上面的代码在 1:30 小时内完成工作。

我想要以最快的方式输出：

{'account','follow' ,'follow','account'}

Answer 1

尝试一下希望这对你有帮助

from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

to_remove = {'stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'}

filtered_token = [word for word in token if word not in to_remove]

# To maintain frequency, you can use Counter to count occurrences and reconstruct the list
counts = Counter(token)
output = [word for word in filtered_token for _ in range(counts[word])]

print(output)

NLP数据保持单词频率

问题描述投票：0回答：1

1个回答

最新问题

NLP数据保持单词频率

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1