我正在研究twitter标签,我已经计算了它们出现在我的csv文件中的次数。我的csv文件看起来像:
GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10
现在,我想将两个近似的术语组合在一起,例如使用fuzzywuzzy库的“GilletsJaunes”和“gilletsjaune”。如果2个术语之间的接近度大于80,则它们的值仅在2个术语中的一个中添加,而另一个术语被删除。这会给:
GilletsJaunes, 120
Macron, 50
tax, 10
使用“fuzzywuzzy”:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output
首先,复制these two functions以便能够计算argmax:
# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
return max(pairs, key=lambda x: x[1])[0]
# given an iterable of values return the index of the greatest value
def argmax_index(values):
return argmax(enumerate(values))
其次,将CSV的内容加载到Python字典中,然后按以下步骤操作:
from fuzzywuzzy import fuzz
input = {
'GilletsJaunes': 100,
'Macron': 50,
'gilletsjaune': 20,
'tax': 10,
}
threshold = 50
output = dict()
for query in input:
references = list(output.keys()) # important: this is output.keys(), not input.keys()!
scores = [fuzz.ratio(query, ref) for ref in references]
if any(s > threshold for s in scores):
best_reference = references[argmax_index(scores)]
output[best_reference] += input[query]
else:
output[query] = input[query]
print(output)
{'GilletsJaunes':120,'Macron':50,'税':10}
这解决了您的问题。您可以通过首先将标记转换为小写来减少输入样本。我不确定模糊是如何工作的,但我怀疑“HeLlO”,“你好”和“你好”总是会超过80,它们代表同一个词。
import csv
from fuzzywuzzy import fuzz
data = dict()
output = dict()
tags = list()
with open('file.csv') as csvDataFile:
csvReader = csv.reader(csvDataFile)
for row in csvReader:
data[row[0]] = row[1]
tags.append(row[0])
for tag in tags:
output[tag] = 0
for key in data.keys():
if fuzz.ratio(tag, key) > 80:
output[tag] = output[tag] + data[key]