使用Python中的值对字符串进行分组

Question

我正在研究twitter标签，我已经计算了它们出现在我的csv文件中的次数。我的csv文件看起来像：

GilletsJaunes, 100
Macron, 50
gilletsjaune, 20
tax, 10

现在，我想将两个近似的术语组合在一起，例如使用fuzzywuzzy库的“GilletsJaunes”和“gilletsjaune”。如果2个术语之间的接近度大于80，则它们的值仅在2个术语中的一个中添加，而另一个术语被删除。这会给：

GilletsJaunes, 120
Macron, 50
tax, 10

使用“fuzzywuzzy”：

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

fuzz.ratio("GiletsJaunes", "giletsjaune")
82 #output

Answer 1

首先，复制these two functions以便能够计算argmax：

# given an iterable of pairs return the key corresponding to the greatest value
def argmax(pairs):
    return max(pairs, key=lambda x: x[1])[0]


# given an iterable of values return the index of the greatest value
def argmax_index(values):
    return argmax(enumerate(values))

其次，将CSV的内容加载到Python字典中，然后按以下步骤操作：

from fuzzywuzzy import fuzz

input = {
    'GilletsJaunes': 100,
    'Macron': 50,
    'gilletsjaune': 20,
    'tax': 10,
}

threshold = 50

output = dict()
for query in input:
    references = list(output.keys()) # important: this is output.keys(), not input.keys()!
    scores = [fuzz.ratio(query, ref) for ref in references]
    if any(s > threshold for s in scores):
        best_reference = references[argmax_index(scores)]
        output[best_reference] += input[query]
    else:
        output[query] = input[query]

print(output)

{'GilletsJaunes'：120，'Macron'：50，'税'：10}

Answer 2

这解决了您的问题。您可以通过首先将标记转换为小写来减少输入样本。我不确定模糊是如何工作的，但我怀疑“HeLlO”，“你好”和“你好”总是会超过80，它们代表同一个词。

import csv
from fuzzywuzzy import fuzz

data = dict()
output = dict()
tags = list()

with open('file.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for row in csvReader:
        data[row[0]] = row[1]
        tags.append(row[0])

for tag in tags:
    output[tag] = 0
    for key in data.keys():
        if fuzz.ratio(tag, key) > 80:
            output[tag] = output[tag] + data[key]

使用Python中的值对字符串进行分组

问题描述投票：4回答：2

2个回答

最新问题

使用Python中的值对字符串进行分组

问题描述 投票：4回答：2

2个回答

最新问题

问题描述投票：4回答：2