是否有一个替代`difflib.get_close_matches（）`来返回索引（列表位置）而不是str列表？

Question

我想使用像difflib.get_close_matches这样的东西，而不是最相似的字符串，我想获得索引（即列表中的位置）。

列表的索引更灵活，因为可以将索引与其他数据结构相关联（与匹配的字符串相关）。

例如，而不是：

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'hallo', 'Hallo']

我想要：

>>> difflib.get_close_matches('Hello', words)
[0, 1, 6]

似乎没有一个参数来获得这个结果，是否有一个替代difflib.get_close_matches()返回索引？

我对替代方案的研究

我知道我可以使用difflib.SequenceMatcher，然后将字符串与ratio（或quick_ratio）进行一对一比较。但是，我担心这会非常低效，因为：

我将不得不创建数千个SequenceMatcher对象并进行比较（我期待get_close_matches避免使用该类）：编辑：错。我检查了source code of get_close_matches，它实际上使用SequenceMatcher。
没有截止（我猜测有一个优化可以避免计算所有字符串的比率）编辑：部分错误。代码是get_close_matches没有任何主要的优化，除了它使用real_quick_ratio, quick_ratio and ratio alltogether。无论如何，我可以轻松地将优化复制到我自己的函数中。另外我没有考虑SequenceMatcher有设置序列的方法：set_seq1，set_seq2，所以至少我不必每次都创建一个对象。
据我所知，所有python库都是C编译的，这会提高性能。编辑：我很确定这是事实。该函数位于名为cpython的文件夹中。编辑：直接从difflib执行和在文件mydifflib.py中复制the function之间存在一个小差异（p值为0.030198）。 ipdb> timeit.repeat("gcm('hello', _vals)", setup="from difflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10) [13.230449825001415, 13.126462900007027, 12.965455356999882, 12.955717618009658, 13.066136312991148, 12.935014379996574, 13.082025538009475, 12.943519036009093, 13.149949093989562, 12.970130036002956] ipdb> timeit.repeat("gcm('hello', _vals)", setup="from mydifflib import get_close_matches as gcm; _vals=['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']", number=100000, repeat=10) [13.363269686000422, 13.087718107010005, 13.112324478992377, 13.358293497993145, 13.283965317998081, 13.056695280989516, 13.021098569995956, 13.04310674899898, 13.024205000008806, 13.152750282009947]

尽管如此，它并没有我想象的那么糟糕，我想我会继续进行，除非有人知道另一个图书馆或替代方案。

Answer 1

我获取了get_close_matches的源代码，并修改它以返回索引而不是字符串值。

# mydifflib.py
from difflib import SequenceMatcher
from heapq import nlargest as _nlargest

def get_close_matches_indexes(word, possibilities, n=3, cutoff=0.6):
    """Use SequenceMatcher to return a list of the indexes of the best 
    "good enough" matches. word is a sequence for which close matches 
    are desired (typically a string).
    possibilities is a list of sequences against which to match word
    (typically a list of strings).
    Optional arg n (default 3) is the maximum number of close matches to
    return.  n must be > 0.
    Optional arg cutoff (default 0.6) is a float in [0, 1].  Possibilities
    that don't score at least that similar to word are ignored.
    """

    if not n >  0:
        raise ValueError("n must be > 0: %r" % (n,))
    if not 0.0 <= cutoff <= 1.0:
        raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
    result = []
    s = SequenceMatcher()
    s.set_seq2(word)
    for idx, x in enumerate(possibilities):
        s.set_seq1(x)
        if s.real_quick_ratio() >= cutoff and \
           s.quick_ratio() >= cutoff and \
           s.ratio() >= cutoff:
            result.append((s.ratio(), idx))

    # Move the best scorers to head of list
    result = _nlargest(n, result)

    # Strip scores for the best n matches
    return [x for score, x in result]

Usage

>>> from mydifflib import get_close_matches_indexes
>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> get_close_matches_indexes('hello', words)
[0, 1, 6]

现在，我可以将此索引与字符串的关联数据相关联，而无需搜索字符串。

是否有一个替代`difflib.get_close_matches（）`来返回索引（列表位置）而不是str列表？

问题描述投票：3回答：1

我对替代方案的研究

1个回答

Usage

最新问题

是否有一个替代`difflib.get_close_matches（）`来返回索引（列表位置）而不是str列表？

问题描述 投票：3回答：1

我对替代方案的研究

1个回答

Usage

最新问题

问题描述投票：3回答：1