在字典上键入用户搜索（询问如何提高我已有的效率的方向/技术）

Question

所以我构建了一个数据库，其中包含一个结构如下的表：

列 1（文本） - 列 2（文本） - 列 3（文本） - 列 4（文本） - 列 5（布尔值） - 列 6（布尔值） - 列 7（布尔值） - 列 8（布尔值） - 列 9（文本）

这个数据表大约有1000行。数据的更多详细信息：

第 1 列：包含 30 个唯一值（类别）
第 2 列：包含 50 个唯一值（类别）
第 3 列：包含 200 个唯一值（类别）
第 1、2 和 3 列彼此独立，因此可以在任何给定行项目上找到唯一值的任意组合。
第 4 列：每个行项目值都不同，几乎没有值重复。
所有列都有相同数量的行项目。

此表的用途之一是允许用户键入一个字符串（或一系列由空格“”分隔的字符串）（变量名称：search_term），然后将其用于搜索数据库，返回所有相似的匹配项搜索词。搜索在第 1、2、3 和 4 列上执行，返回行索引，然后使用该索引向用户显示数据。此外，我无法在每次用户执行搜索时查询服务器，因为返回结果太慢。

目前，因为 searchString 是用户输入的，所以我正在使用 difflib 库中的 SequenceMatcher 类执行部分匹配技术来获取任何匹配项（这就是我处理拼写错误和词干的方法）。启动时，我将整个数据表作为字典读取，格式如下（这不是慢的部分，只是上下文：Anvil Lazy Search对于那些对为什么这不慢感兴趣的人）：

result = {'column title': [value0, ... , valueN], ...}

目前我的搜索方法如下（这是比较慢的部分）：

from difflib import SequenceMatcher
from operator import itemgetter

def similarity_sort(search_term, result):
    # See minimal example below.
    search_terms = search_term.split()
    # Will be used to store unsorted results.
    temp = []
    # Will be used to store results after being sorted by SequenceMatcher returned float.
    sorted_result = []
    for index in range(len(result['column1'])):
        # Used to calculate the average of the match values of each column item.
        check_count = 0
        # Overall of match values.
        overall = 0
        for term in search_terms:
            a = perform_check(term, result['column1'][index])
            b = perform_check(term, result['column2'][index])
            c = perform_check(term, result['column3'][index])
            d = perform_check(term, result['column4'][index])
            overall += sum([a, b, c, d])
            check_count += 4
        # Calculate average.
        overall /= check_count
        temp.append([index, overall])
    # Sorted based off of overall value High -> Low.
    temp.sort(key=itemgetter(1), reverse=True)
    for x in temp:
        ''' 0.4 was selected based off of testing observations, 
this value likely needs some revision so any input would be great, 
is there a way to calculate this mathematically (not the focus of this question)?'''
        if x[1] > 0.4:
            sorted_result.append(x[0])
    return sorted_result


def perform_check(term, row_value):
    if term.lower() == row_value.lower():
        check_ratio = 2
    else:
        check_ratio = SequenceMatcher(None, term.lower(), row_value).quick_ratio()
    return check_ratio

所以上面的代码有两个主要问题。首先，我担心当桌子开始变大时它不会很好地扩展。第二个问题是文本列可以是多个单词，并且由于我正在拆分搜索词，因此多单词搜索的精确匹配是不可能的，最小的例如：

searchTerm = 'a doggy hat'
searchTerms = searchTerm.split() = ['a', 'doggy', 'hat']
Row_value = 'a doggy hat'
perform_check('a', 'a doggy hat']
ect...

只需添加一系列新的perform_check调用来检查整个search_term到正在搜索的所有列，这很容易解决，但这会使搜索变得更大且更慢，这就是我来这里寻求建议的原因。

我正在寻找一种构建数据（可能未排序）的方法，使我能够按照上述规范对整个表执行快速搜索。您认为对我有用的任何技术（实际上是任何技术）都将不胜感激。另外，如果您认为存在任何错误（一切正常但效率低下），我很乐意指出它们！

抱歉，这个问题太长了，如果您有时间和专业知识，我将非常感谢您提供的任何帮助。预先感谢您！

在字典上键入用户搜索（询问如何提高我已有的效率的方向/技术）

问题描述投票：0回答：0

最新问题

在字典上键入用户搜索（询问如何提高我已有的效率的方向/技术）

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0