实时搜索短语的击键汇总

Question

由于在网站上进行了实时搜索，因此我得到了一组类似的字符串：

[
  'how',
  'how do i',
  'how do i cancel my',
  'how do i cancel my account',
  'where is',
  'where is the',
  'where is the analytics',
  'where is the analytics page'
]

我需要应用一种编辑距离算法，该算法将只剩下两个“最终”短语：

[
  'how do i cancel my account',
  'where is the analytics page'
]

我希望对实施提出任何建议。

UPD：这将用于搜索分析，因此可能要处理成千上万条记录。

UPD2：我最终使用了这种方法，这给了我稳定的>0.8分数，可以过滤最终查询。我很好奇听到其他选择。 Jaro-Winkler similarity算法似乎是最合适的算法，因为它比结尾字符优先考虑前导字符。

require 'edits'

values = [
  'how',
  'how do i',
  'how do i cancel my',
  'how do i cancel my account',
  'where is',
  'where is the',
  'where is the analytics',
  'where is the analytics page'
]

values.map(&:strip).uniq
  .each_cons(2)
  .map do |seq|
    [
      seq.first,
      seq.last,
      Edits::JaroWinkler.similarity(seq.first, seq.last)
    ]
  end

["how", "how do i", 0.8541666666666666]
["how do i", "how do i cancel my", 0.888888888888889]
["how do i cancel my", "how do i cancel my account", 0.9384615384615385]
["how do i cancel my account", "where is", 0.47243589743589737]
["where is", "where is the", 0.9333333333333333]
["where is the", "where is the analytics", 0.9090909090909091]
["where is the analytics", "where is the analytics page", 0.962962962962963]

Answer 1

以下代码应删除前缀。

require 'set'

suggestions = Set.new([
  'how',
  'how do i',
  'how do i cancel my',
  'how do i cancel my account',
  'where is',
  'where is the',
  'where is the analytics',
  'where is the analytics page'
])
phrases = suggestions.each do |a|
  suggestions.delete_if {|b| a != b && a.start_with?(b) }
end

phrases.to_a

请注意，上面的代码不适用于大型数组。但我想您不会从您的应用程序中获得超过15或20条建议（带有前缀）。

参考：Set#delete_if

希望这会有所帮助。

实时搜索短语的击键汇总

问题描述投票：0回答：1

1个回答

最新问题

实时搜索短语的击键汇总

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1