是否可能在Postgres中模糊匹配较大字符串中的子字符串?
示例:
对于colour
(ou)的搜索,返回字符串包含color
,colors
或colour
的所有记录。
select
*
from things
where fuzzy(color) in description;
id | description
----------------
1 | A red coloured car
2 | The garden
3 | Painting colors
=> return records 1 and 3
我想知道是否可以同时组合fuzzystrmatch
和tsvector
,以便将模糊匹配应用于每个矢量化项?
或者是否有其他方法?
您当然可以做到,但我怀疑它会非常有用:
select *,levenshtein(lexeme,'color') from things, unnest(to_tsvector('english',description))
order by levenshtein;
id | description | lexeme | positions | weights | levenshtein
----+--------------------+--------+-----------+---------+-------------
3 | Painting colors | color | {2} | {D} | 0
1 | A red coloured car | colour | {3} | {D} | 1
1 | A red coloured car | car | {4} | {D} | 3
1 | A red coloured car | red | {2} | {D} | 5
3 | Painting colors | paint | {1} | {D} | 5
2 | The garden | garden | {2} | {D} | 6
[您可能希望修饰查询以应用一些截止值,可能在截止值取决于长度的情况下,并假设满足该截止值的每个描述仅返回最佳结果。这样做只是常规的SQL操作。
也许最好是最近在pg_trgm
中添加的相似度运算符。
select *, description <->> 'color' as distance from things order by description <->> 'color';
id | description | distance
----+--------------------+----------
3 | Painting colors | 0.166667
1 | A red coloured car | 0.333333
2 | The garden | 1
[另一种选择是找到一个标准化英美拼写的词干或词库(我不知道一个容易使用的词库),然后根本不使用模糊匹配。我认为这是最好的,如果可以的话。