我有一个使用短语搜索来匹配整个短语的查询。
SELECT ts_headline(
'simple',
'This is my test text. My test text has many words. Well, not THAT many words.',
phraseto_tsquery('simple', 'text has many words')
);
结果是:
This is my test <b>text</b>. My test <b>text</b> <b>has</b> <b>many</b> <b>words</b>. Well, not THAT <b>many</b> <b>words</b>.
但我早料到会这样:
This is my test text. My test <b>text</b> <b>has</b> <b>many</b> <b>words</b>. Well, not THAT many words.
或者理想情况下甚至是这样:
This is my test text. My test <b>text has many words</b>. Well, not THAT many words.
旁注:
phraseto_tsquery('simple', 'text has many words')
相当于
to_tsquery('simple', 'text <-> has <-> many <-> words')
我不确定我是否做错了什么,或者 ts_headline 是否根本不支持这种突出显示。
phraseto_tsquery('simple', 'text has many words')
生成正确的查询,但问题似乎出在 ts_headline
函数中。似乎已经报告了 BUG #155172。
我正在编写一个扩展,它改进了 ts_headline 功能,以正确突出显示具有单个标签的匹配短语,而不突出显示部分匹配。该扩展可在 https://github.com/thevermeer/pg_ts_semantic_headline 获取,旨在直接替换 ts_headline。
用途:
SELECT ts_semantic_headline(
'simple',
'This is my test text. My test text has many words. Well, not THAT many words.',
phraseto_tsquery('simple', 'text has many words')
);
产生: | ts_semantic_headline | ts_semantic_headline | | --- | |这是我的测试文本。我的测试文本有很多单词。嗯,没有那么多单词。 |
ts_semantic_headline
解决方案是在底层使用ts_headline
来生成内容片段,然后使用文本解析和定制的TSVectors,以及包含的ts_fast_headline
功能以最小(5-10%)执行多单词突出显示性能成本高于 ts_headline。
如果关注性能,
ts_fast_headline
函数还可以使用 2 个预处理列 (TSPVector + TEXT[]),并以比 ts_headline 快 5-10 倍的速度提供突出显示的内容。