我已经用这个分析器创建了一个索引
{
"settings": {
"analysis": {
"filter": {
"specialCharFilter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30
}
},
"analyzer": {
"specialChar": {
"type": "custom",
"tokenizer": "custom_tokenizer",
"filter": [
"lowercase",
"specialCharFilter"
]
}
},
"tokenizer": {
"custom_tokenizer": {
"type": "ngram",
"min_gram": 2,
"max_gram": 30,
"token_chars": [
"letter",
"digit",
"symbol",
"punctuation"
]
}
}
},
"index.max_ngram_diff": 30
},
"mappings": {
"properties": {
"partyName": {
"type": "keyword",
"analyzer": "specialChar",
"search_analyzer": "standard"
}
}
}
}
[
{
"partyName": "FLYJAC LOGISTICS PVT LTD-TPTBLR ."
},
{
"partyName": "L&T GEOSTRUCTURE PRIVATE LIMITED"
}
]
如果我使用 {"query": {"match": {"partyName": "L&T"}}} 进行查询
我想要以下对象的输出 {"partyName" : "L&T GEOSTRUCTURE PRIVATE LIMITED"}
首先,拥有 ngram 标记生成器和 ngram 标记过滤器是没有意义的,这会生成太多无用的标记并不必要地增加索引大小。
接下来,您搜索
L&T
不会产生任何结果的原因是因为 standard
搜索时间分析器将删除 &
符号,只搜索 l
和 t
,这不会产生任何结果因为您只索引最小长度为 2 的标记。
我建议使用以下分析器,使用空白标记生成器简单地在空白处分割单词,然后在每个标记上运行edge-ngram,即您可以搜索任何索引标记的任何前缀(最小长度为2)。此外
partyName
字段必须是 text
类型(而不是 keyword
)。如果你想分析它的内容:
PUT test
{
"settings": {
"analysis": {
"filter": {
"specialCharFilter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 30
}
},
"analyzer": {
"specialChar": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"specialCharFilter"
]
}
}
},
"index.max_ngram_diff": 30
},
"mappings": {
"properties": {
"partyName": {
"type": "text",
"analyzer": "specialChar",
"search_analyzer": "lowercase"
}
}
}
}
然后我们可以索引您的样本数据:
PUT test/_doc/1
{
"partyName": "FLYJAC LOGISTICS PVT LTD-TPTBLR ."
}
PUT test/_doc/2
{
"partyName": "L&T GEOSTRUCTURE PRIVATE LIMITED"
}
然后搜索您提供的查询将产生第二个文档:
POST test/_search
{
"query": {
"match": {
"partyName": "L&T"
}
}
}
=>
"hits": [
{
"_index": "test",
"_id": "2",
"_score": 1.0538965,
"_source": {
"partyName": "L&T GEOSTRUCTURE PRIVATE LIMITED"
}
}
]