如何在ES过滤器中选择最长的令牌

问题描述 投票:1回答:1

输入是人名列表,我想创建一个有点模糊的精确匹配。

索引文字是冯宝安,下面是我的分析器

PUT trim
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "word_joiner": {
            "type": "shingle",
            "output_unigrams": false,
            "token_separator": "",
            "output_unigrams_if_no_shingles": true,
            "max_shingle_size": 5
          }
        },
        "analyzer": {
          "word_join_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "word_joiner"
            ]
          }
        },
        "tokenizer": {}
      }
    }
  }
}

它将生成三个令牌

{
  "tokens": [
    {
      "token": "baoan",
      "start_offset": 0,
      "end_offset": 6,
      "type": "shingle",
      "position": 0
    },
    {
      "token": "baoanfeng",
      "start_offset": 0,
      "end_offset": 11,
      "type": "shingle",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "anfeng",
      "start_offset": 4,
      "end_offset": 11,
      "type": "shingle",
      "position": 1
    }
  ]
}

我只想要“ baoanfeng”,我不能使用“ min_shingle_size”,因为可以输入两个词。

elasticsearch
1个回答
0
投票

如果您需要的是最长的带状疱疹,我不确定您为什么要使用shingle过滤器...

为什么不简单地将keyword标记生成器与模式过滤器一起删除所有不是字符的字符呢?像这样:

PUT trim
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "pattern": {
            "type": "pattern_replace",
            "pattern": "\\W+",
            "replacement": ""
          }
        },
        "analyzer": {
          "word_join_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": [
              "lowercase",
              "pattern"
            ]
          }
        },
        "tokenizer": {}
      }
    }
  }
}

然后进行测试:

POST trim/_analyze
{
  "analyzer": "word_join_analyzer",
  "text": "Bao-An Feng"
}

{ "tokens" : [ { "token" : "baoanfeng", "start_offset" : 0, "end_offset" : 12, "type" : "word", "position" : 0 } ] }

© www.soinside.com 2019 - 2024. All rights reserved.