弹性搜索中的评分不一致

问题描述 投票:3回答:1

我正在尝试在我创建的标签(关键字短语)中启用全文搜索,这些标签可以分配给索引中的文档(名为“Delta”)。

我的结果是(1)不是我所期望的,(2)如果我重复重复运行相同的代码则不一致。

下面是一些代码。我简化了映射和文档,使代码更清晰,并确保问题不在文档或映射的其他部分。我正在使用Kibana Dev Tools控制台运行所有这些。

PUT /mdelta 
{
  "mappings":{
    "tags":{
      "properties":{
        "synonyms":{ 
          "type":"text"
        }
      }
    }
  }
}

POST _bulk
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Fe"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Deficiency"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Serum Iron"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Sulfate"}
{ "index" : { "_index" : "mdelta", "_type" : "tags" }}
{"synonyms":"Iron Deficiency Anemia"}

GET mdelta/tags/_search
{
    "explain":false,
    "query": {
        "match" : {
            "synonyms" : "iron"
        }
    }
}

根据我对评分算法的理解,我希望首先返回文档{"synonyms":"Iron"}(最高分)。不是这种情况。结果......

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0.5377023,
    "hits": [
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj9",
        "_score": 0.5377023,
        "_source": {
          "synonyms": "Iron Sulfate"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj5",
        "_score": 0.2876821,
        "_source": {
          "synonyms": "Iron"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj8",
        "_score": 0.25811607,
        "_source": {
          "synonyms": "Serum Iron"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj7",
        "_score": 0.1805489,
        "_source": {
          "synonyms": "Iron Deficiency"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj-",
        "_score": 0.14638957,
        "_source": {
          "synonyms": "Iron Deficiency Anemia"
        }
      }
    ]
  }
}

我重复了查询设置为true的查询。

{
  "took": 38,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0.5377023,
    "hits": [
      {
        "_shard": "[mdelta][4]",
        "_node": "McQ619KqR0akS1mHvTXjDw",
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj9",
        "_score": 0.5377023,
        "_source": {
          "synonyms": "Iron Sulfate"
        },
        "_explanation": {
          "value": 0.5377023,
          "description": "weight(synonyms:iron in 1) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.5377023,
              "description": "score(doc=1,freq=1.0 = termFreq=1.0\n), product of:",
              "details": [
                {
                  "value": 0.6931472,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.7757405,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 1.5,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 2.56,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[mdelta][2]",
        "_node": "McQ619KqR0akS1mHvTXjDw",
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj5",
        "_score": 0.2876821,
        "_source": {
          "synonyms": "Iron"
        },
        "_explanation": {
          "value": 0.2876821,
          "description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.2876821,
              "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
              "details": [
                {
                  "value": 0.2876821,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[mdelta][3]",
        "_node": "McQ619KqR0akS1mHvTXjDw",
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj8",
        "_score": 0.25811607,
        "_source": {
          "synonyms": "Serum Iron"
        },
        "_explanation": {
          "value": 0.25811607,
          "description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.25811607,
              "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
              "details": [
                {
                  "value": 0.2876821,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 1,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.89722675,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 2.56,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[mdelta][1]",
        "_node": "McQ619KqR0akS1mHvTXjDw",
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj7",
        "_score": 0.1805489,
        "_source": {
          "synonyms": "Iron Deficiency"
        },
        "_explanation": {
          "value": 0.1805489,
          "description": "weight(synonyms:iron in 0) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.1805489,
              "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
              "details": [
                {
                  "value": 0.18232156,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 2,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.9902773,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 2.5,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 2.56,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": "[mdelta][1]",
        "_node": "McQ619KqR0akS1mHvTXjDw",
        "_index": "mdelta",
        "_type": "tags",
        "_id": "AWA8jRR9YXA6OBvYOfj-",
        "_score": 0.14638957,
        "_source": {
          "synonyms": "Iron Deficiency Anemia"
        },
        "_explanation": {
          "value": 0.14638956,
          "description": "weight(synonyms:iron in 1) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.14638956,
              "description": "score(doc=1,freq=1.0 = termFreq=1.0\n), product of:",
              "details": [
                {
                  "value": 0.18232156,
                  "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details": [
                    {
                      "value": 2,
                      "description": "docFreq",
                      "details": []
                    },
                    {
                      "value": 2,
                      "description": "docCount",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 0.8029196,
                  "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details": [
                    {
                      "value": 1,
                      "description": "termFreq=1.0",
                      "details": []
                    },
                    {
                      "value": 1.2,
                      "description": "parameter k1",
                      "details": []
                    },
                    {
                      "value": 0.75,
                      "description": "parameter b",
                      "details": []
                    },
                    {
                      "value": 2.5,
                      "description": "avgFieldLength",
                      "details": []
                    },
                    {
                      "value": 4,
                      "description": "fieldLength",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

如果您查看第一个匹配(“Iron Sulfate”),则看起来docFreq为1且docCount为2.这是不正确的。

另外,如果我运行delete /mdelta然后重新运行我的代码,我可以获得不同的结果顺序,例如......

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 5,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "Qd0JQWABt4cFDxBHv7Fe",
        "_score": 0.2876821,
        "_source": {
          "synonyms": "Serum Iron"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "Pt0JQWABt4cFDxBHv7Fe",
        "_score": 0.2876821,
        "_source": {
          "synonyms": "Iron"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "QN0JQWABt4cFDxBHv7Fe",
        "_score": 0.2876821,
        "_source": {
          "synonyms": "Iron Deficiency"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "Qt0JQWABt4cFDxBHv7Fe",
        "_score": 0.19856805,
        "_source": {
          "synonyms": "Iron Sulfate"
        }
      },
      {
        "_index": "mdelta",
        "_type": "tags",
        "_id": "Q90JQWABt4cFDxBHv7Fe",
        "_score": 0.16853254,
        "_source": {
          "synonyms": "Iron Deficiency Anemia"
        }
      }
    ]
  }
}

任何关于我做错的想法都将不胜感激。

elasticsearch
1个回答
3
投票

在重新索引数据时未获得一致结果的原因是每个分片计算术语频率。在重建索引时,分片分配与先前的索引不同,因为您未指定任何路由。

问题:

没有得到你所期望的

弹性可能是因为索引中的文档数量很少。尝试使用参数search_type运行查询,如下所示:GET mdelta/tags/_search?search_type= dfs_query_then_fetch。这可确保首先计算索引级别频率。您可以在开发中使用它,但我不认为它在生产中是可取的。如果您有足够的数据,则分片中的频率应该大致相同。

见:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html

© www.soinside.com 2019 - 2024. All rights reserved.