elasticsearch如何统计tf-idf?看起来很奇怪

问题描述 投票:0回答:1

我有一个索引,其中包含存储系统信息和可搜索字段的文档,这些字段被复制到

searchable_keys
字段中。在这种情况下,只有一个这样的字段 -
name

以下是索引的定义:

{
  "settings":{
    "analysis":{
      "analyzer":{
        "my_analyzer":{
          "filter":[
            "lowercase"
          ],
          "type":"custom",
          "tokenizer":"my_tokenizer"
        }
      },
      "tokenizer":{
        "my_tokenizer":{
          "token_chars":[
            "letter",
            "digit"
          ],
          "type":"edge_ngram",
          "min_gram":3,
          "max_gram":20
        }
      }
    }
  },
  "mappings":{
    "properties":{
      "entry_id":{
        "type":"keyword"
      },
      "workspace_id":{
        "type":"keyword"
      },
      "name":{
        "type":"text",
        "copy_to":"searchable_keys"
      },
      "searchable_keys":{
        "type":"text",
        "analyzer":"my_analyzer"
      }
    }
  }
}

我运行了以下查询:

{
  "explain":true,
  "query":{
    "match":{
      "searchable_keys":{
        "query":"dog",
        "operator":"AND"
      }
    }
  }
}

我得到了一个奇怪的结果(响应的完整文档如下所示): 名称为

• Private Emerald Lake & Dogsledding Tour •
的文档得分为
3.7377324
,而名称为
Skagway Sled Dog and Musher's Camp
的文档得分为
3.718998

回复中的完整文档:

[
  {
    "_index":"tours",
    "_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
    "_score":3.7377324,
    "_source":{
      "entry_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
      "workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
      "name":"• Private Emerald Lake & Dogsledding Tour •",
      "type":"Tour"
    },
    "_explanation":{
      "value":3.7377324,
      "description":"weight(searchable_keys:dog in 68) [PerFieldSimilarity], result of:",
      "details":[
        {
          "value":3.7377324,
          "description":"score(freq=1.0), computed as boost * idf * tf from:",
          "details":[
            {
              "value":2.2,
              "description":"boost",
              "details":[
                
              ]
            },
            {
              "value":4.017076,
              "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
              "details":[
                {
                  "value":6,
                  "description":"n, number of documents containing term",
                  "details":[
                    
                  ]
                },
                {
                  "value":360,
                  "description":"N, total number of documents with field",
                  "details":[
                    
                  ]
                }
              ]
            },
            {
              "value":0.4229368,
              "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details":[
                {
                  "value":1.0,
                  "description":"freq, occurrences of term within document",
                  "details":[
                    
                  ]
                },
                {
                  "value":1.2,
                  "description":"k1, term saturation parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":0.75,
                  "description":"b, length normalization parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":23.0,
                  "description":"dl, length of field",
                  "details":[
                    
                  ]
                },
                {
                  "value":19.447222,
                  "description":"avgdl, average length of field",
                  "details":[
                    
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  },
  {
    "_index":"tours",
    "_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
    "_score":3.718998,
    "_source":{
      "entry_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
      "workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
      "name":"Skagway Sled Dog and Musher's Camp",
      "type":"Tour"
    },
    "_explanation":{
      "value":3.718998,
      "description":"weight(searchable_keys:dog in 105) [PerFieldSimilarity], result of:",
      "details":[
        {
          "value":3.718998,
          "description":"score(freq=1.0), computed as boost * idf * tf from:",
          "details":[
            {
              "value":2.2,
              "description":"boost",
              "details":[
                
              ]
            },
            {
              "value":3.3953834,
              "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
              "details":[
                {
                  "value":11,
                  "description":"n, number of documents containing term",
                  "details":[
                    
                  ]
                },
                {
                  "value":342,
                  "description":"N, total number of documents with field",
                  "details":[
                    
                  ]
                }
              ]
            },
            {
              "value":0.49786824,
              "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details":[
                {
                  "value":1.0,
                  "description":"freq, occurrences of term within document",
                  "details":[
                    
                  ]
                },
                {
                  "value":1.2,
                  "description":"k1, term saturation parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":0.75,
                  "description":"b, length normalization parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":15.0,
                  "description":"dl, length of field",
                  "details":[
                    
                  ]
                },
                {
                  "value":19.052631,
                  "description":"avgdl, average length of field",
                  "details":[
                    
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  }
]

问题:

  1. 为什么两个文件的idf不一样?因为唯一单词的 idf 对于集合中的所有文档都是相同的。难道我错了?

  2. tf 的奇怪公式是什么?这个公式不是等于单词出现的频率除以文档中的单词数吗?

  3. 如何才能使文档中包含单独的单词“dog”时,该文档比子字符串“dog”出现在某个单词中时拥有更多的点?同时不要失去按事件搜索的能力,这是由边缘 n-gram 分词器给出的

elasticsearch nlp tokenize tf-idf n-gram
1个回答
0
投票
  1. 基于基本的 tf-idf 公式(https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting),是的 idf 应该是一个常量。您的配置中的公式略有不同,但它仍然应该是一个常数。难道这两个答案来自两个不同的文档集?我发现“entry_id”不同(不是弹性搜索专家)。
  2. 在我提供的链接中,您可以找到:“术语频率,一个术语在给定文档中出现的次数”,又名原始数字而不是比率。我假设在你的例子中计算 tf-idf 的人出于某种原因操纵了公式。不知道它是原生 Elastic Search 公式还是其他人实现的。
© www.soinside.com 2019 - 2024. All rights reserved.