elasticsearch如何统计tf-idf？看起来很奇怪

Question

我有一个索引，其中包含存储系统信息和可搜索字段的文档，这些字段被复制到

searchable_keys

字段中。在这种情况下，只有一个这样的字段 -

name

。

以下是索引的定义：

{
  "settings":{
    "analysis":{
      "analyzer":{
        "my_analyzer":{
          "filter":[
            "lowercase"
          ],
          "type":"custom",
          "tokenizer":"my_tokenizer"
        }
      },
      "tokenizer":{
        "my_tokenizer":{
          "token_chars":[
            "letter",
            "digit"
          ],
          "type":"edge_ngram",
          "min_gram":3,
          "max_gram":20
        }
      }
    }
  },
  "mappings":{
    "properties":{
      "entry_id":{
        "type":"keyword"
      },
      "workspace_id":{
        "type":"keyword"
      },
      "name":{
        "type":"text",
        "copy_to":"searchable_keys"
      },
      "searchable_keys":{
        "type":"text",
        "analyzer":"my_analyzer"
      }
    }
  }
}

我运行了以下查询：

{
  "explain":true,
  "query":{
    "match":{
      "searchable_keys":{
        "query":"dog",
        "operator":"AND"
      }
    }
  }
}

我得到了一个奇怪的结果（响应的完整文档如下所示）：名称为

• Private Emerald Lake & Dogsledding Tour •

的文档得分为

3.7377324

，而名称为

Skagway Sled Dog and Musher's Camp

的文档得分为

3.718998

。

回复中的完整文档：

[
  {
    "_index":"tours",
    "_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
    "_score":3.7377324,
    "_source":{
      "entry_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
      "workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
      "name":"• Private Emerald Lake & Dogsledding Tour •",
      "type":"Tour"
    },
    "_explanation":{
      "value":3.7377324,
      "description":"weight(searchable_keys:dog in 68) [PerFieldSimilarity], result of:",
      "details":[
        {
          "value":3.7377324,
          "description":"score(freq=1.0), computed as boost * idf * tf from:",
          "details":[
            {
              "value":2.2,
              "description":"boost",
              "details":[
                
              ]
            },
            {
              "value":4.017076,
              "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
              "details":[
                {
                  "value":6,
                  "description":"n, number of documents containing term",
                  "details":[
                    
                  ]
                },
                {
                  "value":360,
                  "description":"N, total number of documents with field",
                  "details":[
                    
                  ]
                }
              ]
            },
            {
              "value":0.4229368,
              "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details":[
                {
                  "value":1.0,
                  "description":"freq, occurrences of term within document",
                  "details":[
                    
                  ]
                },
                {
                  "value":1.2,
                  "description":"k1, term saturation parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":0.75,
                  "description":"b, length normalization parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":23.0,
                  "description":"dl, length of field",
                  "details":[
                    
                  ]
                },
                {
                  "value":19.447222,
                  "description":"avgdl, average length of field",
                  "details":[
                    
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  },
  {
    "_index":"tours",
    "_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
    "_score":3.718998,
    "_source":{
      "entry_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
      "workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
      "name":"Skagway Sled Dog and Musher's Camp",
      "type":"Tour"
    },
    "_explanation":{
      "value":3.718998,
      "description":"weight(searchable_keys:dog in 105) [PerFieldSimilarity], result of:",
      "details":[
        {
          "value":3.718998,
          "description":"score(freq=1.0), computed as boost * idf * tf from:",
          "details":[
            {
              "value":2.2,
              "description":"boost",
              "details":[
                
              ]
            },
            {
              "value":3.3953834,
              "description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
              "details":[
                {
                  "value":11,
                  "description":"n, number of documents containing term",
                  "details":[
                    
                  ]
                },
                {
                  "value":342,
                  "description":"N, total number of documents with field",
                  "details":[
                    
                  ]
                }
              ]
            },
            {
              "value":0.49786824,
              "description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details":[
                {
                  "value":1.0,
                  "description":"freq, occurrences of term within document",
                  "details":[
                    
                  ]
                },
                {
                  "value":1.2,
                  "description":"k1, term saturation parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":0.75,
                  "description":"b, length normalization parameter",
                  "details":[
                    
                  ]
                },
                {
                  "value":15.0,
                  "description":"dl, length of field",
                  "details":[
                    
                  ]
                },
                {
                  "value":19.052631,
                  "description":"avgdl, average length of field",
                  "details":[
                    
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  }
]

问题：

为什么两个文件的idf不一样？因为唯一单词的 idf 对于集合中的所有文档都是相同的。难道我错了？
tf 的奇怪公式是什么？这个公式不是等于单词出现的频率除以文档中的单词数吗？
如何才能使文档中包含单独的单词“dog”时，该文档比子字符串“dog”出现在某个单词中时拥有更多的点？同时不要失去按事件搜索的能力，这是由边缘 n-gram 分词器给出的

Answer 1

基于基本的 tf-idf 公式（https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting），是的 idf 应该是一个常量。您的配置中的公式略有不同，但它仍然应该是一个常数。难道这两个答案来自两个不同的文档集？我发现“entry_id”不同（不是弹性搜索专家）。
在我提供的链接中，您可以找到：“术语频率，一个术语在给定文档中出现的次数”，又名原始数字而不是比率。我假设在你的例子中计算 tf-idf 的人出于某种原因操纵了公式。不知道它是原生 Elastic Search 公式还是其他人实现的。

elasticsearch如何统计tf-idf？看起来很奇怪

问题描述投票：0回答：1

1个回答

最新问题

elasticsearch如何统计tf-idf？看起来很奇怪

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1