我有一个索引,其中包含存储系统信息和可搜索字段的文档,这些字段被复制到
searchable_keys
字段中。在这种情况下,只有一个这样的字段 - name
。
以下是索引的定义:
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{
"filter":[
"lowercase"
],
"type":"custom",
"tokenizer":"my_tokenizer"
}
},
"tokenizer":{
"my_tokenizer":{
"token_chars":[
"letter",
"digit"
],
"type":"edge_ngram",
"min_gram":3,
"max_gram":20
}
}
}
},
"mappings":{
"properties":{
"entry_id":{
"type":"keyword"
},
"workspace_id":{
"type":"keyword"
},
"name":{
"type":"text",
"copy_to":"searchable_keys"
},
"searchable_keys":{
"type":"text",
"analyzer":"my_analyzer"
}
}
}
}
我运行了以下查询:
{
"explain":true,
"query":{
"match":{
"searchable_keys":{
"query":"dog",
"operator":"AND"
}
}
}
}
我得到了一个奇怪的结果(响应的完整文档如下所示): 名称为
• Private Emerald Lake & Dogsledding Tour •
的文档得分为 3.7377324
,而名称为 Skagway Sled Dog and Musher's Camp
的文档得分为 3.718998
。
回复中的完整文档:
[
{
"_index":"tours",
"_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
"_score":3.7377324,
"_source":{
"entry_id":"018bb59a-bc8c-76a2-9e76-eaf747bac7c1",
"workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
"name":"• Private Emerald Lake & Dogsledding Tour •",
"type":"Tour"
},
"_explanation":{
"value":3.7377324,
"description":"weight(searchable_keys:dog in 68) [PerFieldSimilarity], result of:",
"details":[
{
"value":3.7377324,
"description":"score(freq=1.0), computed as boost * idf * tf from:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":4.017076,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":6,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":360,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.4229368,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1.0,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":23.0,
"description":"dl, length of field",
"details":[
]
},
{
"value":19.447222,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
}
},
{
"_index":"tours",
"_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
"_score":3.718998,
"_source":{
"entry_id":"018bb598-e50e-7d6d-a639-97ed40bb2ee7",
"workspace_id":"018bb598-708a-7e8d-8995-b30cf0aba239",
"name":"Skagway Sled Dog and Musher's Camp",
"type":"Tour"
},
"_explanation":{
"value":3.718998,
"description":"weight(searchable_keys:dog in 105) [PerFieldSimilarity], result of:",
"details":[
{
"value":3.718998,
"description":"score(freq=1.0), computed as boost * idf * tf from:",
"details":[
{
"value":2.2,
"description":"boost",
"details":[
]
},
{
"value":3.3953834,
"description":"idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details":[
{
"value":11,
"description":"n, number of documents containing term",
"details":[
]
},
{
"value":342,
"description":"N, total number of documents with field",
"details":[
]
}
]
},
{
"value":0.49786824,
"description":"tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details":[
{
"value":1.0,
"description":"freq, occurrences of term within document",
"details":[
]
},
{
"value":1.2,
"description":"k1, term saturation parameter",
"details":[
]
},
{
"value":0.75,
"description":"b, length normalization parameter",
"details":[
]
},
{
"value":15.0,
"description":"dl, length of field",
"details":[
]
},
{
"value":19.052631,
"description":"avgdl, average length of field",
"details":[
]
}
]
}
]
}
]
}
}
]
问题:
为什么两个文件的idf不一样?因为唯一单词的 idf 对于集合中的所有文档都是相同的。难道我错了?
tf 的奇怪公式是什么?这个公式不是等于单词出现的频率除以文档中的单词数吗?
如何才能使文档中包含单独的单词“dog”时,该文档比子字符串“dog”出现在某个单词中时拥有更多的点?同时不要失去按事件搜索的能力,这是由边缘 n-gram 分词器给出的