为什么ElasticSearch中的 "More Like This "不尊重单个术语的TF-IDF顺序？

Question

我一直在尝试摸索ElasticSearch中的 "More Like This "功能。我读了又读了文档，但我很难理解为什么会出现下面的行为。

基本上，我插入了三个文档，然后我尝试了一个 "More Like This Query"，内容是 max_query_terms=1，期望使用较高的TF-IDF术语，但情况似乎并非如此。

curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "dog barks"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat fur"
}';
curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
   "message": "cat naps"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

预期的输出。

"dog barks" document

实际产出。

"cat naps" 和 "cat fur" 证件

对预期产出的解释。

在文件其中提到

假设我们想找到与给定输入文档相似的所有文档。很显然，输入文档本身应该是其对该类型查询的最佳匹配。而根据Lucene评分公式，其原因主要是由于tf-idf最高的术语。因此，输入文档中tf-idf最高的术语是该文档的良好代表，可以在一个不连贯查询（或OR）中用来检索类似的文档。MLT查询只是从输入文档中提取文本，对其进行分析，通常在现场使用同一个分析器。然后选取tf-idf最高的前K个词，对这些词形成一个二元查询。

由于我指定了 max_query_terms = 1，只有输入文档中TF-IDF得分最高的术语才会被用于非连接查询。在这种情况下，输入文档中有两个术语。它们在输入文档中的词频相同，但猫在语料库中的出现频率是原来的两倍，所以它的文档频率更高。因此。dog 应该比TF-IDF得分高。cat因此，我希望二联式查询就是 "message":"dog" 而返回的结果是 "dog barks" 事件。

我试图了解这里发生了什么。任何帮助都是非常感激的:)

关于决定论的说明

我试着重新运行了几次这个设置。当运行上述4个ES命令（3个POST+MLT GET）后，在一个 curl -XDELETE 'http://localhost:9200/samples' 命令，有时我会得到 "cat naps" 和 "cat fur"但其他时候，我会得到 "cat naps","cat fur"和 "dog barks"有几次，我甚至只得到了 "dog barks".

全输出

之前我手舞足蹈，只说了GET查询的输出是什么。让我说得更精确一点Actual output #1（发生在某些时候）。

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":2,"max_score":0.6931472,"hits":
[{"_index":"samples","_type":"_doc","_id":"UHAoI3IBapDWjHWvsQ0_","_score":0.6931472,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"UXAoI3IBapDWjHWvsQ1c","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

实际输出＃2（发生一些时间）：

{"took":2,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":3,"max_score":0.2876821,"hits":
[{"_index":"samples","_type":"_doc","_id":"VHAtI3IBapDWjHWvvA0B","_score":0.2876821,"_source":{
   "message": "cat fur"
}},{"_index":"samples","_type":"_doc","_id":"U3AtI3IBapDWjHWvuw3l","_score":0.2876821,"_source":{
   "message": "dog barks"
}},{"_index":"samples","_type":"_doc","_id":"VXAtI3IBapDWjHWvvA0V","_score":0.2876821,"_source":{
   "message": "cat naps"
}}]}}

实际输出#3 (三者中很少发生):

{"took":1,"timed_out":false,"_shards":
{"total":5,"successful":5,"skipped":0,"failed":0},"hits":
{"total":1,"max_score":0.9808292,"hits":
[{"_index":"samples","_type":"_doc","_id":"WXAzI3IBapDWjHWvbQ3s","_score":0.9808292,"_source":{
   "message": "dog barks"
}}]}}

试着把插入和MLT的时间间隔拉大一点

可能elasticsearch处于一种奇怪的 "处理状态"，在文档之间需要一点时间。所以我在插入文档之间和运行GET命令之前给了ES一些时间。

filename="testEsOutput-10-incremental.txt"
amount=10
echo "Test-10-incremental"
for i in {1..10}
do
    curl -XDELETE 'http://localhost:9200/samples';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "dog barks"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat fur"
    }';
    sleep $amount
    curl -XPOST --header 'Content-Type: application/json' http://localhost:9200/samples/_doc -d '{
       "message": "cat naps"
    }';
    sleep $amount

    curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }' >> $filename
    echo "\n\r----\n\r" >> $filename
    echo "----\n\r" >> $filename
done
echo "Done!"

然而这似乎并没有以任何有意义的方式影响非确定性的输出。

尝试了 `search_type=dfs_query_then_fetch`

遵循这一原则关于ES非决定论的SO帖子我试着添加了dfs_query_then_fetch选项，也就是所谓的 "查询"。

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/?search_type=dfs_query_then_fetch' -d '{
        "query": {
            "more_like_this" : {
                "like" : ["cat", "dog"],
                "fields" : ["message"],
                "minimum_should_match" : 1,
                "min_term_freq" : 1,
                "min_doc_freq" : 1,
                "max_query_terms" : 1
            }
        }
    }'

但结果仍然不是决定性的，它们在三个方案之间存在差异。

补充说明

我试着通过

curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_validate/query?rewrite=true' -d '{
    "query": {
        "more_like_this" : {
            "like" : ["cat", "dog"],
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但这有时输出

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"message:cat"}]}

时而

{"_shards":{"total":1,"successful":1,"failed":0},"valid":true,"explanations":
[{"index":"samples","valid":true,"explanation":"like:[cat, dog]"}]}

所以输出甚至不是确定性的（背靠背运行）。

注：在ElasticSearch 6.8.8上测试，包括本地和在线REPL。也用实际的文档进行了测试，比如说。

curl -XPUT --header 'Content-Type: application/json' http://localhost:9200/samples/_doc/72 -d '{
   "message" : "dog cat"
}';
curl -XGET --header 'Content-Type: application/json' 'http://localhost:9200/samples/_search/' -d '{
    "query": {
        "more_like_this" : {
            "like" : {
                "_id" : "72"
            }
            ,
            "fields" : ["message"],
            "minimum_should_match" : 1,
            "min_term_freq" : 1,
            "min_doc_freq" : 1,
            "max_query_terms" : 1
        }
    }
}';

但得到的结果是一样的 "cat naps" 和 "cat fur" 事件。

Answer 1

好吧，经过大量的调试，我试着将索引限制为一个碎片，也就是

curl -XPUT --header 'Content-Type: application/json' 'http://localhost:9200/samples' -d '{
    "settings" : {
        "index" : {
            "number_of_shards" : 1, 
            "number_of_replicas" : 0 
        }
    }
}';

当我这样做的时候，我得到了，100%的时间，只有的。"dog barks" 文档。

看来，即使在使用 search_type=dfs_query_then_fetch 选项（使用多碎片索引），ES仍然没有做到完全精确的工作。我不知道我还能用什么其他选项来强制精确的行为。也许其他人可以用更多的见解来回答。

为什么ElasticSearch中的 "More Like This "不尊重单个术语的TF-IDF顺序？

问题描述投票：1回答：1

预期的输出。

实际产出。

对预期产出的解释。

关于决定论的说明

全输出

试着把插入和MLT的时间间隔拉大一点

尝试了 `search_type=dfs_query_then_fetch`

补充说明

1个回答

最新问题

为什么ElasticSearch中的 "More Like This "不尊重单个术语的TF-IDF顺序？

问题描述 投票：1回答：1

预期的输出。

实际产出。

对预期产出的解释。

关于决定论的说明

全输出

试着把插入和MLT的时间间隔拉大一点

尝试了 search_type=dfs_query_then_fetch

补充说明

1个回答

最新问题

问题描述投票：1回答：1

尝试了 `search_type=dfs_query_then_fetch`