我们在 Solr 上查询时遇到奇怪的问题。 Solr 云给出的分数与 Solr master 设置的相同内容不同。另外一个问题是 Solr Cloud 正在更改不同请求中相同内容和相同查询的分数,这会导致多次调用中文档的顺序不同。在主从分数上是固定的,并且在不同的呼叫中不会改变。
这里是master Slave上相同记录的分数计算
63372217#83#-2128821991: " 8.439063 = boost(((title:narendra | keywords:narendra) (title:mod | keywords:mod))~1,1.0/(3.16E-11*float(ms(const(1524117881692),date(effectivetriedate)))+1.0)), product of: 9.141734 = sum of: 9.141734 = max of: 9.141734 = weight(title:mod in 10186378) [SchemaSimilarity], result of: 9.141734 = score(doc=10186378,freq=1.0 = termFreq=1.0 ), product of: 9.458362 = idf(docFreq=805, docCount=10322376) 0.96652406 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 6.5560484 = avgFieldLength 7.111111 = fieldLength 8.783037 = weight(keywords:mod in 10186378) [SchemaSimilarity], result of: 8.783037 = score(doc=10186378,freq=1.0 = termFreq=1.0 ), product of: 8.783037 = idf(docFreq=886, docCount=5782333) 1.0 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 0.92313594 = 1.0/(3.16E-11*float(ms(const(1524117881692),date(effectivetriedate)=2018-03-19T18:09:00Z))+1.0) ",
60930380#83#-2128833038: " 8.3860035 = boost(((title:narendra | keywords:narendra) (title:mod | keywords:mod))~1,1.0/(3.16E-11*float(ms(const(1524117881692),date(effectivetriedate)))+1.0)), product of: 12.907965 = sum of: 4.1249275 = max of: 4.1249275 = weight(keywords:narendra in 3310267) [SchemaSimilarity], result of: 4.1249275 = score(doc=3310267,freq=1.0 = termFreq=1.0 ), product of: 4.1249275 = idf(docFreq=93469, docCount=5782333) 1.0 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 8.783037 = max of: 8.783037 = weight(keywords:mod in 3310267) [SchemaSimilarity], result of: 8.783037 = score(doc=3310267,freq=1.0 = termFreq=1.0 ), product of: 8.783037 = idf(docFreq=886, docCount=5782333) 1.0 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 0.6496767 = 1.0/(3.16E-11*float(ms(const(1524117881692),date(effectivetriedate)=2017-10-03T18:02:13Z))+1.0) "
云设置
,63372217#83#-2128821991=
8.45718 = boost(((title:narendra | keywords:narendra) (title:mod | keywords:mod))~1,1.0/(3.16E- 11*float(ms(const(1524118417608),date(effectivetriedate)))+1.0)), product of:
9.161503 = sum of:
9.161503 = max of:
9.161503 = weight(title:mod in 49446) [SchemaSimilarity], result of:
9.161503 = score(doc=49446,freq=1.0 = termFreq=1.0
), product of:
9.522509 = idf(docFreq=298, docCount=4078658)
0.96208924 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
6.4863324 = avgFieldLength
7.111111 = fieldLength
8.878012 = weight(keywords:mod in 49446) [SchemaSimilarity], result of:
8.878012 = score(doc=49446,freq=1.0 = termFreq=1.0
), product of:
8.878012 = idf(docFreq=319, docCount=2291617)
1.0 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.0 = parameter b (norms omitted for field)
0.9231215 = 1.0/(3.16E-11*float(ms(const(1524118417608),date(effectivetriedate)=2018-03-19T18:09:00Z))+1.0)
63372217#83#-2128821991=
8.499447 = boost(((title:narendra | keywords:narendra) (title:mod | keywords:mod))~1,1.0/(3.16E- 11*float(ms(const(1524118478192),date(effectivetriedate)))+1.0)), product of:
9.207306 = sum of:
9.207306 = max of:
9.207306 = weight(title:mod in 90314) [SchemaSimilarity], result of:
9.207306 = score(doc=90314,freq=1.0 = termFreq=1.0
), product of:
9.534913 = idf(docFreq=306, docCount=4240239)
0.96564126 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
6.5421023 = avgFieldLength
7.111111 = fieldLength
8.90691 = weight(keywords:mod in 90314) [SchemaSimilarity], result of:
8.90691 = score(doc=90314,freq=1.0 = termFreq=1.0
), product of:
8.90691 = idf(docFreq=320, docCount=2366191)
1.0 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.0 = parameter b (norms omitted for field)
请在这里推荐。
Solr 中的默认行为是在每个分片上单独计算分数,然后再合并到层次结构中的更高位置。这假设您有相当数量的文档,并且这些文档随机(均匀)地分布在您的分片中,无论其内容如何。如果是这样的话,分数应该足够相似,以避免引起任何更大的问题。
在您的情况下,这就是创建不同分数的原因 - 在您的“单节点回答查询”情况下(即您的主/从设置),计数与您拥有集群 (SolrCloud) 设置时不同 - 在集群设置中文档分布在多个服务器上,默认情况下,仅使用本地计数对每个节点进行评分。比较不同查询的分数(尤其是随着时间的推移而改变的新近度提升)也很困难。我的猜测是,这些文档的分数彼此非常接近,以至于其中哪一个排名最相关还没有定论,并且分数的变化取决于每个分片中存在的文档数量(即添加另一个分片)将文档发送到其中一个分片会更改该分片的本地分数)。
一个可能的解决方案是使用分布式 IDF - 即使用整个集合的频率而不是仅针对本地分片的评分方法。这是通过配置统计缓存以使用
ExactStatsCache
、ExactSharedStatsCache
或LRUStatsCache
而不是默认的LocalStatsCache
来完成的。 LocalStatsCache
描述为:
:这仅使用本地术语和文档统计来计算相关性。在跨分片的术语分布均匀的情况下,这效果相当好。如果没有配置LocalStatsCache
,则此选项是默认选项。<statsCache>
虽然
ExactStatsCache
的描述解释说它使用集合范围的值:
:此实现使用全局值(跨集合)作为文档频率。ExactStatsCache
另外两个是
ExactStatsCache
的不同缓存实现。
您可以更改
solrconfig.xml
中使用的统计缓存:
<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>