我的朋友在 Elastic Search 云上存储了 65000 个文档,我想检索所有这些文档(使用 python)。但是,当我运行当前脚本时,出现错误:
RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.')
我的剧本
es = Elasticsearch(cloud_id=cloud_id, http_auth=(username, password))
docs = es.search(body={"query": {"match_all": {}}, '_source': ["_id"], 'size': 65000})
检索所有这些文档而不将其限制为 10000 个文档的最简单方法是什么?谢谢
已设置限制,以便结果集不会淹没您的节点。结果会占用弹性节点的内存。结果集越大,内存占用和对节点的影响就越大。
取决于您想如何处理检索到的文档,
scroll
api(如错误消息中的建议)。在这种情况下,请注意
scroll context
的生命周期。https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-scroll
Search After
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-body.html#request-body-search-search-after
search_after
。
这是一个使用 Python 和 elasticsearch_dsl 的示例:def hit_generator(index, chunk_size=5000):
i = 0
search_after_id = None
while True:
print(f'Aggregating next {chunk_size} documents, aggregated {i*chunk_size} so far...')
s = Search(using=client, index=index)
s = s.extra(size=chunk_size)
s = s.sort('_id')
if search_after_id:
s = s.extra(**{'search_after': [search_after_id], 'size': chunk_size})
response = s.execute()
if len(response) == 0:
print(f'No more results to return for index {index}, scanned <{i*chunk_size} documents')
break
for hit in response:
search_after_id = hit.meta.id
yield hit
i += 1
# How to use it?
for hit in hit_generator('my_index'):
print(f'Got hit: {hit}')
这样,您一次只能阅读接下来的 5000 个文档,从您在上一个请求中“完成”的文档开始。
它将执行(文档数量)/(块大小)搜索