将Wikipedia Dump加载到Elasticsearch中

问题描述 投票:1回答:2

我想加载一个XML维基百科转储,例如:http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20171001/enwiki-20171001-pages-articles.xml.bz2到Elasticsearch(5.6.4)。但是,我发现的所有工具和教程都已过时,与我的Elasticsearch版本不兼容。任何人都可以解释将转储导入Elasticsearch的最佳方法是什么?

xml elasticsearch wikipedia
2个回答
3
投票

两年前,维基媒体已经提供了生产弹性研究指数的转储。

索引每周导出一次,每个维基有两个导出。

The content index, which contains only article pages, called content;
The general index, containing all pages. This includes talk pages, templates, etc, called general;

你可以在这里找到他们http://dumps.wikimedia.org/other/cirrussearch/current/

  • 根据您的需求创建映射。例如: { "mappings": { "page": { "properties": { "auxiliary_text": { "type": "text" }, "category": { "type": "text" }, "coordinates": { "properties": { "coord": { "properties": { "lat": { "type": "double" }, "lon": { "type": "double" } } }, "country": { "type": "text" }, "dim": { "type": "long" }, "globe": { "type": "text" }, "name": { "type": "text" }, "primary": { "type": "boolean" }, "region": { "type": "text" }, "type": { "type": "text" } } }, "defaultsort": { "type": "boolean" }, "external_link": { "type": "text" }, "heading": { "type": "text" }, "incoming_links": { "type": "long" }, "language": { "type": "text" }, "namespace": { "type": "long" }, "namespace_text": { "type": "text" }, "opening_text": { "type": "text" }, "outgoing_link": { "type": "text" }, "popularity_score": { "type": "double" }, "redirect": { "properties": { "namespace": { "type": "long" }, "title": { "type": "text" } } }, "score": { "type": "double" }, "source_text": { "type": "text" }, "template": { "type": "text" }, "text": { "type": "text" }, "text_bytes": { "type": "long" }, "timestamp": { "type": "date", "format": "strict_date_optional_time||epoch_millis" }, "title": { "type": "text" }, "version": { "type": "long" }, "version_type": { "type": "text" }, "wiki": { "type": "text" }, "wikibase_item": { "type": "text" } } } } }

一旦创建了索引,只需键入:

zcat enwiki-current-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki/_bulk --data-binary @- > /dev/null'

请享用!


0
投票

我尝试了很多方法来导入维基百科。我找到了两种使用Logstash并直接编写python编码器的方法。

© www.soinside.com 2019 - 2024. All rights reserved.