我正在使用stormcrawler(v 2.10)爬行内联网网站并将数据存储在Elasticsearch(v 7.8.0)上。使用 kibana 进行可视化。内网页面有自定义元标记如下
我想将其存储在弹性搜索索引“爬虫内容”中。但我在 kibana/elasticsearch 中没有得到任何这些字段。
更新了索引脚本
{
"settings": {
"index": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "5s",
"default_pipeline": "timestamp"
}
},
"mappings": {
"_source": {
"enabled": true
},
"properties": {
"content": {
"type": "text"
},
"description": {
"type": "text"
},
"domain": {
"type": "keyword"
},
"format": {
"type": "keyword"
},
"keywords": {
"type": "keyword"
},
"host": {
"type": "keyword"
},
"title": {
"type": "text"
},
"url": {
"type": "keyword"
},
"timestamp": {
"type": "date",
"format": "date_optional_time"
},
"metatag": {
"properties": {
"article_description": {
"type": "text"
},
"article_heading": {
"type": "text"
},
"article_publisheddate": {
"type": "date"
},
"article_type": {
"type": "text"
},
"article_year": {
"type": "text"
}
}
}
}
}
}
在jsoupfilters.json中添加
"parse.article_description": "//META[@name=\"Article_Description\"]/@content",
"parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
"parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
"parse.article_type": "//META[@name=\"Article_Type\"]/@content",
"parse.article_year": "//META[@name=\"Article_Year\"]/@content"
在crawler-conf.yaml中添加
indexer.md.mapping:
- parse.title=title
- parse.search=search
- parse.keywords=keywords
- parse.description=description
- parse.article_description=metatag.article_description
- parse.article_heading=metatag.article_heading
- parse.article_publisheddate=metatag.article_publisheddate
- parse.article_type=metatag.article_type
- parse.article_year=metatag.article_year
- domain
- format
我在您的设置中看不到任何明显不正确的地方。您可以在单个 URL 上运行类 https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java 来检查提取。 对于在命令行上测试协议的输出也很有用,请参阅我们最近的博客了解示例。