如何使用stormcrawler从网站将自定义元标签存储在elasticsearch索引中

问题描述 投票:0回答:1

我正在使用stormcrawler(v 2.10)爬行内联网网站并将数据存储在Elasticsearch(v 7.8.0)上。使用 kibana 进行可视化。内网页面有自定义元标记如下

我想将其存储在弹性搜索索引“爬虫内容”中。但我在 kibana/elasticsearch 中没有得到任何这些字段。

更新了索引脚本

{
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 1,
      "refresh_interval": "5s",
      "default_pipeline": "timestamp"
    }
  },
  "mappings": {
    "_source": {
      "enabled": true
    },
    "properties": {
      "content": {
        "type": "text"
      },
      "description": {
        "type": "text"
      },
      "domain": {
        "type": "keyword"
      },
      "format": {
        "type": "keyword"
      },
      "keywords": {
        "type": "keyword"
      },
      "host": {
        "type": "keyword"
      },
      "title": {
        "type": "text"
      },
      "url": {
        "type": "keyword"
      },
      "timestamp": {
        "type": "date",
        "format": "date_optional_time"
      },
      "metatag": {
        "properties": {
          "article_description": {
            "type": "text"
          },
          "article_heading": {
            "type": "text"
          },
          "article_publisheddate": {
            "type": "date"
          },
          "article_type": {
            "type": "text"
          },
          "article_year": {
            "type": "text"
          }
        }
      }
    }
  }
}

在jsoupfilters.json中添加

"parse.article_description": "//META[@name=\"Article_Description\"]/@content",
"parse.article_heading": "//META[@name=\"Article_Heading\"]/@content",
"parse.article_publisheddate": "//META[@name=\"Article_PublishedDate\"]/@content",
"parse.article_type": "//META[@name=\"Article_Type\"]/@content",
"parse.article_year": "//META[@name=\"Article_Year\"]/@content"

在crawler-conf.yaml中添加

indexer.md.mapping:
  - parse.title=title
  - parse.search=search
  - parse.keywords=keywords
  - parse.description=description
  - parse.article_description=metatag.article_description
  - parse.article_heading=metatag.article_heading
  - parse.article_publisheddate=metatag.article_publisheddate
  - parse.article_type=metatag.article_type
  - parse.article_year=metatag.article_year
  - domain
  - format
apache-storm stormcrawler
1个回答
0
投票

我在您的设置中看不到任何明显不正确的地方。您可以在单个 URL 上运行类 https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/stormcrawler/parse/JSoupFilters.java 来检查提取。 对于在命令行上测试协议的输出也很有用,请参阅我们最近的博客了解示例。

© www.soinside.com 2019 - 2024. All rights reserved.