Nutch 1.6:CSVIndexWriter失败

问题描述 投票:0回答:1

我刚刚在Fedora 30上安装了Nutch 1.6。我经历了引导初始列表(注入),生成访存列表,解析,更新数据库和反向链接的步骤。在索引之前,我更新了index-writers.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->
<writers xmlns="http://lucene.apache.org/nutch"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd">


  <writer id="indexer_csv_1" class="org.apache.nutch.indexwriter.csv.CSVIndexWriter">
    <parameters>
      <param name="fields" value="id,title,content"/>
      <param name="charset" value="UTF-8"/>
      <param name="separator" value=","/>
      <param name="valuesep" value="|"/>
      <param name="quotechar" value="&quot;"/>
      <param name="escapechar" value="&quot;"/>
      <param name="maxfieldlength" value="4096"/>
      <param name="maxfieldvalues" value="12"/>
      <param name="header" value="true"/>
      <param name="outpath" value="csvindexwriter"/>
    </parameters>
    <mapping>
      <copy />
      <rename />
      <remove />
    </mapping>
  </writer>

</writers>

然后我跑:

bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/2020* -filter -normalize -deleteGone

下面是我遇到的错误,但我不确定为什么:

2020-01-31 12:03:09,385 INFO  crawl.LinkDb - LinkDb: finished at 2020-01-31 12:03:09, elapsed: 00:00:04
2020-01-31 12:04:24,945 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2020-01-31 12:04:25,260 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20200127084916.
2020-01-31 12:04:25,264 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20200127093759.
2020-01-31 12:04:25,268 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20200130115418.
2020-01-31 12:04:25,271 INFO  segment.SegmentChecker - Segment dir is complete: crawl/segments/20200131101723.
2020-01-31 12:04:25,273 INFO  indexer.IndexingJob - Indexer: starting at 2020-01-31 12:04:25
2020-01-31 12:04:25,282 INFO  indexer.IndexingJob - Indexer: deleting gone documents: true
2020-01-31 12:04:25,282 INFO  indexer.IndexingJob - Indexer: URL filtering: true
2020-01-31 12:04:25,283 INFO  indexer.IndexingJob - Indexer: URL normalizing: true
2020-01-31 12:04:25,283 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2020-01-31 12:04:25,283 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2020-01-31 12:04:25,284 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200127084916
2020-01-31 12:04:25,286 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200127093759
2020-01-31 12:04:25,288 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200130115418
2020-01-31 12:04:25,290 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20200131101723
2020-01-31 12:04:26,115 INFO  mapreduce.Job - The url to track the job: http://localhost:8080/
2020-01-31 12:04:26,116 INFO  mapreduce.Job - Running job: job_local1773068951_0001
2020-01-31 12:04:27,120 INFO  mapreduce.Job - Job job_local1773068951_0001 running in uber mode : false
2020-01-31 12:04:27,122 INFO  mapreduce.Job -  map 0% reduce 0%
2020-01-31 12:04:34,127 INFO  mapreduce.Job -  map 100% reduce 0%
2020-01-31 12:04:45,868 INFO  indexer.IndexWriters - Index writer org.apache.nutch.indexwriter.solr.SolrIndexWriter identified.
2020-01-31 12:04:45,965 WARN  exchange.Exchanges - No exchange was configured. The documents will be routed to all index writers.
2020-01-31 12:04:46,272 INFO  indexer.IndexerOutputFormat - Active IndexWriters :
SolrIndexWriter:
┌────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────┐
│type        │Specifies the SolrClient implementation to use. This is a string value of one of the following "cloud"  or│http                            │
│            │"http". The values represent CloudSolrServer or HttpSolrServer respectively.                              │                                │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│url         │Defines the fully qualified URL of Solr into which data should be indexed. Multiple URL  can  be  provided│http://localhost:8983/solr/nutch│
│            │using comma as a delimiter. When the value of type property is cloud,  the  URL  should  not  include  any│                                │
│            │collections or cores; just the root Solr path.                                                            │                                │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│collection  │The collection used in requests. Only used when the value of type property is cloud.                      │                                │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│commitSize  │Defines the number of documents to send to Solr in a single update  batch.  Decrease  when  handling  very│100                             │
│            │large documents to prevent Nutch from running out of memory. Note: It does not explicitly trigger a server│                                │
│            │side commit.                                                                                              │                                │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│weight.field│Field's name where the weight of the documents will be written. If it is empty no field will be used.     │                                │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│auth        │Whether to enable HTTP basic authentication for communicating with Solr. Use  the  username  and  password│false                           │
│            │properties to configure your credentials.                                                                 │                                │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│username    │The username of Solr server.                                                                              │username                        │
├────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────┤
│password    │The password of Solr server.                                                                              │password                        │
└────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────┘


2020-01-31 12:04:46,448 INFO  solr.SolrIndexWriter - Indexing 72/72 documents
2020-01-31 12:04:46,449 INFO  solr.SolrIndexWriter - Deleting 0 documents
2020-01-31 12:04:46,490 INFO  solr.SolrIndexWriter - Indexing 72/72 documents
2020-01-31 12:04:46,490 INFO  solr.SolrIndexWriter - Deleting 0 documents
2020-01-31 12:04:46,528 WARN  mapred.LocalJobRunner - job_local1773068951_0001
java.lang.Exception: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/nutch
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:491)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:558)
Caused by: java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/nutch
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.makeIOException(SolrIndexWriter.java:282)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:250)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:214)
    at org.apache.nutch.indexer.IndexWriters.commit(IndexWriters.java:264)
    at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:54)
    at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.close(ReduceTask.java:550)
    at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:629)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:346)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/nutch
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:650)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1219)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.push(SolrIndexWriter.java:247)
    ... 12 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused (Connection refused)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:159)
    at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:359)
    at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542)
    ... 16 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:606)
    at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
    at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
    ... 26 more
2020-01-31 12:04:47,133 INFO  mapreduce.Job - Job job_local1773068951_0001 failed with state FAILED due to: NA
2020-01-31 12:04:47,167 INFO  mapreduce.Job - Counters: 30
    File System Counters
        FILE: Number of bytes read=2027841168
        FILE: Number of bytes written=3564196112
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=711822
        Map output records=711822
        Map output bytes=224057287
        Map output materialized bytes=225563661
        Input split bytes=3175
        Combine input records=0
        Combine output records=0
        Reduce input groups=0
        Reduce shuffle bytes=225563661
        Reduce input records=0
        Reduce output records=0
        Spilled Records=711822
        Shuffled Maps =19
        Failed Shuffles=0
        Merged Map outputs=19
        GC time elapsed (ms)=667
        Total committed heap usage (bytes)=16629366784
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=124418962
    File Output Format Counters 
        Bytes Written=0
2020-01-31 12:04:47,167 ERROR indexer.IndexingJob - Indexing job did not succeed, job status:FAILED, reason: NA
2020-01-31 12:04:47,168 ERROR indexer.IndexingJob - Indexer: java.lang.RuntimeException: Indexing job did not succeed, job status:FAILED, reason: NA
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:231)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:240)

关于csv索引编写器为何失败的任何想法?

问候,

nutch
1个回答
0
投票

根据日志,Solr索引器将失败。它是唯一的活动索引编写器,并且配置不正确。这很明显,因为您要使用CSV索引器。 Nutch索引编写器是可插入的,为了激活CSV索引器,您需要将插件添加到属性plugin.includes。这通常是通过编辑文件conf/nutch-site.xml完成的,您需要修改或插入以下几行:

© www.soinside.com 2019 - 2024. All rights reserved.