我正在尝试使用Apache Nutch 1.15抓取一些网站并将其编入索引以使用Solr 7.7.0进行搜索,遵循本教程https://wiki.apache.org/nutch/NutchTutorial。我在Windows 10上使用cygwin64。
每次我运行一个命令,我收到这个消息(我做了一些研究,似乎它无法解决,我是对的吗?),但除此之外它似乎工作。
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by
org.apache.hadoop.security.authentication.util.KerberosUtil
(file:/C:/cygwin64/home/apache-nutch-1.15/lib/hadoop-auth-2.7.4.jar) to
method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of
org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future
release
问题是当我尝试使用此命令进行索引时
$ bin/nutch solrindex crawl/crawldb crawl/linkdb crawl/segments/*
我收到此错误消息:
Segment dir is complete: crawl/segments/20190218180046.
Segment dir is complete: crawl/segments/20190218180429.
Segment dir is complete: crawl/segments/20190218180720.
Segment dir is complete: crawl/segments/20190219113805.
Indexer: starting at 2019-02-19 16:18:44
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
No exchange was configured. The documents will be routed to all index
writers.
Active IndexWriters :
SOLRIndexWriter
type : Type of the server. Can be: "cloud", "concurrent", "http" or "lb"
url : URL of the SOLR instance or URL of the Zookeeper quorum
commitSize : buffer size when sending to SOLR (default 1000)
auth : use authentication (default false)
username : username for authentication
password : password for authentication
Indexing 591/591 documents
Deleting 0 documents
Indexing job did not succeed, job status:FAILED, reason: NA
Indexer: java.lang.RuntimeException: Indexing job did not succeed, job
status:FAILED, reason: NA
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:152)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:235)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:244)
在Solr的日志文件中,我发现此错误:
2019-02-19 16:18:51.668 ERROR (qtp2031588185-21) [ x:nutch]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception
writing document id
http://apache.org/foundation/records/minutes/2010
/board_minutes_2010_06_16.txt
to the index; possible analysis error: DocValuesField "content_str" is too
large, must be <= 32766
at
org.apache.solr.update.DirectUpdateHandler2.addDoc
(DirectUpdateHandler2.java:243)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd
(RunUpdateProcessorFactory.java:67)
(…)
Caused by: java.lang.IllegalArgumentException: DocValuesField "content_str"
is too large, must be <= 32766
at
org.apache.lucene.index.SortedSetDocValuesWriter.addValue
(SortedSetDocValuesWriter.java:82)
at org.apache.lucene.index.DefaultIndexingChain.indexDocValue
(DefaultIndexingChain.java:616)
at org.apache.solr.update.DirectUpdateHandler2.addDoc
(DirectUpdateHandler2.java:235)
... 71 more
2019-02-19 16:19:06.612 INFO (commitScheduler-13-thread-5) [ ]
o.a.s.u.DirectUpdateHandler2 start
commit{,optimize=false,openSearcher=false,waitSearcher=true,
expungeDeletes=false,softCommit=false,prepareCommit=false}
2019-02-19 16:19:06.612 INFO (commitScheduler-13-thread-5) [ ]
o.a.s.u.SolrIndexWriter Calling setCommitData with
IW:org.apache.solr.update.SolrIndexWriter@2d18ed13 commitCommandVersion:0
2019-02-19 16:19:06.671 INFO (commitScheduler-13-thread-5) [ ]
o.a.s.s.SolrIndexSearcher Opening [Searcher@28296264[nutch] realtime]
2019-02-19 16:19:06.676 INFO (commitScheduler-13-thread-5) [ ]
o.a.s.u.DirectUpdateHandler2 end_commit_flush
我找不到任何解决这个问题的方法。有人可以帮帮我吗?如果您需要更多信息,请与我们联系。
谢谢
这个错误不是来自Nutch,而是在Solr方面抛出。更可疑的是,Nutch没有直接提供content_str
字段。尝试在docValues
或false
中将fieldType
设置为field
。 Doc值提供了一定的好处,但是存储大量数据(如32k)会对长期产生负面的性能影响。您应该使用存储的字段。或者更改要标记的此字段的类型。
有关doc值的更多详细信息可以在here找到。