Apache Nutch 1.15 Solr 7.7.0索引失败DocValuesField太大,必须<= 32766

问题描述 投票:0回答:1

我正在尝试使用Apache Nutch 1.15抓取一些网站并将其编入索引以使用Solr 7.7.0进行搜索,遵循本教程https://wiki.apache.org/nutch/NutchTutorial。我在Windows 10上使用cygwin64。

每次我运行一个命令,我收到这个消息(我做了一些研究,似乎它无法解决,我是对的吗?),但除此之外它似乎工作。

    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by 
    org.apache.hadoop.security.authentication.util.KerberosUtil 
    (file:/C:/cygwin64/home/apache-nutch-1.15/lib/hadoop-auth-2.7.4.jar) to 
    method sun.security.krb5.Config.getInstance()
    WARNING: Please consider reporting this to the maintainers of 
    org.apache.hadoop.security.authentication.util.KerberosUtil
    WARNING: Use --illegal-access=warn to enable warnings of further illegal 
    reflective access operations
    WARNING: All illegal access operations will be denied in a future 
    release

问题是当我尝试使用此命令进行索引时

$ bin/nutch solrindex crawl/crawldb crawl/linkdb crawl/segments/*

我收到此错误消息:

Segment dir is complete: crawl/segments/20190218180046.
Segment dir is complete: crawl/segments/20190218180429.
Segment dir is complete: crawl/segments/20190218180720.
Segment dir is complete: crawl/segments/20190219113805.
Indexer: starting at 2019-02-19 16:18:44
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
No exchange was configured. The documents will be routed to all index 
writers.
Active IndexWriters :
SOLRIndexWriter
    type : Type of the server. Can be: "cloud", "concurrent", "http" or "lb"
    url : URL of the SOLR instance or URL of the Zookeeper quorum
    commitSize : buffer size when sending to SOLR (default 1000)
    auth : use authentication (default false)
    username : username for authentication
    password : password for authentication


Indexing 591/591 documents
Deleting 0 documents
Indexing job did not succeed, job status:FAILED, reason: NA
Indexer: java.lang.RuntimeException: Indexing job did not succeed, job 
status:FAILED, reason: NA
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:152)
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:235)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:244)

在Solr的日志文件中,我发现此错误:

2019-02-19 16:18:51.668 ERROR (qtp2031588185-21) [   x:nutch] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception 
writing document id
http://apache.org/foundation/records/minutes/2010
/board_minutes_2010_06_16.txt 
to the index; possible analysis error: DocValuesField "content_str" is too 
large, must be <= 32766
at 

org.apache.solr.update.DirectUpdateHandler2.addDoc
(DirectUpdateHandler2.java:243)
 at org.apache.solr.update.processor.RunUpdateProcessor.processAdd
 (RunUpdateProcessorFactory.java:67)
(…)
 Caused by: java.lang.IllegalArgumentException: DocValuesField "content_str" 
 is too large, must be <= 32766
 at 
 org.apache.lucene.index.SortedSetDocValuesWriter.addValue
 (SortedSetDocValuesWriter.java:82)
 at org.apache.lucene.index.DefaultIndexingChain.indexDocValue
 (DefaultIndexingChain.java:616)
 at org.apache.solr.update.DirectUpdateHandler2.addDoc
 (DirectUpdateHandler2.java:235)
 ... 71 more

 2019-02-19 16:19:06.612 INFO  (commitScheduler-13-thread-5) [   ] 
o.a.s.u.DirectUpdateHandler2 start 
commit{,optimize=false,openSearcher=false,waitSearcher=true,
expungeDeletes=false,softCommit=false,prepareCommit=false}
2019-02-19 16:19:06.612 INFO  (commitScheduler-13-thread-5) [   ] 
o.a.s.u.SolrIndexWriter Calling setCommitData with 
IW:org.apache.solr.update.SolrIndexWriter@2d18ed13 commitCommandVersion:0
2019-02-19 16:19:06.671 INFO  (commitScheduler-13-thread-5) [   ] 
o.a.s.s.SolrIndexSearcher Opening [Searcher@28296264[nutch] realtime]
2019-02-19 16:19:06.676 INFO  (commitScheduler-13-thread-5) [   ] 
o.a.s.u.DirectUpdateHandler2 end_commit_flush

我找不到任何解决这个问题的方法。有人可以帮帮我吗?如果您需要更多信息,请与我们联系。

谢谢

java indexing solr nutch
1个回答
0
投票

这个错误不是来自Nutch,而是在Solr方面抛出。更可疑的是,Nutch没有直接提供content_str字段。尝试在docValuesfalse中将fieldType设置为field。 Doc值提供了一定的好处,但是存储大量数据(如32k)会对长期产生负面的性能影响。您应该使用存储的字段。或者更改要标记的此字段的类型。

有关doc值的更多详细信息可以在here找到。

© www.soinside.com 2019 - 2024. All rights reserved.