我们正在尝试将 hdfs(parquet 文件)中存在的大约 50 亿条记录索引到 solr 上的集合。我们使用的是 solr 7.2.1。我们生成了一个由 7 个数据节点(16 个 VCore,每个 128 GB)组成的 emr 集群,并使用相同的 hdfs 节点来存储 solr 数据(使用 7 个分片创建的 solr 集合)。运行一段时间后,我们发现索引速度变慢,并且由于 2 个主要错误(如下所列),一些任务开始失败。在对大约 550 mill 左右进行索引后,作业失败了,因为某些任务失败了太多次。
错误1:
scala.MatchError: java.net.SocketTimeoutException: Read timed out (of class java.net.SocketTimeoutException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
错误2:
scala.MatchError: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://solr1:8983/solr/connection-2637-config6: 2 Async exceptions during distributed update:
Broken pipe (Write failed)
Broken pipe (Write failed) (of class org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
然后我们尝试将较大的文件分成每个 3 亿个记录文件,然后在中间有一些冷静期进行索引。我们发现每个后续任务都需要更多时间(第一个 300 工厂需要 3 小时,第二个 300 工厂需要 4.5 小时,第三个工厂需要 6 小时,等等)。 现在,当我们索引了超过 20 亿条记录时,即使等待了 5-6 小时,我们仍然会遇到大量套接字超时异常并且作业失败。我们相信这可能会发生,因为即使作业显示完成后,索引的合并仍在后端继续(我们已经看到各个核心目录中存在的文件数量减少,所以这是我们的猜测)。这是真的?对于是/否,有什么方法可以解决这个问题并提高性能。
这些是我们设置的其他一些 solr 配置。
SOLR_JAVA_MEM="-Xms30g -Xmx30g"
GC_TUNE="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=8m -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=75 -XX:+UseLargePages -XX:+AggressiveOpts -XX:MaxDirectMemorySize=20g"
solrconfig.xml
<bool name="solr.hdfs.blockcache.enabled">true</bool>
<int name="solr.hdfs.blockcache.slab.count">1</int>
<bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
<int name="solr.hdfs.blockcache.blocksperbank">16384</int>
<bool name="solr.hdfs.blockcache.read.enabled">true</bool>
<bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
<int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
<int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">25</int>
<int name="segmentsPerTier">25</int>
<double name="noCFSRatio">0.1</double>
</mergePolicyFactory>
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:1200000}</maxTime>
</autoSoftCommit>
非常感谢任何帮助。
*检查执行器日志,发现更详细的堆栈跟踪。
21/08/31 11:35:53 ERROR BaseCloudSolrClient: Request to collection [connection-2637-config6-65536] failed due to (0) java.net.SocketTimeoutException: Read timed out, retry=0 commError=false errorCode=0
21/08/31 11:35:53 INFO BaseCloudSolrClient: request was not communication error it seems
21/08/31 11:35:53 ERROR SolrSupport$: Send batch to collection connection-2637-config6-65536 failed due to: org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://solr4:8983/solr/connection-2637-config6-65536
org.apache.solr.client.solrj.SolrServerException: Timeout occurred while waiting response from server at: http://solr4:8983/solr/connection-2637-config6-65536
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:676)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:368)
at org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:296)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1128)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:897)
at org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:829)
at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolr(SolrSupport.scala:389)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:354)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at shaded.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137)
at shaded.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153)
at shaded.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282)
at shaded.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
at shaded.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
at shaded.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259)
at shaded.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163)
at shaded.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
at shaded.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273)
at shaded.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
at shaded.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
at shaded.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
at shaded.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at shaded.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at shaded.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
... 24 more
21/08/31 11:35:53 ERROR Executor: Exception in task 75.0 in stage 1.0 (TID 77)
scala.MatchError: java.net.SocketTimeoutException: Read timed out (of class java.net.SocketTimeoutException)
at com.lucidworks.spark.util.SolrSupport$.sendBatchToSolrWithRetry(SolrSupport.scala:358)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:333)
at com.lucidworks.spark.util.SolrSupport$$anonfun$indexDocs$1.apply(SolrSupport.scala:322)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
你能解决这个问题吗?我在编辑大约 10 亿条记录时面临着同样的问题。如果您能够解决该问题,请告诉我。