Cassandra的频繁尖峰写入延迟

问题描述 投票:1回答:1

在生产群集中,群集写入延迟频繁从7毫秒增加到4秒。由于此客户端面临大量读取和写入超时。每隔几个小时重复一次。

观察:群集写入延迟(第99百分位数) - 4Sec本地写入延迟(第99百分位数) - 10ms读取和写入一致性 - local_one总节点数 - 7

我尝试使用settraceprobability启用跟踪几分钟,并观察到大部分时间都是在节点间通信中进行的

 session_id                           | event_id                             | activity                                                                                                                    | source        | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing  SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;'  | cassandranode3 |              7 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c |                                                                                                         Preparing statement | Cassandranode3 |             47 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c |                                                                                            reading data from /Cassandranode1 | Cassandranode3 |            121 |                     SharedPool-Worker-47
 4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c |                                                                       REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 |          40614 | MessagingService-Incoming-/Cassandranode1
 4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c |                                                                                     Processing response from /Cassandranode1 | Cassandranode3 |          40626 |                      SharedPool-Worker-5

我尝试检查Cassandra节点之间的连接,但没有看到任何问题。 Cassandra日志充斥着读取超时异常,因为这是一个非常繁忙的集群,具有30k读取/秒和10k写入/秒。

system.log中的警告:

WARN  [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]

在尖峰期间,集群停止运行,诸如“use system_traces”命令之类的简单命令也会失败。

cassandra@cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.

我验证了所有节点上的模式版本及其相同,但在发布时看起来像Cassandra甚至无法读取元数据。

有没有人遇到类似的问题?有什么建议 ?

cassandra production-environment latency cassandra-2.1
1个回答
2
投票

(来自你上面评论的数据)长的完整gc暂停肯定会导致这种情况。添加-XX:+DisableExplicitGC你得到的是完整的GC,因为调用了system.gc,这很可能来自于一个愚蠢的DGC rmi,无论是否需要,它都会定期调用。随着更大的堆非常昂贵。禁用是安全的。

检查您的gc日志标头,确保未设置最小堆大小。我建议设置-XX:G1ReservePercent=20

© www.soinside.com 2019 - 2024. All rights reserved.