我正在对运行Cassandra 2.1.9的4节点群集进行滚动重启。我通过“服务cassandra停止/启动”在节点1上停止并启动了Cassandra,并且在system.log或cassandra.log中都没有发现异常。从节点1执行“ nodetool状态”将显示所有四个节点。
user@node001=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 192.168.187.121 538.95 GB 256 ? c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ? bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ? 273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.05 GB 256 ? b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
但是从其他节点执行相同的命令则显示节点1仍然处于故障状态。
user@node002=> nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 192.168.187.121 538.94 GB 256 ? c99cf581-f4ae-4aa9-ab37-1a114ab2429b rack1
UN 192.168.187.122 630.72 GB 256 ? bfa07f47-7e37-42b4-9c0b-024b3c02e93f rack1
UN 192.168.187.123 572.73 GB 256 ? 273df9f3-e496-4c65-a1f2-325ed288a992 rack1
UN 192.168.187.124 625.04 GB 256 ? b8639cf1-5413-4ece-b882-2161bbb8a9c3 rack1
“ nodetool compactionstats”不显示任何挂起的任务,“ nodetool netstats”不显示任何异常。已经超过12个小时了,这些矛盾仍然存在。另一个示例是当我在重新启动的节点上执行“ nodetool gossipinfo”时,该节点的状态显示为正常:
user@node001=> nodetool -u gossipinfo
/192.168.187.121
generation:1574364410
heartbeat:209150
NET_VERSION:8
RACK:rack1
STATUS:NORMAL,-104847506331695918
RELEASE_VERSION:2.1.9
SEVERITY:0.0
LOAD:5.78684155614E11
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
DC:datacenter1
RPC_ADDRESS:192.168.185.121
与另一个节点,它显示node001的状态为“ shutdown”:
user@node002=> nodetool gossipinfo
/192.168.187.121
generation:1491825076
heartbeat:2147483647
STATUS:shutdown,true
RACK:rack1
NET_VERSION:8
LOAD:5.78679987693E11
RELEASE_VERSION:2.1.9
DC:datacenter1
SCHEMA:fd2dcb4b-ca62-30df-b8f2-d3fd774f2801
HOST_ID:c99cf581-f4ae-4aa9-ab37-1a114ab2429b
RPC_ADDRESS:192.168.185.121
SEVERITY:0.0
我有什么办法可以纠正这种当前状况-这样我就可以继续滚动重启了吗?
这是我最终为使“不良”节点重新进入群集并完成滚动重启而要做的:
执行完全关闭
nodetool disablethrift
nodetool disablebinary
sleep 5
nodetool disablegossip
nodetool drain
sleep 10
/sbin/service cassandra restart
节点返回的监视器
until echo "SELECT * FROM system.peers LIMIT 1;" | cqlsh `hostname` > /dev/null 2>&1; do echo "Node is still DOWN"; sleep 10; done && echo "Node is now UP"
从集群中删除重新启动的节点
从集群中的另一个节点,执行以下命令:
nodetool removenode <host-id>
执行第二次完全关闭
nodetool disablethrift
nodetool disablebinary
sleep 5
nodetool disablegossip
nodetool drain
sleep 10
/sbin/service cassandra restart
节点返回的监视器
until echo "SELECT * FROM system.peers LIMIT 1;" | cqlsh `hostname` > /dev/null 2>&1; do echo "Node is still DOWN"; sleep 10; done && echo "Node is now UP"
确认重新启动的节点已重新加入集群
从一个或多个其他节点拖尾/var/log/cassandra/system.log文件,查找以下消息:
INFO [HANDSHAKE-/192.168.187.124] 2019-12-12 19:17:33,654 OutboundTcpConnection.java:485 - Handshaking version with /192.168.187.124
INFO [GossipStage:1] 2019-12-12 19:18:23,212 Gossiper.java:1019 - Node /192.168.187.124 is now part of the cluster
INFO [SharedPool-Worker-1] 2019-12-12 19:18:23,213 Gossiper.java:984 - InetAddress /192.168.187.124 is now UP
确认期望的节点数在集群中
以下命令的结果在所有节点上应相同:
nodetool status