Ignite Server pod 无法获取文件锁

问题描述 投票:0回答:2

我正在调试我们的一台 Ignite 服务器的问题,该服务器在 Kubernetes 中运行了几个月,没有出现任何问题。

客户端节点开始无法启动,因为它们无法对 Ignite 服务器进行就绪探测。当我访问 Ignite 服务器时,我看到日志中充满了

[PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
。这些消息的开始似乎与客户端节点开始失败的时间一致。

我尝试缩小 Ignite 服务器的规模,然后再扩大规模,但这并没有解决问题。下面是重启后的日志。

进一步查看 /ignite/work/log 目录的内容时,发现有大量日志文件锁。通常我只能看到当前日志的锁,而不是上一个日志。

我的问题是:

  1. 为什么旧日志的锁定继续存在?如何阻止这种情况发生?
  2. 如何清理 Ignite 服务器?缩小 Pod 规模并没有帮助。手动删除锁定文件可以解决问题吗?
# ls -l /ignite/work/log/*.lck | wc -l
3846

编辑: 我还注意到锁文件的数量远远大于锁的数量:

# ls -l ./*.log |wc -l
35

锁示例:

-rw-r--r--    1 root     root             0 Nov 17 10:32 ignite-f543b237.0.log.lck

>>>    __________  ________________
>>>   /  _/ ___/ |/ /  _/_  __/ __/
>>>  _/ // (7 7    // /  / / / _/
>>> /___/\___/_/|_/___/ /_/ /___/
>>>
>>> ver. 2.11.1#20211220-sha1:eae1147d
>>> 2021 Copyright(C) Apache Software Foundation
>>>
>>> Ignite documentation: http://ignite.apache.org

[21:20:08,116][INFO][main][IgniteKernal] Config URL: file:/ignite/config/ignite_server_config.xml
[21:20:08,140][INFO][main][IgniteKernal] IgniteConfiguration [igniteInstanceName=null, pubPoolSize=8, svcPoolSize=8, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=8, mgmtPoolSize=4, dataStreamerPoolSize=8, utilityCachePoolSize=8, utilityCacheKeepAliveTime=60000, p2pPoolSize=2, qryPoolSize=8, buildIdxPoolSize=1, ign
iteHome=/opt/ignite/apache-ignite, igniteWorkDir=/ignite/work, mbeanSrv=com.sun.jmx.mbeanserver.JmxMBeanServer@6e38921c, nodeId=165f2068-01fd-4356-ae65-f199e33bf7cd, marsh=BinaryMarshaller [], marshLocJobs=false, daemon=false, p2pEnabled=false, netTimeout=5000, netCompressionLevel=1, sndRetryDelay=1000, sndRetryCnt=3, m
etricsHistSize=10000, metricsUpdateFreq=2000, metricsExpTime=9223372036854775807, discoSpi=TcpDiscoverySpi [addrRslvr=null, addressFilter=null, sockTimeout=0, ackTimeout=0, marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=600000, soLinger=0, forceSrvMode=false, clientReconnectDisabled=false, internalLsnr=null, sk
ipAddrsRandomization=false], segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true, allResolversPassReq=true, segChkFreq=10000, commSpi=TcpCommunicationSpi [connectGate=org.apache.ignite.spi.communication.tcp.internal.ConnectGateway@6ad82709, ctxInitLatch=java.util.concurrent.CountDownLatch@510f3d34[Count = 1], stop
ping=false, clientPool=null, nioSrvWrapper=null, stateProvider=null], evtSpi=org.apache.ignite.spi.eventstorage.NoopEventStorageSpi@7817fd62, colSpi=NoopCollisionSpi [], deploySpi=LocalDeploymentSpi [], indexingSpi=org.apache.ignite.spi.indexing.noop.NoopIndexingSpi@24313fcc, addrRslvr=null, encryptionSpi=org.apache.ign
ite.spi.encryption.noop.NoopEncryptionSpi@7d20d0b, tracingSpi=org.apache.ignite.spi.tracing.NoopTracingSpi@77f1baf5, clientMode=false, rebalanceThreadPoolSize=1, rebalanceTimeout=10000, rebalanceBatchesPrefetchCnt=3, rebalanceThrottle=0, rebalanceBatchSize=524288, txCfg=TransactionConfiguration [txSerEnabled=false, dflt
Isolation=REPEATABLE_READ, dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0, txTimeoutOnPartitionMapExchange=0, deadlockTimeout=10000, pessimisticTxLogSize=0, pessimisticTxLogLinger=10000, tmLookupClsName=null, txManagerFactory=null, useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=60000, deployMode=SHARED
, p2pMissedCacheSize=100, locHost=null, timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=10000, sysWorkerBlockedTimeout=null, clientFailureDetectionTimeout=30000, metricsLogFreq=60000, connectorCfg=ConnectorConfiguration [jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false, sndBufSiz
e=32768, rcvBufSize=32768, idleQryCurTimeout=600000, idleQryCurCheckFreq=60000, sndQueueLimit=0, selectorCnt=1, idleTimeout=7000, sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null, portRange=100, threadPoolSize=8, msgInterceptor=null], odbcCfg=null, warmupClos=null, atomicCfg=AtomicConfiguration
 [seqReserveSize=1000, cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null, sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null, pstCfg=null, dsCfg=DataStorageConfiguration [sysRegionInitSize=41943040, sysRegionMaxSize=104857600, pageSize=0, concLvl=0, dfltDataRegConf=DataRegionConfigu
ration [name=default, maxSize=54081806336, initSize=268435456, swapPath=null, pageEvictionMode=DISABLED, pageReplacementMode=CLOCK, evictionThreshold=0.9, emptyPagesPoolSize=100, metricsEnabled=false, metricsSubIntervalCount=5, metricsRateTimeInterval=60000, persistenceEnabled=true, checkpointPageBufSize=0, lazyMemoryAl
location=true, warmUpCfg=null], dataRegions=null, storagePath=null, checkpointFreq=180000, lockWaitTime=10000, checkpointThreads=4, checkpointWriteOrder=SEQUENTIAL, walHistSize=20, maxWalArchiveSize=1073741824, walSegments=10, walSegmentSize=67108864, walPath=/ignite/wal, walArchivePath=/ignite/walarchive, metricsEnable
d=false, walMode=LOG_ONLY, walTlbSize=131072, walBuffSize=0, walFlushFreq=2000, walFsyncDelay=1000, walRecordIterBuffSize=67108864, alwaysWriteFullPages=false, fileIOFactory=org.apache.ignite.internal.processors.cache.persistence.file.AsyncFileIOFactory@66d18979, metricsSubIntervalCnt=5, metricsRateTimeInterval=60000, w
alAutoArchiveAfterInactivity=-1, writeThrottlingEnabled=false, walCompactionEnabled=false, walCompactionLevel=1, checkpointReadLockTimeout=null, walPageCompression=DISABLED, walPageCompressionLevel=null, dfltWarmUpCfg=null, encCfg=org.apache.ignite.configuration.EncryptionConfiguration@609cd4d8, defragmentationThreadPoo
lSize=4], snapshotPath=snapshots, activeOnStart=true, activeOnStartPropSetFlag=false, autoActivation=true, autoActivationPropSetFlag=false, clusterStateOnStart=null, sqlConnCfg=null, cliConnCfg=ClientConnectorConfiguration [host=null, port=10800, portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true, maxOpe
nCursorsPerConn=128, threadPoolSize=8, selectorCnt=4, idleTimeout=0, handshakeTimeout=10000, jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false, useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null, thinCliCfg=ThinClientConfiguration [maxActiveTxPerConn=100, maxActiveComputeTasksPe
rConn=0]], mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false, failureHnd=null, commFailureRslvr=null, sqlCfg=SqlConfiguration [longQryWarnTimeout=3000, dfltQryTimeout=0, sqlQryHistSize=1000, validationEnabled=false], asyncContinuationExecutor=null]
[21:20:08,141][INFO][main][IgniteKernal] Daemon mode: off
[21:20:08,141][INFO][main][IgniteKernal] OS: Linux 5.15.0-58-generic amd64
[21:20:08,141][INFO][main][IgniteKernal] OS user: root
[21:20:08,144][INFO][main][IgniteKernal] PID: 1
[21:20:08,144][INFO][main][IgniteKernal] Language runtime: Java Platform API Specification ver. 1.8
[21:20:08,144][INFO][main][IgniteKernal] VM information: OpenJDK Runtime Environment 1.8.0_212-b04 IcedTea OpenJDK 64-Bit Server VM 25.212-b04
[21:20:08,147][INFO][main][IgniteKernal] VM total memory: 4.0GB
[21:20:08,148][INFO][main][IgniteKernal] Remote Management [restart: off, REST: on, JMX (remote: off)]
[21:20:08,148][INFO][main][IgniteKernal] Logger: JavaLogger [quiet=true, config=null]
[21:20:08,148][INFO][main][IgniteKernal] IGNITE_HOME=/opt/ignite/apache-ignite
[21:20:08,148][INFO][main][IgniteKernal] VM arguments: [-XX:+AggressiveOpts, -DIGNITE_WAL_MMAP=false, -DIGNITE_UPDATE_NOTIFIER=false, -XX:+UseG1GC, -Xmx4g, -XX:+DisableExplicitGC, -Xms4g, -XX:+AlwaysPreTouch, -XX:+ScavengeBeforeFullGC, -DIGNITE_HOME=/opt/ignite/apache-ignite]
[21:20:08,149][INFO][main][IgniteKernal] System cache's DataRegion size is configured to 40 MB. Use DataStorageConfiguration.systemRegionInitialSize property to change the setting.
[21:20:08,149][INFO][main][IgniteKernal] Configured caches [in 'sysMemPlc' dataRegion: ['ignite-sys-cache']]
[21:20:08,149][WARNING][main][IgniteKernal] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[21:20:08,153][INFO][main][IgniteKernal] 3-rd party licenses can be found at: /opt/ignite/apache-ignite/libs/licenses
[21:20:08,315][INFO][main][IgnitePluginProcessor] Configured plugins:
[21:20:08,315][INFO][main][IgnitePluginProcessor]   ^-- None
[21:20:08,315][INFO][main][IgnitePluginProcessor]
[21:20:08,324][INFO][main][FailureProcessor] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[21:20:08,770][INFO][main][TcpCommunicationSpi] Successfully bound communication NIO server to TCP port [port=47100, locHost=0.0.0.0/0.0.0.0, selectorsCnt=4, selectorSpins=0, pairedConn=false]
[21:20:08,770][WARNING][main][TcpCommunicationSpi] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[21:20:08,785][WARNING][main][NoopCheckpointSpi] Checkpoints are disabled (to enable configure any GridCheckpointSpi implementation)
[21:20:08,832][WARNING][main][GridCollisionManager] Collision resolution is disabled (all jobs will be activated upon arrival).
[21:20:08,833][INFO][main][IgniteKernal] Security status [authentication=off, sandbox=off, tls/ssl=off]
[21:20:08,908][INFO][main][TcpDiscoverySpi] Successfully bound to TCP port [port=47500, localHost=0.0.0.0/0.0.0.0, locNodeId=165f2068-01fd-4356-ae65-f199e33bf7cd]
[22:01:05,356][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db/node00-2fc48c4c-2bf8-44e2-a4c3-ca22a0001fa2], reason: No locks available
[00:10:06,798][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[00:34:41,360][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[03:01:07,293][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[03:25:41,852][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[05:52:07,788][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[06:16:42,348][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[08:43:08,270][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[09:07:42,832][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[11:34:08,764][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[11:58:43,329][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[14:25:09,266][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available
[14:49:43,829][INFO][main][PdsFoldersResolver] Unable to acquire lock to file [/ignite/work/db], reason: No locks available

编辑:

尽管增加和减少了 Ignite 服务器的数量,但我不再看到表明服务器正在启动的日志,只是看到更多表明服务器无法获取锁的日志。

我尝试在 Pod 上手动启动服务器,但它只是在启动时停止运行并且没有进展:

./ignite.sh ../config/nkw-ignite.xml -v
Ignite Command Line Startup, ver. 2.11.1#20211220-sha1:eae1147d
2021 Copyright(C) Apache Software Foundation
java linux-kernel ignite alpine-linux file-locking
2个回答
2
投票

首先根据您的日志,该节点是在 server 模式下启动的:

clientMode=false

所以它看起来像以下消息:

Unable to acquire lock to file ...

和 ignite-xxx.0.log.lck 文件是不同的东西。

根据第一条消息

[22:01:05,356][INFO][main][PdsFoldersResolver] Unable to acquire lock to file 
[/ignite/work/db/node00-2fc48c4c-2bf8-44e2-a4c3-ca22a0001fa2], 
reason: No locks available

Ignite 尝试获取 /ignite/work/db/node00-2fc48c4c-2bf8-44e2-a4c3-ca22a0001fa2 的独占锁,因为当前实例是在服务器模式下启动的,但失败了。

这个 pds 文件夹很可能已被不同的实例使用。不同 Pod 之间共享相同的持久卷吗?

因此,首先尝试确定该卷是否是共享的。如果是,请尝试查找可能锁定此目录的另一个实例。您应该在

lock
文件夹中找到
/$IGNITE_WORK/db/$NODE_CONSISTENT_ID
文件,其内容如下:

[d55a32ec-6929-4a60-a9d0-2de33afd7007][]

如果您找到此文件,但没有其他正在运行的节点具有此 id,您可能可以尝试删除此文件并尝试重新启动有问题的节点。


0
投票

如果 Ignite 尝试在其上创建锁定文件的文件系统是 Unix/Linux 上的 NFS v3 挂载,则可能是操作系统级别的问题导致了该问题。 NFS v3 需要 statd 和 lockd 进程启动并运行以支持文件锁定。

我们最近遇到了同样的问题 - 机器重新启动后,其他一些进程抢占了 statd 的端口,并且它没有正确启动,并且 Ignite 无法在启动时锁定文件并拒绝启动。

要求 SA 修复 statd / lockd 进程或切换到不需要运行这些进程的 NFS v4。

© www.soinside.com 2019 - 2024. All rights reserved.