Clickhouse zookeeper failures - apache-zookeeper

There are large snapshot files that fail to load when zookeeper pods were started. The log lines show this:
java.io.IOException: Unreasonable length = 2883600
at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:166)
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:127)
at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:159)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:768)
at org.apache.zookeeper.server.persistence.FileTxnLog.getLastLoggedZxid(FileTxnLog.java:362)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:277)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1090)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1075)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2022-09-19 21:52:27,658 [myid:2] - ERROR [main:QuorumPeer#1136] - Unable to load database on disk
java.io.IOException: No snapshot found, but there are log entries. Something is broken!
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:281)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:285)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:1090)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1075)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
Did anyone face this issue while using Clickhouse with replication? We hadn't seen this issue in Months.
We are using:
zookeeper - v3.6.1
clickhouse - 21.11.10.1
Any help is appreciated

Related

Unable to load database on disk

I got this error when I stop my zookeeper instance and copy all data of zookeeper in other path and change dataDir=/data/zookeeper-data in zookeeper.properties.
ERROR Unable to load database on disk (org.apache.zookeeper.server.quorum.QuorumPeer)
java.io.IOException: Unreasonable length = 198238896
at org.apache.jute.BinaryInputArchive.checkLength(BinaryInputArchive.java:127)
at org.apache.jute.BinaryInputArchive.readBuffer(BinaryInputArchive.java:92)
at org.apache.zookeeper.server.persistence.Util.readTxnBytes(Util.java:233)
at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:629)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:166)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:601)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:591)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:164)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Seems that some snapshot files under folder /opt/confluent/zookeeper/data/version-2 are corrupted or it doesn't have permission because when I use systemctl start confluent-zookeeper i got thsi error and if I start zookeeper manually I don't have this problem.
It was from systemd want to write logs in a path that is root:root and it doesn't have permission so when I used chown -R kafka:kafka dir/to/log/path and changed permission, my problem solved.

Kubernetes DSE Cassandra CommitLogReplayer$CommitLogReplayException

I have installed Cassandra on Kubernetes (9 pods) All the pods are up and running except
for one pod, which shows the below error.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Caused by: org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:111)
... 12 more
ERROR [main] 2021-09-06 06:19:08,990 JVMStabilityInspector.java:251 - JVM state determined to be unstable. Exiting forcefully due to:
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Can someone help me out please
For whatever reason, one of the commit log segments got corrupted on the node.
You can workaround the issue by manually deleting this file on the pod:
/var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log
Interestingly, that commit log segment was created on September 2 (1630582314923) but the log entry you posted was from September 6. This indicates something happened to the pod which resulted in the corrupted file.
You'll need to review the Cassandra logs on the pod (not the pod logs itself) to determine the root cause and address it. Cheers!

HADOOPFS - Could not verify the base directory in streamsets

I am having issues running the Pipeline with in streamsets, I can see the following error is :
HADOOPFS_44 - Could not verify the base directory: 'java.net.ConnectException: Call From SDC/...... to ......failed on connection exception: java.net.ConnectException: Connection refused;
For more details see:
https://cwiki.apache.org/confluence/display/HADOOP2/ConnectionRefused
You should follow the steps given in the Hadoop wiki. The machine running StreamSets Data Collector cannot connect to the Hadoop cluster for some reason.

Backup Daemon error - ops manager 3.4

I have installed ops manager and setup the configuration for backup. When it tries to sync the mongodb deployment, it is giving an error as it could not find mongod in /opt/mongodb/mms/mongodb-releases.
Here is the error below, this is the error thrown by backup daemon- backup.jobs.590664394c9f732dd6c88b7c.tax
Failed to start mongod
com.xgen.svc.brs.util.GenericMongoManager$MongoManagerConfigException: Could not find mongod. Found /opt/mongodb/mms/mongodb-releases/mongodb-linux-x86_64-rhel70-3.2.8/bin, but did not find /opt/mongodb/mms/mongodb-releases/mongodb-linux-x86_64-rhel70-3.2.8/bin/mongod.
com.xgen.svc.brs.util.GenericMongoManager$Purpose.<init>(GenericMongoManager.java:132)
com.xgen.svc.brs.util.MongoManager$MongoDPurpose.<init>(MongoManager.java:331)
com.xgen.svc.brs.util.MongoManager$HeadPurpose.<init>(MongoManager.java:477)
com.xgen.svc.brs.job.ReplicaSetJob.startMongo(ReplicaSetJob.java:103)
com.xgen.svc.brs.job.ReplicaSetJob.startMongo(ReplicaSetJob.java:80)
com.xgen.svc.brs.job.IncrementalSyncJob.doWork(IncrementalSyncJob.java:82)
Can you please show how can it be resolved?
Restart the backup daemon (only the backup daemon).
It should normally download the mongodb releases at the very beginning.
2017-05-05T00:00:47.558+0000 [MongoDbReleaseAutoDownload Thread] INFO com.xgen.svc.brs.autoDownloader.MongoDbReleaseAutoDownloadThread [MongoDbReleaseAutoDownloadThread.java.runInternal:105] - MongoDbReleaseAutoDownload run completed.
If it does not work, look at the daemon log (by default /opt/mongodb/mms/logs/daemon.log).

OrientDB multi-node configuration: hazelcast issues

I have OrienDB 2.1.4 cluster of 3 nodes with basic configuration. The only change in hazelcast.xml I made is to replace multicast by implicit tcp-ip hosts list.
After heavy request to DB (select without joins, about 300k rows in result set), OrientDB stops response to network connection attempts from application (OrientDB Studio is still working), the follwoing exceptions continuously appear in logs:
on master node
2016-02-24 10:02:17:647 INFO [10.10.10.124]:2434 [zertodb] [3.3.5] Remaining migration tasks in queue => 1 [InternalPartitionService][10.10.10.124]:2434 [zertodb] [3.3.5] Received data format is invalid. (An old version of Hazelcast may be running here.)
com.hazelcast.nio.serialization.HazelcastSerializationException: java.io.UTFDataFormatException: Length check failed, maybe broken bytestream or wrong stream position
at com.hazelcast.nio.serialization.SerializationServiceImpl.handleException(SerializationServiceImpl.java:354)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:341)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readObject(ByteArrayObjectDataInput.java:454)
at com.hazelcast.cluster.MulticastService.receive(MulticastService.java:155)
at com.hazelcast.cluster.MulticastService.run(MulticastService.java:113)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.UTFDataFormatException: Length check failed, maybe broken bytestream or wrong stream position
at com.hazelcast.nio.UTFEncoderDecoder.readUTF0(UTFEncoderDecoder.java:505)
at com.hazelcast.nio.UTFEncoderDecoder.readUTF(UTFEncoderDecoder.java:77)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readUTF(ByteArrayObjectDataInput.java:450)
at com.hazelcast.cluster.ConfigCheck.readData(ConfigCheck.java:219)
at com.hazelcast.cluster.JoinMessage.readData(JoinMessage.java:80)
at com.hazelcast.cluster.JoinRequest.readData(JoinRequest.java:64)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:111)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:39)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:44)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:335)
... 4 more
on other nodes:
[10.10.10.194]:2434 [zertodb] [3.3.5] Received data format is invalid. (An old version of Hazelcast may be running here.)
com.hazelcast.nio.serialization.HazelcastSerializationException: java.io.StreamCorruptedException: invalid type code: 00
at com.hazelcast.nio.serialization.SerializationServiceImpl.handleException(SerializationServiceImpl.java:354)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:341)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readObject(ByteArrayObjectDataInput.java:454)
at com.hazelcast.cluster.ConfigCheck.readData(ConfigCheck.java:215)
at com.hazelcast.cluster.JoinMessage.readData(JoinMessage.java:80)
at com.hazelcast.cluster.JoinRequest.readData(JoinRequest.java:64)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:111)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:39)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:44)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:335)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readObject(ByteArrayObjectDataInput.java:454)
at com.hazelcast.cluster.MulticastService.receive(MulticastService.java:155)
at com.hazelcast.cluster.MulticastService.run(MulticastService.java:113)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.StreamCorruptedException: invalid type code: 00
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1379)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:196)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:44)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:335)
... 12 more
The same query with smaller result set works fine.
I have found this issue https://github.com/hazelcast/hazelcast/issues/4327 about your problem with hazelcast.
Hope it helps.