MongoDB replication error NetworkInterfaceExceededTimeLimit and MaxTimeMSExpired - Causes and Fix - mongodb

In a replication set up of primary, secondary and arbiter, the replication connection URI times out intermittently and logs show below error. Please help share what could be the issue and what would be the recommended fix for same.
NetworkInterfaceExceededTimeLimit: error in fetcher batch callback ::
caused by :: timed out. Last fetched optime (with hash): { ts:
Timestamp(1554364591, 71), t: 65596 }[3697357721798898959]. Restarts
remaining: 1 I REPL [replication-1] Error returned from oplog
query (no more query restarts left): MaxTimeMSExpired: error in
fetcher batch callback :: caused by :: operation exceeded time limit W
REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with
error: MaxTimeMSExpired: error in fetcher batch callback :: caused by
:: operation exceeded time limit I REPL [SyncSourceFeedback]
SyncSourceFeedback error sending update to <ServerFQDN>:27017:
InvalidSyncSource: Sync source was cleared. Was <ServerFQDN>:27017`
I did refer to link: https://stackoverflow.com/questions/44798577/mongodb-replication-timeout
but it does not match the case with ours. The disk drives have enough space. Please help suggest what could be wrong here. Thank you!!
Each time restart the mongo service on both the servers but it does not help. Error keeps coming intermittently.

Related

Mongo RealmSync Getting Error: Encountered non-recoverable resume token error

I am getting the below error Mongo RealmSync, after that i need to manually re-enable the sync process.
Anyone know, what is the cause of that error,
encountered non-recoverable resume token error. Sync cannot be resumed from this state and must be terminated and re-enabled to continue functioning: (ChangeStreamHistoryLost) PlanExecutor error during aggregation :: caused by :: Resume of change stream was not possible, as the resume point may no longer be in the oplog.
Note: My sync don't have many transactions which is syncing.

Kubernetes DSE Cassandra CommitLogReplayer$CommitLogReplayException

I have installed Cassandra on Kubernetes (9 pods) All the pods are up and running except
for one pod, which shows the below error.
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Caused by: org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:111)
... 12 more
ERROR [main] 2021-09-06 06:19:08,990 JVMStabilityInspector.java:251 - JVM state determined to be unstable. Exiting forcefully due to:
org.apache.cassandra.db.commitlog.CommitLogReplayer$CommitLogReplayException: Encountered bad header at position 47137 of commit log /var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log, with bad position but valid CRC
at org.apache.cassandra.db.commitlog.CommitLogReplayer.shouldSkipSegmentOnError(CommitLogReplayer.java:438)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.handleUnrecoverableError(CommitLogReplayer.java:452)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:109)
at org.apache.cassandra.db.commitlog.CommitLogSegmentReader$SegmentIterator.computeNext(CommitLogSegmentReader.java:84)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
at org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:236)
at org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:134)
at org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:154)
at org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:213)
at org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:194)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:338)
at com.datastax.bdp.server.DseDaemon.setup(DseDaemon.java:527)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:702)
at com.datastax.bdp.DseModule.main(DseModule.java:96)
Can someone help me out please
For whatever reason, one of the commit log segments got corrupted on the node.
You can workaround the issue by manually deleting this file on the pod:
/var/lib/cassandra/commitlog/CommitLog-600-1630582314923.log
Interestingly, that commit log segment was created on September 2 (1630582314923) but the log entry you posted was from September 6. This indicates something happened to the pod which resulted in the corrupted file.
You'll need to review the Cassandra logs on the pod (not the pod logs itself) to determine the root cause and address it. Cheers!

AWS DMS task failing after some time in CDC mode

I'm having trouble in setting up a task migrating the data in a RDS Database (PostgreSQL, engine 10.15) into an S3 bucket in the initial migration + CDC mode.
Both endpoints are configured and tested successfully.
I have created the task twice, both times it ran a couple of hours at most, the first time the initial dump went fine and some of the incremental dumps took place as well, the second time only the initial dump finished and no incremental dump was performed before the task failed.
The error message is now:
Last Error Task 'data-migration-bp-dev' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL_
but just after it failed for the first time it was:
Last Error An internal WAL conversational protocol error has occurred. Task error notification received from subtask 0, thread 0 reptask/replicationtask.c:2859 1020452 Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated reptask/replicationtask.c:2866 1020452 Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
In the CloudWatch logs I see the following error messages:
SOURCE_CAPTURE I: Streaming initiated successfully (postgres_pglogical.c:274)
SOURCE_CAPTURE I: #1 : Non-monotonic LSN sequence: Current LSN '00000000/00000000' < Previous LSN '000001E3/94016430'. Event is ignored. (postgres_endpoint_wal_engine.c:710)
SOURCE_CAPTURE I: Unable to resolve attributes for relation id '28804'. Aborting action. (postgres_pglogical.c:1643)
SOURCE_CAPTURE I: End of CDC / CAPTURE events for POSTGRES endpoint. (postgres_endpoint_capture.c:520)
SOURCE_CAPTURE I: CAPTURE ended with exceptions. (postgres_endpoint_capture.c:527)
SOURCE_CAPTURE E: Could not find relation id '28804' in hash. 1020483 (postgres_pglogical.c:1470)
SOURCE_CAPTURE E: Failed to parse relation from dml command 1020483 (postgres_pglogical.c:2515)
SOURCE_CAPTURE E: Failed to find relation id on target while processing message from source 1020452 (postgres_endpoint_wal_engine.c:805)
SOURCE_CAPTURE E: WAL stream loop ended abnormally. (STATUS_PROTOCOL_ERROR) 1020452 (postgres_endpoint_wal_engine.c:992)
SOURCE_CAPTURE E: WAL reader terminated with irrecoverable error. 1020452 (postgres_endpoint_capture.c:496)
TASK_MANAGER I: Task - data-migration-bp-dev is in ERROR state, updating starting status to AR_NOT_APPLICABLE (repository.c:5102)
SOURCE_CAPTURE E: Error executing source loop 1020452 (streamcomponent.c:1870)
TASK_MANAGER E: Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev 1020452 (subtask.c:1409)
SOURCE_CAPTURE E: Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (subtask.c:1578)
TASK_MANAGER E: Task error notification received from subtask 0, thread 0 1020452 (replicationtask.c:2859)
TASK_MANAGER E: Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (replicationtask.c:2866)
TASK_MANAGER E: Task 'data-migration-bp-dev' encountered a recoverable error, retry attempt # 0 (repository.c:5184)
At this point I should mention, that we had to configure the pglogical plugin and restart the database, but we got an error in the end, which we ignored since the DMS task started after that operation.
ERROR: current database is not configured as pglogical node
HINT: create pglogical node first
Is the problem of our failing DMS task related to the pglogical plugin configuration? If so, how can we configure it for it to work (our db engine should be compatible with it, no?)? And if not, how to fix it?
Thank you in advance!
Should anyone get the same error in the future, here is what we were told by the AWS tech specialist:
There is a known (to AWS) issue with the pglogical plugin. The solution requires using the test_decoding plugin instead.
Enforce using the test_decoding plugin on the DMS Endpoint by specifying pluginName=test_decoding in Extra Connection Attributes
Create a new DMS task using this endpoint (using the old task may cause it to fail due to dissynchronization between the task and the logs)
It sure did resolve the issue, but we still don't know what the problem really was with the plugin that is strongly suggested everywhere in the DMS documentation (at the moment).

OrientDB multi-node configuration: hazelcast issues

I have OrienDB 2.1.4 cluster of 3 nodes with basic configuration. The only change in hazelcast.xml I made is to replace multicast by implicit tcp-ip hosts list.
After heavy request to DB (select without joins, about 300k rows in result set), OrientDB stops response to network connection attempts from application (OrientDB Studio is still working), the follwoing exceptions continuously appear in logs:
on master node
2016-02-24 10:02:17:647 INFO [10.10.10.124]:2434 [zertodb] [3.3.5] Remaining migration tasks in queue => 1 [InternalPartitionService][10.10.10.124]:2434 [zertodb] [3.3.5] Received data format is invalid. (An old version of Hazelcast may be running here.)
com.hazelcast.nio.serialization.HazelcastSerializationException: java.io.UTFDataFormatException: Length check failed, maybe broken bytestream or wrong stream position
at com.hazelcast.nio.serialization.SerializationServiceImpl.handleException(SerializationServiceImpl.java:354)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:341)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readObject(ByteArrayObjectDataInput.java:454)
at com.hazelcast.cluster.MulticastService.receive(MulticastService.java:155)
at com.hazelcast.cluster.MulticastService.run(MulticastService.java:113)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.UTFDataFormatException: Length check failed, maybe broken bytestream or wrong stream position
at com.hazelcast.nio.UTFEncoderDecoder.readUTF0(UTFEncoderDecoder.java:505)
at com.hazelcast.nio.UTFEncoderDecoder.readUTF(UTFEncoderDecoder.java:77)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readUTF(ByteArrayObjectDataInput.java:450)
at com.hazelcast.cluster.ConfigCheck.readData(ConfigCheck.java:219)
at com.hazelcast.cluster.JoinMessage.readData(JoinMessage.java:80)
at com.hazelcast.cluster.JoinRequest.readData(JoinRequest.java:64)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:111)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:39)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:44)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:335)
... 4 more
on other nodes:
[10.10.10.194]:2434 [zertodb] [3.3.5] Received data format is invalid. (An old version of Hazelcast may be running here.)
com.hazelcast.nio.serialization.HazelcastSerializationException: java.io.StreamCorruptedException: invalid type code: 00
at com.hazelcast.nio.serialization.SerializationServiceImpl.handleException(SerializationServiceImpl.java:354)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:341)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readObject(ByteArrayObjectDataInput.java:454)
at com.hazelcast.cluster.ConfigCheck.readData(ConfigCheck.java:215)
at com.hazelcast.cluster.JoinMessage.readData(JoinMessage.java:80)
at com.hazelcast.cluster.JoinRequest.readData(JoinRequest.java:64)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:111)
at com.hazelcast.nio.serialization.DataSerializer.read(DataSerializer.java:39)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:44)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:335)
at com.hazelcast.nio.serialization.ByteArrayObjectDataInput.readObject(ByteArrayObjectDataInput.java:454)
at com.hazelcast.cluster.MulticastService.receive(MulticastService.java:155)
at com.hazelcast.cluster.MulticastService.run(MulticastService.java:113)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.StreamCorruptedException: invalid type code: 00
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1379)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at com.hazelcast.nio.serialization.DefaultSerializers$ObjectSerializer.read(DefaultSerializers.java:196)
at com.hazelcast.nio.serialization.StreamSerializerAdapter.read(StreamSerializerAdapter.java:44)
at com.hazelcast.nio.serialization.SerializationServiceImpl.readObject(SerializationServiceImpl.java:335)
... 12 more
The same query with smaller result set works fine.
I have found this issue https://github.com/hazelcast/hazelcast/issues/4327 about your problem with hazelcast.
Hope it helps.

SQL timeout on Azure website suddenly started when return large number (1500) rows

Azure Website with EF6 just started to get timeout on pages where I retrieve more than about 1000 rows. (unsure about the limit, works on 400 or less, fails on 1500 or more)
[Win32Exception (0x80004005): The wait operation timed out]
[SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. This failure occurred while attempting to connect to the routing destination. The duration spent while attempting to connect to the original server was - [Pre-Login] initialization=1; handshake=21; [Login] initialization=0; authentication=0; [Post-Login] complete=1; ]
The app has been running smoothly for several month I just noticed today. Any ideas?
(In case the error is still present: Page with err: http://fartslek.no/fartslek/15 Page without err: http://fartslek.no/fartslek/3 )