AWS DMS task failing after some time in CDC mode - postgresql

I'm having trouble in setting up a task migrating the data in a RDS Database (PostgreSQL, engine 10.15) into an S3 bucket in the initial migration + CDC mode.
Both endpoints are configured and tested successfully.
I have created the task twice, both times it ran a couple of hours at most, the first time the initial dump went fine and some of the incremental dumps took place as well, the second time only the initial dump finished and no incremental dump was performed before the task failed.
The error message is now:
Last Error Task 'data-migration-bp-dev' was suspended after 9 successive recovery failures Stop Reason FATAL_ERROR Error Level FATAL_
but just after it failed for the first time it was:
Last Error An internal WAL conversational protocol error has occurred. Task error notification received from subtask 0, thread 0 reptask/replicationtask.c:2859 1020452 Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated reptask/replicationtask.c:2866 1020452 Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE
In the CloudWatch logs I see the following error messages:
SOURCE_CAPTURE I: Streaming initiated successfully (postgres_pglogical.c:274)
SOURCE_CAPTURE I: #1 : Non-monotonic LSN sequence: Current LSN '00000000/00000000' < Previous LSN '000001E3/94016430'. Event is ignored. (postgres_endpoint_wal_engine.c:710)
SOURCE_CAPTURE I: Unable to resolve attributes for relation id '28804'. Aborting action. (postgres_pglogical.c:1643)
SOURCE_CAPTURE I: End of CDC / CAPTURE events for POSTGRES endpoint. (postgres_endpoint_capture.c:520)
SOURCE_CAPTURE I: CAPTURE ended with exceptions. (postgres_endpoint_capture.c:527)
SOURCE_CAPTURE E: Could not find relation id '28804' in hash. 1020483 (postgres_pglogical.c:1470)
SOURCE_CAPTURE E: Failed to parse relation from dml command 1020483 (postgres_pglogical.c:2515)
SOURCE_CAPTURE E: Failed to find relation id on target while processing message from source 1020452 (postgres_endpoint_wal_engine.c:805)
SOURCE_CAPTURE E: WAL stream loop ended abnormally. (STATUS_PROTOCOL_ERROR) 1020452 (postgres_endpoint_wal_engine.c:992)
SOURCE_CAPTURE E: WAL reader terminated with irrecoverable error. 1020452 (postgres_endpoint_capture.c:496)
TASK_MANAGER I: Task - data-migration-bp-dev is in ERROR state, updating starting status to AR_NOT_APPLICABLE (repository.c:5102)
SOURCE_CAPTURE E: Error executing source loop 1020452 (streamcomponent.c:1870)
TASK_MANAGER E: Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev 1020452 (subtask.c:1409)
SOURCE_CAPTURE E: Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (subtask.c:1578)
TASK_MANAGER E: Task error notification received from subtask 0, thread 0 1020452 (replicationtask.c:2859)
TASK_MANAGER E: Error executing source loop; Stream component failed at subtask 0, component st_0_data-migration-rds-bp-dev; Stream component 'st_0_data-migration-rds-bp-dev' terminated 1020452 (replicationtask.c:2866)
TASK_MANAGER E: Task 'data-migration-bp-dev' encountered a recoverable error, retry attempt # 0 (repository.c:5184)
At this point I should mention, that we had to configure the pglogical plugin and restart the database, but we got an error in the end, which we ignored since the DMS task started after that operation.
ERROR: current database is not configured as pglogical node
HINT: create pglogical node first
Is the problem of our failing DMS task related to the pglogical plugin configuration? If so, how can we configure it for it to work (our db engine should be compatible with it, no?)? And if not, how to fix it?
Thank you in advance!

Should anyone get the same error in the future, here is what we were told by the AWS tech specialist:
There is a known (to AWS) issue with the pglogical plugin. The solution requires using the test_decoding plugin instead.
Enforce using the test_decoding plugin on the DMS Endpoint by specifying pluginName=test_decoding in Extra Connection Attributes
Create a new DMS task using this endpoint (using the old task may cause it to fail due to dissynchronization between the task and the logs)
It sure did resolve the issue, but we still don't know what the problem really was with the plugin that is strongly suggested everywhere in the DMS documentation (at the moment).

Related

Kogito Kafka messages - Message Trigger information is not complete TriggerMetaData

has someone has ever encounter this error when using Kogito (with quarkus) to send Kafka messages from a flow (bpmn2 file) throwing an "Intermediate Message" from the intermediate events?
[ERROR] com.package.xxxxxx.whenRunTestSupermarketSuccess Time elapsed: 0.006 s
<<< ERROR!
java.lang.RuntimeException:
java.lang.RuntimeException: io.quarkus.builder.BuildException: Build failure: Build failed due to errors
[error]: Build step org.kie.kogito.quarkus.common.deployment.KogitoAssetsProcessor#generateSources
threw an exception: org.kie.kogito.codegen.process.ProcessCodegenException: Error while elaborating
process id = "supermarketFlow", packageName = "com.package.xxxxxx": Message Trigger information is
not complete TriggerMetaData [name=cartCreated, type=ProduceMessage, dataType=String, modelRef=null, ownerId=1]
at org.kie.kogito.codegen.process.ProcessCodegen.internalGenerate(ProcessCodegen.java:325)
at org.kie.kogito.codegen.core.AbstractGenerator.generate(AbstractGenerator.java:69)
I can sense modelRef=null could be the issue but as far as I've seen in the official documentation, is not mention. I suppose I should give it a value on the bpmn file but I don't where.
I'm trying to send Kafka messages after running some logic, for the moment no one consuming the messages.
May be the issue is related to this thread?
BPMN 2.0, kogito, triggering signal or message from another process

Mongo RealmSync Getting Error: Encountered non-recoverable resume token error

I am getting the below error Mongo RealmSync, after that i need to manually re-enable the sync process.
Anyone know, what is the cause of that error,
encountered non-recoverable resume token error. Sync cannot be resumed from this state and must be terminated and re-enabled to continue functioning: (ChangeStreamHistoryLost) PlanExecutor error during aggregation :: caused by :: Resume of change stream was not possible, as the resume point may no longer be in the oplog.
Note: My sync don't have many transactions which is syncing.

MongoDB replication error NetworkInterfaceExceededTimeLimit and MaxTimeMSExpired - Causes and Fix

In a replication set up of primary, secondary and arbiter, the replication connection URI times out intermittently and logs show below error. Please help share what could be the issue and what would be the recommended fix for same.
NetworkInterfaceExceededTimeLimit: error in fetcher batch callback ::
caused by :: timed out. Last fetched optime (with hash): { ts:
Timestamp(1554364591, 71), t: 65596 }[3697357721798898959]. Restarts
remaining: 1 I REPL [replication-1] Error returned from oplog
query (no more query restarts left): MaxTimeMSExpired: error in
fetcher batch callback :: caused by :: operation exceeded time limit W
REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with
error: MaxTimeMSExpired: error in fetcher batch callback :: caused by
:: operation exceeded time limit I REPL [SyncSourceFeedback]
SyncSourceFeedback error sending update to <ServerFQDN>:27017:
InvalidSyncSource: Sync source was cleared. Was <ServerFQDN>:27017`
I did refer to link: https://stackoverflow.com/questions/44798577/mongodb-replication-timeout
but it does not match the case with ours. The disk drives have enough space. Please help suggest what could be wrong here. Thank you!!
Each time restart the mongo service on both the servers but it does not help. Error keeps coming intermittently.

GCS Connector Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

We are trying to run Hive queries on HDP 2.1 using GCS Connector, it was working fine until yesterday but since today morning our jobs are randomly started failing. When we restart them manually they just work fine. I suspect it's something to do with number of parallel Hive jobs running at a given point of time.
Below is the error message:
vertexId=vertex_1407434664593_37527_2_00, diagnostics=[Vertex Input: audience_history initializer failed., java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found]
DAG failed due to vertex failure. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask
Any help will be highly appreciated.
Thanks!

How do I fix server Oops error when deploying Play Framework 2.0 application as a Windows service?

I was following the processes in this very helpful post:
How do I run a Play Framework 2.0 application as a Windows service?
when I ran into trouble in Step 9. When I execute runConsole.bat, the service cycles between the running and restarting states. The full log is here:
wrapper.log
but what jumps out at me in the log are the following:
INFO|7268/0|play.core.server.NettyServer|13-12-28 13:07:28|Oops, cannot start the server.
INFO|7268/0|play.core.server.NettyServer|13-12-28 13:07:28|Configuration error: Configuration error[Cannot connect to database [default]]
...
INFO|7268/0|play.core.server.NettyServer|13-12-28 13:07:28|Caused by: org.h2.jdbc.JdbcSQLException: Database may be already in use: "Locked by another process". Possible solutions: close all other connection(s); use the server mode [90020-168]
...
INFO|wrapper|play.core.server.NettyServer|13-12-28 13:07:28|restart process due to default exit code rule
INFO|wrapper|play.core.server.NettyServer|13-12-28 13:07:28|restart internal RUNNING
INFO|wrapper|play.core.server.NettyServer|13-12-28 13:07:28|stopping process with pid/timeout 7268 45000
INFO|wrapper|play.core.server.NettyServer|13-12-28 13:07:30|process exit code: -1
...
INFO|7812/1|play.core.server.NettyServer|13-12-28 13:07:45|[error] c.j.b.h.AbstractConnectionHook - Failed to obtain initial connection Sleeping for 0ms and trying again. Attempts left: 0. Exception: null
INFO|7812/1|play.core.server.NettyServer|13-12-28 13:07:45|Oops, cannot start the server.
INFO|7812/1|play.core.server.NettyServer|13-12-28 13:07:45|Configuration error: Configuration error[Cannot connect to database [default]]
repeats several times...
Half way through writing this post, I realized that before step 9., you should terminate the start.bat script which you start up in step 6. When I did this, runConsole.bat would execute normally without all the errors I list above.