Need help, I just tried to create postgres-debezium-pub/sub/bigquery.
Debezium seems ok to receive updates from PostgreSQL, but when I try to pull the pub/sub subscription I couldn't find any messages.
This is the log from Debezium
2022-03-25 08:44:13,053 INFO \[io.deb.con.pos.PostgresSchema\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) REPLICA IDENTITY for 'public.test_table' is 'DEFAULT'; UPDATE and DELETE events will contain previous values only for PK columns
2022-03-25 08:44:13,054 INFO \[io.deb.pip.sou.sna.inc.AbstractIncrementalSnapshotChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) No incremental snapshot in progress, no action needed on start
2022-03-25 08:44:13,055 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Retrieved latest position from stored offset 'LSN{0/1632FE58}'
2022-03-25 08:44:13,056 INFO \[io.deb.con.pos.con.WalPositionLocator\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Looking for WAL restart position for last commit LSN 'null' and last change LSN 'LSN{0/1632FE58}'
2022-03-25 08:44:13,056 INFO \[io.deb.con.pos.con.PostgresReplicationConnection\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Initializing PgOutput logical decoder publication
2022-03-25 08:44:13,136 INFO \[io.deb.con.pos.con.PostgresConnection\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Obtained valid replication slot ReplicationSlot \[active=false, latestFlushedLsn=LSN{0/12106F78}, catalogXmin=124135\]
2022-03-25 08:44:13,142 INFO \[io.deb.jdb.JdbcConnection\] (pool-8-thread-1) Connection gracefully closed
2022-03-25 08:44:13,162 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Requested thread factory for connector PostgresConnector, id = debezium-postgres named = keep-alive
2022-03-25 08:44:13,163 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Creating thread debezium-postgresconnector-debezium-postgres-keep-alive
2022-03-25 08:44:13,177 INFO \[io.deb.con.pos.PostgresSchema\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) REPLICA IDENTITY for 'public.test_table' is 'DEFAULT'; UPDATE and DELETE events will contain previous values only for PK columns
2022-03-25 08:44:13,179 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Searching for WAL resume position
2022-03-25 08:45:14,447 INFO \[io.deb.con.pos.con.WalPositionLocator\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) First LSN 'LSN{0/16330180}' received
2022-03-25 08:45:14,448 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) WAL resume position 'LSN{0/16330180}' discovered
2022-03-25 08:45:14,449 INFO \[io.deb.jdb.JdbcConnection\] (pool-11-thread-1) Connection gracefully closed 2022-03-25 08:45:14,451 INFO \[io.deb.jdb.JdbcConnection\] (pool-12-thread-1) Connection gracefully closed
2022-03-25 08:45:14,484 INFO \[io.deb.con.pos.con.PostgresReplicationConnection\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Initializing PgOutput logical decoder publication
2022-03-25 08:45:14,499 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Requested thread factory for connector PostgresConnector, id = debezium-postgres named = keep-alive
2022-03-25 08:45:14,499 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Creating thread debezium-postgresconnector-debezium-postgres-keep-alive 2022-03-25 08:45:14,500 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Processing messages
2022-03-25 08:45:15,515 INFO \[io.deb.con.pos.con.WalPositionLocator\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Message with LSN 'LSN{0/16330180}' arrived, switching off the filtering
2022-03-25 08:47:01,552 INFO \[io.deb.ser.DebeziumServer\] (main) Received request to stop the engine
2022-03-25 08:47:01,554 INFO \[io.deb.emb.EmbeddedEngine\] (main) Stopping the embedded engine
2022-03-25 08:47:01,555 INFO \[io.deb.emb.EmbeddedEngine\] (main) Waiting for PT5M for connector to stop Stream closed EOF for debezium/debezium-0 (debezium)
This is the application properties:
debezium.sink.pravega.scope=''
debezium.sink.type=pubsub
debezium.sink.pubsub.ordering.enabled=false
debezium.format.value=json
debezium.format.value.schemas.enable=false
debezium.sink.pubsub.project.id=xxxxxxxxxx
debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.database.hostname=xxx.xxx.xxx.xxx
debezium.source.database.port=5432
debezium.source.database.user=replication_user
debezium.source.database.password=secret
debezium.source.database.dbname=debezium-test
debezium.source.database.server.name=xxxxxxxxxx
debezium.source.table.include.list=public.test_table
debezium.source.plugin.name=pgoutput
What should I do for next, any suggestion? please help thank you very much!
We have create service account and binding with google service account using workload identity.
The expected result is Debezium sink to Pub/Sub working correctly indicated by messages shown from pulling the subscription.
This is usually the problem where when debezium is trying to access the token using one of the google api but is not able to get one either the service account doesn't have permission to generate token or something. One way to verify is to set property of quarkus.log.level=DEBUG which shows the error and another way is to generate the creds file and then pass it to the container if you are running it from docker.
docker run -p 80:80 -v "$HOME/.config/gcloud/application_default_credentials.json":/gcp/creds.json:ro --env GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json --network host deb-server:v1
These should help you identify the underlying problem and resolve it.
Related
I have 2 members in patroni cluster (1-master and 1-replica). In logs i saw problem after master reconnecting to new etcd server:
ERROR: Request to server http://etcd2:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'etcd2\', port=2379): Max retries exceeded with url: /v2/keys/patroni/patroni-cluster/?recursive=true (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'etcd2\', port=2379): Read timed out. (read timeout=3.333078201239308)"))')
INFO: Reconnection allowed, looking for another server.
INFO: Retrying on http://etcd1:2379
INFO: Selected new etcd server http://etcd1:2379
INFO: Lock owner: patroni2; I am patroni1
INFO: does not have lock
INFO: Reaped pid=3098484, exit status=0
LOG: received immediate shutdown request
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
After this replica node became a master:
INFO: Got response from patroni1 http://0.0.0.0:8008/patroni: {"state": "running", "postmaster_start_time": "2021-08-09 14:43:18.372 UTC", "role": "replica", "server_version": 120003, "cluster_unlocked": true, "xlog": {"received_location": 139045264096, "replayed_location": 139045264096, "replayed_timestamp": "2021-09-27 15:03:10.389 UTC", "paused": false}, "timeline": 30, "database_system_identifier": "6904244251638517787", "patroni": {"version": "1.6.5", "scope": "patroni-cluster"}}
WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
INFO: promoted self to leader by acquiring session lock
server promoting
LOG: received promote request
INFO: Lock owner: patroni2; I am patroni2
INFO: no action. i am the leader with the lock
ERROR: replication slot "patroni1" does not exist
ERROR: replication slot "patroni1" does not exist
INFO: acquired session lock as a leader
As you can see above new master cannot see a patroni1 now. After several times to recover wal patroni1 wrote these logs below:
INFO: establishing a new patroni connection to the postgres cluster
INFO: My wal position exceeds maximum replication lag
INFO: following a different leader because i am not the healthiest node
INFO: My wal position exceeds maximum replication lag
These logs information doesn't change at this time. patroni2 writes acquired session lock as a leader and patroni1 writes my wal position exceeds maximum replication lag.
But i can't see them in patroni cluster when use patronictl -c /patroni.yml list command.
How should i bring them back to cluster in better way?
try to migrate my app django/celery in nomad(hashicorp) to Kubernetes, and jobs with #shared_task() it's executed after 15 min at receiving message
I don't see anything in stats or status, Redis connection is OK
I see the task in flower, but it remains started during 15min
Received 2021-09-28 20:30:56.387649 UTC
Started 2021-09-28 20:30:56.390532 UTC
Succeeded 2021-09-28 20:46:00.556030 UTC
Received 2021-09-28 21:18:43.436750 UTC
Started 2021-09-28 21:18:43.441041 UTC
Succeeded 2021-09-28 21:33:49.391542 UTC
Celery version is 4.4.2
Any resolution to this problem?
fixed, it's based on redis key cache with setex
thanks
I had a docker container running timescaleDB. The database data was stored outside the container.
docker run -d --name timescale -v /<DATA>:/var/lib/postgresql/data timescale/timescaledb-postgis:latest-pg10
Something strange happened lately. I log in and see all the databases have suddenly vanished
I see the below in the log file
2021-03-13 11:32:00.215 UTC [21] LOG: database system was interrupted; last known up at 2021-03-11 16:16:19 UTC
2021-03-13 11:32:00.242 UTC [21] LOG: database system was not properly shut down; automatic recovery in progress
2021-03-13 11:32:00.243 UTC [21] LOG: redo starts at 0/15C1270
2021-03-13 11:32:00.243 UTC [21] LOG: invalid record length at 0/15C12A8: wanted 24, got 0
2021-03-13 11:32:00.243 UTC [21] LOG: redo done at 0/15C1270
2021-03-13 11:32:00.247 UTC [8] LOG: database system is ready to accept connections
2021-03-13 20:33:10.424 UTC [31] LOG: could not receive data from client: Operation timed out
2021-03-13 20:33:10.424 UTC [29] LOG: could not receive data from client: Operation timed out
Does that means that database has corrupted? If so is there a way to recover it somehow? The container has been running for 3 years without a problem and suddenly this unexpected loss of database.
Thanks
Yes, the database was corrupted, but it was recovering by the automated recovery process. It looked like the db system started working since it sent this message: database system is ready to accept connections. This means that the logfile recovery was done properly (which doesn't mean that the database files are fully consistent).
When the database is abruptly shutdown, there is small chance for filelvel corruption as well, but the good news is that I don't see anything in the log, after the recovery that can suggest that this is the case, however, you need to have backup of the files.
The next log message could not receive data from client: Operation timed out is not related to recovery, it's due to the client application which had terminated without properly closing the connection.
Check more information on corruptions and reasons in Postgresql wiki.
If you depend on the data in the database, always keep backup. Easiest way is to use pg_dumpall. This will dump the data in plain text format as a series of SQL statements and you will be able to import the data on later versions of PostgreSQL.
So my recommendation, before you do anything else with it, STOP THE CONTAINER AND TAKE BACKUP OF THE FILES. The recovery is trial and error process, and you will need to have the fresh copy of the files to try different thing. After you do this, export the data with pg_dumpall. If this passes, you can resume normal operations of the database.
After an unfortunate docker outage, one of our test channels has stopped working properly. This is the output of a previously working “peer chaincode invoke” command:
Error: error sending transaction for invoke: got unexpected status: SERVICE_UNAVAILABLE -- will not enqueue, consenter for this channel hasn't started yet - proposal response: version:1 response:<status:200 payload:"be85bda14845a33cd07db9825d2e473dc65902e6986fdfccea30d8c32385f758" > payload:"\n \364\242+\t\222\216\361\020\024}d7\203\277WY04\233\225vA\376u\330r\2045\312\206\304\333\022\211\001\n3\022\024\n\004lscc\022\014\n\n\n\004strs\022\002\010\004\022\033\n\004strs\022\023\032\021\n\tkeepalive\032\004ping\032E\010\310\001\032#be85bda14845a33cd07db9825d2e473dc65902e6986fdfccea30d8c32385f758\"\013\022\004strs\032\0031.0" endorsement:<endorser:"\n\013BackboneMSP\022\203\007-----BEGIN CERTIFICATE----- etc. more output removed from here
I find this in the orderer's log:
2019-07-29 14:46:50.930 UTC [orderer/consensus/kafka] try -> DEBU 3c10 [channel: steel] Connecting to the Kafka cluster
2019-07-29 14:46:50.931 UTC [orderer/consensus/kafka] try -> DEBU 3c11 [channel: steel] Need to retry because process failed = kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
2019-07-29 14:46:56.967 UTC [common/deliver] Handle -> WARN 3c12 Error reading from 10.0.0.4:32800: rpc error: code = Canceled desc = context canceled
2019-07-29 14:46:56.967 UTC [orderer/common/server] func1 -> DEBU 3c13 Closing Deliver stream
2019-07-29 14:46:56.972 UTC [orderer/common/server] Deliver -> DEBU 3c14 Starting new Deliver handler
2019-07-29 14:46:56.972 UTC [common/deliver] Handle -> DEBU 3c15 Starting new deliver loop for 10.0.0.4:32802
2019-07-29 14:46:56.973 UTC [common/deliver] Handle -> DEBU 3c16 Attempting to read seek info message from 10.0.0.4:32802
2019-07-29 14:46:56.973 UTC [common/deliver] deliverBlocks -> WARN 3c17 [channel: steel] Rejecting deliver request for 10.0.0.4:32802 because of consenter error
2019-07-29 14:46:56.973 UTC [common/deliver] Handle -> DEBU 3c18 Waiting for new SeekInfo from 10.0.0.4:32802
2019-07-29 14:46:56.973 UTC [common/deliver] Handle -> DEBU 3c19 Attempting to read seek info message from 10.0.0.4:32802
2019-07-29 14:46:56.995 UTC [common/deliver] Handle -> WARN 3c1a Error reading from 10.0.0.23:49844: rpc error: code = Canceled desc = context canceled
2019-07-29 14:46:56.995 UTC [orderer/common/server] func1 -> DEBU 3c1b Closing Deliver stream
And this is from the endorser peer’s log:
2019-07-29 15:14:17.829 UTC [ConnProducer] DisableEndpoint -> WARN 3d6 Only 1 endpoint remained, will not black-list it
2019-07-29 15:14:17.834 UTC [blocksProvider] DeliverBlocks -> WARN 3d7 [steel] Got error &{SERVICE_UNAVAILABLE}
2019-07-29 15:14:27.839 UTC [blocksProvider] DeliverBlocks -> WARN 3d8 [steel] Got error &{SERVICE_UNAVAILABLE}
I use these docker images:
hyperledger/fabric-kafka:0.4.10
hyperledger/fabric-orderer:1.2.0
hyperledger/fabric-peer:1.2.0
Based on the above, I assume that the consistency between the orderer and the corresponding kafka topic is broken. It also doesn't help if I redirect requests to another orderer or force to change the kafka topic leader. Is it correct that if KAFKA_LOG_RETENTION_MS=-1 had been set, this error would probably have been prevented?
After reviewing the archives, I found that it is not possible to fix this error. As I see it, I can't shutdown only one channel, and I even have to stop all the peers subscribed to the channel if I don't want continuous error messages in the orderer logs. What is the best practice in cases like mine?
Regards;
Sandor
consenter error
which means before connection made b/w kafka & orderer you are trying
to do some operations.
Which means there is an error present in between kafka and orderer
Note: Probably you might have set up a connection which is not stable
Try to check the logs of orderer it should have a message posted
successfully.
whenever Kafka, orderer try to connect, orderer will post a message if it
successfully posted to a topic which means you have configured correctly
Make sure connection b/w kafka and orderer are configured correctly
2019-07-29 14:46:50.931 UTC [orderer/consensus/kafka] try -> DEBU 3c11 [channel: steel] Need to retry because process failed = kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
By above clue, it is complete with Kafka nothing todo with the orderer
try check this
I was setting up a new cluster for Hyperledger Fabric on EKS. The cluster has 4 kafka nodes, 3 zookeeper nodes, 4 peers, 3 orderers, 1 CA. All the containers come up individually, and the kafka/zookeeper backend is also stable. I can SSH into any kafka/zookeeper and check for connections to any other nodes, create topics, post messages etc. The kafka is accessible via Telnet from all orderers.
When I try to create a channel I get the following error from the orderer:
2019-04-25 13:34:17.660 UTC [orderer.common.broadcast] ProcessMessage -> WARN 025 [channel: channel1] Rejecting broadcast of message from 192.168.94.15:53598 with SERVICE_UNAVAILABLE: rejected by Consenter: backing Kafka cluster has not completed booting; try again later
2019-04-25 13:34:17.660 UTC [comm.grpc.server] 1 -> INFO 026 streaming call completed grpc.service=orderer.AtomicBroadcast grpc.method=Broadcast grpc.peer_address=192.168.94.15:53598 grpc.code=OK grpc.call_duration=14.805833ms
2019-04-25 13:34:17.661 UTC [common.deliver] Handle -> WARN 027 Error reading from 192.168.94.15:53596: rpc error: code = Canceled desc = context canceled
2019-04-25 13:34:17.661 UTC [comm.grpc.server] 1 -> INFO 028 streaming call completed grpc.service=orderer.AtomicBroadcast grpc.method=Deliver grpc.peer_address=192.168.94.15:53596 error="rpc error: code = Canceled desc = context canceled" grpc.code=Canceled grpc.call_duration=24.987468ms
And the Kafka leader reports the following error:
[2019-04-25 14:07:09,453] WARN [SocketServer brokerId=2] Unexpected error from /192.168.89.200; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 369295617 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)
[2019-04-25 14:13:53,917] INFO [GroupMetadataManager brokerId=2] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
The error indicates that you are receiving messages larger than the permitted maximum size, that defaults to ~100MB. Try to increase the following property in server.properties file, so that it can fit larger receive (in this case at least 369295617 bytes):
# Set to 500MB
socket.request.max.bytes=500000000
and then restart your Kafka Cluster.
If this doesn't work for you, then I guess that you are trying to connect to a non-SSL listener. Therefore, you'd have to verify that broker's SSL listener port is 9092 (or the corresponding port in case you are not using the default one) . The following should do the trick:
listeners=SSL://:9092
advertised.listeners=SSL://:9092
inter.broker.listener.name=SSL