celery on kubernetes execute task 15 minutes after receive - kubernetes

try to migrate my app django/celery in nomad(hashicorp) to Kubernetes, and jobs with #shared_task() it's executed after 15 min at receiving message
I don't see anything in stats or status, Redis connection is OK
I see the task in flower, but it remains started during 15min
Received 2021-09-28 20:30:56.387649 UTC
Started 2021-09-28 20:30:56.390532 UTC
Succeeded 2021-09-28 20:46:00.556030 UTC
Received 2021-09-28 21:18:43.436750 UTC
Started 2021-09-28 21:18:43.441041 UTC
Succeeded 2021-09-28 21:33:49.391542 UTC
Celery version is 4.4.2
Any resolution to this problem?

fixed, it's based on redis key cache with setex
thanks

Related

Debezium Pub/Sub Sink Not Working, Message didn't arrived

Need help, I just tried to create postgres-debezium-pub/sub/bigquery.
Debezium seems ok to receive updates from PostgreSQL, but when I try to pull the pub/sub subscription I couldn't find any messages.
This is the log from Debezium
2022-03-25 08:44:13,053 INFO \[io.deb.con.pos.PostgresSchema\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) REPLICA IDENTITY for 'public.test_table' is 'DEFAULT'; UPDATE and DELETE events will contain previous values only for PK columns
2022-03-25 08:44:13,054 INFO \[io.deb.pip.sou.sna.inc.AbstractIncrementalSnapshotChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) No incremental snapshot in progress, no action needed on start
2022-03-25 08:44:13,055 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Retrieved latest position from stored offset 'LSN{0/1632FE58}'
2022-03-25 08:44:13,056 INFO \[io.deb.con.pos.con.WalPositionLocator\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Looking for WAL restart position for last commit LSN 'null' and last change LSN 'LSN{0/1632FE58}'
2022-03-25 08:44:13,056 INFO \[io.deb.con.pos.con.PostgresReplicationConnection\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Initializing PgOutput logical decoder publication
2022-03-25 08:44:13,136 INFO \[io.deb.con.pos.con.PostgresConnection\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Obtained valid replication slot ReplicationSlot \[active=false, latestFlushedLsn=LSN{0/12106F78}, catalogXmin=124135\]
2022-03-25 08:44:13,142 INFO \[io.deb.jdb.JdbcConnection\] (pool-8-thread-1) Connection gracefully closed
2022-03-25 08:44:13,162 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Requested thread factory for connector PostgresConnector, id = debezium-postgres named = keep-alive
2022-03-25 08:44:13,163 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Creating thread debezium-postgresconnector-debezium-postgres-keep-alive
2022-03-25 08:44:13,177 INFO \[io.deb.con.pos.PostgresSchema\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) REPLICA IDENTITY for 'public.test_table' is 'DEFAULT'; UPDATE and DELETE events will contain previous values only for PK columns
2022-03-25 08:44:13,179 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Searching for WAL resume position
2022-03-25 08:45:14,447 INFO \[io.deb.con.pos.con.WalPositionLocator\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) First LSN 'LSN{0/16330180}' received
2022-03-25 08:45:14,448 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) WAL resume position 'LSN{0/16330180}' discovered
2022-03-25 08:45:14,449 INFO \[io.deb.jdb.JdbcConnection\] (pool-11-thread-1) Connection gracefully closed 2022-03-25 08:45:14,451 INFO \[io.deb.jdb.JdbcConnection\] (pool-12-thread-1) Connection gracefully closed
2022-03-25 08:45:14,484 INFO \[io.deb.con.pos.con.PostgresReplicationConnection\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Initializing PgOutput logical decoder publication
2022-03-25 08:45:14,499 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Requested thread factory for connector PostgresConnector, id = debezium-postgres named = keep-alive
2022-03-25 08:45:14,499 INFO \[io.deb.uti.Threads\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Creating thread debezium-postgresconnector-debezium-postgres-keep-alive 2022-03-25 08:45:14,500 INFO \[io.deb.con.pos.PostgresStreamingChangeEventSource\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Processing messages
2022-03-25 08:45:15,515 INFO \[io.deb.con.pos.con.WalPositionLocator\] (debezium-postgresconnector-debezium-postgres-change-event-source-coordinator) Message with LSN 'LSN{0/16330180}' arrived, switching off the filtering
2022-03-25 08:47:01,552 INFO \[io.deb.ser.DebeziumServer\] (main) Received request to stop the engine
2022-03-25 08:47:01,554 INFO \[io.deb.emb.EmbeddedEngine\] (main) Stopping the embedded engine
2022-03-25 08:47:01,555 INFO \[io.deb.emb.EmbeddedEngine\] (main) Waiting for PT5M for connector to stop Stream closed EOF for debezium/debezium-0 (debezium)
This is the application properties:
debezium.sink.pravega.scope=''
debezium.sink.type=pubsub
debezium.sink.pubsub.ordering.enabled=false
debezium.format.value=json
debezium.format.value.schemas.enable=false
debezium.sink.pubsub.project.id=xxxxxxxxxx
debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.database.hostname=xxx.xxx.xxx.xxx
debezium.source.database.port=5432
debezium.source.database.user=replication_user
debezium.source.database.password=secret
debezium.source.database.dbname=debezium-test
debezium.source.database.server.name=xxxxxxxxxx
debezium.source.table.include.list=public.test_table
debezium.source.plugin.name=pgoutput
What should I do for next, any suggestion? please help thank you very much!
We have create service account and binding with google service account using workload identity.
The expected result is Debezium sink to Pub/Sub working correctly indicated by messages shown from pulling the subscription.
This is usually the problem where when debezium is trying to access the token using one of the google api but is not able to get one either the service account doesn't have permission to generate token or something. One way to verify is to set property of quarkus.log.level=DEBUG which shows the error and another way is to generate the creds file and then pass it to the container if you are running it from docker.
docker run -p 80:80 -v "$HOME/.config/gcloud/application_default_credentials.json":/gcp/creds.json:ro --env GOOGLE_APPLICATION_CREDENTIALS=/gcp/creds.json --network host deb-server:v1
These should help you identify the underlying problem and resolve it.

Docker Postgres database corrupted

I had a docker container running timescaleDB. The database data was stored outside the container.
docker run -d --name timescale -v /<DATA>:/var/lib/postgresql/data timescale/timescaledb-postgis:latest-pg10
Something strange happened lately. I log in and see all the databases have suddenly vanished
I see the below in the log file
2021-03-13 11:32:00.215 UTC [21] LOG: database system was interrupted; last known up at 2021-03-11 16:16:19 UTC
2021-03-13 11:32:00.242 UTC [21] LOG: database system was not properly shut down; automatic recovery in progress
2021-03-13 11:32:00.243 UTC [21] LOG: redo starts at 0/15C1270
2021-03-13 11:32:00.243 UTC [21] LOG: invalid record length at 0/15C12A8: wanted 24, got 0
2021-03-13 11:32:00.243 UTC [21] LOG: redo done at 0/15C1270
2021-03-13 11:32:00.247 UTC [8] LOG: database system is ready to accept connections
2021-03-13 20:33:10.424 UTC [31] LOG: could not receive data from client: Operation timed out
2021-03-13 20:33:10.424 UTC [29] LOG: could not receive data from client: Operation timed out
Does that means that database has corrupted? If so is there a way to recover it somehow? The container has been running for 3 years without a problem and suddenly this unexpected loss of database.
Thanks
Yes, the database was corrupted, but it was recovering by the automated recovery process. It looked like the db system started working since it sent this message: database system is ready to accept connections. This means that the logfile recovery was done properly (which doesn't mean that the database files are fully consistent).
When the database is abruptly shutdown, there is small chance for filelvel corruption as well, but the good news is that I don't see anything in the log, after the recovery that can suggest that this is the case, however, you need to have backup of the files.
The next log message could not receive data from client: Operation timed out is not related to recovery, it's due to the client application which had terminated without properly closing the connection.
Check more information on corruptions and reasons in Postgresql wiki.
If you depend on the data in the database, always keep backup. Easiest way is to use pg_dumpall. This will dump the data in plain text format as a series of SQL statements and you will be able to import the data on later versions of PostgreSQL.
So my recommendation, before you do anything else with it, STOP THE CONTAINER AND TAKE BACKUP OF THE FILES. The recovery is trial and error process, and you will need to have the fresh copy of the files to try different thing. After you do this, export the data with pg_dumpall. If this passes, you can resume normal operations of the database.

PSQL TimescaleDB, ERROR: the database system is in recovery mode

We have an application pipeline and Postgres-12(TimescaleDB, managed through Patroni) on a separate server (VM with Ubuntu 18.04 LTS).
We are facing an issue with the DB, it suddenly stuck in the recovery mode, and also we can’t access it from the psql client and select queries also hung.
After an hour or late all got back to normal (As my current pipeline terminated) and able to run queries against the DB server.
Master DB error details:
2020-11-03 18:35:08.612 IST [9773] [unknown]#[unknown] LOG: connection received: host=x.x.x.x port=58780
2020-11-03 18:35:08.612 IST [9773] FATAL: the database system is in recovery mode
2020-11-03 18:35:08.596 IST [18276] LOG: could not send data to client: Broken pipe
Replica server error details:
2020-11-03 18:34:55 IST [18316]: [85649-1] user=postgres,db=postgres,app=[unknown],client=x.x.x.x LOG: duration: 10.228 ms statement: SELECT * FROM pg_stat_bgwriter;
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 18:35:08 IST [18322]: [2-1] user=,db=,app=,client= FATAL: could not receive data from WAL stream: SSL SYSCALL error: EOF detected
2020-11-03 18:35:08 IST [20500]: [1-1] user=,db=,app=,client= FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
Pipeline error details:
Job aborted due to stage failure: Task 4 in stage 0.0 failed 3 times, most recent failure: Lost task 4.2 in stage 0.0 (TID 29, ip-x-x-x-x.ap-southeast-1.compute.internal, executor 19): org.postgresql.util.PSQLException: FATAL: the database system is in recovery mode at org.postgresql.core.v3.ConnectionFactoryImpl.doAuthentication(ConnectionFactoryImpl.java:514) at org.postgresql.core.v3.ConnectionFactoryImpl.tryConnect(ConnectionFactoryImpl.java:141) at org.postgresql.core.v3.ConnectionFactoryImpl.openConnectionImpl(ConnectionFactoryImpl.java:192) at org.postgresql.core.ConnectionFactory.openConnection(ConnectionFactory.java:49) at org.postgresql.jdbc.PgConnection.<init>(PgConnection.java:195) at org.postgresql.Driver.makeConnection(Driver.java:454) at org.postgresql.Driver.connect(Driver.java:256) at org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.connect(DriverWrapper.scala:45)
Please any advise on this issue?
What version of TimescaleDB are you running? In particular, there were some issues with 1.7.x if you try to query a read replica; we recommend upgrading to 1.7.4.
(Otherwise, there's not much information about to suggest what might have happened.)
https://github.com/timescale/timescaledb/releases/tag/1.7.4

Hyperledger Fabric in Kubernetes: Not able to instantiate chaincode

Hello Everyone i am working on setup of fabric default first-network in kubernetes. But when i am instantiate the chaincode it gives me error. Please check below are my peer logs.
2019-07-22 07:25:02.134 UTC [endorser] SimulateProposal -> ERRO 066 [mychannel][c4b4e2ae] failed to invoke chaincode name:"lscc" , error: container exited with 0
github.com/hyperledger/fabric/core/chaincode.(*RuntimeLauncher).Launch.func1
/opt/gopath/src/github.com/hyperledger/fabric/core/chaincode/runtime_launcher.go:63
runtime.goexit
/opt/go/src/runtime/asm_amd64.s:1333
chaincode registration failed
Getting this error on Cli :-
2019-07-22 07:24:58.263 UTC [chaincodeCmd] checkChaincodeCmdParams -> INFO 001 Using default escc
2019-07-22 07:24:58.264 UTC [chaincodeCmd] checkChaincodeCmdParams -> INFO 002 Using default vscc
Error: could not assemble transaction, err proposal response was not successful, error code 500, msg chaincode registration failed: container exited with 0
Once check if all your docker containers are up and running and if you are simply running the sample network without making any changes to the smart contract and the docker files then you can simply stop your network and freshly start the network(it worked in my case).
I have check with my configuration files it was due to wrong CORE_PEER_CHAINCODELISTENADDRESS env variable value for the peer.

What is the purpose of `pg_logical` directory inside PostgreSQL data?

I've just stumbled upon this error while testing failover of a PostgreSQL 9.4 cluster I've set up. Here I'm trying to promote a slave to be the new master:
$ repmgr -f /etc/repmgr/repmgr.conf --verbose standby promote
2014-09-22 10:46:37 UTC LOG: database system shutdown was interrupted; last known up at 2014-09-22 10:44:02 UTC
2014-09-22 10:46:37 UTC LOG: database system was not properly shut down; automatic recovery in progress
2014-09-22 10:46:37 UTC LOG: redo starts at 0/18000028
2014-09-22 10:46:37 UTC LOG: consistent recovery state reached at 0/19000600
2014-09-22 10:46:37 UTC LOG: record with zero length at 0/1A000090
2014-09-22 10:46:37 UTC LOG: redo done at 0/1A000028
2014-09-22 10:46:37 UTC LOG: last completed transaction was at log time 2014-09-22 10:36:22.679806+00
2014-09-22 10:46:37 UTC FATAL: could not open directory "pg_logical/snapshots": No such file or directory
2014-09-22 10:46:37 UTC LOG: startup process (PID 2595) exited with exit code 1
2014-09-22 10:46:37 UTC LOG: aborting startup due to startup process failure
pg_logical/snapshots dir in fact exists on master node and it is empty.
UPD: I've just manually created empty directories pg_logical/snapshots and pg_logical/mappings and server has started without complaining. repmgr standby clone seems to omit this dirs while syncing. But the question still remains because I'm just curious what this directory is for, maybe I'm missing something in my setup. Simply Googling it did not yield any meaningful results.
It's for the new logical changeset extraction / logical replication feature in 9.4.
This shouldn't happen, though... it suggests a significant bug somewhere, probably repmgr. I'll wait for details (repmgr version, etc).
Update: Confirmed, it's a repmgr bug. It's fixed in git master already (and was before this report) and will be in the next release. Which had better be soon, given the significance of this issue.