PRECONDITION_FAILED: Delivery Acknowledge Timeout on Celery & RabbitMQ with Gevent and concurrency - kubernetes

I just switched from ForkPool to gevent with concurrency (5) as the pool method for Celery workers running in Kubernetes pods. After the switch I've been getting a non recoverable erro in the worker:
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
The broker logs gives basically the same message:
2021-11-01 22:26:17.251 [warning] <0.18574.1> Consumer None4 on channel 1 has timed out waiting for delivery acknowledgement. Timeout used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
I have the CELERY_ACK_LATE set up, but was not familiar with the necessity to set a timeout for the acknowledgement period. And that never happened before using processes. Tasks can be fairly long (60-120 seconds sometimes), but I can't find a specific setting to allow that.
I've read in another post in other forum a user who set the timeout on the broker configuration to a huge number (like 24 hours), and was also having the same problem, so that makes me think there may be something else related to the issue.
Any ideas or suggestions on how to make worker more resilient?

For future reference, it seems that the new RabbitMQ versions (+3.8) introduced a tight default for consumer_timeout (15min I think).
The solution I found (that has also been added to Celery docs not long ago here) was to just add a large number for the consumer_timeout in RabbitMQ.
In this question, someone mentions setting consumer_timeout to false, in a way that using a large number is not needed, but apparently there's some specifics regarding the format of the configuration for that to work.
I'm running RabbitMQ in k8s and just done something like:
rabbitmq.conf: |
consumer_timeout = 31622400000

The accepted answer is the correct answer. However, if you have an existing RabbitMQ server running and do not want to restart it, you can dynamically set the configuration value by running the following command on the RabbitMQ server:
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
This will set the new timeout to 10 hrs (36000000ms). For this to take effect, you need to restart your workers though. Existing worker connections will continue to use the old timeout.
You can check the current configured timeout value as well:
rabbitmqctl eval 'application:get_env(rabbit, consumer_timeout).'
If you are running RabbitMQ via Docker image, here's how to set the value: Simply add -e RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-rabbit consumer_timeout 36000000" to your docker run OR set the environment RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS to "-rabbit consumer_timeout 36000000".
Hope this helps!

Related

RabbitMQ randomly disconnecting application consumers in a Kubernetes/Istio environment

Issue:
My company has recently moved workers from Heroku to Kubernetes. We previously used a Heroku-managed add-on (CloudAMQP) for our RabbitMQ brokers. This worked perfectly and we never saw issues with dropped consumer connections.
Now that our workloads live in Kubernetes deployments on separate nodegroups, we are seeing daily dropped consumer connections, causing our messages to not be processed by our applications living in Kubernetes. Our new RabbitMQ brokers live in CloudAMQP but are not managed Heroku add-ons.
Errors on the consumer side just indicate a Unexpected disconnect. No additional details.
No errors on the Istio envoy proxy level that is evident.
We do not have a Istio Egress, so no destination rules set here.
No errors on the RabbitMQ server that is evident.
Remediation Attempts:
Read all StackOverflow/GitHub issues for the Unexpected errors we are seeing. Nothing we have found has remediated the issue.
Our first attempt to remediate was to change the heartbeat to 0 (disabling heartbeats) on our RabbitMQ server and consumer. This did not fix anything, connections still randomly dropping. CloudAMQP also suggests disabling this, because they rely heavily on TCP keepalive.
Created a message that just logs on the consumer every five minutes. To keep the connection active. This has been a bandaid fix for whatever the real issue is. This is not perfect, but we have seen a reduction of disconnects.
What we think the issue is:
We have researched why this might be happening and are honing in on network TCP keepalive settings either within Kubernetes or on our Istio envoy proxy's outbound connection settings.
Any ideas on how we can troubleshoot this further, or what we might be missing here to diagnose?
Thanks!

Confluent Kafka services (local) do not start properly on wsl2 and seems to timeout communicating their status

I am seeing various different issues while trying to start Kafka services on wsl2. Details/symptoms below:
Confluent Kafka (7.0.0) platform
wsl2 - ubuntu 20.04LTS
When I use the command:
confluent local services start
Typically the system will take a long time and then exit with service failed (e.g. zookeeper, as that is the first service to start).
If I check the logs, it is actually started. So I again type the command and sure enough it immediately says zookeeper up, then proceed to try start kafka, which again after a min will say failed to start (but it really has started).
I suspect after starting the service (which is quite fast), system is not able to communicate back/exit and thus times out, I am not sure where the logs related to this are.
Can see this in the screenshot below
This means to start the whole stack (zookeeper/kafka/schema-registry/kafka-rest/kafka-connect/etc), takes forever, and in between I start getting other errors (sometimes, schema-registry is not able to find the cluster id, sometimes its a log file related error), which means I need to destroy and start again.
I have tried this over a couple of days and cant get this to work. Is confluent kafka that unstable on windows or I am missing some config change.
In terms of setup, I have not done any change in the config and am using the default config/ports.

24 hours performance test execution stopped abruptly running in jmeter pod in AKS

I am running load test of 24 hours using Jmeter in Azure Kubernetes service. I am using Throughput shaping timer in my jmx file. No listener is added as part of jmx file.
My test stopped abruptly after 6 or 7 hrs.
jmeter-server.log file under Jmeter slave pod is giving warning --> WARN k.a.j.t.VariableThroughputTimer: No free threads left in worker pool.
Below is snapshot from jmeter-server.log file.
Using Jmeter version - 5.2.1 and Kubernetes version - 1.19.6
I checked, Jmeter pods for master and slaves are continously running(no restart happened) in AKS.
I provided 2GB memory to Jmeter slave pod still load test is stopped abruptly.
I am using log analytics workspace for logging. Checked ContainerLog table not getting error.
Snapshot of JMX file.
Using following elements -> Thread Group, Throughput Controller, Http request Sampler and Throughput Shaping Timer
Please suggest for same.
It looks like your Schedule Feedback Function configuration is wrong in its last parameter
The warning means that the Throughput Shaping Timer attempts to increase the number of threads to reach/maintain the desired concurrency but it doesn't have enough threads in order to do this.
So either increase this Spare threads ration to be closer to 1 if you're using a float value for percentage or increment the absolute value in order to match the number of threads.
Quote from documentation:
Example function call: ${__tstFeedback(tst-name,1,100,10)} , where "tst-name" is name of Throughput Shaping Timer to integrate with, 1 and 100 are starting threads and max allowed threads, 10 is how many spare threads to keep in thread pool. If spare threads parameter is a float value <1, then it is interpreted as a ratio relative to the current estimate of threads needed. If above 1, spare threads is interpreted as an absolute count.
More information: Using JMeter’s Throughput Shaping Timer Plugin
However it doesn't explain the premature termination of the test so ensure that there are no errors in jmeter/k8s logs, one of the possible reasons is that JMeter process is being terminated by OOMKiller

Airflow KubernetesPodOperator Losing Connection to Worker Pod

Experiencing an odd issue with KubernetesPodOperator on Airflow 1.1.14.
Essentially for some jobs Airflow is losing contact with the pod it creates.
[2021-02-10 07:30:13,657] {taskinstance.py:1150} ERROR - ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
When I check logs in kubernetes with kubectl logs I can see that the job carried on past the connection broken error.
The connection broken error seems to happen exactly 1 hour after the last logs that Airflow pulls from the pod (we do have a 1 hour config on connections), but the pod keeps running happily in the background.
I've seen this behaviour repeatedly, and it tends to happen with longer running jobs with a gap in the log output, but I have no other leads. Happy to update the question if certain specifics are misssing.
As I have mentioned in comments section I think you can try to set operators get_logs parameter to False - default value is True .
Take a look: airflow-connection-broken, airflow-connection-issue .

How to use the Python Kubernetes client in a way resilient to GKE Kubernetes Master disruptions?

We sometimes use Python scripts to spin up and monitor Kubernetes Pods running on Google Kubernetes Engine using the Official Python client library for kubernetes. We also enable auto-scaling on several of our node pools.
According to this, "Master VM is automatically scaled, upgraded, backed up and secured". The post also seems to indicate that some automatic scaling of the control plane / Master VM occurs when the node count increases from 0-5 to 6+ and potentially at other times when more nodes are added.
It seems like the control plane can go down at times like this, when many nodes have been brought up. In and around when this happens, our Python scripts that monitor pods via the control plane often crash, seemingly unable to find the KubeApi/Control Plane endpoint triggering some of the following exceptions:
ApiException, urllib3.exceptions.NewConnectionError, urllib3.exceptions.MaxRetryError.
What's the best way to handle this situation? Are there any properties of the autoscaling events that might be helpful?
To clarify what we're doing with the Python client is that we are in a loop reading the status of the pod of interest via read_namespaced_pod every few minutes, and catching exceptions similar to the provided example (in addition we've tried also catching exceptions for the underlying urllib calls). We have also added retrying with exponential back-off, but things are unable to recover and fail after a specified max number of retries, even if that number is high (e.g. keep retrying for >5 minutes).
One thing we haven't tried is recreating the kubernetes.client.CoreV1Api object on each retry. Would that make much of a difference?
When a nodepool size changes, depending on the size, this can initiate a change in the size of the master. Here are the nodepool sizes mapped with the master sizes. In the case where the nodepool size requires a larger master, automatic scaling of the master is initiated on GCP. During this process, the master will be unavailable for approximately 1-5 minutes. Please note that these events are not available in Stackdriver Logging.
At this point all API calls to the master will fail, including the ones from the Python API client and kubectl. However after 1-5 minutes the master should be available and calls from both the client and kubectl should work. I was able to test this by scaling my cluster from 3 node to 20 nodes and for 1-5 minutes the master wasn't available .
I obtained the following errors from the Python API client:
Max retries exceeded with url: /api/v1/pods?watch=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at>: Failed to establish a new connection: [Errno 111] Connection refused',))
With kubectl I had :
“Unable to connect to the server: dial tcp”
After 1-5 minutes the master was available and the calls were successful. There was no need to recreate kubernetes.client.CoreV1Api object as this is just an API endpoint.
According to your description, your master wasn't accessible after 5 minutes which signals a potential issue with your master or setup of the Python script. To troubleshoot this further on side while your Python script runs, you can check for availability of master by running any kubectl command.