we use celery with rabbitMQ backend and some of our servers hang with error: "[Errno 113] No route to host"(which can be a result of half of our servers being in US and half in Europe).
I need to be sure that every task is being delivered, unfortunately I have no idea how to retry tasks sent using send_task/string identifier(server that sends tasks has no access to code of remote worker) like this:
send_task("remote1.tasks.add_data", args=[...], kwargs={}, queue="remote1")
Is it possible to retry such task?
sent_task send just the message to the broker, if the exception is raised on the servers that call the sent_task, probably the message simply doesn't reach the broker, than there is no task to retry but just an exception to be handled.
Otherwise if all you workers randomly raise this exception because they can't reach the broker for some reason probably you can solve by set to true the celery conf vars
CELERY_ACKS_LATE = True
"Late ack means the task messages will be acknowledged after the task has been executed, not just before, which is the default behavior."
This means that if something go mad during the execution of the task in the worker, the broker doesn't receive the acks and another worker will execute the task.
Related
I am trying understand how to handle failed consumer records. How to
we know there is record failure. What I am seeing is when the record
processing failed in the consumer with runtime exception consumer is
keep on retrying. But when the next record is available to process it
is commiting offset of the latest record, which is expected. My
question how to we know about failed record. In older messaging
systems failed messages are rolled back to queues and processing stops
there. Then we know the queue is down and we can take action.
I can record the failed record into some db table,but what happens if this recording fails?
I can move failures to error/ dead letter queues, again what happens if this moving fails?
I am using kafka 2.6 with spring boot 2.3.4. Any help would be appreciated
Sounds like you would need to disable auto commits and manually commit the offsets yourself when your scope of "sucessfully processed" is achieved. If you include external processes like a database, then you will also need to increase Kafka client timeouts so it doesnt think the consumer is dead while waiting on error logging/handling.
Using its statsd plugin, Airflow can report on metric executor.queued_tasks as well as some others.
I am using CeleryExecutor and need to know how many tasks are waiting in the Celery broker, so I know when new workers should be spawned. Indeed, I set my workers so they cannot take many tasks concurrently. Is this metric what I need?
Nope. If you want to know how many TIs are waiting in the broker, you'll have to connect to it.
Task instances that are waiting to get picked up in the celery broker are queued according to the Airflow DB, but running according to the CeleryExecutor. This is because the CeleryExecutor considers that any task instance that was successfully sent to the broker is now running (unlike the DB, which waits for a worker to pick it up before marking it as running).
Metric executor.queued_tasks reports the number of tasks queued according to the executor, not the DB.
The number of queued task instances according to the DB is not exactly what you need either, because it reports the number of task instances that are waiting in the broker plus the number of task instances queued to the executor. But when would TIs be stuck in the executor's queue, you ask? When the parallelism setting of Airflow prevents the executor from sending them to the broker.
This is not a question about how to capture logging on celery workers. Is there any way to capture celery logging on a Producer. What I want is to capture every log that get generated by celery on the Producer when I call task.delay(...) or task.apply_async(...).
EDIT:
I don't want to capture worker logs on producer. I want to capture everything that happen in celery from the time of my call to apply_async until the task is sent to the broker.
No, there is no way to capture worker logs on the producer. All you get is the exception, if thrown. Any logging is happening on the worker side, so you have to examine logs of that particular worker, or if you use some centralised log system then you have to look for logs from that worker...
Update: seems like you want to capture eventual logging from Celery on the producer (client) side. As far as I know Celery and the underlying transport handling library (Kombu) do not log anything. I could be wrong of course, but I can't remember seeing any logging there and I have read Celery (Kombu not that much to be fair) code many times...
A possible solution is to make Celery workers send logs to some centralised system that your Celery client can access...
There is a retry feature on Kafka clients. I am struggling to find out when a retry happens. Would a retry happen if the connection to the broker in interrupted briefly? How about if the brokers were not reachable for 5 mins? Will the messages get delivered once the brokers are back up? Or does retry only happen on known scenarios to the kafka clients?
Kafka Producer consists of a pool of buffer space that holds records that haven't yet been transmitted to the server as well as a background I/O thread that is responsible for turning these batch records into requests and transmitting them to the cluster.
For example if records are sent faster than they can be delivered to the server the producer will block for max.block.ms after which it will throw an exception. Then client assumes batch is failed and will retry to send the batch based on retries config
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for my-test-topic-4 due to 30024 ms has passed since batch creation plus linger time
Suppose if the retries config is set to 3 and if all retries fails then the batch is lost
error: Failed to send message after 3 tries
How about if the brokers were not reachable for 5 mins?
If the broker is down and in mean time if retry is exhausted then you lost the data
I am using hornetq-2.2.14 Final and configured connection-ttl in hornetq-jms.xml is 60000ms . I have a publisher program which sends messages to a topic and a Consumer program which consumes messages from the topic. My consumer program exited abruptly without closing the resources. I waited 1 minute since the ttl is 60000ms,but server not clearing up the resources even after one minute. Any one can help me out to resolve this issue, if this is a configuration issue?
Sometimes it can take as 2X TTL depending on how it happened. we recently had a fix on master to make sure it is always close to the TTL configured.