Preserve beanstalkd queue on restart or crash - queue

I'm using beanstalkd to managed queues. I just realised that if there are jobs in a queue and the beanstalkd process is restarted or crashes then the job is lost forever (or so I think).
Is there a way to preserve the jobs in the queue on beanstalkd failure or restart? If not, whats best practice to ensure jobs are never lost?

Beanstalkd can be started with the -b (binary log) option, and beanstalkd will write all jobs to a binlog. If the power goes out, you can restart beanstalkd with the same option and it will recover the contents of the log.

Related

How to run a Kafka Producer and Kafka Consumer via CLI commands for 24 hours

We have a requirement, where we would need to showcase the resiliency of a kafka cluster. To prove this, we have a use case where we need to run a producer and consumer ( I am thinking kafka-console-producer and kafla-console-consumer) preferably via cli commands and/or scripts to run continuously for 24hrs. We are not concerned with the message size and contents; preferably the size can be as small as possible and messages be any random value, say the present timestamp.
How can I achieve this?
There's nothing preventing you from doing this, and the problem isn't unique to Kafka.
You can use nohup to run a script as a daemon, otherwise, the commands will terminate when that console session ends. You could also use cron to schedule any script, a minimum of every minute...
Or you can write your own app with a simple while(true) loop.
Regardless, you will want a proess supervisor to truly ensure the command remains running at all times.

Apache Flink - duplicate message processing during job deployments, with ActiveMQ as source

Given,
I have a Flink job that reads from ActiveMQ source & writes to a mysql database - keyed on an identifier. I have enabled checkpoints for this job every one second. I point the checkpoints to a Minio instance, I verified the checkpoints are working with the jobid. I deploy this job is an Openshift (Kubernetes underneath) - I can scale up/down this job as & when required.
Problem
When the job is deployed (rolling) or the job went down due to a bug/error, and if there were any unconsumed messages in ActiveMQ or unacknowledged messages in Flink (but written to the database), when the job recovers (or new job is deployed) the job process already processed messages, resulting in duplicate records inserted in the database.
Question
Shouldn't the checkpoints help the job recover from where it left?
Should I take the checkpoint before I (rolling) deploy new job?
What happens if the job quit with error or cluster failure?
As the jobid keeps changing on every deployment, how does the recovery happens?
Edit As I cannot expect idempotency from the database, to avoid duplicates saved into the database (Exactly-Once), can I write database specific (upsert) query to update if the given record is present & insert if not?
JDBC currently only supports at least once, meaning you get duplicate messages upon recovery. There is currently a draft to add support for exactly once, which would probably be released with 1.11.
Shouldn't the checkpoints help the job recover from where it left?
Yes, but the time between last successful checkpoints and recovery could produce the observed duplicates. I gave a more detailed answer on a somewhat related topic.
Should I take the checkpoint before I (rolling) deploy new job?
Absolutely. You should actually use cancel with savepoint. That is the only reliable way to change the topology. Additionally, cancel with savepoints avoids any duplicates in the data as it gracefully shuts down the job.
What happens if the job quit with error or cluster failure?
It should automatically restart (depending on your restart settings). It would use the latest checkpoint for recovery. That would most certainly result in duplicates.
As the jobid keeps changing on every deployment, how does the recovery happens?
You usually point explicitly to the same checkpoint directory (on S3?).
As I cannot expect idempotency from the database, is upsert the only way to achieve Exactly-Once processing?
Currently, I do not see a way around it. It should change with 1.11.

Logging Celery on Client/Producer

This is not a question about how to capture logging on celery workers. Is there any way to capture celery logging on a Producer. What I want is to capture every log that get generated by celery on the Producer when I call task.delay(...) or task.apply_async(...).
EDIT:
I don't want to capture worker logs on producer. I want to capture everything that happen in celery from the time of my call to apply_async until the task is sent to the broker.
No, there is no way to capture worker logs on the producer. All you get is the exception, if thrown. Any logging is happening on the worker side, so you have to examine logs of that particular worker, or if you use some centralised log system then you have to look for logs from that worker...
Update: seems like you want to capture eventual logging from Celery on the producer (client) side. As far as I know Celery and the underlying transport handling library (Kombu) do not log anything. I could be wrong of course, but I can't remember seeing any logging there and I have read Celery (Kombu not that much to be fair) code many times...
A possible solution is to make Celery workers send logs to some centralised system that your Celery client can access...

Storm fault tolerance: Nimbus reassigns worker to a different machine?

How do I make storm-nimbus to restart worker on the same machine?
To test the fault tolerance, I do a kill -9 on a worker process expecting the worker to be restarted on the same machine, but on one of the machines, nimbus launches the worker on another machine!!!
Nimbus log does not show several tries or anything unusual or errors!
Would appreciate any help, Thanks!
You shouldn't need to. Workers should be able to switch to an open slot on any supervisor. If you have a bolt that doesn't accomodate this because it is reading data on a particular supervisor, this is a design problem.
Additionally, Storm's fault tolerance is intended to handle not only worker failures, but also supervisor failures, in which case you won't be able to restart a worker on the same supervisor. You shouldn't need to worry where a worker is: that's a feature of Storm.

Purging tasks in Mongodb broker

I am using celery 3.0.13 and using mongodb as the broker.
I am trying to purge the tasks in a custom queue. I stopped all the workers and I tried to purge waiting tasks in the custom queue using the "celery purge" command but the command reports that no tasks were purged. I rechecked that the tasks are still in the queue even after running the commands (using flower).
Am I missing anything?
Thanks.
It may be related to this issue:
https://github.com/celery/celery/issues/1047