Zookeeper priority queue - distributed-computing

My problem description is follows:
I have n state based database infinite crawlers:
Currently how it is happening:
We are using single machine for crawling.
We have three level of priority queue. High, Medium and LOW.
At starting all Database job are put into lower level queue.
Worker reads a job from queue and do operation.
After finishing job it reschedule it with a delay of 5 minutes.
Solution I found
For Priority Queue I can use:
-
http://zookeeper.apache.org/doc/r3.2.2/recipes.html#sc_recipes_priorityQueues
Problem solution I am still searching are:
How to reschedule a job in queue with future schedule time. Is there
a way to do that in zookeeper ?
Canceling a already started job. Suppose user change his database
authentication details. I want to stop already running job for that
database and restart with new details.
What I thought is while starting a worker It will subscribe for that
it's znode changes and if something happen, It will stop that job and
reschedule it.
Infinite Queue
What I thought is that after finishing it will remove it from queue and
readd it with future schdule time. (It implementation depend on point 1)
Is it correct way of doing this task infinite task?

Related

Processing Groups of Results with Vertx - How to coordinate?

I have a job processing system where each job contains thousands of individual tasks that require different strategies to complete. The individual tasks make up the whole job. If all tasks have been completed, the job is marked as successfully completed and other steps are taken, if any of the tasks fail, the job must be marked as failed and other steps are taken, if the job times out the job must be marked as failed and other steps are taken.
Once all of the results for a job have been received, the next job can be fetched. The next job shouldn't be fetched while a job is currently being processed.
Here is the what the flow looks like:
The Job Polling Verticle publishes a job to the event bus, and the Job Processing Verticle publishes each task to the event bus. When the job strategy completes, it publishes the task result to the event bus.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The only way I can think to do this would be to have a global stateful object. But I don't think this is good design.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
I could do this with the global state, but again I don't think that's the right solution.
Does this verticle pattern make sense for what I'm trying to do?
First, let me try to address your questions. Then I'll try to explain what problems this design has.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The solution could be reference counting verticle. Each worker should emit a start message on event bus with jobId when it starts, and end message with jobId when it completes. Even if you have fan-out (those are the cases that you don't know how many workers there are), counting verticle will know that. In your diagram, "Job Post Processing Verticle" is a good candidate for this. It can maintain a counter, and only when it reaches zero, it should start the next job. That also helps avoiding actually sharing some memory reference.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
In the same verticle you can start a timer every time you get a new start message. If you get end message, cancel the timer. Otherwise, cancel current job and start again.
Now, this solution will work, but the design has two main flaws. One is the fact that you maintain all your flow in memory, it seems. If your application crashes, all progress is lost, and it's not clear how you record it. Maybe polling Jobs table in DB would actually be better, since your job execution is sequential anyway.
Second point is the fact that all those timeouts and reference counting is homemade implementation of structured concurrency. Maybe you should take a look at something like Kotlin coroutines for that, at it will handle many of your problems for you.

Distributed queue consumers in an unstable net

I'm working on the design of a distributed system. The system consists of multiple producers, distributed queue and multiple consumers aka workers.
Workers instances resides within datacentres in different locations. Sometimes one location is manually disconnected.
In such a case, the issue is the worker from the disconnected location got some task from the queue and is then shutting down before task completion. I want:
workers from an alive location be able to got such a task and complete it
when a disconnected worker finally turns on, it should determine if the task was already completed by another worker and decide what to do with it
What is a convenient way to solve such an issue?
This design might help you. Every time a worker consumes a task, move the task from queue to some other distributed list of consumed tasks. In this list of tasks, maintain a timestamp with every task.
Then the worker that consumed the task should send some kind of still alive message every second or so (similar to Hadoop's hearbeat message) that updates the timestamp of a task in consumed tasks list. This is to indicate that the worker who consumed this task is still alive and received a message from him recently.
Now, implement a daemon to monitor this consumed tasks list and move the tasks back to queue whose timestamp is older than a threshold number of seconds (considering message losses).

zookeeper queue delay?

What would you guys suggest to be a good way to implement a queue in zookeeper that has the ability to delay a job without blocking a worker?
Reference beanstalkd delayed job option.
What you need is develop a Barriers using zookeeper.
I assume the "delay time" was set by another process called master.
Master first create a node say /work/flag with data "false"
What worker need to do is get and watch node /work/flag. The watcher would call back in asyn so you can do other thing in worker, would't block.
When the time comes, master would set the /work/flag data to "true", which cause a ZOO_CHANGED_EVENT event.
And the worker should get the event call back saying "ZOO_CHANGED_EVENT" in /work/flag. Then it can get and check if /work/flag is true and determine whether continue the workflow.

How to ensure when time arrive Quartz job has an available thread

Is there a way to ensure that when a trigger fire time arrived it has always a thread to run on.triggers has priority but this not guarantee ,and i giving more thread then max number of job is not an option.I have lots of job but one of them must run at specified time.
Use a separate scheduler for the job. So the important job will always have a free thread, because nobody else would use it.

Use beanstalkd, for periodic tasks, how to always make a job replaced by its latest one?

I am trying to use beanstalk for queuing a large number of periodic
tasks (for example, tasks need processed every N minutes), for each
task, if the last queued job is not completed (not reserved, i mean)
when current job to be added, the last queued job should be replaced
with current job, in other words, only the latest queued job of a task
should be processed.
how can i achieve that using beanstalk?
Ideas i have got right now is:
for each task, use memcached store its latest timestamps (set this
when add jobs to queue),
every time the worker reserved a job successfully, it first checks
timestamps for this task in memcached,
if timestamps of the job is same as timestamps in memcached, then
process this job,
otherwise skip this job, and delete it from the queue.
So is there better ways to do such work? please give your suggestions,
thanks.
I found a memcache/beanstalk combination also the best solution for an implementation where I didnt want a newer but identical job entering a queue.
Until 'named jobs' are done and the software released, that may be one of the better solutions.