How to configure druid properly to fire a periodic kill task - druid

I have been trying to get druid to fire a kill task periodically to clean up unused segments.
These are the configuration variables responsible for it
druid.coordinator.kill.on=true
druid.coordinator.kill.period=PT45M
druid.coordinator.kill.durationToRetain=PT45M
druid.coordinator.kill.maxSegments=10
From the above configuration my mental model is, once ingested data is marked unused, kill task will fire and delete the segments that are older that 45 mins while retaining 45 mins worth of data. period and durationToRetain are the config vars that are confusing me, not quite sure how to leverage them. Any help would be appreciated.

The caveat for druid.coordinator.kill.on=true is that segments are deleted from whitelisted datasources. The whitelist is empty by default.
To populate the whitelist with all datasources, set killAllDataSources to true. Once I did that, the kill task fired as expected and deleted the segments from s3 (COS). This was tested for Druid version 0.18.1.
Now, while the above configuration properties can be set when you build your image, the killAllDataSources needs to be set through an API. This can be set via the druid UI too.
When you click the option, a modal appears that has Kill All Data Sources. Click on True and you should see a kill task (Ingestion ---> Tasks below) firing in the interval specified. It would be really nice to have this as a part of runtime.properties or some sort of common configuration file that we can set the value in when build the druid image.

Use crontab it works quite well for us.

If you want to have a control outside the druid over the segments removal, then you must use an scheduled task which runs based on your desire interval and register kill-tasks in druid. It can increase your control over your segments, since when they go away, you cannot recover them. You can use this script to accompany you:
https://github.com/mostafatalebi/druid-kill-task

Related

Stop running Azure Data Factory Pipeline when it is still running

I have a Azure Data Factory Pipeline. My trigger has been set for every each 5 minutes.
Sometimes my Pipeline takes more than 5 mins to finished its jobs. In this case, Trigger runs again and creates another instance of my Pipeline and two instances of the same pipeline make problem in my ETL.
How can I be sure than just one instance of my pipeline runs at time?
As you can see there are several instances running of my pipelines
Few options I could think of:
OPT 1
Specify 5 min timeout on your pipeline activities:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities#activity-policy
OPT 2
1) Create a 1 row 1 column sql RunStatus table: 1 will be our "completed", 0 - "running" status
2) At the end of your pipeline add a stored procedure activity that would set the bit to 1.
3) At the start of your pipeline add a lookup activity to read that bit.
4) The output of this lookup will then be used in if condition activity:
if 1 - start the pipeline's job, but before that add another stored procedure activity to set our status bit to 0.
if 0 - depending on the details of your project: do nothing, add a wait activity, send an email, etc.
To make a full use of this option, you can turn the table into a log, where the new line with start and end time will be added after each successful run (before initiating a new run, you can check if the previous run had the end time). Having this log might help you gather data on how much does it take to run your pipeline and perhaps either add more resources or increase the interval between the runs.
OPT 3
Monitor the pipeline run with SDKs (have not tried that, so this is just to possibly direct you):
https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically
Hopefully you can use at least one of them
It sounds like you're trying to run a process more or less constantly, which is a good fit for tumbling window triggers. You can create a dependency such that the trigger is dependent on itself - so it won't run until the previous run has completed.
Start by creating a trigger that runs a pipeline on a tumbling window, then create a tumbling window trigger dependency. The section at the bottom of that article discusses "tumbling window self-dependency properties", which shows you what the code should look like once you've successfully set this up.
Try changing the concurrency of the pipeline to 1.
Link: https://www.datastackpros.com/2020/05/prevent-azure-data-factory-from-running.html
My first thought is that the recurrence is too frequent under these circumstances. If the graph you shared is all for the same pipeline, then most of them take close to 5 minutes, but you have some that take 30, 40, even 60 minutes. Situations like this are when a simple recurrence trigger probably isn't sufficient. What is supposed to happen while the 60 minute one is running? There will be 10-12 runs that wouldn't start: so they still need to run or can they be ignored?
To make sure all the pipelines run, and manage concurrency, you're going to need to build a queue manager of some kind. ADF cannot handle this itself, so I have built such a system internally and rely on it extensively. I use a combination of Logic Apps, Stored Procedures (Azure SQL), and Azure Functions to queue, execute, and monitor pipeline executions. Here is a high level break down of what you probably need:
Logic App 1: runs every 5 minutes and queues an ADF job in the SQL database.
Logic App 2: runs every 2-3 minutes and checks the queue to see if a) there is not a job currently running (status = 'InProgress') and 2) there is a job in the queue waiting to run (I do this with a Stored Procedure). IF this state is met: execute the next ADF and update its status to 'InProgress'.
I use an Azure Function to submit jobs instead of the built in Logic App activity because I have better control over variable parameters. Also, they can return the newly created ADF RunId, which I rely in #3.
Logic App 3: runs every minute and updates the status of any 'InProgress' jobs.
I use an Azure Function to check the status of the ADF pipeline based on RunId.

Does the Overpas API installation's Area Creation process resume if stopped/started?

The Area Creation process can take up to 24 hours. If something happens during that time which causes the process to stop, will it resume when I run it again or does it start back over from the beginning?
We can assume for this question that the files in $DB_DIR remain in place throughout the running/stopping/starting process.
It will start over from the beginning, assuming you're using areas.osm3s to define the area creation rules. This file contains a number of queries which are being executed to generate the areas. If you restart the process, it will execute those very same queries again from the beginning.
For performance reasons, we use areas_delta.osm3s and the accompanying rules_delta_loop.sh script on the production servers. This way, we can limit the workload to those areas, which have been changed since the last area creation run.

Graceful custom activity timeout in data factory

I'm looking to impose a timeout on custom activities in data factory via the policy.timeout property in the activity json.
However I haven't seen any documentation to suggest how the timeout operates upon Azure batch? I assume that the batch task is forcibly terminated somehow.
But is the task -> custom activity informed so it can tidy up?
The reason I ask is that I could be mid-copying to data lake store and I neither want to let it run indefinitely nor stop it without some sort of clean up (I can't see a way of doing transactions as such using the data lake store SDK).
I'm considering putting the timeout within the custom activity, but it would be a shame to have timeouts defined at 2 different levels (I'd probably still want the overall timeout).
I feel your pain.
ADF does simply terminate the activity if its own time out is reached regardless of what state the invoked service is in.
I have the same issue with U-SQL processing calls. It takes a lot of proactive monitoring via PowerShell to ensure data lake or batch jobs have enough compute to complete jobs with natually increasing data volumes before the ADF timeout kill occurs.
I'm not aware of any graceful way for ADF to handle this because it would differ for each activity type.
Time to create another feedback article for Microsoft!

Is there any method to get mutual exclusion in a chef node?

For example, If a process updates a node when a chef-client is running the chef-client will overwrite the node data:
chef-client gets node data (state 1)
The process A gets node data (state 1)
The process A updates locally the node data (state 2)
This process saves node data (state 2)
chef-client updates locally the node data (state 2*)
chef-client saves node data, and this node data does not contains the changes from the process A (state 2). The chef-client overwrite the node data. (state 2*)
The same problem occurs, if we have two processes saving node data in the same moment
EDIT
We need to external modification because we have a nice UI of Chef server to manage remotely a lot of computers, showing like a tree (similar to LDAP). An administrator can update the value of the recipes from here. This project is OpenSource: https://github.com/gecos-team/
Although we had a semaphore system, we have detected that if we have two or more simultaneous requests, we can have a concurrence problem:
The regular case is that the system works
But sometimes the system does not work
EDIT 2
I have added a document with a lot of information about our problem.
Throwing what I would do for this case as an answer:
Have a distributed lock mechanism like
This
I'm not using it myself, it is just for the idea
Build a start/report/error handler which will
at start acquire a lock on the node name in the DLM in 1.
if it can't abort the run or wait untill the lock is free
at end (report or error) release the lock.
Modify the External system to do the same as the handler above, aquire a lock before modifying and release when done.
Pay attention to the lock lifetime !!! It should be longer than your Chef Run plus a margin, and the UI should ensure its lock is still there before writing and abort if not.
A way to get rid of the handler (but you still need a lock for the UI) is to take advantage of the reporting api (premium feature of chef 12, free under 25 nodes, license needed upward)
This turn a bit convoluted and need the node to do reporting (so the chef-server url should end with organizations/ and the client version should be above 11.16 or use the backport)
Then your can ask about the runs for a node and check if there's one at started status for this node, and wait until it is ended.
Chef doesn't implement a transaction feature and also it does not re-converge nodes on updates automatically by default. It's open for race conditions which you can try to reduce by updated node attributes from within a chef-client run (right before you do something critical) but you will never end up in a reliable, working setup.
The longer the converge runs, the higher the gap and risk of corruption.
Chef's node attributes are only useful for debugging or modification by the chef-client running on the node itself and pretty much useless in highly concurrent/dynamic environments.
I would use Consul.io to coordinate semaphores and key/value configuration data in realtime. Access it using chef recipes or LWRPs using one of the various interfaces consul provides (http, DNS, …).
You can implement a very easy push-job task to run chef-client (IMHO easier and more powerful than the chef "push jobs" feature, however not integrated in Chefs' ACL/user management) which also is guarded by a distributed semaphore or using the "Leader Election" feature. Of course you'll have to add this logic to your node update script, too.
Chef-client will then retrieve a lock on start and block you from manipulating data while it converges and vice versa.
Discovered this one in production and came to the conclusion that there is no safe way to edit the node attributes directly. Leave it to the chef-client :-)
Good news is that there are other more reliable ways to set node attributes. Chef roles and environments can both be edited safely while a client is running and only take effect during the next chef run. Additionally node attribute precedence rules ensure that any settings you make override those that might be made by a recipe.
I suggest to avoid Chef node data updates from your external app, and move that desired node configuration to a Chef databag.
So nodes will read Chef node data and configuration databag and write only in node data. And you external app read both but only writes in the databag.
If you want to avoind a dependency on another external service, perhaps you could use some kind of time slicing.
Roughly: nodes only start a chef-client on odd minutes. Api only update chef data on even minutes (distribute these even minutes if you have more than a queue).

Work around celerybeat being a single point of failure

I'm looking for recommended solution to work around celerybeat being a single point of failure for celery/rabbitmq deployment. I didn't find anything that made sense so far, by searching the web.
In my case, once a day timed scheduler kicks off a series of jobs that could run for half a day or longer. Since there can only be one celerybeat instance, if something happens to it or the server that it's running on, critical jobs will not be run.
I'm hoping there is already a working solution for this, as I can't be the only one who needs reliable (clustered or the like) scheduler. I don't want to resort to some sort of database-backed scheduler, if I don't have to.
There is an open issue in celery github repo about this. Don't know if they are working on it though.
As a workaround you could add a lock for tasks so that only 1 instance of specific PeriodicTask will run at a time.
Something like:
if not cache.add('My-unique-lock-name', True, timeout=lock_timeout):
return
Figuring out lock timeout is well, tricky. We're using 0.9 * task run_every seconds if different celerybeats will try to run them at different times.
0.9 just to leave some margin (e.g. when celery is a little behind schedule once, then it is on schedule which would cause lock to still be active).
Then you can use celerybeat instance on all machines. Each task will be queued for every celerybeat instance but only one task of them will finish the run.
Tasks will still respect run_every this way - worst case scenario: tasks will run at 0.9*run_every speed.
One issue with this case: if tasks were queued but not processed at scheduled time (for example because queue processors was unavailable) - then lock may be placed at wrong time causing possibly 1 next task to simply not run. To go around this you would need some kind of detection mechanism whether task is more or less on time.
Still, this shouldn't be a common situation when using in production.
Another solution is to subclass celerybeat Scheduler and override its tick method. Then for every tick add a lock before processing tasks. This makes sure that only celerybeats with same periodic tasks won't queue same tasks multiple times. Only one celerybeat for each tick (one who wins the race condition) will queue tasks. In one celerybeat goes down, with next tick another one will win the race.
This of course can be used in combination with the first solution.
Of course for this to work cache backend needs to be replicated and/or shared for all of servers.
It's an old question but I hope it helps anyone.