How to evenly distribute the DAGs runs execution during the day - scheduler

I have a huge amount of DAGS (>>100.000) that should each run once a day.
In order to not have big spikes in processing at certain times during the day (and other reasons) I would like to have the actual DAG runs be distributed evenly throughout the day.
Do I need to do this programmatically by myself distributing the start_date throughout the day or is there a better way where Airflow does that for me?

One possibly solution: If you create one or more pools, each with a limited number of slots you can effectively set a 'maximum parallelism' of execution, and tasks will wait until a slot is available. However, it may not give you quite enough flexibility you need

Related

Several group by in pyspark by different columns

i have a big table in databricks, this table have 500 columns and i want to calculate a groupBy by each column like that.
list_grp = [df_ref_crt.groupBy('SEGMENT',colum).count().withColumnRenamed(colum,"new_name").withColumn('variable',lit(colum)) for colum in df_ref_crt.columns]
The code works but is too slow, so my question is if there another way to do faster.
(the purpose of this is calculate PSI and this is the reference, of course there is another group by that a have to do respecto same variables and Period.)
thanks....
You can improve the Databricks perform by implementing below suggestions. This will definitely improve your execution performance:
Use larger cluster: General cluster configuration with two workers with four cores each takes forever to do anything. It’s actually not any more expensive to use a large cluster for a workload than it is to use a smaller one. It’s just faster.
Note that you’re renting the cluster for the length of the workload. So, if you spin up that two worker cluster and it takes an hour, you’re paying for those workers for the full hour. However, if you spin up a four worker cluster and it takes only half an hour, the cost is actually the same!
Delta Cache: If you’re using regular clusters, be sure to use the L series or E series on Azure Databricks. This will have fast SSDs and caching enabled by default.
You can refer the Microsoft official document to know more about optimization.

Druid Cluster going into Restricted Mode

We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.

Resource Maintenance

I have a Resource with 10 Units. I want that 1 unit out of them goes for scheduled maintenance. But the Maintenance option under the resource tab is sending all the 10 units for maintenance at the same time. Can I select only 1 at a time? I just want out of 10, only 1 goes under maintenance according to the schedule. How can I do this ?
Just split the 10 resources into two resource pools, one with a maintenance schedule and one without.
Seize from both pools for all blocks where you were seizing from the original one.
Please show your setup for the resource pool. By default the maintenance is created for each unit in the resource pool individually, depending on your settings they should not all go for maintenance at the same time.
See simple example below. I changed the default maintenance settings slightly to get a better example.
And never get a situation where all 10 goes for maintenance at the same time. See below, the time plot is simply plotting resourcePool.idle()
If however you used deterministic values instead of distributions in your maintenance setup you will find that they all go off and on maintenance at exactly the same time.

Overhead of using many partitions in Postgres

In the following link, the creator of a tool I use (Airflow) suggests to create partitions for daily snapshots of dimension tables. I am wondering about the overhead of doing something like this in Postgres.
I am using the Postgres 10 built in partitioning for several tables, but mostly at a monthly or yearly level for facts. I never tried implementing a daily partition for dimensions before and it seems scary. It would simplify things though in several areas for me in case I need to rerun old tasks.
https://medium.com/#maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a
Simple. With dimension snapshots where a new partition is appended at
each ETL schedule. The dimension table becomes a collection of
dimension snapshots where each partition contains the full dimension
as-of a point in time. “But only a small percentage of the data
changes every day, that’s a lot of data duplication!”. That’s right,
though typically dimension tables are negligible in size in proportion
to facts. It’s also an elegant way to solve SCD-type problematic by
its simplicity and reproducibility. Now that storage and compute are
dirt cheap compared to engineering time, snapshoting dimensions make
sense in most cases.
While the traditional type-2 slowly changing dimension approach is
conceptually sound and may be more computationally efficient overall,
it’s cumbersome to manage. The processes around this approach, like
managing surrogate keys on dimensions and performing surrogate key
lookup when loading facts, are error-prone, full of mutations and
hardly reproducible.
I have worked with systems with different levels of partitioning.
Generally any partitioning is OK as long as you have check constrains on partitions which allow query planner to find adequate partitions for query. Or you will have to query specific partition directly for some special cases. Otherwise you will see sequential scans over all partitions even for simple queries.
Daily partitions are completely OK do not worry. And I worked event with data collector based on PG which needed to have partitions for every 5 minutes of data because it collected several TBs per day.
Number of partitions can become a bigger problem only when you have several thousands or dozens of thousands of partitions - with this amount of partitions everything goes to different level of problems.
You will have to set proper max_locks_per_transaction for example to be able to work with them. Because even simple select over parent table places SharedAccessLock over all partitions which is not exactly nice but PG inheritance works this way.
Plus higher planing time for query - in our data warehouse we sometimes see planning times for queries like several minutes and queries taking only seconds - which is a bit craped... But it is hard to do anything with it because current PG planner works this way.
But PROs still overweight CONs so I highly recommend to use any partitioning granularity you need.

Celery - Granular tasks vs. message passing overhead

The Celery docs section Performance and Strategies suggests that tasks with multiple 'steps' should be divided into subtasks for more efficient parallelization. It then mentions that (of course) there will be more message passing overhead, so dividing into subtasks may not be worth the overhead.
In my case, I have an overall task of retrieving a small image (150px x 115px) from a third party API, then uploading via HTTP to my site's REST API. I can either implement this as a single task, or divide up the steps of retrieving the image and then uploading it into two seperate tasks. If I go with seperate tasks, I assume I will have to pass the image as part of the message to the second task.
My question is, which approach should be better in this case, and how can I measure the performance in order to know for sure?
Since your jobs are I/O-constrained, dividing the task may increase the number of operations that can be done in parallel. The message-passing overhead is likely to be tiny since any capable broker should be able to handle lots of messages/second with only a few ms of latency.
In your case, uploading the image will probably take longer than downloading it. With separate tasks, the download jobs needn't wait for uploads to finish (so long as there are available workers). Another advantage of separation is that you can put each job on different queue and dedicate more workers as backed-up queues reveal themselves.
If I were to try to benchmark this, I would compare execution times using same number of workers for each of the two strategies. For instance 2 workers on the combined task vs 2 workers on the divided one. Then do 4 workers on each and so on. My inclination is that the separated task will show itself to be faster; especially when the worker count is increased.