Airflow celery worker will be blocked if sensor number large than concurrency? - celery

Let's say, I set celery concurrency to n, but I have m(m>n) ExternalTaskSensor in a dag, it will check another dag named do_sth, these ExternalTaskSensor will consume all celery worker, so that no one will work in fact.
But I can't set concurreny too high(like 2*m), because dag do_sth may start too many process which will lead to out of memory.
I am confused what number I should set to celery concurrency?

In ETL best practices with Airflow's Gotchas section the author addresses this general problem. One of the suggestions is to setup a pool for your sensor tasks so that your other tasks don't get starved. For your situation determine the number of sensor tasks that you want running at one time (less than your concurrency level) and setup a pool with that as a limit. Once your pool is setup pass the pool argument to each of your sensor operators.
For more on pools see Airflow's documentation on concepts. Here is an example of passing a pool argument to an operator:
aggregate_db_message_job = BashOperator(
task_id='aggregate_db_message_job',
execution_timeout=timedelta(hours=3),
pool='ep_data_pipeline_db_msg_agg',
bash_command=aggregate_db_message_job_cmd, dag=dag)

Related

Is there a Cadence metric that can help spot overloads for each specific activity worker?

My company would like to automatically scale the activity workers and each workflow workers independently according to the load of a tasklist.
Reading the docs I have found the following metrics for activity workers:
cadence_activity_scheduled_to_start_latency_bucket
cadence_activity_scheduled_to_start_latency_count
cadence_activity_scheduled_to_start_latency_sum
However these seem to be global metrics for activity workers. Is there a Cadence metric that would allow me to spot overloads for each specific activity worker?
Example:
We have 4 different activity workers : A, B, C and D
We would like to scale independently A or B or C or D without impacting the others
Understand scheduled_to_start_latency
scheduled_to_start_latency is a measurement of the time from scheduled to started by worker. From scheduled to started, a task is transferred from matching service to an activity worker.
These are the potential hotspots when this latency got high:
The matching service is too hot to dispatch tasks -- in this case, need to confirm with CPU/memory of the matching nodes
The tasklist is overloaded because it defaults to have one partition which mapped to only one matching node: https://cadenceworkflow.io/docs/operation-guide/maintain/#scale-up-a-tasklist-using-scalable-tasklist-feature -- in this case, use task per second metrics to confirm the task rate of the tasklist
The activity worker is overloaded.
How to monitor activity worker being overloaded
CPU/memory/Thread usage/Garbage collection of the activity worker is usually enough to make sure an worker is not overloaded
You can also use scheduled_to_start_latency, but the high latency could mean different things like above. Use other metrics to rule out the causes.

Apache Spark - How does internal job scheduler in spark define what are users and what are pools

I am sorry about being a little general here, but I am a little confused about how job scheduling works internally in spark. From the documentation here I get that it is some sort of implementation of Hadoop Fair Scheduler.
I am unable to come around to understand that who exactly are users here (are the linux users, hadoop users, spark clients?). I am also unable to understand how are the pools defined here. For example, In my hadoop cluster I have given resource allocation to two different pools (lets call them team 1 and team 2). But in spark cluster, wont different pools and the users in them instantiate their own spark context? Which again brings me to question that what parameters do I pass when I am setting property to spark.scheduler.pool.
I have a basic understanding of how driver instantiates a spark context and then splits them into task and jobs. May be I am missing the point completely here but I would really like to understand how Spark's internal scheduler works in context of actions, tasks and job
I find official documentation quite thorough and covering all your questions. However, one might find it hard to digest from the first time.
Let us put some definitions and rough analogues before we delve into details. application is what creates SparkContext sc and may be referred to as something you deploy with spark-submit. job is an action in spark definition of transformation and action meaning anything like count, collect etc.
There are two main and in some sense separate topics: Scheduling Across applications and Scheduling Within application. The former relates more to Resource Managers including Spark Standalone FIFO only mode and also concept of static and dynamic allocation.
The later, Scheduling Within Spark application is the matter of your question, as I understood from your comment. Let me try to describe what happens there at some level of abstraction.
Suppose, you submitted your application and you have two jobs
sc.textFile("..").count() //job1
sc.textFile("..").collect() //job2
If this code happens to be executed in the same thread there is no much interesting happening here, job2 and all its tasks get resources only after job1 is done.
Now say you have the following
thread1 { job1 }
thread2 { job2 }
This is getting interesting. By default, within your application scheduler will use FIFO to allocate resources to all the tasks of whichever job happens to appear to scheduler as first. Tasks for the other job will get resources only when there are spare cores and no more pending tasks from more "prioritized" first job.
Now suppose you set spark.scheduler.mode=FAIR for your application. From now on each job has a notion of pool it belongs to. If you do nothing then for every job pool label is "default". To set the label for your job you can do the following
sc.setLocalProperty("spark.scheduler.pool", "pool1").textFile("").count() // job1
sc.setLocalProperty("spark.scheduler.pool", "pool2").textFile("").collect() // job2
One important note here is that setLocalProperty is effective per thread and also all spawned threads. What it means for us? Well if you are within the same thread it means nothing as jobs are executed one after another.
However, once you have the following
thread1 { job1 } // pool1
thread2 { job2 } // pool2
job1 and job2 become unrelated in the sense of resource allocation. In general, properly configuring each pool in fairscheduler file with minShare > 0 you can be sure that jobs from different pools will have resources to proceed.
However, you can go even further. By default, within each pool jobs are queued up in a FIFO manner and this situation is basically the same as in the scenario when we have had FIFO mode and jobs from different threads. To change that you you need to change the pool in the xml file to have <schedulingMode>FAIR</schedulingMode>.
Given all that, if you just set spark.scheduler.mode=FAIR and let all the jobs fall into the same "default" pool, this is roughly the same as if you would use default spark.scheduler.mode=FIFO and have your jobs be launched in different threads. If you still just want single "default" fair pool just change config for "default" pool in xml file to reflect that.
To leverage the mechanism of pools you need to define the concept of user which is the same as setting "spark.scheduler.pool" from a proper thread to a proper value. For example, if your application listens to JMS, then a message processor may set the pool label for each message processing job depending on its content.
Eventually, not sure if the number of words is less than in the official doc, but hopefully it helps is some way :)
By default spark works with FIFO scheduler where jobs are executed in FIFO manner.
But if you have your cluster on YARN, YARN has pluggable scheduler, it means in YARN you can scheduler of your choice. If you are using YARN distributed by CDH you will have FAIR scheduler by deafult but you can also go for Capacity scheduler.
If you are using YARN distributed by HDP you will have CAPACITY scheduler by default and you can move to FAIR if you need that.
How Scheduler works with spark?
I'm assuming that you have your spark cluster on YARN.
When you submit a job in spark, it first hits your resource manager. Now your resource manager is responsible for all the scheduling and allocating resources. So its basically same as that of submitting a job in Hadoop.
How scheduler works?
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time(using preemption killing all over used tasks). Unlike the default Hadoop scheduler(FIFO), which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.
Spark internally uses FIFO/FCFS job scheduler. But, when you talk about the tasks, it works in a Round Robin fashion. It will be clear if we concentrate on the below example:
Suppose, the first job in Spark's own queue doesn't require all the resources of the cluster to be utilized; so, immediately second job in the queue will also start getting executed. Now, both jobs are running simultaneously. Each job has few tasks to be executed in order to execute the whole job. Assume, the first job assigns 10 tasks and the second one assigns 8. Then, those 18 tasks will share the CPU cycles of the whole cluster in a preemptive manner. If you want to further drill down, lets start with executors.
There will be few executors in the cluster. Assume the number is 6. So, in an ideal condition, each executor will be assigned 3 tasks and those 3 tasks will get same CPU time of the executors(separate JVM).
This is how spark internally schedules the tasks.

Least load scheduler

I'm working on a system that uses several hundreds of workers in parallel (physical devices evaluating small tasks). Some workers are faster than others so I was wondering what the easiest way to load balance tasks on them without a priori knowledge of their speed.
I was thinking about keeping track of the number of tasks a worker is currently working on with a simple counter and then sorting the list to get the worker with the lowest active task count. This way slow workers would get some tasks but not slow down the whole system. The reason I'm asking is that the current round-robin method is causing hold up with some really slow workers (100 times slower than others) that keep accumulating tasks and blocking new tasks.
It should be a simple matter of sorting the list according to the current number of active tasks, but since I would be sorting the list several times a second (average work time per task is below 25ms) I fear that this might be a major bottleneck. So is there a simple version of getting the worker with the lowest task count without having to sort over and over again.
EDIT: The tasks are pushed to the workers via an open TCP connection. Since the dependencies between the tasks are rather complex (exclusive resource usage) let's say that all tasks are assigned to start with. As soon as a task returns from the worker all tasks that are no longer blocked are queued, and a new task is pushed to the worker. The work queue will never be empty.
How about this system:
Worker reaches the end of its task queue
Worker requests more tasks from load balancer
Load balancer assigns N tasks (where N is probably more than 1, perhaps 20 - 50 if these tasks are very small).
In this system, since you are assigning new tasks when the workers are actually done, you don't have to guess at how long the remaining tasks will take.
I think that you need to provide more information about the system:
How do you get a task to a worker? Does the worker request it or does it get pushed?
How do you know if a worker is out of work, or even how much work is it doing?
How are the physical devices modeled?
What you want to do is avoid tracking anything and find a more passive way to distribute the work.

How to design task distribution with ZooKeeper

I am planning to write an application which will have distributed Worker processes. One of them will be Leader which will assign tasks to other processes. Designing the Leader elelection process is quite simple: each process tries to create a ephemeral node in the same path. Whoever is successful, becomes the leader.
Now, my question is how to design the process of distributing the tasks evenly? Any recipe for this?
I'll elaborate a little on the environment setup:
Suppose there are 10 worker maschines, each one runs a process, one of them become leader. Tasks are submitted in the queue, the Leader takes them and assigns to a worker. The worker processes gets notified whenever a tasks is submitted.
I am not sure I understand your algorithm for Leader election, but the recommended way of implementing this is to use sequential ephemeral nodes and use the algorithm at http://zookeeper.apache.org/doc/r3.3.3/recipes.html#sc_leaderElection which explains how to avoid the "herd" effect.
Distribution of tasks can be done with a simple distributed queue and does not strictly need a Leader. The producer enqueues tasks and consumers keep a watch on the tasks node - a triggered watch will lead the consumer to take a task and delete the associated znode. There are certain edge conditions to consider with requeuing tasks from failed consumers. http://zookeeper.apache.org/doc/r3.3.3/recipes.html#sc_recipes_Queues
I would recommend the section Example: Master-Worker Application of this book ZooKeeper Distributed Process Coordination http://shop.oreilly.com/product/0636920028901.do
The example demonstrates to distribute tasks to worker using znodes and common zookeeper commands.
Consider using an actor singleton service pattern. For example, in Scala there is Akka which solves this class of problem with less code.

Select node in Quartz cluster to execute a job

I have some questions about Quartz clustering, specifically about how triggers fire / jobs execute within the cluster.
Does quartz give any preference to nodes when executing jobs? Such as always or never the node that executed the same job the last time, or is it simply whichever node that gets to the job first?
Is it possible to specify the node which should execute the job?
The answer to this will be something of a "it depends".
For quartz 1.x, the answer is that the execution of the job is always (only) on a more-or-less random node. Where "randomness" is really based on whichever node gets to it first. For "busy" schedulers (where there are always a lot of jobs to run) this ends up giving a pretty balanced load across the cluster nodes. For non-busy scheduler (only an occasional job to fire) it may sometimes look like a single node is firing all the jobs (because the scheduler looks for the next job to fire whenever a job execution completes - so the node just finishing an execution tends to find the next job to execute).
With quartz 2.0 (which is in beta) the answer is the same as above, for standard quartz. But the Terracotta folks have built an Enterprise Edition of their TerracottaJobStore which offers more complex clustering control - as you schedule jobs you can specify which nodes of the cluster are valid for the execution of the job, or you can specify node characteristics/requisites, such as "a node with at least 100 MB RAM available". This also works along with ehcache, such that you can specify the job to run "on the node where the data keyed by X is local".
I solved this question for my web application using Spring + AOP + memcached. My jobs do know from the data they traverse if the job has already been executed, so the only thing I need to avoid is two or more nodes running at the same time.
You can read it here:
http://blog.oio.de/2013/07/03/cluster-job-synchronization-with-spring-aop-and-memcached/