Do Dataproc have a resource allocation limit per job - google-cloud-dataproc

Let say I have a Dataproc cluster of 100 worker nodes with a certain spec.
When I submitted a job to dataproc, is there a usage allocation limit to each job
e.g. job A cannot run more than 50% of all total nodes
Do we have this kind of limit? Or any job can allocate all resource of the cluster

There is no such per job limit on DataProc. One job could use all resources of YARN, and that's usually the default config for various job types on DataProc. But users can set per job limit as they want, e.g., for Spark, disable dynamic allocation, set the number of executors and the memory size of each executors.

Related

Cluster Resource Usage in Databricks

I was just wondering if anyone could explain if all compute resources in a Databricks cluster are shared or if the resources are tied to each worker. For example, if two users were connected to a cluster made up of 2 workers with 4 cores per worker and one user's job required 2 cores and the other's required 6 cores, would they be able to share the 8 total cores or would the full 4 cores from one worker be unavailable during the job that only required 2 cores?
TL;DR; Yes, default behavior is to allow sharing but you're going to have to tightly control the default parallelism with such a small cluster.
Take a look at Job Scheduling for Apache Spark. I'm assuming you are using an "all-purpose" / "interactive" cluster where users are working on notebooks OR you are submitting jobs to an existing, all-purpose cluster and it is NOT a job cluster with multiple spark applications being deployed.
Databricks Runs in FAIR Scheduling Mode by Default
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
By default, all queries started in a notebook run in the same fair scheduling pool
The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing.
Apache Spark Defaults to FIFO
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Keep in mind the word "job" is specific Spark term that represents an action being taken that launches one or more stages and tasks. See What is the concept of application, job, stage and task in spark?.
So in your example you have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with only 2 tasks.
One application (App B) that has a job that launches a stage with 6 tasks.
In this case, YES, you will be able to share the resources of the cluster. However, the devil is in the default behaviors. If you're reading from many files, performing a join, aggregating, etc, you're going to run into the fact that Spark is going to partition your data into chunks that can be acted on in parallel (see configuration like spark.default.parallelism).
So, in a more realistic example, you're going to have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with 200 tasks.
One application (App B) that has a job that launches three stage with 8, 200, and 1 tasks respectively.
In a scenario like this FIFO scheduling, as is the default, will result in one of these applications blocking the other since the number of executors is completely overwhelmed by the number of tasks in just one stage.
In a FAIR scheduling mode, there will still be some blocking since the number of executors is small but some work will be done on each job since FAIR scheduling does a round-robin at the task level.
In Apache Spark, you have tighter control by creating different pools of the resources and submitting apps only to those pools where they have "isolated" resources. The "better" way of doing this is with Databricks Job clusters that have isolated compute dedicated to the application being ran.

How can I increase the max num of concurrent jobs in Dataproc?

I need to run hundreds of concurrent jobs in a Dataproc cluster, each job is pretty lightweight (e.g., a Hive query which gets a table metadata) which doesn't take much resources. But there seem to be some unknown factors which limit the max concurrent jobs. What can I do if I want to increase the max concurrency limit?
If you are submitting the jobs through the Dataproc API / CLI, these are the factors which affect the max number of concurrent jobs:
The number of master nodes;
The master memory size;
The cluster properties dataproc:agent.process.threads.job.max and dataproc:dataproc.scheduler.driver-size-mb, see Dataproc Properties for more details.
For debugging, when submitting jobs with gcloud, SSH into the master node and run ps aux | grep dataproc-launcher.py | wc -l every a few seconds to show how many concurrent jobs are running. At the same time, you can run tail -f /var/log/google-dataproc-agent.0.log to monitor how the agent is launching the jobs. You can tune the parameters above to get a higher concurrency.
You can also try submitting the jobs directly from the master node through spark-submit or Hive beeline, which will bypass the Dataproc job concurrency control mechanism. This can help you identify where the bottleneck is.

Airflow Memory Error: Task exited with return code -9

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:
makes a query to get one of my BigQuery tables, and
writes the results to a collection in my Mongo database.
The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.
I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.
As you said, the error message you get corresponds to an out of memory issue.
Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.
You can check it with following steps:
In the Cloud Console, navigate to Kubernetes Engine -> Workloads
Click on airflow-worker, and look under Managed pods
If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
What are the possible ways to fix OOM issue?
Create a new Cloud Composer environment with a larger machine type than the current machine type.
Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed
Additionally you can check the behavior of CPU:
In the Cloud Console, navigate to Kubernetes Engine -> Clusters
Locate Node Pools at the bottom of the page, and expand the default-pool section
Click the link listed under Instance groups
Switch to the Monitoring tab, where you can find CPU utilization
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I hope you find the above pieces of information useful.
I am going to chunk the data so that less is loaded into any 1 task at any given time. I'm not sure yet whether I will need to use GCS/S3 for intermediary storage.

Why does dataproc not create two executors per worker when spark.yarn.executor.memoryOverhead is configured?

Dataproc is supposed to fit in two Executors per worker (or yarn NodeManager) with each one getting half the cores and half the memory.
And it does work that way.
However, if we override a setting, say spark.yarn.executor.memoryOverhead=4096
then it only creates one Executor per worker. Half the cores and memory of the clusters are not utilized. And no matter how we play around with spark.executor.memory or spark.executor.cores, it still doesn't spin up enough executors to utilize all cluster resources.
How to make dataproc still create 2 executors per worker? The yarn overhead is deducted out of the executor memory, so it should still be able to fit in 2 executors, shouldn't it?
When executing in YARN, Spark will request containers with memory sized as spark.executor.memory + spark.yarn.executor.memoryOverhead. If you're adding to memoryOverhead, you will want to subtract an equal amount from spark.executor.memory to preserve the same container packing characteristics.

Yarn cluster doesn't equally manage vcores, queue resource limit exceeded

I have 3 yarn node managers working in a yarn cluster, and an issue connected with vcores avalibity per yarn node.
For e.g., I have:
on first node : available 15 vcores,
on second node : non vcores avalible,
on third node : available 37 vcores.
And now, job try to start and fails withe the error:
"Queue's AM resource limit exceeded"
Is this connected with the non vcores available on second node, or maybe I can somehow increase the resources limit in queue?
I also want to mention, that I have the following setting:
yarn.scheduler.capacity.maximum-am-resource-percent=1.0
That means, that your drivers have exceeded max memory configured in Max Application Master Resources. You can either increase max memory for AM or decrease driver memory in your jobs.