Avoid over-utilizing Cluster - google-cloud-dataproc

Given a default DataProc cluster, are there any configurations to avoid overloading the job queue with too many tasks on the YARN side?
For instance, if a spike of job submits occur, is there a way to force the cluster to honor a concurrency, such that the entire spike of jobs doesn't deplete/crash the YARN master?

As #igor-dvorzhak from Google mentioned, the resolution for this is https://stackoverflow.com/a/49693693/1195652
Ref: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.3/bk_system-admin-guide/content/setting_application_limits.html

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

How can I increase the max num of concurrent jobs in Dataproc?

I need to run hundreds of concurrent jobs in a Dataproc cluster, each job is pretty lightweight (e.g., a Hive query which gets a table metadata) which doesn't take much resources. But there seem to be some unknown factors which limit the max concurrent jobs. What can I do if I want to increase the max concurrency limit?
If you are submitting the jobs through the Dataproc API / CLI, these are the factors which affect the max number of concurrent jobs:
The number of master nodes;
The master memory size;
The cluster properties dataproc:agent.process.threads.job.max and dataproc:dataproc.scheduler.driver-size-mb, see Dataproc Properties for more details.
For debugging, when submitting jobs with gcloud, SSH into the master node and run ps aux | grep dataproc-launcher.py | wc -l every a few seconds to show how many concurrent jobs are running. At the same time, you can run tail -f /var/log/google-dataproc-agent.0.log to monitor how the agent is launching the jobs. You can tune the parameters above to get a higher concurrency.
You can also try submitting the jobs directly from the master node through spark-submit or Hive beeline, which will bypass the Dataproc job concurrency control mechanism. This can help you identify where the bottleneck is.

Kubernetes dynamic Job scaling

I’m finally dipping my toes in the kubernetes pool and wanted to get some advice on the best way to approach a problem I have:
Tech we are using:
GCP
GKE
GCP Pub/Sub
We need to do bursts of batch processing spread out across a fleet and have decided on the following approach:
New raw data flows in
A node analyses this and breaks the data up into manageable portions which are pushed onto a queue
We have a cluster with Autoscaling On and Min Size ‘0’
A Kubernetes job spins up a pod for each new message on this cluster
When pods can’t pull anymore messages they terminate successfully
The question is:
What is the standard approach for triggering jobs such as this?
Do you create a new job each time or are jobs meant to be long lived and re-run?
I have only seen examples of using a yaml file however we would probably want the node which did the portioning of work to create the job as it knows how many parallel pods should be run. Would it be recommended to use the python sdk to create the job spec programatically? Or if jobs are long lived would you simply hit the k8 api and modify the parallel pods required then re-run job?
Jobs in Kubernetes are meant to be short-lived and are not designed to be reused. Jobs are designed for run-once, run-to-completion workloads. Typically they are be assigned a specific task, i.e. to process a single queue item.
However, if you want to process multiple items in a work queue with a single instance then it is generally advisable to instead use a Deployment to scale a pool of workers that continue to process items in the queue, scaling the number of pool workers dependent on the number of items in the queue. If there are no work items remaining then you can scale the deployment to 0 replicas, scaling back up when there is work to be done.
To create and control your workloads in Kubernetes the best-practice would be to use the Kubernetes SDK. While you can generate YAML files and shell out to another tool like kubectl using the SDK simplifies configuration and error handling, as well as allowing for simplified introspection of resources in the cluster as well.

dataproc cluster update (resize) command not completing

We have a dataproc cluster we dynamically resize for large jobs. I submitted a cluster resize request to reduce our cluster to its original size (1m,2workers) from 10-workers, 3-preemptive workers but this still hasn't completed an hour later.
Is this normal? is there a way to re-issue the request? at the moment I get cluster update in progress style messages.
If you downscale Dataproc 1.2+ cluster using Graceful Decommissioning this is expected that it could take a long time if there are running jobs on cluster - downscale operation will wait until YARN containers will finish on decommissioned nodes.
Also, if you are intensively using HDFS, nodes decommissioning could take a long time for data to be replicated to prevent data loss.
You can not issue another resize operation until current operation is finished.

Spark fails with too many open files on HDInsight YARN cluster

I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.