Pre-emptible machines on genomics pipeline - gcloud

I tried running thousands of machines using the genomics pipeline using the preemptible flag in v2alpha1 JSON mapping.
Even though the machines were preemptible - a lot of workers were using up persistent disk space while not even having started.
the gcloud alpha genomics operations describe $operation_id
I see description: Worker released
details:
'#type': type.googleapis.com/google.genomics.v2alpha1.WorkerReleasedEvent
instance: google-pipelines-worker-10f2002aa213b3108fb69a7488d0d4ce
zone: us-east1-c
timestamp: '2019-04-15T01:49:29.065576Z'
- description: Worker "google-pipelines-worker-10f2002aa213b3108fb69a7488d0d4ce"
assigned in "us-east1-c"
details:
'#type': type.googleapis.com/google.genomics.v2alpha1.WorkerAssignedEvent
instance: google-pipelines-worker-10f2002aa213b3108fb69a7488d0d4ce
zone: us-east1-c
timestamp: '2019-04-14T19:16:08.141993Z'
I expected workers to be assigned only when a preemptible instance was available. It looks like the assigned workers took up disk space without taking up cpu resources.
Is there something more I should be doing - when setting up the pipeline json.
https://cloud.google.com/genomics/reference/rest/Shared.Types/Metadata#Pipeline

Could you please join and post this question to the Google Genomics Discuss mailing list and provide some more detail with what you're seeing here? Would like to understand what pipeline you're trying to run, the config you're using, and where you're seeing the workers taking up storage without CPU.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

Google DATA FUSION CPU

I have a problem when I try to deploy a downstream pipeline, the error from logs i receive is this:
PROVISION task failed in REQUESTING_CREATE state for program run program_run:default.ListaNomi1_v3.-SNAPSHOT.workflow.DataPipelineWorkflow.182bbf2c-576b-11ec-8095-da8d4f8ab0b3 due to Dataproc operation failure: INVALID_ARGUMENT: Multiple validation errors: - Insufficient 'CPUS' quota. Requested 10.0, available 3.0. - Insufficient 'CPUS_ALL_REGIONS' quota. Requested 10.0, available 7.0. - Insufficient 'IN_USE_ADDRESSES' quota. Requested 3.0, available 1.0. - This request exceeds CPU quota. Some things to try: request fewer workers (a minimum of 2 is required), use smaller master and/or worker machine types (such as n1-standard-2)..
I'm trying to change the worker and Master nodes configuration but it always Fail,
I can't modify the quota because I m not the leader and he says that can't change.
To process data with Cloud Data Fusion you need a cluster.
Two options are:
Ephemeral cluster when it's created for each pipeline run. This is the one you are trying to use, but it needs compute quotas to create a cluster
Static cluster (Existing Dataproc). In this case the cluster is created beforehand and you simply "points" your Pipeline to use it by creating and selection provisioning profile. This can be an option to prevent quota issues during pipeline start. But such a static cluster would incur costs while it's running, even without any jobs.

Airflow Memory Error: Task exited with return code -9

According to both of these Link1 and Link2, my Airflow DAG run is returning the error INFO - Task exited with return code -9 due to an out-of-memory issue. My DAG run has 10 tasks/operators, and each task simply:
makes a query to get one of my BigQuery tables, and
writes the results to a collection in my Mongo database.
The size of the 10 BigQuery tables range from 1MB to 400MB, and the total size of all 10 tables is ~1GB. My docker container has default 2GB of memory and I've increased this to 4GB, however I am still receiving this error from a few of the tasks. I am confused about this, as 4GB should be plenty of memory for this. I am also concerned because, in the future, these tables may become larger (a single table query could be 1-2GB), and I'd like to avoid these return code -9 errors at that time.
I'm not quite sure how to handle this issue, since the point of the DAG is to transfer data from BigQuery to Mongo daily, and the queries / data in-memory for the DAG's tasks is necessarily fairly large then, based on the size of the tables.
As you said, the error message you get corresponds to an out of memory issue.
Referring to the official documentation:
DAG execution is RAM limited. Each task execution starts with two
Airflow processes: task execution and monitoring. Currently, each node
can take up to 6 concurrent tasks. More memory can be consumed,
depending on the size of the DAG.
High memory pressure in any of the GKE nodes will lead the Kubernetes scheduler to evict pods from nodes in an attempt to relieve that pressure. While many different Airflow components are running within GKE, most don't tend to use much memory, so the case that happens most frequently is that a user uploaded a resource-intensive DAG. The Airflow workers run those DAGs, run out of resources, and then get evicted.
You can check it with following steps:
In the Cloud Console, navigate to Kubernetes Engine -> Workloads
Click on airflow-worker, and look under Managed pods
If there are pods that show Evicted, click each evicted pod and look for the The node was low on resource: memory message at the top of the window.
What are the possible ways to fix OOM issue?
Create a new Cloud Composer environment with a larger machine type than the current machine type.
Ensure that the tasks in the DAG are idempotent, which means that the result of running the same DAG run multiple times should be the same as the result of running it once.
Configure task retries by setting the number of retries on the task - this way when your task gets -9'ed by the scheduler it will go to up_for_retry instead of failed
Additionally you can check the behavior of CPU:
In the Cloud Console, navigate to Kubernetes Engine -> Clusters
Locate Node Pools at the bottom of the page, and expand the default-pool section
Click the link listed under Instance groups
Switch to the Monitoring tab, where you can find CPU utilization
Ideally, the GCE instances shouldn't be running over 70% CPU at all times, or the Composer environment may become unstable during resource usage.
I hope you find the above pieces of information useful.
I am going to chunk the data so that less is loaded into any 1 task at any given time. I'm not sure yet whether I will need to use GCS/S3 for intermediary storage.

how do we choose --nthreads and --nprocs per worker in dask distributed running via helm on kubernetes?

I'm running some I/O intensive Python code on Dask and want to increase the number of threads per worker. I've deployed a Kubernetes cluster that runs Dask distributed via helm. I see from the worker deployment template that the number of threads for a worker is set to the number of CPUs, but I'd like to set the number of threads higher unless that's an anti-pattern. How do I do that?
It looks like from this similar question that I can ssh to the dask scheduler and spin up workers with dask-worker? But ideally I'd be able to configure the worker resources via helm so that I don't have to interact with the scheduler other than submitting jobs to it via the Client.
Kubernetes resource limits and requests should match the --memory-limit and --nthreads parameters given to the dask-worker command. For more information please follow the link 1 (Best practices described on Dask`s official documentation) and 2
Threading in Python is a careful art and is really dependent on your code. To do the easy one, -nprocs should almost certainly be 1, if you want more processes, launch more replicas instead. For the thread count, first remember the GIL means only one thread can be running Python code at a time. So you only get concurrency gains under two main sitations: 1) some threads are blocked on I/O like waiting to hear back from a database or web API or 2) some threads are running non-GIL-bound C code inside NumPy or friends. For the second situation, you still can't get more concurrency than the number of CPUs since that's just how many slots there are to run at once, but the first can benefit from more threads than CPUs in some situations.
There's a limitation of Dask's helm chart that doesn't allow for the setting of --nthreads in the chart. I confirmed this with the Dask team and filed an issue: https://github.com/helm/charts/issues/18708.
In the meantime, use Dask Kubernetes for a higher degree of customization.

Kubernetes different container args depending on number of pods in replica set

I want to scale an application with workers.
There could be 1 worker or 100, and I want to scale them seamlessly.
The idea is using replica set. However due to domain-specific reasons, the appropriate way to scale them is for each worker to know its: ID and the total number of workers.
For example, in case I have 3 workers, I'd have this:
id:0, num_workers:3
id:1, num_workers:3
id:2, num_workers:3
Is there a way of using kubernetes to do so?
I pass this information in command line arguments to the app, and I assume it would be fine having it in environment variables too.
It's ok on size changes for all workers to be killed and new ones spawned.
Before giving the kubernetes-specific answer, I wanted to point out that it seems like the problem is trying to push cluster-coordination down into the app, which is almost by definition harder than using a distributed system primitive designed for that task. For example, if every new worker identifies themselves in etcd, then they can watch keys to detect changes, meaning no one needs to destroy a running application just to update its list of peers, their contact information, their capacity, current workload, whatever interesting information you would enjoy having while building a distributed worker system.
But, on with the show:
If you want stable identifiers, then StatefulSets is the modern answer to that. Whether that is an exact fit for your situation depends on whether (for your problem domain) id:0 being "rebooted" still counts as id:0 or the fact that it has stopped and started now disqualifies it from being id:0.
The running list of cluster size is tricky. If you are willing to be flexible in the launch mechanism, then you can have a pre-launch binary populate the environment right before spawning the actual worker (that example is for reading from etcd directly, but the same principle holds for interacting with the kubernetes API, then launching).
You could do that same trick in a more static manner by having an initContainer write the current state of affairs to a file, which the app would then read in. Or, due to all Pod containers sharing networking, the app could contact a "sidecar" container on localhost to obtain that information via an API.
So far so good, except for the
on size changes for all workers to be killed and new one spawned
The best answer I have for that requirement is that if the app must know its peers at launch time, then I am pretty sure you have left the realm of "scale $foo --replicas=5" and entered into the "destroy the peers and start all afresh" realm, with kubectl delete pods -l some-label=of-my-pods; which is, thankfully, what updateStrategy: type: OnDelete does, when combined with the delete pods command.
In the end, I've tried something different. I've used kubernetes API to get the number of running pods with the same label. This is python code utilizing kubernetes python client.
import socket
from kubernetes import client
from kubernetes import config
config.load_incluster_config()
v1 = client.CoreV1Api()
with open(
'/var/run/secrets/kubernetes.io/serviceaccount/namespace',
'r'
) as f:
namespace = f.readline()
workers = []
for pod in v1.list_namespaced_pod(
namespace,
watch=False,
label_selector="app=worker"
).items:
workers.append(pod.metadata.name)
workers.sort()
num_workers = len(workers)
worker_id = workers.index(socket.gethostname())