Airflow fault tolerance - kubernetes

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled

what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

Related

GridGain server deployment/Statefulset Termination grace period

I deployed gridgain cluster in google kubernetes cluster following[1]. I enabled native persistency using statefulset. In statefulset.yaml in [2] terminationGracePeriodSeconds set to 60000. What is the purpose of this large timeout?
When deleting pod using kubectl delete pod command it take very large time. What is the suitable value for terminationGracePeriodSeconds without loss any data.
[1]. https://www.gridgain.com/docs/latest/installation-guide/kubernetes/gke-deployment
[2]. https://www.gridgain.com/docs/latest/installation-guide/kubernetes/gke-deployment#creating-pod-configuration
I believe the reason behind setting it to 60000 was - do not rely on it. Prior to Ignite 2.9 there was an issue with the startup script that didn't bypass SYS SIGNAL to the underlying Java app, making it impossible to perform a graceful shutdown.
If a node is being restarted gracefully and IGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN is enabled, Ignite will ensure that the node leave won't lead to a data loss. Sometimes a rebalance might take a while.
Keeping the above in mind: the hang issue might happen for Apache Ignite 2.8 and below, keeping the recommended terminationGracePeriodSeconds should be fine and never be used in practice (in a normal flow).

Kubernetes Handling a Sudden Request of Processing Power (Such as a Python Script using 5 Processes)

I have a little scenario that I am running in my mind with the following setup:
A Django Web Server running in Kubernetes with the ability to autoscale resources (Google Kubernetes Engine), and I set the resource values to be requesting nodes with 8 Processing Units (8 Cores) and 16 GB Ram.
Because it is a web server, I have my frontend that can call a Python script that executes with 5 Processes, and here's what I am worried about:
I know that If I run this script twice on my webserver (located in the same container as my Django code), I am going to be using (to keep it simple) 10 Processes/CPUs to execute this code.
So what would happen?
Would the first Python script be ran on Pod 1 and the second Python script (since we used 5 out of the 8 processing units) trigger a Pod 2 and another Node, then run on that new replica with full access to 5 new processes?
Or, would the first Python script be ran on Replica 1, and then the second Python script be throttled to 3 processing units because, perhaps, Kubernetes is allocating based on CPU usage in the Replica, not how much processes I called the script with?
If your system has a Django application that launches scripts with subprocess or a similar mechanism, those will always be in the same container as the main server, in the same pod, on the same node. You'll never trigger any of the Kubernetes autoscaling capabilities. If the pod has resource limits set, you could get CPU utilization throttled, and if you exceed the configured memory limit, the pod could get killed off (with Django and all of its subprocesses together).
If you want to take better advantage of Kubernetes scheduling and resource management, you may need to restructure this application. Ideally you could run the Django server and each of the supporting tasks in a separate pod. You would then need a way to trigger the tasks and collect the results.
I'd generally build this by introducing a job queue system such as RabbitMQ or Celery into the mix. The Django application adds items to the queue, but doesn't directly do the work itself. Then you have a worker for each of the processes that reads the queue and does work. This is not directly tied to Kubernetes, and you could run this setup with a RabbitMQ or Redis installation and a local virtual environment.
If you deploy this setup to Kubernetes, then each of the tasks can run in its own deployment, fed by the work queue. Each task could run up to its own configured memory and CPU limits, and they could run on different nodes. With a little extra work you can connect a horizontal pod autoscaler to scale the workers based on the length of the job queue, so if you're running behind processing one of the tasks, the HPA can cause more workers to get launched, which will create more pods, which can potentially allocate more nodes.
The important detail here, though, is that a pod is the key unit of scaling; if all of your work stays within a single pod then you'll never trigger the horizontal pod autoscaler or the cluster autoscaler.

Airflow - Async pod lauch

we have a long-running task in airflow (5hr and more)
I'm looking on a way to do fire a forget (after the pod started running) and monitor the status of the task using a sensor
The rationale is to free the worker and reduce resource consumption.
This line seems to monitor for pod health.
https://github.com/apache/airflow/blob/10ce31127f1ff87176158935925afce46a989917/airflow/kubernetes/pod_launcher.py#L140
Now, I know I can override this limitation using a Cloud Function to trigger the pod, but this is a kind of over-complication of the DAG.
I've also opened an issue in github

Kubernetes dynamic Job scaling

I’m finally dipping my toes in the kubernetes pool and wanted to get some advice on the best way to approach a problem I have:
Tech we are using:
GCP
GKE
GCP Pub/Sub
We need to do bursts of batch processing spread out across a fleet and have decided on the following approach:
New raw data flows in
A node analyses this and breaks the data up into manageable portions which are pushed onto a queue
We have a cluster with Autoscaling On and Min Size ‘0’
A Kubernetes job spins up a pod for each new message on this cluster
When pods can’t pull anymore messages they terminate successfully
The question is:
What is the standard approach for triggering jobs such as this?
Do you create a new job each time or are jobs meant to be long lived and re-run?
I have only seen examples of using a yaml file however we would probably want the node which did the portioning of work to create the job as it knows how many parallel pods should be run. Would it be recommended to use the python sdk to create the job spec programatically? Or if jobs are long lived would you simply hit the k8 api and modify the parallel pods required then re-run job?
Jobs in Kubernetes are meant to be short-lived and are not designed to be reused. Jobs are designed for run-once, run-to-completion workloads. Typically they are be assigned a specific task, i.e. to process a single queue item.
However, if you want to process multiple items in a work queue with a single instance then it is generally advisable to instead use a Deployment to scale a pool of workers that continue to process items in the queue, scaling the number of pool workers dependent on the number of items in the queue. If there are no work items remaining then you can scale the deployment to 0 replicas, scaling back up when there is work to be done.
To create and control your workloads in Kubernetes the best-practice would be to use the Kubernetes SDK. While you can generate YAML files and shell out to another tool like kubectl using the SDK simplifies configuration and error handling, as well as allowing for simplified introspection of resources in the cluster as well.

Is there downtime when a partition is moved to a new node?

Service Fabric offers the capability to rebalance partitions whenever a node is removed or added to the cluster. The Service Fabric Cluster Resource Manager will move one or more partitions to this node so more work can be done.
Imagine a reliable actor service which has thousands of actors running who are distributed across multiple partitions. If the Resource Manager decides to move one or more partitions, will this cause any downtime? Or does rebalancing partitions work the same as upgrading a service?
They act pretty much the same way, The main difference I can point is that Upgrades might affect only the services being updated, and re-balancing might affect multiple services at once. During an upgrade, the cluster might re-balance the services as well to fit the new service instance in a node.
Adding or Removing nodes I would compare more with node failures. In any of these cases they will be rebalanced because of the cluster capacity changes, not because of the service metric\load changes.
The main difference between a node failure and a cluster scaling(Add/remove node) is that the rebalance will take in account the services states during the process, when a infrastructure notification comes in telling that a node is being shutdown(for updates or maintenance, or scaling down) the SF will ask the Infrastructure to wait so it can prepare for this announced 'failure', and then start re-balancing the services.
Even though re-balancing cares about the service states for a scale down it should not be considered more reliable than a node failure, because the infrastructure will wait for a while before shutting down the node(the limit it can wait will depend on the reliability tier you defined for your cluster), until SF check if the services meet health conditions, like turn down services and creating new ones, checking if they will run fine without errors, if this process takes too long, these service might be killed once the timeout is reached and the infrastructure proceed with the changes, Also, the new instances of the services might fail on new nodes, forcing the services to move again.
When you design the services is safer to consider the re-balancing as a node failure, because at the end is not much different. Your services will move around, data stored in memory will be lost if not persisted, the service address will change, and etc. The services should have replicated data and the clients should always use a retry logic and refresh the services location to reduce the down time.
The main difference between service upgrade and service rebalancing is that during upgrade all replicas from all partitions are get turned off on particular node. According to documentation here balancing is done on replica basis i.e. only some replicas from some partitions will get moved, so there shouldn't be any outage.