Trigger DAG across Airflow Instances (Multi Tenant) - triggers

We wanted to experiment with having multiple Airflow instances (one for each team) and I am looking for a way to trigger DAG across instances. For example, I have a DAG A1 in instance A, which should wait for DAG B1 in instance B to finish before starting.
I'm looking at PubSub as a solution, so DAG B1 will publish a message to a PubSub topic, and DAG A1 will pull a message from that topic using PubSubPullSensor. However, it doesn't seem like there is a way to tell Airflow to check for a specific message.
Is there a better way to send messages/triggers across Airflow instances? We are primarily using GCP.

Related

Is it possible in airflow to run a single task on multiple worker nodes i.e running a task in distributed way

I am using spring batch to create a workflow of batch job. The single batch job takes 2 hrs to complete(data to be processed ~ 1 million) so decided to run in distributed way where one task will be distributed across multiple worker nodes, that way it can execute in short time. The other jobs (all are working in distributed manner) in workflow need to run in sequential manner one after other. The jobs are multi node distributed jobs(master/slave architecture) that need to run one after another.
Now, I was considering to deploy the workflow on airflow. So, while exploring that I could not find any way to run a single task that distributes across multiple machine. Is it possible in airflow?
Yes, you can create a task using Spark framework. Spark allows you to process the data on multiple nodes in a distributed fashion.
You can then use SparkSubmitOperator to align the task in your DAG.

Best practice when deplyoying a Flink Job Cluster on Kubernetes regarding savepointing and updating the job

I am looking into a deploying a Flink job on Kubernetes. When looking through the documentations I'm having a hard time coming up with what the best practices are regarding how to deploy the job specifically when the job has to maintain state.
There are two main points regarding this job:
It is a streaming job dealing with unbounded data (never ending stream)
Keeps and uses state that needs to be maintained over different job versions
Currently, we are running on Hadoop. There it is quite easy when you want to deploy a new version of the job and keep state. The steps are: cancel the job with savepoint, then deploy a new job and point to that savepoint.
Kubernetes:
Based on the definitions, it seems that for our use case a Job Cluster is the best fit for the requirements. There will only be one job running on this cluster.
The issue with the Kubernetes setup is that the savepoint location needs to be added as an argument to the Deployment. In the case that a pod is taken offline, it will restart the application with the original savepoint in the Deployment. Specifically this will reset the Kafka offset to whenever the job was deployed and reprocess a lot of data.
In addition to that, how would i go about canceling a job with savepoint when running on a Job cluster from something like ci/cd? Would i need to create another deployer pod and use the rest api?
What is the best practice regarding deploying a stateful Flink job on kubernetes and upgrading it without losing the state?

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

Flink - multiple instances of flink application deployment on kubernetes

I need help on Flink application deployment on K8
we have 3 source that will send trigger condition as in form of SQL queries. Total queries ~3-6k and effectively a heavy load on flink instance. I try to execute but it was very slow and takes lot of time to start.
Because of high volume of queries, we decide to create multiple flink app instance per source. so effectively one flink instance will execute ~1-2K queries only.
example: sql query sources are A, B, C
Flink instance:
App A --> will be responsible to handle source A queries only
App B --> will be responsible to handle source B queries only
App C --> will be responsible to handle source C queries only
I want to deploy these instances on Kubernetes
Question:
a) is it possible to deploy standalone flink jar with mini cluster (inbuilt)? like just start main method: Java -cp mainMethod (sourceName is command line argument A/B/C).
b) if k8's one pod or flink instance is down then how we can manage it in another pod or another flink instance? is it possible to give the work to other pod or other flink instance?
sorry If I mixed up two or more things together :(
Appreciate your help. thanks
Leaving aside issues of exactly-once semantics, one way to handle this would be to have a parallel source function that emits the SQL queries (one per sub-task), and a downstream FlatMapFunction that executes the query (one per sub-task). Your source could then send out updates to the query without forcing you to restart the workflow.

Kubernetes dynamic Job scaling

I’m finally dipping my toes in the kubernetes pool and wanted to get some advice on the best way to approach a problem I have:
Tech we are using:
GCP
GKE
GCP Pub/Sub
We need to do bursts of batch processing spread out across a fleet and have decided on the following approach:
New raw data flows in
A node analyses this and breaks the data up into manageable portions which are pushed onto a queue
We have a cluster with Autoscaling On and Min Size ‘0’
A Kubernetes job spins up a pod for each new message on this cluster
When pods can’t pull anymore messages they terminate successfully
The question is:
What is the standard approach for triggering jobs such as this?
Do you create a new job each time or are jobs meant to be long lived and re-run?
I have only seen examples of using a yaml file however we would probably want the node which did the portioning of work to create the job as it knows how many parallel pods should be run. Would it be recommended to use the python sdk to create the job spec programatically? Or if jobs are long lived would you simply hit the k8 api and modify the parallel pods required then re-run job?
Jobs in Kubernetes are meant to be short-lived and are not designed to be reused. Jobs are designed for run-once, run-to-completion workloads. Typically they are be assigned a specific task, i.e. to process a single queue item.
However, if you want to process multiple items in a work queue with a single instance then it is generally advisable to instead use a Deployment to scale a pool of workers that continue to process items in the queue, scaling the number of pool workers dependent on the number of items in the queue. If there are no work items remaining then you can scale the deployment to 0 replicas, scaling back up when there is work to be done.
To create and control your workloads in Kubernetes the best-practice would be to use the Kubernetes SDK. While you can generate YAML files and shell out to another tool like kubectl using the SDK simplifies configuration and error handling, as well as allowing for simplified introspection of resources in the cluster as well.