How to keep Flink taskmanager pod running on K8s - kubernetes

I've created a Flink cluster using Session mode on native K8s using the command:
$ ./bin/kubernetes-session.sh -Dkubernetes.cluster-id=my-first-flink-cluster
based on these instructions.
I am able to submit jobs to the cluster and view the Flink UI. However, I noticed that Flink creates a taskmanager pod only when the job is submitted and deletes it right after the job is finished. Previously I tried the same using YARN based deployment on Google Dataproc and with that method the cluster had a taskmanager always running which reduced job start time.
Hence, is there a way to keep a taskmanager pod always running using K8s Flink deployment such that job start time is reduced?

the intention of the native k8s support provided by Flink is to have this active resource allocation (i.e. task slots through new TaskManager instances) in case it is needed. In addition to that, it will allow the shutdown of TaskManager pods if they are not used anymore. That's the behavior you're observing.
What you're looking for is the standalone k8s support. Here, Flink does not try to start new TaskManager pods. The ResourceManager is passive, i.e. it only considers the TaskManagers that are registered. Some outside process (or a user) has to manage TaskManager pods instead. This might lead to jobs failing if there are not enough task slots available.
Best,
Matthias

Related

Best practice to clean up Flink application cluster on Kubernetes when the application is completed

We are running Flink jobs on Kubernetes in Application mode, the problem is when the job is completed/stopped, the job manager container will exit but the 1. deployment for task managers 2. job manager service 3. configMap will still be there unless we run kubectl delete to clean it up.
This is not a big deal if we stop the job manually, but in case our Flink job is a batch job which will complete sometime later, it means we need an external service to keep monitoring job manager container and clean up the rest resources when it's done, which is not very practical.
I wonder what's the best practice here? Do we support run Flink batch jobs on Kubernetes? If yes then there should be a way for the Flink job itself to clean up everything when it's completed right?
I assume that you are running standalone Flink application on Kubernetes. In such mode, Flink is not aware of Kubernetes cluster. So the users have to leverage some external tools(e.g. kubectl, k8s-operator) to manage the lifecyle of Flink clusters. This means that you need to delete the TaskManager deployment, configmaps, services manually.
I think this situation could get improved via the following two ways.
Set the owner reference for TaskManager deployment, configmaps, services to JobManager job. However, you still need to delete the Kubernetes job manually after application finished.
Have a try on the native Kubernetes integration. Flink will have an embedded Kubernetes client and could delete the resource automatically when application finished.

Flink Statefun HA kubernetes cluster

I'm trying to deploy high available flink cluster on kubernetes. In the below examples worker nodes are replicated but we have only one master pod.
https://github.com/apache/flink-statefun
As far as I understand there are 2 approaches to make job manager HA.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html
https://medium.com/hepsiburadatech/high-available-flink-cluster-on-kubernetes-setup-73b2baf9200e
In the first example we deploy another job manager to switch between them in case of failure
In the second example kubernetes redeploy the job manager pod in case of failure
So I have few questions
For both examples what happens to the running jobs when the active job manager fails?
Can the first scenario be applied on kubernetes?
For the second scenario in case of job manager failure flink UI will be unavailable until the pod recover but in the second first scenario it will be available am I right?
What is the pros/cons of the both scenarios?
There is one approach to make job manager HA, both of your link is using the JM HA using zookeeper cluster to make active/standby arhitecture of the JM.
When JobManager fails there is a "Failover" such as describe in apache flink documentation(first link), the standby JM become to be Active.
Ofcouse, kubernetes is just the deployment of the whole Flink cluster, you can still use the HA cluster mode using zk.
No, both will make the "failover" and a standby JM will become active.
You are not understand that kubernetes is only the deploy cluster of flink, Same as you can deploy it on phsical/virtual servers, than u can deploy it on kubernetes, but things like High Aviability will stay the same.
EDIT:
You can make 2 or more pods in kubernetes of JobManager and then it`ll be equal to the first solution.

Flink session cluster and jobs submission in Kubernetes

Our team set up a Flink Session Cluster in our K8S cluster. We chose Flink Session Cluster rather than Job Cluster because we have a number of different Flink Jobs, so that we want to decouple the development and deployment of Flink from those of our jobs. Our Flink setup contains:
Single JobManager as a K8S pod, no High Availability (HA) setup
A number of TaskManagers, each as a K8S pod
And we develop our jobs in a separate repository and deploy to Flink cluster when there is code merged.
Now, we noticed that JobManager as a pod in K8S can be redeployed anytime by K8S. So, once it is redeployed, it loses all jobs. To solve this problem, we developed a script that keeps monitoring the jobs in Flink, if jobs not running, the script will resubmit the jobs to the cluster. Since it may take some time for the script to discover and resubmit the jobs, there is a small service break quite often, and we are thinking if this could be improved.
So far, we have some ideas or questions:
One possible solution could be: when the JobManager is (re)deployed, it will fetch the latest Jobs jar and run the jobs. This solution looks overall good. Still, since our jobs are developed in a separate repo, we need a solution for the cluster to notice the latest jobs when there are changes in the jobs, either JobManager keeps polling the latest jobs jar or Jobs repo deploys the latest jobs jar.
I see that Flink HA feature can store checkpoints/savepoints, but not sure if Flink HA can already handle this redeployment issue?
Does anyone have any comment or suggestion on this? Thanks!
Yes, Flink HA will solve the JobManager failover problems you're concerned about. The new job manager will pick up information about what jobs are (supposed to be) running, their jars, checkpoint status, etc, from the HA storage.
Note also that Flink 1.10 includes a beta release of native support for Kubernetes session clusters. See the docs.

Deployment "A" checks a set of checks and scales deployment "B" to run tasks

I have a GKE cluster running (v1.12.8-gke.10). I am trying to set up a specific app that will work the way I want but I can't seem to find and documentation to piece it together. What I am trying to accomplish may not even be possible.
I would like to set up a deployment(1 pod) using a python docker image where it is running a looped pythons script performing checks. If the checks all pass, I would like this deployment/pod to start/scale another deployment that will do a simple task and then kill the pod that was started.
I am not sure if I should be using a deployment or if I need a HPA mixed somewhere in this process. I have also tried looking at KEDA but it only has specified triggers and doesn't fit what I am trying to do.
I am expecting two different deployments.
Deploy A = 1 pod constantly running a python script that is checking if it should be sending any commands to Deploy B.
Deploy B = listening for Deploy A to reach out to tell it to start a pod to run a task. After the task is completed, have the pod terminate.
The workflow you describe is possible. The controller would need access to the Kubernetes API, probably using the official Python client. When you received a request, you would create a Job, and probably pass information about what to run as command-line arguments. The process inside the Job's Pod would do the work and then exit normally. You'd then be responsible for monitoring the Job's status and noticing when it finished, but you wouldn't have to explicitly scale it down; deleting the completed Job would be polite.
The architecture I'd generally recommend here would be to use a job queue like RabbitMQ. You'd have a Deployment for your controller, and a separate Deployment for your worker, and a StatefulSet to run the job queue (or perhaps something like the stable/rabbitmq Helm chart. None of these would directly interact with the Kubernetes API. When a new request came in, the controller would post a message to RabbitMQ, and when the worker received a message off the queue, it would do a job.
This has the advantage of being easier to develop locally (you can just run RabbitMQ on your laptop or in a container, but getting access to the Kubernetes API is harder). If you suddenly get swamped with a huge number of job submissions, you won't try to overload the cluster with thousands of jobs; they'll back up in RabbitMQ and you can do them one at a time. If you want the cluster to do more, you can kubectl scale deployment to get more workers. If you run out of jobs the worker pod(s) will sit idle but that's not really a problem.

How to properly use Kubernetes for job scheduling?

I have the following system in mind: A master program that polls a list of tasks to see if they should be launched (based on some trigger information). The tasks themselves are container images in some repository. Tasks are executed as jobs on a Kubernetes cluster to ensure that they are run to completion. The master program is a container executing in a pod that is kept running indefinitely by a replication controller.
However, I have not stumbled upon this pattern of launching jobs from a pod. Every tutorial seems to be assuming that I just call kubectl from outside the cluster. Of course I could do this but then I would have to ensure the master program's availability and reliability through some other system. So am I missing something? Launching one-off jobs from inside an indefinitely running pod seems to me as a perfectly valid use case for Kubernetes.
Your master program can utilize the Kubernetes client libraries to preform operations on a cluster. Find a complete example here.