I'm trying to deploy high available flink cluster on kubernetes. In the below examples worker nodes are replicated but we have only one master pod.
https://github.com/apache/flink-statefun
As far as I understand there are 2 approaches to make job manager HA.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html
https://medium.com/hepsiburadatech/high-available-flink-cluster-on-kubernetes-setup-73b2baf9200e
In the first example we deploy another job manager to switch between them in case of failure
In the second example kubernetes redeploy the job manager pod in case of failure
So I have few questions
For both examples what happens to the running jobs when the active job manager fails?
Can the first scenario be applied on kubernetes?
For the second scenario in case of job manager failure flink UI will be unavailable until the pod recover but in the second first scenario it will be available am I right?
What is the pros/cons of the both scenarios?
There is one approach to make job manager HA, both of your link is using the JM HA using zookeeper cluster to make active/standby arhitecture of the JM.
When JobManager fails there is a "Failover" such as describe in apache flink documentation(first link), the standby JM become to be Active.
Ofcouse, kubernetes is just the deployment of the whole Flink cluster, you can still use the HA cluster mode using zk.
No, both will make the "failover" and a standby JM will become active.
You are not understand that kubernetes is only the deploy cluster of flink, Same as you can deploy it on phsical/virtual servers, than u can deploy it on kubernetes, but things like High Aviability will stay the same.
EDIT:
You can make 2 or more pods in kubernetes of JobManager and then it`ll be equal to the first solution.
Related
We are running Flink jobs on Kubernetes in Application mode, the problem is when the job is completed/stopped, the job manager container will exit but the 1. deployment for task managers 2. job manager service 3. configMap will still be there unless we run kubectl delete to clean it up.
This is not a big deal if we stop the job manually, but in case our Flink job is a batch job which will complete sometime later, it means we need an external service to keep monitoring job manager container and clean up the rest resources when it's done, which is not very practical.
I wonder what's the best practice here? Do we support run Flink batch jobs on Kubernetes? If yes then there should be a way for the Flink job itself to clean up everything when it's completed right?
I assume that you are running standalone Flink application on Kubernetes. In such mode, Flink is not aware of Kubernetes cluster. So the users have to leverage some external tools(e.g. kubectl, k8s-operator) to manage the lifecyle of Flink clusters. This means that you need to delete the TaskManager deployment, configmaps, services manually.
I think this situation could get improved via the following two ways.
Set the owner reference for TaskManager deployment, configmaps, services to JobManager job. However, you still need to delete the Kubernetes job manually after application finished.
Have a try on the native Kubernetes integration. Flink will have an embedded Kubernetes client and could delete the resource automatically when application finished.
What are the major difference b/w Native Kubernetes and Kubernetes deployments?
I'm new to Kubernetes and trying to understand how different is the Flink deployments on them.
If any insight into internals is given it will be of great help.
In a Kubernetes session or per-job deployment, Flink has no idea it's running on Kubernetes. In this mode, Flink behaves as it does in any standalone deployment (where there is no cluster framework available to do resource management). Kubernetes just happens to be how the infrastructure was created, but as far as Flink is concerned, it could have been bare metal. You will have to arrange for kubernetes to create the infrastructure that you will have configured Flink to expect.
In a Native Kubernetes session deployment, Flink uses its KubernetesResourceManager, which submits a description of the cluster it wants to the Kubernetes ApiServer, which creates it. As jobs come and go, and the requirements for task managers (and slots) go up and down, Flink is able to obtain and release resources from kubernetes as appropriate.
In Application Mode (blog post) (details) you end up with Flink running as a kubernetes application, which will automatically create and destroy cluster components as needed for the job(s) in one Flink application.
Our team set up a Flink Session Cluster in our K8S cluster. We chose Flink Session Cluster rather than Job Cluster because we have a number of different Flink Jobs, so that we want to decouple the development and deployment of Flink from those of our jobs. Our Flink setup contains:
Single JobManager as a K8S pod, no High Availability (HA) setup
A number of TaskManagers, each as a K8S pod
And we develop our jobs in a separate repository and deploy to Flink cluster when there is code merged.
Now, we noticed that JobManager as a pod in K8S can be redeployed anytime by K8S. So, once it is redeployed, it loses all jobs. To solve this problem, we developed a script that keeps monitoring the jobs in Flink, if jobs not running, the script will resubmit the jobs to the cluster. Since it may take some time for the script to discover and resubmit the jobs, there is a small service break quite often, and we are thinking if this could be improved.
So far, we have some ideas or questions:
One possible solution could be: when the JobManager is (re)deployed, it will fetch the latest Jobs jar and run the jobs. This solution looks overall good. Still, since our jobs are developed in a separate repo, we need a solution for the cluster to notice the latest jobs when there are changes in the jobs, either JobManager keeps polling the latest jobs jar or Jobs repo deploys the latest jobs jar.
I see that Flink HA feature can store checkpoints/savepoints, but not sure if Flink HA can already handle this redeployment issue?
Does anyone have any comment or suggestion on this? Thanks!
Yes, Flink HA will solve the JobManager failover problems you're concerned about. The new job manager will pick up information about what jobs are (supposed to be) running, their jars, checkpoint status, etc, from the HA storage.
Note also that Flink 1.10 includes a beta release of native support for Kubernetes session clusters. See the docs.
We are still in a design phase to move away from monolithic architecture towards Microservices with Docker and Kubernetes. We did some basic research on Docker and Kubernetes and got some understanding. We still have couple of open question considering we will be creating K8s cluster with multiple Linux hosts (due to some reason we can't think about Cloud right now) .
Consider a scenario where we have K8s Cluster spanning over multiple linux hosts (5+).
1) If one of the linux worker node crashes and once we bring it back, does enabling kubelet as part of systemctl in advance will be sufficient to bring up required K8s jobs so that it be detected by master again?
2) I believe once worker node is crashed (X pods), after the pod eviction timeout master will reschedule those X pods into some other healthy node(s). Once the node is UP it won't do any deployment of X pods as master already scheduled to other node but will be ready to accept new requests from Master.
Is this correct ?
Yes, should be the default behavior, check your Cluster deployment tool.
Yes, Kubernetes handles these things automatically for Deployments. For StatefulSets (with local volumes) and DaemonSets things can be node specific and Kubernetes will wait for the node to come back.
Better to create a test environment and see/test the failure scenarios
I installed and configured 3 node K8S cluster. The worker nodes are windows nodes. We have one .Net application. We want to containerize this application. This application internally using Apache Ignite for the distributed cache.
We build docker image for this application, wrote a deployment file and deployed it in K8S cluster. The deployment will also create a service of “LoadBalancer” type. Using this service we are connecting to the application from the outside world. All is good so far.
Coming to the issue, as we are using Apache Ignite for the distributed cache. One of the POD will be master. We want to always forward the traffic to the POD which is acting as the master node in the Apache Ignite cluster. The Apache Ignite master node identification must be dynamic.
I had gone through the below link. Here the POD configuration is static. We want to dynamically identify the master POD and forward the traffic. What we have to do on the service side.
https://appscode.com/products/voyager/7.4.0/guides/ingress/http/statefulset-pod/
Any help on how to forward the traffic to the POD is greatly appreciated.
The very fact that you have a leader/follower topology, the ask to direct traffic to a said nome (master node) is flawed for a couple of reasons:
What happens when the current leader fails over and there is a new election to select a new leader
The fact that pods are ephemeral they should not have major roles to play in production, instead work with deployments and their replicas. What you are trying to achieve is an anti-pattern
In any case, if this is what you want, may be you want to read about gateways in istio which can be found here