Flink HA JobManager cluster cannot elect a leader

Flink HA JobManager cluster cannot elect a leader - kubernetes

I'm trying to deploy Apache Flink 1.6 on kubernetes. With following the tutorial at job manager high availabilty
page. I already have a working Zookeeper 3.10 cluster from its logs I can see that it's healthy and doesn't configured to Kerberos or SASL.All ACL rules are let's every client to write and read znodes. When I start the cluster everything works as expected every JobManager and TaskManager pods are successfully getting into Running state and I can see the connected TaskManager instances from the master JobManager's web-ui. But when I delete the master JobManager's pod, the other JobManager pod's cannot elect a leader with following error message on any JobManager-UI in the cluster.
{
"errors": [
"Service temporarily unavailable due to an ongoing leader election. Please refresh."
]
}
Even if I restart this page nothing changes. It stucks at this error message.
My suspicion is, the problem is related with high-availability.storageDir option. I already have a working (tested with CloudExplorer) minio s3 deployment to my k8s cluster. But flink cannot write anything to the s3 server. Here you can find every config from github-gist.

According to the logs it looks as if the TaskManager cannot connect to the new leader. I assume that this is the same for the web ui. The logs say that it tries to connect to flink-job-manager-0.flink-job-svc.flink.svc.cluster.local/10.244.3.166:44013. I cannot say from the logs whether flink-job-manager-1 binds to this IP. But my suspicion is that the headless service might return multiple IPs and Flink picks the wrong/old one. Could you log into the flink-job-manager-1 pod and check what its IP address is?
I think you should be able to resolve this problem by defining for each JobManager a dedicated service or if you use the pod hostname instead.

Related

Failed to send the transaction successfully to the order status: SERVICE UNAVAILABLE

I am using kakfa orderer service for hyperledger fabric 1.4. while updating chaincode or making any puutState call i am getting error message stated as Failed to send the transaction successfully to the order status: SERVICE UNAVAILABLE. while checking zoopeker and kafka node it seems like kafka nodes are not able to talk to each other.
kakfa & zookeeper logs

Could you provide more info about the topology of the zookeeper-kafka cluster?
Did you use docker to deploy the zk cluster? If so, you can refer to this file.
https://github.com/whchengaa/hyperledger-fabric-technical-tutorial/blob/cross-machine-hosts-file/balance-transfer/artifacts/docker-compose-kafka.yaml
Remember to specify the IP address of the other zookeeper nodes in the hosts file which is mounted to /etc/hosts of that zookeeper node.
Make sure the port number of zookeeper nodes listed in ZOO_SERVERS environment variables are correct.

Flink Statefun HA kubernetes cluster

I'm trying to deploy high available flink cluster on kubernetes. In the below examples worker nodes are replicated but we have only one master pod.
https://github.com/apache/flink-statefun
As far as I understand there are 2 approaches to make job manager HA.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html
https://medium.com/hepsiburadatech/high-available-flink-cluster-on-kubernetes-setup-73b2baf9200e
In the first example we deploy another job manager to switch between them in case of failure
In the second example kubernetes redeploy the job manager pod in case of failure
So I have few questions
For both examples what happens to the running jobs when the active job manager fails?
Can the first scenario be applied on kubernetes?
For the second scenario in case of job manager failure flink UI will be unavailable until the pod recover but in the second first scenario it will be available am I right?
What is the pros/cons of the both scenarios?

There is one approach to make job manager HA, both of your link is using the JM HA using zookeeper cluster to make active/standby arhitecture of the JM.
When JobManager fails there is a "Failover" such as describe in apache flink documentation(first link), the standby JM become to be Active.
Ofcouse, kubernetes is just the deployment of the whole Flink cluster, you can still use the HA cluster mode using zk.
No, both will make the "failover" and a standby JM will become active.
You are not understand that kubernetes is only the deploy cluster of flink, Same as you can deploy it on phsical/virtual servers, than u can deploy it on kubernetes, but things like High Aviability will stay the same.
EDIT:
You can make 2 or more pods in kubernetes of JobManager and then it`ll be equal to the first solution.

Restarting NiFi Node Joins Cluster as New Node

I am currently running Apache NiFi as a StatefulSet on Kubernetes. I'm testing to see how the cluster recovers if I kill a pod but am experiencing a problem when the pod (NiFi node) rejoins the cluster.
The node will rejoin as an additional node instead of appearing as it's original identity. For example, if I have a 3 node NiFi cluster and kill and restart one pod/NiFi node I will end up with a 4 node cluster with one disconnected.
Before:
After:
I believe that the NiFi node is identified somehow in a config file which isn't persisting when it is killed. So far I am using persistent volumes to persist the following config files:
state-management.xml
authorizers.xml
I haven't persisted nifi.properties (it is dynamically generated on startup a
and I can't see anything in there that could uniquely identify the node).
So I guess, the question is how is the node uniquely identified to the server and where is it stored?
EDIT: I'm using an external Zookeeper.
Thank you in advance,
Harry

Each node stores the state of the cluster in the local state manager which by default would be written to a write-ahead-log in nifi-home/state/local. Most likely you are losing the state/local directory on the node being restarted.

What sort of logs should I see when ignite nodes discovery each other?

I'm setting up an ignite cluster in Kuberentes.
Installation is straight forward using this helm chart. Ignite helm chart
My cluster has network policies enabled, so I added policies for ignite nodes to talk to themselves on all ports. (jdbc, spi_communication, spi_discovery, jmx, sql, rest, thin_clients)
However, my failover test clearly shows that they haven't discovered each other to form a cluster.
Reconnected to 100.96.0.28:10800
Error: Cache does not exist [cacheId= -479252689]
Reconnected to 100.96.0.28:10800
Error: Cache does not exist [cacheId= -479252689]
Also, there's seem to be no logs hinting ignite nodes finding each other.
I'm out of clue atm. So, if anyone has got ignite spi discovery working, could you give an example of what sort of log should I expect?

There is a message that you can see in the logs on each topology change event:
Topology snapshot [ver=2, locNode=abffa279, servers=2, clients=0,
state=ACTIVE, CPUs=32, offheap=32.0GB, heap=16.0GB]
It means that nodes were able to find each other and joined the cluster.

in kubernetes 1.7 with multi-master nodes and etcd cluster, how the master node connect the etcd cluster by default?

I want to know when the master nodes want to connect the etcd cluster, which etcd node will be selected?does the master node always connects the same etcd node untill it becomes unavailable?does each node in master cluster will connect the same node in etcd cluster?

The scheduler and controller-manager talk to the API server present on the same node. In a HA setup you'll have only one of them running at a time (based on a lease) and whoever is the current active will be talking to the local API server. If for some reason it fails to connect to the local API server, it doesn't renew the lease and another leader will be elected.
As described only one API server will be the leader at any given moment so that's the only place that needs to worry about reaching the etcd cluster. As for the etcd cluster itself, when you configure the kubernetes API server you pass it the etcd-servers flag, which is a list of etcd nodes like:
--etcd-servers=https://10.240.0.10:2379,https://10.240.0.11:2379,https://10.240.0.12:2379
This is then passed the Go etcd/client library which, looking at it's README states:
etcd/client does round-robin rotation on other available endpoints if the preferred endpoint isn't functioning properly. For example, if the member that etcd/client connects to is hard killed, etcd/client will fail on the first attempt with the killed member, and succeed on the second attempt with another member. If it fails to talk to all available endpoints, it will return all errors happened.
Which means that it'll try each of the available nodes until it succeeds connecting to one.