recently I have encountered a problem with Kafka (running on our company's K8S system). Every thing was running fine then suddenly all of my kafka and zookeeper pods could not connect to their headless services (the pods are still in running state), which results in timeout exception everytime I pushlish a message into a topic. Below is an image from the log of a zookeeper pod:
The same things happen to all of my broker pods.
Have anyone faced with this problem and solved it? Please let me know.
Thanks in advance! By the way I'm sorry for my bad English.
this seems like networking problem, "no route to host"
Related
I have a single instance of zookeeper running without issues, however when I add two more nodes it crashes with leader election or we got a connection request from the server with own id.
Appreciate any help here.
In short, you should use statefulset.
Would you like community help you - please provide logs and errors of crushes.
Anyone has aware of this issue, I have a cluster of 3 nodes and Am running pods in statefulset. totally 3 pods are running in the order, assume pod-0 running on node-1, pod-2 running on node-2, and pod-3 running on node-3. now, the traffic is going properly and getting the response immediately, when we stop one node(eg: node-2) , then the response is intermittent and the traffic is routing to stopped pod as well, is there any solution/workaround for this issue.
when we stop one node(eg: node-2), then the response is intermittent and the traffic is routing to stopped pod as well, is there any solution/workaround for this issue.
This seem to be a reported issue. However, Kubernetes is a distribued cloud native system and you should design for resilience with use of request retries.
Improve availability and resilience of your Microservices using these seven cloud design patterns
How to Make Services Resilient in a Microservices Environment
I'm having the same problem with kafka-streams and spring-kafka applications. The first one is using kafka-clients:1.0.0 library while another one version 1.0.2
There is just one broker instance running in kubernetes (KAFKA_ADVERTISED_LISTENERS="PLAINTEXT://${POD_IP}:9092"). It's stateful set and it's accessed from the client app via headless service internal endpoint (although I've tried cluster ip and the issue is the same)
Once I delete this kafka pod and it's recreated, my client application can't reconnect. Pod is indeed recreated with another ip address but since I'm accessing it via service internal endpoint I'm expecting my client app to resolve this new address but it's not happening.
The kafka-clients library is logging "Found least loaded node [old_ip]:9092 (id: 0 rack: null)" while there is nothing anymore running on this address
JVM TTL cache is not a problem as I've set it to periodically refresh.
Restarting client application solves the problem
If providing {POD_IP} in KAFKA_ADVERTISED_LISTENERS causes this problem, would providing a pod's hostname solve this problem? Or is there a way to direct my client to resolve this new address?
It seems that it has something to do with KAFKA-7755. Updating client version to 2.2.0 / 2.1.1 should help.
I'm trying to deploy Apache Flink 1.6 on kubernetes. With following the tutorial at job manager high availabilty
page. I already have a working Zookeeper 3.10 cluster from its logs I can see that it's healthy and doesn't configured to Kerberos or SASL.All ACL rules are let's every client to write and read znodes. When I start the cluster everything works as expected every JobManager and TaskManager pods are successfully getting into Running state and I can see the connected TaskManager instances from the master JobManager's web-ui. But when I delete the master JobManager's pod, the other JobManager pod's cannot elect a leader with following error message on any JobManager-UI in the cluster.
{
"errors": [
"Service temporarily unavailable due to an ongoing leader election. Please refresh."
]
}
Even if I restart this page nothing changes. It stucks at this error message.
My suspicion is, the problem is related with high-availability.storageDir option. I already have a working (tested with CloudExplorer) minio s3 deployment to my k8s cluster. But flink cannot write anything to the s3 server. Here you can find every config from github-gist.
According to the logs it looks as if the TaskManager cannot connect to the new leader. I assume that this is the same for the web ui. The logs say that it tries to connect to flink-job-manager-0.flink-job-svc.flink.svc.cluster.local/10.244.3.166:44013. I cannot say from the logs whether flink-job-manager-1 binds to this IP. But my suspicion is that the headless service might return multiple IPs and Flink picks the wrong/old one. Could you log into the flink-job-manager-1 pod and check what its IP address is?
I think you should be able to resolve this problem by defining for each JobManager a dedicated service or if you use the pod hostname instead.
I faced the issue with Kubernetes after OOM on the master node. Kubernetes services were looking Ok, there were not any error or warning messages in the log. But Kubernetes failed to process new deployment, wich was created after OOM happened.
I reloaded Kubernetes by systemctl restart kube-*. And it solved the issue, Kubernetes began work normally.
I just wonder is it expected behavior or bug in Kubernetes?
It would be great if you can share kube-controller's log. But when api server crash / OOMKilled, there can be potential synchronization problems in early version of kubernetes (i remember we saw similar problems with daemonset and I have bug filed to Kubernete community), but rare.
Meanwhile, we did a lot of work to make kubernetes production ready: both tuning kubernetes and crafting other micro-services that need to talk to kubernetes. Hope these blog entries would help:
https://applatix.com/making-kubernetes-production-ready-part-2/ This is about 30+ knobs we used to tune kubernetes
https://applatix.com/making-kubernetes-production-ready-part-3/ This is about micro service behavior to ensure cluster stability
It seems the problem wasn't caused by OOM. It was caused by kube-controller regardless to was OOM happen or not.
If I restart kube-controller Kubernetes begins process deployments and pods normally.