Can you run Zookeeper cluster without using statefulsets in openshift? - kubernetes

I have a single instance of zookeeper running without issues, however when I add two more nodes it crashes with leader election or we got a connection request from the server with own id.
Appreciate any help here.

In short, you should use statefulset.
Would you like community help you - please provide logs and errors of crushes.

Related

GKE Upgrade Causing Kafka to become Unavailable

I have a Kafka cluster hosted in GKE. Google updates GKE nodes in weekly basis and whenever this happens Kafka becomes temporarily unavailable and this causes massive error/rebalance to get it backup to healthy state. Currently we rely on K8 retry to eventually succeed once upgrade completes and cluster becomes available. Is there a way to gracefully handle this type of situation in Kafka or avoid it if possible?
In order to be able to inform you better, you would have to give us a little more information, what is your setup? Versions of Kube and Kafka? How many Kafka & ZK pods? How are you deploying your Kafka cluster (via a simple helm chart or an operator?) What are the exact symptoms you see when you upgrade your kube cluster? What errors do you get? What is the state of the Kafka cluster etc.? How do you monitor it?
But here are some points worth investigating.
Are you spreading the Kafka/ZK pods correctly across the nodes/zones?
Do you set PDBs to a reasonable maxUnavailable setting?
What are your readiness/liveness probes for your Kafka/ZK pods?
Are your topics correctly replciated?
I would strongly encourage you to use take a look at https://strimzi.io/ which can be very helpful if you want to operate Kafka on Kube. It is open source operator and very well documented.
You have control over the GKE Node's auto upgrade through "upgrade maintenance window" to decide when upgrades should occur. Based on your business criticality you can configure this option along with K8 retry feature.

Could not connect to Kafka headless service

recently I have encountered a problem with Kafka (running on our company's K8S system). Every thing was running fine then suddenly all of my kafka and zookeeper pods could not connect to their headless services (the pods are still in running state), which results in timeout exception everytime I pushlish a message into a topic. Below is an image from the log of a zookeeper pod:
The same things happen to all of my broker pods.
Have anyone faced with this problem and solved it? Please let me know.
Thanks in advance! By the way I'm sorry for my bad English.
this seems like networking problem, "no route to host"

Kafka on Kubernetes

We have 12 API's deployed on a cluster and we are using Kafka which are deployed on 3 EC2 instances. Should I add the Kafka Servers in K8s too or should I keep it the same? Or should I start using AWS MSK?
Still Experimenting so any suggestions or good documentation would be helpful
This is opinion based so it's probably going to be closed but check out https://strimzi.io/. It's been working great for us.

Hazelcast split-brain

I'm using hazelcast (3.7.4) with OpenShift.
Each application is starting a HazelcastInstance.
The network discovery is done via hazelcast-kubernetes (1.1.0).
Sometimes when I deploy the whole application, the cluster is stuck in a split-brain syndrom forever. It never fix and reconnect the whole cluster.
I have to restart pods to enable the reconstruction of a single cluster.
Can someone help me to prevent the split-brain or at least making it recover after ?
Thanks
Use StatefulSet instead of Deployment (or ReplicationController). Then, PODs start one by one which prevents the Split Brain issue. You can have a look at the official OpenShift Code Sample for Hazelcast or specifically at the OpenShift template for Hazelcast.
What's more, try to use the latest Hazelcast version, I think it should re-form the cluster even if you use Deployment and the cluster starts with a Split Brain.

Kubernetes breaks after OOM

I faced the issue with Kubernetes after OOM on the master node. Kubernetes services were looking Ok, there were not any error or warning messages in the log. But Kubernetes failed to process new deployment, wich was created after OOM happened.
I reloaded Kubernetes by systemctl restart kube-*. And it solved the issue, Kubernetes began work normally.
I just wonder is it expected behavior or bug in Kubernetes?
It would be great if you can share kube-controller's log. But when api server crash / OOMKilled, there can be potential synchronization problems in early version of kubernetes (i remember we saw similar problems with daemonset and I have bug filed to Kubernete community), but rare.
Meanwhile, we did a lot of work to make kubernetes production ready: both tuning kubernetes and crafting other micro-services that need to talk to kubernetes. Hope these blog entries would help:
https://applatix.com/making-kubernetes-production-ready-part-2/ This is about 30+ knobs we used to tune kubernetes
https://applatix.com/making-kubernetes-production-ready-part-3/ This is about micro service behavior to ensure cluster stability
It seems the problem wasn't caused by OOM. It was caused by kube-controller regardless to was OOM happen or not.
If I restart kube-controller Kubernetes begins process deployments and pods normally.