Hazelcast split-brain - kubernetes

I'm using hazelcast (3.7.4) with OpenShift.
Each application is starting a HazelcastInstance.
The network discovery is done via hazelcast-kubernetes (1.1.0).
Sometimes when I deploy the whole application, the cluster is stuck in a split-brain syndrom forever. It never fix and reconnect the whole cluster.
I have to restart pods to enable the reconstruction of a single cluster.
Can someone help me to prevent the split-brain or at least making it recover after ?
Thanks

Use StatefulSet instead of Deployment (or ReplicationController). Then, PODs start one by one which prevents the Split Brain issue. You can have a look at the official OpenShift Code Sample for Hazelcast or specifically at the OpenShift template for Hazelcast.
What's more, try to use the latest Hazelcast version, I think it should re-form the cluster even if you use Deployment and the cluster starts with a Split Brain.

Related

How to configure readiness/shutdown in Embedded Hazelcast in Spring app running in Kubernetes (OpenShift)

I'm working on a system that is using Embedded hazelcast in Spring apps running on Kubernetes (OpenShift) and I'm wondering how to properly configure things so things work with scaling up/down, restart of pods etc. I realize I should be using the Hazelcast Kubernetes plugin - but I'm unsure about the following aspects:
Should I do something special about readiness to report Hazelcast readiness? Things that could be done: Explicitly test if Hazelcast is working (by doing something in e.g. a map) or by directly checking the status of the Hazelcast readiness probe? Or is there a third option (maybe not do anything at all - note the application will only report UP once the whole app has initialized including the initial Hazelcast startup).
Do I need to install something in the way of shutdonw hooks to properly shutdown hazelcast? Or does Hazelcast itself react to SIGTERM etc. Documentation I've read seems to suggest the latter is not the case of out-of-the-box, suggesting either Hazelcast configuration needs to be changed to do this, or an explicit shutdown hook installed.
I'm a bit wary regarding #1, especially if it's wise to include an explicit check if Hazelcast is working (accessing a map etc.) because how will this play along with Hazelcast migration and discovery? Will the other peers only discovre the node when the readiness in this sense is UP, and may even kick the node out if it reports DOWN? Then this could give rise to bad feedback efects with nodes entering/exiting the cluster. I'm thinking maybe it's safer to relay the informtion in the Hazelcast readiness probe? Or maybe there's a different way in this specific setting of Kubernetes/Hazelcast?
Tried with out of box configuration but I'm seeing instability when scaling pods up and down.

GKE Upgrade Causing Kafka to become Unavailable

I have a Kafka cluster hosted in GKE. Google updates GKE nodes in weekly basis and whenever this happens Kafka becomes temporarily unavailable and this causes massive error/rebalance to get it backup to healthy state. Currently we rely on K8 retry to eventually succeed once upgrade completes and cluster becomes available. Is there a way to gracefully handle this type of situation in Kafka or avoid it if possible?
In order to be able to inform you better, you would have to give us a little more information, what is your setup? Versions of Kube and Kafka? How many Kafka & ZK pods? How are you deploying your Kafka cluster (via a simple helm chart or an operator?) What are the exact symptoms you see when you upgrade your kube cluster? What errors do you get? What is the state of the Kafka cluster etc.? How do you monitor it?
But here are some points worth investigating.
Are you spreading the Kafka/ZK pods correctly across the nodes/zones?
Do you set PDBs to a reasonable maxUnavailable setting?
What are your readiness/liveness probes for your Kafka/ZK pods?
Are your topics correctly replciated?
I would strongly encourage you to use take a look at https://strimzi.io/ which can be very helpful if you want to operate Kafka on Kube. It is open source operator and very well documented.
You have control over the GKE Node's auto upgrade through "upgrade maintenance window" to decide when upgrades should occur. Based on your business criticality you can configure this option along with K8 retry feature.

Off-Loading of k8s deployments to different cluster in case of high loads

Since I am unable to find anything on google or the official docs, I have a question.
I have a local minikube cluster with deployment, service and ingress, which is working fine. Now when the load on my local cluster becomes too high I want to automatically switch to a remote cluster.
Is this possible?
How would I achieve this?
Thank you in advance
EDIT:
A remote cluster in my case would be a rancher Kubernetes cluster, but as long as the resources on my local one are sufficient I want to stay there.
So lets say my local cluster has enough resources to run two replicas of my application, but when a third one is needed to distribute the load, it should be deployed to the remote rancher cluster. (I hope that is clearer now)
I imagine it would be doable with kubefed (https://github.com/kubernetes-sigs/kubefed) when using the ReplicaSchedulingPreferences (https://github.com/kubernetes-sigs/kubefed/blob/master/docs/userguide.md#replicaschedulingpreference) and just weighting the local cluster very high and the remote one very low and then setting spec.rebalance to true to distribute it in case of high loads, but that approach seems a bit like a workaround.
Your idea of using Kubefed sounds good but there is an another option: Multicluster-Scheduler.
Multicluster-scheduler is a system of Kubernetes controllers that
intelligently schedules workloads across clusters. It is simple to use
and simple to integrate with other tools.
To be able to make a better choice for your use case you can read through the Comparison with Kubefed (Federation v2).
All the necessary info can be found in the provided GitHub thread.
Please let me know if that helped.

Forward Traffic to POD in Kubernetes Cluster

I installed and configured 3 node K8S cluster. The worker nodes are windows nodes. We have one .Net application. We want to containerize this application. This application internally using Apache Ignite for the distributed cache.
We build docker image for this application, wrote a deployment file and deployed it in K8S cluster. The deployment will also create a service of “LoadBalancer” type. Using this service we are connecting to the application from the outside world. All is good so far.
Coming to the issue, as we are using Apache Ignite for the distributed cache. One of the POD will be master. We want to always forward the traffic to the POD which is acting as the master node in the Apache Ignite cluster. The Apache Ignite master node identification must be dynamic.
I had gone through the below link. Here the POD configuration is static. We want to dynamically identify the master POD and forward the traffic. What we have to do on the service side.
https://appscode.com/products/voyager/7.4.0/guides/ingress/http/statefulset-pod/
Any help on how to forward the traffic to the POD is greatly appreciated.
The very fact that you have a leader/follower topology, the ask to direct traffic to a said nome (master node) is flawed for a couple of reasons:
What happens when the current leader fails over and there is a new election to select a new leader
The fact that pods are ephemeral they should not have major roles to play in production, instead work with deployments and their replicas. What you are trying to achieve is an anti-pattern
In any case, if this is what you want, may be you want to read about gateways in istio which can be found here

Ignite ReadinessProbe

Deploying an ignite cluster within Kubernetes, I cam across an issue that prevents cluster members from joining the group. If I use a readinessProbe and a livenessProbe, even with a delay as low as 10 seconds, they nodes never join each other. If I remove those probes, they find each other just fine.
So, my question is: can you use these probes to monitor node health, and if so, what are appropriate settings. On top of that, what would be good, fast health checks for Ignite, anyway?
Update:
After posting on the ignite mailing list, it looks like StatefulSets are the way to go. (Thanks Dmitry!)
I think I'm going to leave in the below logic to self-heal any segmentation issues although hopefully it won't be triggered often.
Original answer:
We are having the same issue and I think we have a workable solution. The Kubernetes discovery spi lists services as they become ready.
This means that if there are no ready pods at startup time, ignite instances all think that they are the first and create their own grid.
The cluster should be able to self heal if we have a deterministic way to fail pods if they aren't part of an 'authoritative' grid.
In order to do this, we keep a reference to the TcpDiscoveryKubernetesIpFinder and use it to periodically check the list of ignite pods.
If the instance is part of a cluster that doesn't contain the alphabetical first ip in the list, we know we have a segmented topology. Killing the pods that get into that state should cause them to come up again, look at service list and join the correct topology.
I am facing the same issue, using Ignite embedded within a Java spring application.
As you said the readinessProbe: on the Kubernetes Deployment spec.template.spec.container has the side effect to prevent the Kubernetes Pods from being listed on the related Kubernetes Service as Endpoints
Trying without any readinessProbe, it seems to indeed works better (Ignite nodes are all joinging the same Ignite cluster)
Yet this have the undesired side effect of exposing the Kubernetes Pods when not yet ready, as Spring has not yet fully started ...