Deploying MongoDB on Kubernetes - mongodb

Suppose we're using Kubernetes Deployments to host MongoDB.
I know that Deployments suppose their pods are identical, and the corresponding service divides requests between the pods regardless of which pod is which.
Does this yield data inconsistency?
Or will mongo be run in its Replication manner, where one pod becomes Primary (& accepts write/read requests) and others become Secondary (& answer read requests).
Suppose we're using Kubernetes StatefulSets to host MongoDB.
Naturally, there's a service dividing request between stateful pods.
Now in my application server, should I manually send Write requests to the Primary node? Or will it be handled by MongoDB itself? If so, how does MongoDB keeps the consistency? Where do the stateful pods get notified about each other and by who?
I suppose one scenario would be that my request gets sent to any pod randomly, If that pod can handle it (i.e. read request), it will do so. Otherwise (write request), the pod will send the request to the Primary Pod.
But that seems slow and inefficient and slow, doesn't it? Sending requests randomly hoping they would reach the correct place on the first try.

Related

In kubernetes, is there a way to make statefulset pods linger to finish requests on rolling update?

In Kubernetes, I have a statefulset with a number of replicas.
I've set the updateStrategy to RollingUpdate.
I've set podManagementPolicy to Parallel.
My statefulset instances do not have a persistent volume claim -- I use the statefulset as a way to allocate ordinals 0..(N-1) to pods in a deterministic manner.
The main reason for this, is to keep availability for new requests while rolling out software updates (freshly built containers) while still allowing each container, and other services in the cluster, to "know" its ordinal.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset (mapped by the ordinal) without a temporary outage.
Unfortunately, I don't see a way of doing this -- what am I missing?
Because I don't use volume claims, you might think I could use deployments instead, but I really do need each of the pods to have a deterministic ordinal, that:
is unique at the point of dispatching new service requests (incoming HTTP requests, including public ingresses)
is discoverable by the pod itself
is persistent for the duration of the pod lifetime
is contiguous from 0 .. (N-1)
The second-best option I can think of is using something like zookeeper or etcd to separately manage this property, using some of the traditional long-poll or leader-election mechanisms, but given that kubernetes already knows (or can know) about all the necessary bits, AND kubernetes service mapping knows how to steer incoming requests from old instances to new instances, that seems more redundant and complicated than necessary, so I'd like to avoid that.
I assume that you need this for a stateful workload, a workload that e.g. requires writes. Otherwise you can use Deployments with multiple pods online for your shards. A key feature with StatefulSet is that they provide unique stable network identities for the instances.
The behavior I want, when doing a rolling update, is for the previous statefulset pods to linger while there are still long-running requests processing on them, but I want new traffic to go to the new pods in the statefulset.
This behavior is supported by Kubernetes pods. But you also need to implement support for it in your application.
New traffic will not be sent to your "old" pods.
A SIGTERM signal will be sent to the pod - your application may want to listen to this and do some action.
After a configurable "termination grace period", your pod will get killed.
See Kubernetes best practices: terminating with grace for more info about pod termination.
Be aware that you should connect to services instead of directly to pods for this to work. E.g. you need to create headless services for the replicas in a StatefulSet.
If your clients are connecting to a specific headless service, e.g. N, this means that it will not be available for some times during upgrades. You need to decide if your clients should retry their connections during this time period or if they should connect to another headless service if N is not available.
If you are in a case where you need:
stateful workload (e.g. support for write operations)
want high availability for your instances
then you need a form of distributed system that does some form of replication/synchronization, e.g. using raft or a product that implements this. Such system is easiest deployed as a StatefulSet.
You may be able to do this using Container Lifecycle Hooks, specifically the preStop hook.
We use this to drain connections from our Varnish service before it terminates.
However, you would need to implement (or find) a script to do the draining.

How to connect a web server to a Kubernetes statefulset and headless service

I have been learning Kubernetes for a few weeks and now I am trying to figure out the right way to connect a web server to a statefulset correctly.
Let's say I deployed a master-slave Postgres statefulset and now I will connect my web server to it. By using a cluster IP service, the requests will be load balanced across the master and the slaves for both reading (SELECT) and writing (UPDATE, INSERT, DELETE) records, right? But I can't do that because writing requests should be handled by the master. However, when I point my web server to the master using the headless service that will give us a DNS entry for each pod, I won't get any load balancing to the other slave replications and all of the requests will be handled by one instance and that is the master. So how am I supposed to connect them the right way? By obtaining both load balancing to all replications along with the slave in reading records and forwarding writing records requests to the master?
Should I use two endpoints in the web server and configure them in writing and reading records?
Or maybe I am using headless services and statefulsets the wrong way since I am new to Kubernetes?
Well, your thinking is correct - the master should be read-write and replicas should be read only. How to configure it properly? There are different possible approaches.
First approach is what you thinking about, to setup two headless services - one for accessing primary instances, the second one to access to the replica instances - good example is Kubegres:
In this example, Kubegres created 2 Kubernetes Headless services (of default type ClusterIP) using the name defined in YAML (e.g. "mypostgres"):
a Kubernetes service "mypostgres" allowing to access to the Primary PostgreSql instances
a Kubernetes service "mypostgres-replica" allowing to access to the Replica PostgreSql instances
Then you will have two endpoints:
Consequently, a client app running inside a Kubernetes cluster, would use the hostname "mypostgres" to connect to the Primary PostgreSql for read and write requests, and optionally it can also use the hostname "mypostgres-replica" to connect to any of the available Replica PostgreSql for read requests.
Check this starting guide for more details.
It's worth noting that there are many database solutions which are using this approach - another example is MySQL. Here is a good article in Kubernetes documentation about setting MySQL using Stateful set.
Another approach is to use some middleware component which will act as a gatekeeper to the cluster, for example Pg-Pool:
Pg pool is a middleware component that sits in front of the Postgres servers and acts as a gatekeeper to the cluster.
It mainly serves two purposes: Load balancing & Limiting the requests.
Load Balancing: Pg pool takes connection requests and queries. It analyzes the query to decide where the query should be sent.
Read-only queries can be handled by read-replicas. Write operations can only be handled by the primary server. In this way, it loads balances the cluster.
Limits the requests: Like any other system, Postgres has a limit on no. of concurrent connections it can handle gracefully.
Pg-pool limits the no. of connections it takes up and queues up the remaining. Thus, gracefully handling the overload.
Then you will have one endpoint for all operations - the Pg-Pool service. Check this article for more details, including the whole setup process.

Load balancing onto replicas of pods

We have an AKS cluster and we want to achieve below two points in our architecture:
We have replicas of pods and we want to have only 1 request served by one pod. basically one pod - one request design.
When all pods are busy, then next coming request should not be queued at POD level, instead it should be queued at service level and once any of busy pod become idle or available then only queued request should be dispatched on idle pod.
How to achieve above things?
Generally, this could be achieved by creating a custom proxy that creates pods on demand, but in practice it will be very difficult and performance will be poor. This was very well explained by David Maze in his comment:
You need to write a custom proxy with access to the Kubernetes API that can create new pods on demand; this is not a standard Kubernetes setup. This is also an extremely heavy-weight setup (if it takes tens of seconds to pull and deploy a new pod you can hit HTTP request timeouts very easily) and every Web framework supports handling multiple requests per process.

How to signal "bad" but not "fatal" health check from spring boot to Kubernetes?

What we're looking for is a way for an actuator health check to signal some intention like "I am limping but not dead. If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp."
We have a rest service hosted in clustered Kubernetes containers that periodically call out to fetch fresh data from an external resource. Occasionally we have failures reaching those external resources, and sometimes, but not every time, a restart of the pod will resolve the issue.
The services can operate just fine on possibly stale data. Although we wouldn't want to continue operating on stale data, that's preferable to just going down entirely.
In the interim, we're planning on having a node unilaterally decide not to report any problems through actuator until X amount of time has passed since the last successful sync, but that really only delays the point at which all nodes would still report failure.
In Kubernetes you can use LivenessProbe and ReadinessProbe to let a controller to heal your service, but some situations is better handled with HTTP response codes or alternative degraded service.
LivenessPobe
Use a LivenessProbe to resolve a deadlock situation. When your pod does not respond on a LivenessProbe, it will be killed and a new pod will replace it.
ReadinessProbe
Use a ReadinessProbe when your pod is not prepared for serving requests, e.g. if your pod need to read some files or need to connect to an external service before serving requests.
Fault affecting all replicas
If you have a problem that all your replicas depends on, e.g. an external service is down, then you can not solve it by restarting your pods. You may use an OpsToogle or a circuit breaker in this situation and notifying other services that you are degraded or show a message about temporary error.
For your situations
If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp.
You can not delegate that logic to Kubernetes. Your application need to understand each fault situation, e.g. if an error was a transient network error or if your error will affect all replicas.

Persistent Kafka transacton-id across restarts on Kubernetes

I am trying to achieve the exactly-once delivery on Kafka using Spring-Kafka on Kubernetes.
As far as I understood, the transactional-ID must be set on the producer and it should be the same across restarts, as stated here https://stackoverflow.com/a/52304789/3467733.
The problem arises using this semantic on Kubernetes. How can you get a consistent ID?
To solve this problem I implementend a Spring boot application, let's call it "Replicas counter" that checks, through the Kubernetes API, how many pods there are with the same name as the caller, so I have a counter for every pod replica.
For example, suppose I want to deploy a Pod, let's call it APP-1.
This app does the following:
It perfoms a GET to the Replicas-Counter passing the pod-name as parameter.
The replicas-counter calls the Kubernetes API in order to check how many pods there are with that pod name. So it does a a +1 and returns it to the caller. I also need to count not-ready pods (think about a first deploy, they couldn't get the ID if I wasn't checking for not-ready pods).
The APP-1 gets the id and will use it as the transactional-id
But, as you can see a problem could arise when performing rolling updates, for example:
Suppose we have 3 pods:
At the beginning we have:
app-1: transactional-id-1
app-2: transactional-id-2
app-3: transactional-id-3
So, during a rolling update we would have:
old-app-1: transactional-id-1
old-app-2: transactional-id-2
old-app-3: transactional-id-3
new-app-3: transactional-id-4 (Not ready, waiting to be ready)
New-app-3 goes ready, so Kubernetes brings down the Old-app-3. So time to continue the rolling update.
old-app-1: transactional-id-1
old-app-2: transactional-id-2
new-app-3: transactional-id-4
new-app-2: transactional-id-4 (Not ready, waiting to be ready)
As you can see now I have 2 pods with the same transactional-id.
As far as I understood, these IDs have to be the same across restarts and unique.
How can I implement something that gives me consistent IDs? Is there someone that have dealt with this problem?
The problem with these IDs are only for the Kubernetes Deployments, not for the Stateful-Sets, as they have a stable identifier as name. I don't want to convert all deployment to stateful sets to solve this problem as I think it is not the correct way to handle this scenario.
The only way to guarantee the uniqueness of Pods is to use StatefulSet.
StatefulSets will allow you to keep the number of replicas alive but everytime pod dies it will be replaced with the same host and configuration. That will prevent data loss that is required.
Service in Statefulset must be headless because since each pod is going to be unique, so you are going to need certain traffic to reach certain pods.
Every pod require a PVC (in order to store data and recreate whenever pod is deleted from that data).
Here is a great article describing why StatefulSet should be used in similar case.