Sharded load balancing for stateful services in Kubernetes - kubernetes

I am currently switching from Service Fabric to Kubernetes and was wondering how to do custom and more complex load balancing.
So far I already read about Kubernetes offering "Services" which do load balancing for pods hidden behind them, but this is only available in more plain ways.
What I want to rewrite right now looks like the following in Service Fabric:
I have this interface:
public interface IEndpointSelector
{
int HashableIdentifier { get; }
}
A context keeping track of the account in my ASP.Net application e.g. inherits this. Then, I wrote some code which would as of now do service discovery through the service fabric cluster API and keep track of all services, updating them when any instances die or are being respawned.
Then, based on the deterministic nature of this identifier (due to the context being cached etc.) and given multiple replicas of the target service of a frontend -> backend call, I can reliably route traffic for a certain account to a certain endpoint instance.
Now, how would I go about doing this in Kubernetes?
As I already mentioned, I found "Services", but it seems like their load balancing does not support custom logic and is rather only useful when working with stateless instances.
Is there also a way to have service discovery in Kubernetes which I could use here to replace my existing code at some points?

StatefulSet
StatefulSet is a building block for stateful workload on Kubernetes with certain guarantees.
Stable and unique network identity
StatefulSet Pods have a unique identity that is comprised of an ordinal, a stable network identity, and stable storage.
As an example, if your StatefulSet has the name sharded-svc
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: sharded-svc
And you have e.g. 3 replicas, those will be named by <name>-<ordinal> where ordinal starts from 0 up to replicas-1.
The name of your pods will be:
sharded-svc-0
sharded-svc-1
sharded-svc-2
and those pods can be reached with a dns-name:
sharded-svc-0.sharded-svc.your-namespace.svc.cluster.local
sharded-svc-1.sharded-svc.your-namespace.svc.cluster.local
sharded-svc-2.sharded-svc.your-namespace.svc.cluster.local
given that your Headless Service is named sharded-svc and you deploy it in namespace your-namespace.
Sharding or Partitioning
given multiple replicas of the target service of a frontend -> backend call, I can reliably route traffic for a certain account to a certain endpoint instance.
What you describe here is that your stateful service is what is called sharded or partitioned. This does not come out of the box from Kubernetes, but you have all the needed building blocks for this kind of service. It may happen that it exists an 3rd party service providing this feature that you can deploy, or it can be developed.
Sharding Proxy
You can create a service sharding-proxy consisting of one of more pods (possibly from Deployment since it can be stateless). This app need to watch the pods/service/endpoints in your sharded-svc to know where it can route traffic. This can be developed using client-go or other alternatives.
This service implements the logic you want in your sharding, e.g. account-nr modulus 3 is routed to the corresponding pod ordinal
Update: There are 3rd party proxies with sharding functionallity, e.g. Weaver Proxy
Sharding request based on headers/path/body fields
Recommended reading: Weaver: Sharding with simplicity
Consuming sharded service
To consume your sharded service, the clients send request to your sharding-proxy that then apply your routing or sharding logic (e.g. request with account-nr modulus 3 is routed to the corresponding pod ordinal) and forward the request to the replica of sharded-svc that match your logic.
Alternative Solutions
Directory Service: It is probably easier to implement sharded-proxy as a directory service but it depends on your requirements. The clients can ask your directory service to what statefulSet replica should I send account-nr X and your serice reply with e.g. sharded-svc-2
Routing logic in client: The probably most easy solution is to have your routing logic in the client, and let this logic calculate to what statefulSet replica to send the request.

Services generally run the proxy in kernel space for performance reasons so writing custom code is difficult. Cillium does allow writing eBPF programs for some network features but I don't think service routing is one of them. So that pretty much means working with a userspace proxy instead. If your service is HTTP-based, you could look at some of the existing Ingress controllers to see if any are close enough or allow you to write your own custom session routing logic. Otherwise you would have to write a daemon yourself to handle it.

Related

Why headless services are not recommended in stateless applications?

I'm a kubernetes beginner and I found that headless service is generally recommended for stateful applications, so I'm wondering why it can't be used in stateless again? Is it bad that requests should be faster if they go directly to the Pod and not through the Service?
A regular Service provides you with a single, stable, IP address to access any of the replicas associated to the Service.
Given that the application is really stateless, it does not matter which replica you talk to, so the "hiding" behind a single IP.
This relieves you of the responsibility of load balancing: You can just use the very same IP every time and don't have to handle added replicas, removed replicas or repeated DNS lookups.
If you would use a headless service for a stateless application, which is of course possible, you would have to handle all that or risk that the replicas do not deliver scaling and/or high-availability as intended.
In fact a headless service shuffles the order of returned records when performing a DNS looup to help clients with the load balancing, if they can't handle it themself. But you have to perform regular DNS lookups for headless services to learn about added and/or removed instances, which is not the case for a regular service with a service IP.
First of all Headless services are vaguely used to access all the
pod replicas directly instead of using the Services.
In the case of deployment(Stateless services) the pods are
interchangeable because if the pod needs to reschedule it wont
maintain the same id as the previous pod.
Whereas the Statefulsets maintain a unique ID for each pod.
Statefulsets provide 2 unique identities for each pod. First Network
Identity which helps us to give the same DNS name to the pod
regardless of numerous restarts and second is the Storage Identity,
it will also remain the same regardless of the restarts. So,
statefulset won’t use the shared volume.
An IP address is not assigned to a headless service. Internally, it
builds the required endpoints for DNS-named pod exposure.
Conclusion: Pods having a distinct identity are required for stateful applications (hostname). To communicate one pod with other pods a StatefulSet requires a Headless Service. A service with a service IP is a headless service. As a result, it immediately returns the IPs of our related pods. This enables us to communicate with the pods directly instead of using a proxy. Whereas in stateless applications a Service is required to interact with other pods.
For more detailed information refer to this article

Dynamic deployment of stateful applications in GKE

I'm trying to figure out which tools from GKE stack I should apply to my use case which is a dynamic deployment of stateful application with dynamic HTTP endpoints.
Stateful in my case means that I don't want any replicas and load-balancing (because the app doesn't scale horizontally at all). I understand though that in k8s/gke nomenclature I'm still going to be using a 'load-balancer' even though it'll act as a reverse proxy and not actually balance any load.
The use case is as follows. I have some web app where I can request for a 'new instance' and in return I get a dynamically generated url (e.g. http://random-uuid-1.acme.io). This domain should point to a newly spawned, single instance of a container (Pod) hosting some web application. Again, if I request another 'new instance', I'll get a http://random-uuid-2.acme.io which will point to another (separate), newly spawned instance of the same app.
So far I figured out following setup. Every time I request a 'new instance' I do the following:
create a new Pod with dynamic name app-${uuid} that exposes HTTP port
create a new Service with NodePort that "exposes" the Pod's HTTP port to the Cluster
create or update (if exists) Ingress by adding a new http rule where I specify that domain X should point at NodePort X
The Ingress mentioned above uses a LoadBalancer as its controller, which is automated process in GKE.
A few issues that I've already encountered which you might be able to help me out with:
While Pod and NodePort are separate resources per each app, Ingress is shared. I am thus not able to just create/delete a resource but I'm also forced to keep track of what has been added to the Ingress to be then able to append/delete from the yaml which is definitely not the way to do that (i.e. editing yamls). Instead I'd probably want to have something like an Ingress to monitor a specific namespace and create rules automatically based on Pod labels. Say I have 3 pods with labels, app-1, app-2 and app-3 and I want Ingress to automatically monitor all Pods in my namespace and create rules based on the labels of these pods (i.e. app-1.acme.io -> reverse proxy to Pod app-1).
Updating Ingress with a new HTTP rule takes around a minute to allow traffic into the Pod, until then I keep getting 404 even though both Ingress and LoadBalancer look as 'ready'. I can't figure out what I should watch/wait for to get a clear message that the Ingress Controller is ready for accepting traffic for newly spawned app.
What would be the good practice of managing such cluster where you can't strictly define Pods/Services manifests because you are creating them dynamically (with different names, endpoints or rules). You surely don't want to create bunch of yaml-s for every application you spawn to maintain. I would imagine something similar to consul templates in case of Consul but for k8s?
I participated in a similar project and our decision was to use Kubernetes Client Library to spawn instances. The instances were managed by a simple web application, which took some customisation parameters, saved them into its database, then created an instance. Because of the database, there was no problem with keeping track of what have been created so far. By querying the database we were able to tell if such deployment was already created or update/delete any associated resources.
Each instance consisted of:
a deployment (single or multi-replica, depending on the instance);
a ClusterIp service (no reason to reserve machine port with NodePort);
an ingress object for shared ingress controller;
and some shared configMaps.
And we also used external DNS and cert manager, one to manage DNS records and another to issue SSL certificates for the ingress. With this setup it took about 10 minutes to deploy a new instance. The pod and ingress controller were ready in seconds but we had to wait for the certificate and it's readiness depended on whether issuer's DNS got our new record. This problem might be avoided by using a wildcard domain but we had to use many different domains so it wasn't an option in our case.
Other than that you might consider writing a Helm chart and make use of helm list command to find existing instances and manage them. Though, this is a rather 'manual' solution. If you want this functionality to be a part of your application - better use a client library for Kubernetes.

Kubernetes Pod to Pod communication

I have 2 Deployment - A(1 replica) and B(4 replica)
I have scheduled job in the POD A and on successful completion it hits endpoint present in one of the Pods from Deployment B through the service for Deployment B.
Is there a way I can hit all the endpoints of 4 PODS from Deployment B on successful completion of job?
Ideally one of the pod is notified!! But is this possible as I don't want to use pub-sub for this.
Is there a way I can hit all the endpoints of 4 PODS from Deployment B on successful completion of job?
But is this possible as I don't want to use pub-sub for this.
As you say, a pub-sub solution is best for this problem. But you don't want to use it.
Use stable network identity for Service B
To solve this without pub-sub, you need a stable network-identity for the pods in Deployment B. To get this, you need to change to a StatefulSet for your service B.
StatefulSets are valuable for applications that require one or more of the following.
Stable, unique network identifiers.
When B is deployed with a StatefulSet, your job or other applications can reach your pods of B, with a stable network identity that is the same for every version of service B that you deploy. Remember that you also need to deploy a Headless Service for your pods.
Scatter pattern: You can have an application aware (e.g. aware of number of pods of Service B) proxy, possibly as a sidecar. Your job sends the request to this proxy. The proxy then sends a request to all your replicas. As described in Designing Distributed Systems: Patterns and Paradigms
Pub-Sub or Request-Reply
If using pub-sub, the job only publish an event. Each pod in B is responsible to subscribe.
In a request-reply solution, the job or a proxy is responsible for watching what pods exists (unless it is a fixed number of pods) in service B, in addition it need to send request to all, if requests fails to any pod (it will happen on deployments sometime) it is responsibly to retry the request to those pods.
So, yes, it is a much more complicated problem in a request-reply way.
Kubernetes service is an abstraction to provide service discovery and load balancing. So if you are using service your request will be sent to one of the backend pods.
To achieve what you want I suggest you create 4 different services each having only one backend pod or use a message queue such as rabbitmq between service A and service B.
You can use headless service. That way Kubernetes won't allocate separate IP address and instead will setup DNS record with IP addresses of all the pods. Then in your application just resolve the record and send notification to all endpoints. But really, this is ideal use case for pub-sub or service discovery system. DNS is too unreliable for this.
PUB-SUB is the best option here. I have similar use case and I am using pub-sub , which is in production for last 6 months.

Synchronize HTTP requests between several service instances in Kubernetes

We have a service with multiple replicas which works with storage without transactions and blocking approaches. So we need somehow to synchronize concurrent requests between multiple instances by some "sharding" key. Right now we host this service in Kubernetes environment as a ReplicaSet.
Don't you know any simple out-of-the-box approaches on how to do this to not implement it from scratch?
Here are several of our ideas on how to do this:
Deploy the service as a StatefulSet and implement some proxy API which will route traffic to the specific pod in this StatefulSet by sharding key from the HTTP request. In this scenario, all requests which should be synchronized will be handled by one instance and it wouldn't be a problem to handle this case.
Deploy the service as a StatefulSet and implement some custom logic in the same service to re-route traffic to the specific instance (or process on this exact instance). As I understand it's not possible to have abstract implementation and it would work only in Kubernetes environment.
Somehow expose each pod IP outside the cluster and implement routing logic on the client-side.
Just implement synchronization between instances through some third-party service like Redis.
I would like to try to route traffic to the specific pods. If you know standard approaches how to handle this case I'll be much appreciated.
Thank you a lot in advance!
Another approach would be to put a messaging queue (like Kafka and RabbitMq) in front of your service.
Then your pods will subscribe to the MQ topic/stream. The pod will decide if it should process the message or not.
Also, try looking into service meshes like Istio or Linkerd.
They might have an OOTB solution for your use-case, although I wasn't able to find one.
Remember that Network Policy is not traffic routing !
Pods are intended to be stateless and indistinguishable from one another, pod-networking.
I recommend to Istio. It has special component which is responsible or routing- Envoy. It is a high-performance proxy developed in C++ to mediate all inbound and outbound traffic for all services in the service mesh.
Useful article: istio-envoy-proxy.
Istio documentation: istio-documentation.
Useful Istio explaination https://www.youtube.com/watch?v=e2kowI0fAz0.
But you should be able to create a Deployment per customer group, and a Service per Deployment. The Ingress nginx should be able to be told to map incoming requests by whatever attributes are relevant to specific customer group Services.
Other solution is to use kube-router.
Kube-router can be run as an agent or a Pod (via DaemonSet) on each node and leverages standard Linux technologies iptables, ipvs/lvs, ipset, iproute2.
Kube-router uses IPVS/LVS technology built in Linux to provide L4 load balancing. Each ClusterIP, NodePort, and LoadBalancer Kubernetes Service type is configured as an IPVS virtual service. Each Service Endpoint is configured as real server to the virtual service. The standard ipvsadm tool can be used to verify the configuration and monitor the active connections.
How it works: service-proxy.

Why don't Kubernetes deployments support services?

I'm new to K8s, so still trying to get my head around things. I've been looking at deployments and can appreciate how useful they will be. However, I don't understand why they don't support services (only replica sets and pods).
Why is this? Does this mean that services would typically be deployed outside of a deployment?
To answer your question, Kubernetes deployments are used for managing stateless services running in the cluster instead of StatefulSets which are built for the stateful application run-time. Actually, with deployments you can describe the update strategy and road map for all underlying objects that have to be created during implementation.Therefore, we can distinguish separate specification fields for some objects determination, like needful replica number of Pods, template for Pod by describing a list of containers that should be in the Pod, etc.
However, as #P Ekambaram already mention in his answer, Services represent abstraction layer of network communication model inside Kubernetes cluster, and they declare a way to access Pods within a cluster via corresponded Endpoints. Services are separated from deployment object manifest specification, because of their mission to dynamically provide specific network behavior for the nested Pods without affecting or restarting them in case of any communication modification via appropriate Service Types.
Yes, services should be deployed as separate objects. Note that deployment is used to upgrade or rollback the image and works above ReplicaSet
Kubernetes Pods are mortal. They are born and when they die, they are not resurrected. ReplicaSets in particular create and destroy Pods dynamically (e.g. when scaling out or in). While each Pod gets its own IP address, even those IP addresses cannot be relied upon to be stable over time. This leads to a problem: if some set of Pods (let’s call them backends) provides functionality to other Pods (let’s call them frontends) inside the Kubernetes cluster, how do those frontends find out and keep track of which backends are in that set?
Services.come to the rescue.
A Kubernetes Service is an abstraction which defines a logical set of Pods and a policy by which to access them. The set of Pods targeted by a Service is (usually) determined by a Label Selector
Something I've just learnt that is somewhat related to my question: multiple K8s objects can be included in the same yaml file, separate by ---. Something like:
apiVersion: v1
kind: Deployment
# other stuff here
---
apiVersion: v1
kind: Service
# other stuff here
i think it intends to decoupled and fine-grained.