Does sessionAffinity over ClientIP works with UDP protocol on Kubernetes setup? - kubernetes

lets say, we have two independent Kubernetes clusters Cluster 1 & Cluster 2 , Each of them has two replicas of same application Pod. Like
Cluster 1 : Pod A & Pod B
Cluster 2 : Pod C & Pod D
Application code in Pod A(client) wants to connect to any Pod running in cluster 2 via NodePort/Loadbalancer service over UDP protocol to send messages. The only requirement is, to maintain affinity so that all messages from Pod A should go to any one pod only (either Pod C or Pod D). Since, UDP is a connectionless protocol, my concern is around the session Affinity based on ClientIP. Should setting the sessionAffinity as client IP solve my issue ?

Since, UDP is a connectionless protocol, my concern is around the session Affinity based on ClientIP. Should setting the sessionAffinity as client IP solve my issue ?
sessionAffinity keeps each session based on sourceIP regardless of the protocols at the same cluster. But it does not mean your real session is kept as you expected on your env across your whole access path journey.
In other words, just only using sessionAffinity does not ensure keeping whole session on your access paths.
For example, Pod A outbound IP is translated as running node IP(SNAT) if you does not use egress IP solutions for the Pod A.
It also depends your NodePort and LoadBalancer Service config about source IP in cluster 2. Refer Using Source IP for more details.
So you should consider how to keep session safely while accessing each other between other clusters. Personally I think you had better consider application layer(7Layer) sticky session for keeping the session, not sessionAffinity of the service.

Related

Preserve SourceIP address in Kubernetes and distribute the load

In a multiple node cluster we want to expose a service handling UDP traffic. There are two requirements:
We want the service to be backed up by multiple pods (possibly running on different nodes) in order to scale horizontally.
The service needs the UDP source IP address of the client (i.e., should use DNAT instead of SNAT)
Is that possible?
We currently use a NodePort service with externalTrafficPolicy: local. This forces DNAT but only the pod running on the requested node is receiving the traffic.
There doesn't seem to be a way to spread the load over multiple pods on multiple mnodes.
I already looked at this Kubernetes tutorial and also this article here.
The Problem
I feel like there is a need for some explanation before facing the actual issue(s) in order to understand why things do not work as expected:
Usually what happens when using NodePort is that you expose a port on every node in your cluster. When making a call to node1:port the traffic will then (same as with a ClusterIP type) be forwarded to one Pod that matches the selector, regardless of that Pod being on node1 or another node.
Now comes the tricky part.
When using externalTrafficPolicy: Local, packages that arrive on a node that does not have a Pod on it will be dropped.
Perhaps the following illustration explains the behavior in a more understandable way.
NodePort with default externalTrafficPolicy: Cluster:
package --> node1 --> forwards to random pod on any node (node1 OR node2 OR ... nodeX)
NodePort with externalTrafficPolicy: Local:
package --> node1 --> forwards to pod on node1 (if pod exists on node1)
package --> node1 --> drops package (if there is no pod on node1)
So in essence to be able to properly distribute the load when using externalTrafficPolicy: Local two main issues need to be addressed:
There has to be a Pod running on every node in order for packages not to be dropped
The client has to send packages to multiple nodes in order for the load to be distributed
The solution
The first issue can be resolved rather easily by using a DaemonSet. It will ensure that one instance of the Pod runs on every node in the cluster.
Alternatively one could also use a simple Deployment, manage the replicas manually and ensure proper distribution across the nodes by using podAntiAffinity. This approach would take more effort to maintain since replicas must be adjusted manually but can be useful if you want to have more than just 1 Pod on each node.
Now for the second issue.
The easiest solution would be to let the client implement logic on his part and send requests to all the nodes in a round robin principle, however, that is not a very practical and/or realistic way of doing it.
Usually when using NodePort there is still a load balancer of some kind in front of it to distribute the load (not taking about the Kubernetes service type LoadBalancer here). This may seem redundant since by default NodePort will distribute the traffic across all the Pods anyways, however, the node that gets requested still gets the traffic and then another hop happens. Furthermore if only the same node is addressed at all time, once that node goes down (for whatever reason) traffic will never reach any of the Pods anyways. So for those (and many other reasons) a load balancer should always be used in combination with NodePort. To solve the issue simply configure the load balancer to preserve the source IP of the original client.
Furthermore, depending on what cloud you are running on, there is a chance of you being able to configure a service type LoadBalancer instead of NodePort (which basically is a NodePort service + a load balancer in front of it as described above) , configure it with externalTrafficPolicy: Local and address the first issue as described earlier and you achieved what you wanted to do.

Performance considerations for NodePort vs. ClusterIP vs. Headless Service on Kubernetes

We have two types of services that we run on AWS EKS:
external-facing services which we expose through an application-level load balancer using aws-alb-ingress-controller
internal-facing services which we use both directly through the service name (for EKS applications) and through an internal application-level loadbalancer also using aws-alb-ingress-controller (for non-EKS applications)
I would like to understand the performance implications of choosing Nodeport, ClusterIP or Headless Service for both the external and internal services. I have the setup working with all three options.
If I understanding the networking correctly, it seems that a Headless Service requires less hops and would hence be (slightly) faster? This article however seems to suggest that a Headless Service would not be properly load balanced when called directly. Is this correct? And would this still hold when called through the external (or internal) ALB?
Is there any difference in performance for NodePort vs ClusterIP?
Finally, what is the most elegant/performant way of using internal services from outside of the cluster (where we don't have access to the Kubernetes DNS) but within the same VPC? Would it be to use ClusterIp and specify the IP address in the service definition so it remains stable? Or are there better options?
I've put more detailed info on the each of the connection forwarding types and how the services are forwarded down under the headings belowfor context to my answers.
If I understanding the networking correctly, it seems that a Headless Service requires less hops and would hence be (slightly) faster?
Not substantially faster. The "extra hop" is the packet traversing local lookup tables which it traverses anyway so not a noticeable difference. The destination pod is still going to be the same number of actual network hops away.
If you have 1000's of services that run on a single pod and could be headless then you might use that to limit the number of iptables NAT rules and speed rule processing up (see iptables v ipvs below).
Is < a headless service not load balanced > correct? And would this still hold when called through the external (or internal) ALB?
Yes it is correct, the client (or ALB) would need to implement the load balancing across the Pod IP's.
Is there any difference in performance for NodePort vs ClusterIP?
A NodePort has a possible extra network hop from the entry node to the node running the pod. Assuming the ClusterIP ranges are routed to the correct node (and routed at all)
If you happen to be using a service type: LoadBalancer this behaviour can change by setting [.spec.externalTrafficPolicy to Local][https://kubernetes.io/docs/concepts/services-networking/service/#aws-nlb-support] which means traffic will only be directed to a local pod.
Finally, what is the most elegant/performant way of using internal services from outside of the cluster
I would say use the AWS ALB Ingress Controller with the alb.ingress.kubernetes.io/target-type: ip annotation. The k8s config from the cluster will be pushed out to the ALB via the ingress controller and address pods directly without traversing any connection forwarding or extra hops. All cluster reconfig will be automatically pushed out.
There is a little bit of latency for config to get to the ALB compared to cluster kube-proxy reconfiguration. Something like a rolling deployment might not be as seamless as the updates arrive after a pod is gone. The ALB's are equipped to handle the outage themselves, eventually.
Kubernetes Connection Forwarding
There is a kube-proxy process running on each node which manages how and where connections are forwared. There are 3 options for how kube-proxy does that: Userspace proxy, iptables or IPVS. Most clusters will be on iptables and that will cater for the vast majority of use cases.
Userspace proxy
The forwarding is via a process that runs in userspace to terminate and forward the connections. It's slow. It's unlikely you are using it, don't use it.
iptables
iptables forwards connections in kernel via NAT, which is fast. This is most common setup and will cover 90% of use cases. New connections are shared evenly between all nodes running pods for a service.
IPVS
Runs in kernel, it is fast and scalable. If you shift a traffic to a large number of apps this might improve the forwarding performance. It also supports different service load balancing modes:
- rr: round-robin
- lc: least connection (smallest number of open connections)
- dh: destination hashing
- sh: source hashing
- sed: shortest expected delay
- nq: never queue
Access to services
My explanations are iptables based as I haven't done much detailed work with ipvs clusters yet. I'm gonna handwave the ipvs complexity away and say it's basically the same as iptables, just with faster rule processing as the number of rules increases on huge clusters (i.e number of pods/services/network policies).
I'm also ignoring the userspace proxy in the description, due to the overhead just don't use it.
The basic thing to understand is a "Service ClusterIP" is a virtual construct in the cluster that only exists as rule for where the traffic should go. Every node maintains this rule mapping of all ClusterIP/port to PodIP/port (via kube-proxy)
Nodeport
ALB routes to any node, The node/nodeport forwards the connection to a pod handling the service. This could be a remote pod which would involve sending traffic back out over the "wire".
ALB > wire > Node > Kernel Forward to SVC ( > wire if remote node ) > Pod
ClusterIP
Using the ClusterIP for direct access depends on the Service cluster IP ranges being routed to the correct node. Sometimes they aren't routed at all.
ALB > wire > Node > Kernel Forward to SVC > Pod
The "Kernel Forward to SVC" step can be skipped with an ALB annotation without using a headless service.
Headless Service
Again, Pod IP's aren't always addressable from outside the cluster depending on the network setup. You should be fine on EKS.
ALB > wire > Node > Pod
Note
I'll suffix this with requests are probably looking at < 1ms of additional latency if a connection is forwarded to a node in a VPC. Enhanced networking instances at the low end of that. Inter availability-zone comms might be a tad higher than intra-AZ. If you happened to have a geographically separated cluster it might increase the importance of controlling traffic flow. For example having a tunnelled calico network that actually jumped over a number of real networks.
what is the most elegant/performant way of using internal services from outside of the cluster (where we don't have access to the Kubernetes DNS) but within the same VPC?
For this to achieve, I think you should have a look at a Service Mesh. For example, Istio(https://istio.io). It handles your internal service calls manually so that the call doesn't have to go through Kubernetes DNS. Please have a look at Istio's docs (https://istio.io/docs) for more info.
Also, you can have a look at Istio at EKS (https://aws.amazon.com/blogs/opensource/getting-started-istio-eks)
Headless service will not have any load balancing at L4 layer but if you use it behind an ALB you are getting load balancing at L7 layer.
Nodeport internally uses cluster IP but because your request may randomly be routed to a pod on another host when it could have been routed to a pod on the same host, avoiding that extra hop out to the network. Nodeport is generally a bad idea for production usage.
IMHO best way to access internal services from outside of the cluster will be using ingress.
You can use nginx as ingress controller where you deploy the nginx ingress controller on your cluster and expose it via a LoadBalancer type service using ALB. Then you can configure path or host based routing using ingress api to route traffic between backend kubernetes services.

Unique external IP per kubernetes pod

I need to scale my application so that it won't get banned for passing request rate-limit of a site it uses frequently (which allow up to X requests per minute per IP).
I meant to use kubernetes and split the requests between multiple workers, but I saw that all the pods get the same external IP.
so what can I do?
I used kubernetes DaemonSet to attach pod to each node, and instead of scaling by changing deployment, I'm scaling by adding new nodes.
If you run in cloud you can create worker nodes with Public IP addresses. Then your pods will use node's public IP address. And then you can somehow distribute your pods across nodes using multiple replicas or DaemonSet.
do not worry a bout getting one external IP because if you have 3 worker and one master like below
worker1 192.168.1.10
worker2 192.168.1.11
worker3 192.168.1.12
master 192.168.1.13
and forexample if you expose nginx on 30000 port the kubernetes open this port in every nod and you can access it by
curl 192.168.1.10:30000
curl 192.168.1.11:30000
curl 192.168.1.12:30000
curl 192.168.1.13:30000
and if you want to every worker have one pod you can use DaemonSet or you can use label to the node that you want
This probably has less to do with your Kubernetes implementation and more to do with your network setup. It would depend on the source of the "exernal IP" you're referencing: is it given to you by your ISP? If you google "what is my ip", does it match the single IP you're talking about? If so, then you would need to negotiate with your ISP for additional external IPs.
Worth Noting that #JamesJJ is correct. Using additional IPs to 'trick' the API into allowing more connections is most likely a violation of that site's TOS and may result in your access getting terminated.

kubernetes session affinity behavior

I am using kubernetes 1.9.2 created but kubeadm.
this kubernetes cluster is running in 4 ec2 nodes.
I have a deployment that requires using cache in every pod.
in order to accomlish that we used session affinity from ClusterIP.
since I was ELB in front of my Kubernetes cluster I wonder how the session affinity is behaving.
the natural behavior would be that for every client IP a different will get the requests but given the traffic is transferred via ELB , whoch IP does the session affinity recognizes , the ELB IP or the actual Client IP?
when I check the traffic to the pods I see that 102 pods get all the requests and the 2 other pods are just waiting.
many thanks for any help.
SessionAffinity recognizes Client IP and ELB should pass the Client IP.
I think you should work with HTTP Headers and Classic Load Balancers and setup X-Forwarded-For: client-ip-address
Also, this seems to be a know issue enabling Session affinity goes to a single pod only #3056.
It was reported for 0.18.0 and 0.19.0 version of NGINX Ingress controller.
Issue was closed and commented that is was fixed in version 0.21.0, but in December initial author said it still doesn't work for him.

How to setup up DNS and ingress-controllers for a public facing web app?

I'm trying to understand the concepts of ingress and ingress controllers in kubernetes. But I'm not so sure what the end product should look like. Here is what I don't fully understand:
Given I'm having a running Kubernetes cluster somewhere with a master node which runes the control plane and the etcd database. Besides that I'm having like 3 worker nodes - each of the worker nodes has a public IPv4 address with a corresponding DNS A record (worker{1,2,3}.domain.tld) and I've full control over my DNS server. I want that my users access my web application via www.domain.tld. So I point the the www CNAME to one of the worker nodes (I saw that my ingress controller i.e. got scheduled to worker1 one so I point it to worker1.domain.tld).
Now when I schedule a workload consisting of 2 frontend pods and 1 database pod with 1 service for the frontend and 1 service for the database. From what've understand right now, I need an ingress controller pointing to the frontend service to achieve some kind of load balancing. Two questions here:
Isn't running the ingress controller only on one worker node pointless to internally load balance two the two frontend pods via its service? Is it best practice to run an ingress controller on every worker node in the cluster?
For whatever reason the worker which runs the ingress controller dies and it gets rescheduled to another worker. So the ingress point will get be at another IPv4 address, right? From a user perspective which tries to access the frontend via www.domain.tld, this DNS entry has to be updated, right? How so? Do I need to run a specific kubernetes-aware DNS server somewhere? I don't understand the connection between the DNS server and the kubernetes cluster.
Bonus question: If I run more ingress controllers replicas (spread across multiple workers) do I do a DNS-round robin based approach here with multiple IPv4 addresses bound to one DNS entry? Or what's the best solution to achieve HA. I rather not want to use load balancing IP addresses where the worker share the same IP address.
Given I'm having a running Kubernetes cluster somewhere with a master
node which runes the control plane and the etcd database. Besides that
I'm having like 3 worker nodes - each of the worker nodes has a public
IPv4 address with a corresponding DNS A record
(worker{1,2,3}.domain.tld) and I've full control over my DNS server. I
want that my users access my web application via www.domain.tld. So I
point the the www CNAME to one of the worker nodes (I saw that my
ingress controller i.e. got scheduled to worker1 one so I point it to
worker1.domain.tld).
Now when I schedule a workload consisting of 2 frontend pods and 1
database pod with 1 service for the frontend and 1 service for the
database. From what've understand right now, I need an ingress
controller pointing to the frontend service to achieve some kind of
load balancing. Two questions here:
Isn't running the ingress controller only on one worker node pointless to internally load balance two the two frontend pods via its
service? Is it best practice to run an ingress controller on every
worker node in the cluster?
Yes, it's a good practice. Having multiple pods for the load balancer is important to ensure high availability. For example, if you run the ingress-nginx controller, you should probably deploy it to multiple nodes.
For whatever reason the worker which runs the ingress controller dies and it gets rescheduled to another worker. So the ingress point
will get be at another IPv4 address, right? From a user perspective
which tries to access the frontend via www.domain.tld, this DNS entry
has to be updated, right? How so? Do I need to run a specific
kubernetes-aware DNS server somewhere? I don't understand the
connection between the DNS server and the kubernetes cluster.
Yes, the IP will change. And yes, this needs to be updated in your DNS server.
There are a few ways to handle this:
assume clients will deal with outages. you can list all load balancer nodes in round-robin and assume clients will fallback. this works with some protocols, but mostly implies timeouts and problems and should generally not be used, especially since you still need to update the records by hand when k8s figures it will create/remove LB entries
configure an external DNS server automatically. this can be done with the external-dns project which can sync against most of the popular DNS servers, including standard RFC2136 dynamic updates but also cloud providers like Amazon, Google, Azure, etc.
Bonus question: If I run more ingress controllers replicas (spread
across multiple workers) do I do a DNS-round robin based approach here
with multiple IPv4 addresses bound to one DNS entry? Or what's the
best solution to achieve HA. I rather not want to use load balancing
IP addresses where the worker share the same IP address.
Yes, you should basically do DNS round-robin. I would assume external-dns would do the right thing here as well.
Another alternative is to do some sort of ECMP. This can be accomplished by having both load balancers "announce" the same IP space. That is an advanced configuration, however, which may not be necessary. There are interesting tradeoffs between BGP/ECMP and DNS updates, see this dropbox engineering post for a deeper discussion about those.
Finally, note that CoreDNS is looking at implementing public DNS records which could resolve this natively in Kubernetes, without external resources.
Isn't running the ingress controller only on one worker node pointless to internally load balance two the two frontend pods via its service? Is it best practice to run an ingress controller on every worker node in the cluster?
A quantity of replicas of the ingress will not affect the quality of load balancing. But for HA you can run more than 1 replica of the controller.
For whatever reason the worker which runs the ingress controller dies and it gets rescheduled to another worker. So the ingress point will get be at another IPv4 address, right? From a user perspective which tries to access the frontend via www.domain.tld, this DNS entry has to be updated, right? How so? Do I need to run a specific kubernetes-aware DNS server somewhere? I don't understand the connection between the DNS server and the kubernetes cluster.
Right, it will be on another IPv4. Yes, DNS should be updated for that. There are no standard tools for that included in Kubernetes. Yes, you need to run external DNS and somehow manage records on it manually (by some tools or scripts).
DNS server inside a Kubernetes cluster and your external DNS server are totally different things. DNS server inside the cluster provides resolving only inside the cluster for service discovery. Kubernetes does not know anything about access from external networks to the cluster, at least on bare-metal. In a cloud, it can manage some staff like load-balancers to automate external access management.
I run more ingress controllers replicas (spread across multiple workers) do I do a DNS-round robin based approach here with multiple IPv4 addresses bound to one DNS entry? Or what's the best solution to achieve HA.
DNS round-robin works in that case, but if one of the nodes is down, your clients will get a problem with connecting to that node, so you need to find some way to move/remove IP of that node.
The solutions for HA provided by #jjo is not the worst way to achieve what you want if you can prepare an environment for that. If not, you should choose something else, but the best practice is using a Load Balancer provided by an infrastructure. Will it be based on several dedicated servers, or load balancing IPs, or something else - it does not matter.
The behavior you describe is actually a LoadBalancer (a Service with type=LoadBalancer in Kubernetes), which is "naturally" provided when you're running Kubernetes on top of a cloud provider.
From your description, it looks like your cluster is on bare-metal (either true or virtual metal), a possible approach (that has worked for me) will be:
Deploy https://github.com/google/metallb
this is where your external IP will "live" (HA'd), via the speaker-xxx pods deployed as DaemonSet to each worker node
depending on your extn L2/L3 setup, you'll need to choose between L3 (BGP) or L2 (ARP) modes
fyi I've successfully used L2 mode + simple proxyarp at the border router
Deploy nginx-ingress controller, with its Service as type=LoadBalancer
this will make metallb to "land" (actually: L3 or L2 "advertise" ...) the assigned IP to the nodes
fyi I successfully tested it together with kube-router using --advertise-loadbalancer-ip as CNI, the effect will be that e.g. <LB_IP>:80 will be redirected to the ingress-nginx Service NodePort
Point your DNS to ingress-nginx LB IP, i.e. what's shown by:
kubectl get svc --namespace=ingress-nginx ingress-nginx -ojsonpath='{.status.loadBalancer.ingress[].ip}{"\n"}'
fyi you can also quickly test it using fake DNSing with http://A.B.C.D.xip.io/ (A.B.C.D being your public IP addr)
Here is a Kubernetes DNS add-ons Configure external DNS servers (AWS Route53, Google CloudDNS and others) for Kubernetes Ingresses and Services allowing to handle DNS record updates for ingress LoadBalancers. It allows to keep DNS record up to date according to Ingress controller config.