Kubernetes: How does CNI take advantage of BPG? - kubernetes

When learning the Kubernetes CNI, I heard some plugins are using the BGP or VXLAN under the hood.
On the internet, border gateway protocol (BGP) manages how packets are routed between edge routers.
Autonomous systems (AS) are network routers managed by a single enterprise or service provider. for example, Facebook and Google.
Autonomous systems (AS) communicate with peers and form a mesh.
But I still can't figure out how does the CNI plugin take advantage of BGP.
Imagine there is a Kubernetes cluster, which is composed of 10 nodes. Calico is the chosen CNI plugin.
Who plays the Autonomous System(AS) role? Is each node an AS?
How are packets forward from one node to another node? Is the iptable still required?

The CNI plugin is responsible for allocating IP addresses (IPAM) and ensuring that packets get where they need to get.
For Calico specifically, you can get a lot of information from the architecture page as well as the Calico network design memoirs.
Whenever a new Pod is created, the IPAM plugin allocates an IP address from the global pool and the Kubernetes scheduler assigns the Pod to a Node. The Calico CNI plugin (like any other) configures the networking stack to accept connections to the Pod IP and routes them to the processes inside. This happens with iptables and uses a helper process called Felix.
Each Node also runs a BIRD (BGP) daemon that watches for these configuration events: "IP 10.x.y.z is hosted on node A". These configuration events are turned into BGP updates and sent to other nodes using the open BGP sessions.
When the other nodes receive these BGP updates, they program the node route table (with simple ip route commands) to ensure the node knows how to reach the Pod. In this model, yes, every node is an AS.
What I just described is the "AS per compute server" model: it is suitable for small deployments in environments where nodes are not necessarily on the same L2 network. The problem is that each node needs to maintain a BGP session with every other node, which scales as O(N^2).
For larger deployments therefore, a compromise is to run one AS per rack of compute servers ("AS per rack"). Each top of rack switch then runs BGP to communicate routes to other racks, while the switch internally knows how to route packets.

Related

Clarification about Ports in Kubernetes scaling

Let's say I have a web application Backend that I want to deploy with the help of Kubernetes, how exactly does scaling work in this case.
I understand scaling in Kubernetes as: We have one a master node that orchestrates multiple worker nodes where each of the worker nodes runs 0-n different containers with the same image. My question is, if this is correct, how does Kubernetes deal with the fact that the same application use the same Port within one worker node? Does the request reach the master node which then handles this problem internally?
Does the request reach the master node which then handles this problem internally?
No, the master nodes does not handle traffic for your apps. Typically traffic meant for your apps arrive to a load balancer or gateway, e.g. Google Cloud Load Balancer or AWS Elastic Load Balancer, then the load balancer forwards the request to a replica of a matching service - this is managed by the Kubernetes Ingress resource in your cluster.
The master nodes - the control plane - is only used for management, e.g. when you deploy a new image or service.
how does Kubernetes deal with the fact that the same application use the same Port within one worker node?
Kubernetes uses a container runtime for your containers. You can try this on your own machine, e.g. when you use docker, you can create multiple containers (instances) of your app, all listening on e.g. port 8080. This is a key feature of containers - the provide network isolation.
On Kubernetes, all containers are tied together with a custom container networking. How this works, depends on what Container Networking Interface-plugin you use in your cluster. Each Pod in your cluster will get its own IP address. All your containers can listen to the same port, if you want - this is an abstraction.

source nat traffic through calico

i have a bunch of kubernetes worker nodes in RFC1918, so they do not have direct internet access. in most cases, i just set the application to use HTTP_PROXY env's in the containers to bounce outgoing (http) traffic through a set of squid servers.
however, i have a need to get non-http traffic out of the cluster (to the internet); database and ssh connections for example. whilst i am aware i can setup tunnels etc, i want a more generic solution.
in my kubernetes cluster, i already have a few nodes in a non-RFC1918 address range, so any pods on these nodes can reach the internet (it coincidentally is where my squid proxies reside). as they are part of the same kubernetes cluster, they share the same calico overlay network with the non-routable nodes.
i notice https://projectcalico.docs.tigera.io/networking/workloads-outside-cluster: however, it doesn't go into much detail. is there a way i can use this (or any other calico way) to route all "internet" traffic from all pods on the RFC1918 worker nodes through the non-RFC1918 nodes in our cluster?

In which cases do we need container networks in kubernetes while we already have kubernetes Service?

Why do we need point-to-point connection between pods while we have workloads abstraction and networking mechanism (Service/kube-proxy/Ingress etc.) over it?
What is the default CNI?
REDACTED: I was confused about this question because I felt like I haven't installed any of popular CNI plugins when I was installing Kubernetes. It turns out Kubernetes defaults to kubenet
Btw, I see a lot of overlap features between Istio and container networks. IMO they could achieve identical objectives. The only difference is that Istio is high-level and CNI is low-level and more efficient, is that correct?
REDACTED:Interestingly, istio has it's own CNI
Kubernetes networking has some requirements:
pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
pods in the host network of a node can communicate with all pods on all nodes without NAT
and CNI(Container Network Interface) setup a standard interface, all implements(calico, flannel) need follow it.
So it aims to resolve the kubernetes networking.
The SVC is different, it's supplied a virtual address to proxy the pods, sine pods is ephemeral and its ip will changing but the address of svc is immutable.
For the istio, it's another thing, it make the connection between microservice as infrastructure and pull out this part from business code (think about spring cloud).
why do we need point-to-point connection between pods while we have workloads abstraction and networking mechanism(Service/kube-proxy/Ingress etc.) over it?
In general, you will find everything about networking in a cluster in this documentation. You can find more information about pod networking:
Every Pod gets its own IP address. This means you do not need to explicitly create links between Pods and you almost never need to deal with mapping container ports to host ports. This creates a clean, backwards-compatible model where Pods can be treated much like VMs or physical hosts from the perspectives of port allocation, naming, service discovery, load balancing, application configuration, and migration.
Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):
pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Note: For those platforms that support Pods running in the host network (e.g. Linux):
pods in the host network of a node can communicate with all pods on all nodes without NAT
Then you are asking:
what is the default cni?
There is no single default CNI in a kubernetes cluster. It depends on what type you meet, where and how you set up the cluster etc. As you can see reading this doc about implementing networking model there are many CNI's available in Kubernetes.
Istio is a completely different tool for something else. You can't compare them like that. Istio is a service mesh tool.
Istio extends Kubernetes to establish a programmable, application-aware network using the powerful Envoy service proxy. Working with both Kubernetes and traditional workloads, Istio brings standard, universal traffic management, telemetry, and security to complex deployments.

What route do service requests pass through Kubernetes?

Let's say that a Service in a Kubernetes cluster is mapped to a group of cloned containers that will fulfill requests made for that service from the outside world.
What are the steps in the journey that a request from the outside world will make into the Kubernetes cluster, then through the cluster to the designated container, and then back through the Kubernetes cluster out to the original requestor in the outside world?
The documentation indicates that kube-controller-manager includes the Endpoints controller, which joins services to Pods. But I have not found specific documentation illustrating the steps in the journey that each request makes through a Kubernetes cluster.
This is important because it affects how one might design security for services, including the configuration of routing around the control plane.
Assuming you are using mostly the defaults:
Packet comes in to your cloud load balancer of choice.
It gets forwarded to a random node in the cluster.
It is received by the kernel and run through iptables.
Iptables defines a mapping rule to forward the packet to a container IP.
Unless it randomly happens to be on the same box, it then goes through your CNI network, usually some kind of overlay possibly with a wrapping and unwrapping.
It eventually gets to the container IP, and then is delivered to whatever the process inside the container is.
The Services and Endpoints system is what creates and manages the iptables rules and the cloud load balancers so that the LB knows the right node IPs and the iptables rules know the right container IPs.

In K8S, does every kube-proxy (running on every node) have the same implementation?

I am new to K8S and I am trying to understand the exact role of kube-proxy running on each node in a cluster. The documentation mentions that "kube-proxy reflects services as defined in the Kubernetes API on each node and can do simple TCP, UDP, and SCTP stream forwarding or round robin TCP, UDP, and SCTP forwarding across a set of backends". For this to be true, each kube-proxy will need to have complete information about all the services running in the cluster as it is the responsibility of the kube-proxy to provide access to any service which is demanded by an application running on a pod (on that respective node). So does that mean that all the kube-proxies inside a K8S cluster (running on each node) are mirror images? If so, why is a kube-proxy present on each node instead of a centralized one for entire cluster?
link to K8S documentaion on proxies: https://kubernetes.io/docs/concepts/cluster-administration/proxies/
So does that mean that all the kube-proxies inside a K8S cluster (running on each node) are mirror images?
Yea, They are instance of same image.
If so, why is a kube-proxy present on each node instead of a centralized one for entire cluster?
kube-proxy uses the operating system packet filtering layer if there is one and it’s available such as IPtable, IPVS. Otherwise, kube-proxy forwards the traffic itself. kube-proxy
Kube-Proxy is a k8s controller itself which watch the desired state(service & endpoints) of the cluster and make changes on the nodes, as it manage IPtabels (in use of iptable mode)
For this to be true, each kube-proxy will need to have complete information about all the services running in the cluster .....
there are following flags to set the behaviour of kube-proxy
--iptables-min-sync-period duration
The minimum interval of how often the iptables rules can be refreshed as endpoints and services change (e.g. '5s', '1m', '2h22m').
--iptables-sync-period duration Default: 30s
The maximum interval of how often iptables rules are refreshed (e.g. '5s', '1m', '2h22m'). Must be greater than 0.
IMO, Decision of the connections (forwarding, accepting ) among pods on nodes should be made by node components rather than a central plane components. Besides, K8s Control plane (api-server, etcd) keep the desired state and current state of the cluster, so all of the controller can reconcile according to their set behaviour.