i have a bunch of kubernetes worker nodes in RFC1918, so they do not have direct internet access. in most cases, i just set the application to use HTTP_PROXY env's in the containers to bounce outgoing (http) traffic through a set of squid servers.
however, i have a need to get non-http traffic out of the cluster (to the internet); database and ssh connections for example. whilst i am aware i can setup tunnels etc, i want a more generic solution.
in my kubernetes cluster, i already have a few nodes in a non-RFC1918 address range, so any pods on these nodes can reach the internet (it coincidentally is where my squid proxies reside). as they are part of the same kubernetes cluster, they share the same calico overlay network with the non-routable nodes.
i notice https://projectcalico.docs.tigera.io/networking/workloads-outside-cluster: however, it doesn't go into much detail. is there a way i can use this (or any other calico way) to route all "internet" traffic from all pods on the RFC1918 worker nodes through the non-RFC1918 nodes in our cluster?
Related
When learning the Kubernetes CNI, I heard some plugins are using the BGP or VXLAN under the hood.
On the internet, border gateway protocol (BGP) manages how packets are routed between edge routers.
Autonomous systems (AS) are network routers managed by a single enterprise or service provider. for example, Facebook and Google.
Autonomous systems (AS) communicate with peers and form a mesh.
But I still can't figure out how does the CNI plugin take advantage of BGP.
Imagine there is a Kubernetes cluster, which is composed of 10 nodes. Calico is the chosen CNI plugin.
Who plays the Autonomous System(AS) role? Is each node an AS?
How are packets forward from one node to another node? Is the iptable still required?
The CNI plugin is responsible for allocating IP addresses (IPAM) and ensuring that packets get where they need to get.
For Calico specifically, you can get a lot of information from the architecture page as well as the Calico network design memoirs.
Whenever a new Pod is created, the IPAM plugin allocates an IP address from the global pool and the Kubernetes scheduler assigns the Pod to a Node. The Calico CNI plugin (like any other) configures the networking stack to accept connections to the Pod IP and routes them to the processes inside. This happens with iptables and uses a helper process called Felix.
Each Node also runs a BIRD (BGP) daemon that watches for these configuration events: "IP 10.x.y.z is hosted on node A". These configuration events are turned into BGP updates and sent to other nodes using the open BGP sessions.
When the other nodes receive these BGP updates, they program the node route table (with simple ip route commands) to ensure the node knows how to reach the Pod. In this model, yes, every node is an AS.
What I just described is the "AS per compute server" model: it is suitable for small deployments in environments where nodes are not necessarily on the same L2 network. The problem is that each node needs to maintain a BGP session with every other node, which scales as O(N^2).
For larger deployments therefore, a compromise is to run one AS per rack of compute servers ("AS per rack"). Each top of rack switch then runs BGP to communicate routes to other racks, while the switch internally knows how to route packets.
I want to publish my node IPs to an open endpoint so my IDS-System can whitelist all cluster node IPs.
I just want to double check if you see any risk in doing so?
I guess the nodes should be safe by design when I use GKE.
Normally there is only one Loadbalancer IP on which the DNS points to.
So with ping command I can only get the Loadbalancer IP.
Is there a way to get the node ips as an attacker anyway?
Do you see a big security issue here?
To prioritize high-value cluster security, it is better to secure node IP addresses from accessing over the internet. You should limit exposure of your cluster control plane and nodes to the internet.
To disable direct internet access to nodes, specify the gcloud tool option --enable-private-nodes at cluster creation.This tells GKE to provision nodes with internal IP addresses, which means the nodes aren't directly reachable over the public internet.
If the endpoint is trusted & secure then we can whitelist.
Until and unless there are no ports to open that endpoint it’s ok to whitelist.
Refer Restrict network access to the control plane and nodes for information.
You can also use Shielded GKE nodes which provides strong, verifiable node identity and integrity to increase the security of Google Kubernetes Engine (GKE) nodes.
Refer Using Shielded GKE nodes for information.
Why do we need point-to-point connection between pods while we have workloads abstraction and networking mechanism (Service/kube-proxy/Ingress etc.) over it?
What is the default CNI?
REDACTED: I was confused about this question because I felt like I haven't installed any of popular CNI plugins when I was installing Kubernetes. It turns out Kubernetes defaults to kubenet
Btw, I see a lot of overlap features between Istio and container networks. IMO they could achieve identical objectives. The only difference is that Istio is high-level and CNI is low-level and more efficient, is that correct?
REDACTED:Interestingly, istio has it's own CNI
Kubernetes networking has some requirements:
pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
pods in the host network of a node can communicate with all pods on all nodes without NAT
and CNI(Container Network Interface) setup a standard interface, all implements(calico, flannel) need follow it.
So it aims to resolve the kubernetes networking.
The SVC is different, it's supplied a virtual address to proxy the pods, sine pods is ephemeral and its ip will changing but the address of svc is immutable.
For the istio, it's another thing, it make the connection between microservice as infrastructure and pull out this part from business code (think about spring cloud).
why do we need point-to-point connection between pods while we have workloads abstraction and networking mechanism(Service/kube-proxy/Ingress etc.) over it?
In general, you will find everything about networking in a cluster in this documentation. You can find more information about pod networking:
Every Pod gets its own IP address. This means you do not need to explicitly create links between Pods and you almost never need to deal with mapping container ports to host ports. This creates a clean, backwards-compatible model where Pods can be treated much like VMs or physical hosts from the perspectives of port allocation, naming, service discovery, load balancing, application configuration, and migration.
Kubernetes imposes the following fundamental requirements on any networking implementation (barring any intentional network segmentation policies):
pods on a node can communicate with all pods on all nodes without NAT
agents on a node (e.g. system daemons, kubelet) can communicate with all pods on that node
Note: For those platforms that support Pods running in the host network (e.g. Linux):
pods in the host network of a node can communicate with all pods on all nodes without NAT
Then you are asking:
what is the default cni?
There is no single default CNI in a kubernetes cluster. It depends on what type you meet, where and how you set up the cluster etc. As you can see reading this doc about implementing networking model there are many CNI's available in Kubernetes.
Istio is a completely different tool for something else. You can't compare them like that. Istio is a service mesh tool.
Istio extends Kubernetes to establish a programmable, application-aware network using the powerful Envoy service proxy. Working with both Kubernetes and traditional workloads, Istio brings standard, universal traffic management, telemetry, and security to complex deployments.
Do we need any specific wifi-router/ LAN router with metalLb in kubernetes.
How does metalLB help if it is on a machine .. all the router traffic would have to first come on the machine and then get routed; causing the machine to be the bottleneck.
Shouldn't the metalLB solution fit somewhere in the router itself ?
Maybe first what is MetalLB and why to use it:
MetalLB is a load-balancer implementation for bare metal Kubernetes clusters, using standard routing protocols.
...
Bare metal cluster operators are left with two lesser tools to bring user traffic into their clusters, “NodePort” and “externalIPs” services. Both of these options have significant downsides for production use, which makes bare metal clusters second class citizens in the Kubernetes ecosystem.
MetalLB aims to redress this imbalance by offering a Network LB implementation that integrates with standard network equipment, so that external services on bare metal clusters also “just work” as much as possible.
There is nothing special needed besides correctly routing the traffic to your bare metal server. You might set it up as DMZ Host or just forward ports to the server behind the router.
If you are looking into Load Balancing a traffic before the server, that will only work with several servers.
If you have 4 bare metal serves, you can setup one as master node and other three as worker nodes, so master node would be responsible for balancing the load across worker nodes.
You can use MetalLB in Layer 2 Mode
In layer 2 mode, one node assumes the responsibility of advertising a service to the local network. From the network’s perspective, it simply looks like that machine has multiple IP addresses assigned to its network interface.
Under the hood, MetalLB responds to ARP requests for IPv4 services, and NDP requests for IPv6.
The major advantage of the layer 2 mode is its universality: it will work on any ethernet network, with no special hardware required, not even fancy routers.
and BGP Mode
In BGP mode, each node in your cluster establishes a BGP peering session with your network routers, and uses that peering session to advertise the IPs of external cluster services.
Assuming your routers are configured to support multipath, this enables true load-balancing: the routes published by MetalLB are equivalent to each other, except for their nexthop. This means that the routers will use all nexthops together, and load-balance between them.
Once the packets arrive at the node, kube-proxy is responsible for the final hop of traffic routing, to get the packets to one specific pod in the service.
You can read more about usage or MetalLB here.
I've been looking into Kubernetes networking, more specifically, how to serve HTTPS users the most efficient.
I was watching this talk: https://www.youtube.com/watch?v=0Omvgd7Hg1I and from 22:18 he explains what the problem is with a load balancer that is not pod aware. Now, how they solve this in kubernetes is by letting the nodes also act as a 'router' and letting the node pass the request on to another node. (explained at 22:46). This does not seem very efficient, but when looking around SoundCloud (https://developers.soundcloud.com/blog/how-soundcloud-uses-haproxy-with-kubernetes-for-user-facing-traffic) actually seems to do something similar to this but with NodePorts. They say that the overhead costs less than creating a better load balancer.
From what I have read an option might be using an ingress controller. Making sure that there is not more than one ingress controller per node, and routing the traffic to the specific nodes that have an ingress controller. That way there will not be any traffic re-routing needed. However, this does add another layer of routing.
This information is all from 2017, so my question is: is there any pod aware load balancer out there, or is there some other method that does not involve sending the http request and response over the network twice?
Thank you in advance,
Hendrik
EDIT:
A bit more information about my use case:
There is a bare-metal setup with kubernetes. The firewall load balances the incomming data between two HAProxy instances. These HAProxy instances do ssl termination and forward the traffic to a few sites. This includes an exchange setup, a few internal IIS sites and a nginx server for a static web app. The idea is to transform the app servers into kubernetes.
Now my main problem is how to get the requests from HAProxy into kubernetes. I see a few options:
Use the SoundCloud setup. The infrastructure could stay almost the same, the HAProxy server can still operate the way they do now.
I could use an ingress controller on EACH node in the kubernetes cluster and have the firewall load balance between the nodes. I believe it is possible to forward traffic from the ingress controller to server outside the cluster, e.g. exchange.
Some magic load balancer that I do not know about that is pod aware and able to operate outside of the kubernetes cluster.
Option 1 and 2 are relatively simple and quite close in how they work, but they do come with a performance penalty. This is the case when the node that the requests gets forwarded to by the firewall does not have the required pod running, or if another pod is doing less work. The request will get forwarded to another node, thus, using the network twice.
Is this just the price you pay when using Kubernetes, or is there something that I am missing?
How traffic heads to pods depend on whether a managed cluster is used.
Almost all cloud providers can forward traffic in a cloud-native way in their managed K8s clusters. First, you can a managed cluster with some special network settings (e.g. vpc-native cluster of GKE). Then, the only thing you need to do is to create a LoadBalancer typed Service to expose your workload. You can also create Ingresses for your L7 workloads, they are going to be handled by provided IngressControllers (e.g. ALB of AWS).
In an on-premise cluster without any cloud provider(OpenStack or vSphere), the only way to expose workloads is NodePort typed Service. It doesn't mean you can't improve it.
If your cluster is behind reverse proxies (the SoundCloud case), setting externalTrafficPolicy: Local to Services could break traffic forwarding among work nodes. When traffic received through NodePorts, they are forwarded to local Pods or dropped if Pods reside on other nodes. Reserve proxy will mark these NodePort as unhealthy in the backend health check and reject to forward traffic to them. Another choice is to use topology-aware service routing. In this case, local Pods have priorities and traffic is still forwarded between node when no local Pods matched.
For IngressController in on-prem clusters, it is a little different. You may have some work nodes that have EIP or public IP. To expose HTTP(S) services, an IngressController usually deployed on those work nodes through DaemeaSet and HostNetwork such that clients access the IngressController via the well-known ports and EIP of nodes. These work nodes regularly don't accept other workloads (e.g. infra node in OpenShift) and one more forward on the Pod network is needed. You can also deploy the IngressController on all work nodes as well as other workloads, so traffic could be forwarded to a closer Pod if the IngressController supports topology-aware service routing although it can now.
Hope it helps!