kube-proxy has an option called --proxy-mode,and according to the help message, this option can be userspace or iptables.(See below)
# kube-proxy -h
Usage of kube-proxy:
...
--proxy-mode="": Which proxy mode to use: 'userspace' (older, stable) or 'iptables' (experimental). If blank, look at the Node object on the Kubernetes API and respect the 'net.experimental.kubernetes.io/proxy-mode' annotation if provided. Otherwise use the best-available proxy (currently userspace, but may change in future versions). If the iptables proxy is selected, regardless of how, but the system's kernel or iptables versions are insufficient, this always falls back to the userspace proxy.
...
I can't figure out what does userspace mode means here.
Anyone can tell me what the working principle is when kube-proxy runs under userspace mode?
Userspace and iptables refer to what actually handles the connection forwarding. In both cases, local iptables rules are installed to intercept outbound TCP connections that have a destination IP address associated with a service.
In the userspace mode, the iptables rule forwards to a local port where a go binary (kube-proxy) is listening for connections. The binary (running in userspace) terminates the connection, establishes a new connection to a backend for the service, and then forwards requests to the backend and responses back to the local process. An advantage of the userspace mode is that because the connections are created from an application, if the connection is refused, the application can retry to a different backend.
In iptables mode, the iptables rules are installed to directly forward packets that are destined for a service to a backend for the service. This is more efficient than moving the packets from the kernel to kube-proxy and then back to the kernel so it results in higher throughput and better tail latency. The main downside is that it is more difficult to debug, because instead of a local binary that writes a log to /var/log/kube-proxy you have to inspect logs from the kernel processing iptables rules.
In both cases there will be a kube-proxy binary running on your machine. In userspace mode it inserts itself as the proxy; in iptables mode it will configure iptables rather than to proxy connections itself. The same binary works in both modes, and the behavior is switched via a flag or by setting an annotation in the apiserver for the node.
Related
Problem statement:
I cannot access services running in pods within third party containers that don't listen on a few specific ports when using istio-sidecar
Facts:
I am running on a network with firewalled connections, so only a handful of ports can be used to communicate across the nodes, I cannot change that.
I have some third party containers running within pods that listen on ports that do not belong to the handful that is allowed
Without istio, I can do an iptables REDIRECT on an initContainer and just use any port I want
With istio-sidecar, the envoy catch-all iptables rules forward the ORIGINAL_DST to envoy with the original port, so it always tries to connect to a port that nobody is listening at. I see envoy receiving it and trying to connect to the pod at the port that I faked, the one that is allowed in the network, not the one the service is listening at.
I am trying to avoid using a socat-like solution that runs another process copying from one port to another.
I can use any kind of iptables rules and/or istio resources, EnvoyFilters etc....
My istio setup is the standard sidecar setup with nothing particular to it.
I am trying to connect to an external SCTP server from a pod inside a K8s cluster. The external SCTP server only allows connections from a configured IP address AND a specific source port.
From what i can understand, K8s performs SNAT during connection establishment and updates the IP address with the K8s node IP address, and also the port with a random port. So the SCTP server sees the random source port and therefore rejects the connection.
K8s cluster is using calico plugin and I have tried to "disable NAT for target CIDR range" option as explained here by installing an ip-pool. But it didn't work, I can see via tcpdump on the server that source port is still random, and not sure if the ip-pool gets picked up.
So my question is: Is there a way to preserve the source port? Am i on the right track trying to disable NAT, i.e. will it work considering the pod IPs are internal?
Note: I am not sure if the problem/solution is related to it, but kube-proxy is in iptables mode.
Note: There is actually an identical question here and the accepted answer suggesting to use hostNetwork:true works for me as well. But I can't use hostNetwork, so wanted to post this as a new question. Also the "calico way" of disabling NAT towards specific targets seemed to be promising, and hoping that calico folks can help.
Thanks
What is the logic algorithm that a kubernets service uses to assign requests to pods that it exposes? Can this algorithm be customized?
Thanks.
kube-proxy in userspace mode chooses a backend via a round-robin algorithm.
kube-proxy in iptables mode chooses a backend at random.
IPVS provides more options for balancing traffic to backend Pods; these are:rr: round-robin,lc: least connection (smallest number of open connections),dh: destination hashing,sh: source hashing,sed: shortest expected delay, nq: never queue
As mentioned here:- Service
For application level routing you would need to use a service mesh like istio ,envoy, kong.
You can use a component kube-proxy. What is it?
kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.
kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster.
kube-proxy uses the operating system packet filtering layer if there is one and it's available. Otherwise, kube-proxy forwards the traffic itself.
But why use a proxy when there is a round-robin DNS algorithm? There are a few reasons for using proxying for Services:
There is a long history of DNS implementations not respecting record TTLs, and caching the results of name lookups after they should have expired.
Some apps do DNS lookups only once and cache the results indefinitely.
Even if apps and libraries did proper re-resolution, the low or zero TTLs on the DNS records could impose a high load on DNS that then becomes difficult to manage.
kube-proxy has many modes:
User space proxy mode - In the userspace mode, the iptables rule forwards to a local port where a go binary (kube-proxy) is listening for connections. The binary (running in userspace) terminates the connection, establishes a new connection to a backend for the service, and then forwards requests to the backend and responses back to the local process. An advantage of the userspace mode is that because the connections are created from an application, if the connection is refused, the application can retry to a different backend
Iptables proxy mode - In iptables mode, the iptables rules are installed to directly forward packets that are destined for a service to a backend for the service. This is more efficient than moving the packets from the kernel to kube-proxy and then back to the kernel so it results in higher throughput and better tail latency. The main downside is that it is more difficult to debug, because instead of a local binary that writes a log to /var/log/kube-proxy you have to inspect logs from the kernel processing iptables rules.
IPVS proxy mode - IPVS is a Linux kernel feature that is specifically designed for load balancing. In IPVS mode, kube-proxy programs the IPVS load balancer instead of using iptables. This works, it also uses a mature kernel feature and IPVS is designed for load balancing lots of services; it has an optimized API and an optimized look-up routine rather than a list of sequential rules.
You can read more here - good question about proxy mode on StackOverflow, here - comparing proxy modes and here - good article about proxy modes.
Like rohatgisanat mentioned in his answer you can also use service mesh. Here is also good article about Kubernetes service mesh comparsion.
I've got a lift 'n shift deployment type (i.e. by no means cloud-native) and I'd like to setup sticky sessions so that the requests keep being handled by the same pod if it's available (from the client's perspective).
Client --> LB --> Ingress --> Service --> Deployment
Due to the fact that LB does SNAT, I think service.spec.sessionAffinityConfig.clientIP will work, but because all the requests would be coming with the same source IP of the loadbalancer, the workload won't be truly balanced across all the pods in the deployment.
Can you think of any way to consider source IP & port pair in the sticky session behavior?
Edit 1: The deployment runs in Oracle Cloud. We're using the Oracle Cloud Loadbalancer service in plain TCP mode (i.e. OSI Layer4).
What the question describes is actually a default traffic management behavior in K8s. The packets within each TCP session target the same pod. The TCP session is initiated from the certain source IP (in our case the LB) and source port (which is different for each session), and this session remains "sticky" for its whole duration.
I have network policy created and implemented as per https://github.com/ahmetb/kubernetes-network-policy-recipes, and its working fidn , however I would like to understand how exactly this gets implemeneted in the back end , how does network policy allow or deny traffic , by modifying the iptables ? which kubernetes componenets are involved in implementing this ?
"It depends". It's up to whatever controller actually does the setup, which is usually (but not always) part of your CNI plugin.
The most common implementation is Calico's Felix daemon, which supports several backends, but iptables is a common one. Other plugins use eBPF network programs or other firewall subsystems to similar effect.
Network Policy is implemented by network plugins (calico for example) most commonly by setting up Linux Iptables Netfilter rules on the Kubernetes nodes.
From the docs here
In the Calico approach, IP packets to or from a workload are routed and firewalled by the Linux routing table and iptables infrastructure on the workload’s host. For a workload that is sending packets, Calico ensures that the host is always returned as the next hop MAC address regardless of whatever routing the workload itself might configure. For packets addressed to a workload, the last IP hop is that from the destination workload’s host to the workload itself