Can't get to GCE instance from k8s pods on the same subnet

Can't get to GCE instance from k8s pods on the same subnet - kubernetes

I have a cluster with container range 10.101.64.0/19 on a net A and subnet SA with ranges 10.101.0.0/18. On the same subnet, there is VM in GCE with IP 10.101.0.4 and it can be pinged just fine from within the cluster, e.g. from a node with 10.101.0.3. However, if I go to a pod on this node which got address 10.101.67.191 (which is expected - this node assigns addresses 10.101.67.0/24 or something), I don't get meaningful answer from that VM I want to access from this pod. Using tcpdump on icmp, I can see that when I ping that VM machine from the pod, the ping gets there but I don't receive ACK in the pod. Seems like VM is just throwing it away.
Any idea how to resolve it? Some routes or firewalls? I am using the same topology in the default subnet created by kubernetes where this work but I cannot find anything relevant which could explain this (there are some routes and firewall rules which could influence it but I wasn't successful when trying to mimic them in my subnet)

I think it is a firewall issue.
Here I've already provided the solution on Stakoverflow.
It may help to solve your case.

Related

External IP of Kubernetes node is missing

I recently added a new linux node to our on-prem bare bones kubernetes cluster and it seems to not get assigned an external IP like the rest of the linux worker nodes have done in the past. Not sure what could be causing this. The only reason this is an issue is that if our ingress controller gets scheduled on a node with no external ip, it is causing DNS resolution to slow down 10x. Has others experienced this issue? If so, how did you get an external IP resolving on the node. We can always taint the node so that it doesn't get the ingress controller, but that would be sub-optimal for us.
Here is the results when running "kubectl get nodes -o wide":
enter image description here

Why does K8s app fail to connect MongoDB atlas? - persist k8s nodes IP's

Just trying to make an app on k8s to connect to MongoDB atlass,
So far tried the following:
Changed the DNSpolicy to Default and many others - no luck
Created nginx-ingress link- so have the main IP address of the cluster
Added that IP to IP access list - but still no luck
The cluster tier is M2 - so no private peering or private endpoints.
The Deployment/Pod that is trying to connect will not have an a DNS assigned to it, it is simply a service running inside of the k8s and processing rabbitmq messages.
So not sure on what I should whitelist if the service is never exposed.
I assume it would have to be something with Nodes or K8s egress or something, but not sure where to even look
Tried pretty much everything I could and still cannot find clear documentation on how to achieve the desired result apart from whitelisting All IP addresses
UPDATE: Managed to find this article https://www.digitalocean.com/community/questions/urgent-how-to-connect-to-mongodb-atlas-cluster-from-a-kubernetes-pod
So now im trying to find a way to persist Node IP addresses, as I understand during the scale up or down or upgrade of nodes it will create new IP addresses.
So is there a way to persist them?

Unable to access Kubernetes service from one cluster to another (over VPC peerng)

I'm wondering if anyone can help with my issue, here's the setup:
We have 2 separate kubernetes clusters in GKE, running on v1.17, and they each sit in a separate project
We have set up VPC peering between the two projects
On cluster 1, we have 'service1' which is exposed by an internal HTTPS load balancer, we don't want this to be public
On cluster 2, we intend on being able to access 'service1' via the internal load balancer, and it should do this over the VPC peering connection between the two projects
Here's the issue:
When I'm connected via SSH on a GKE node on cluster 2, I can successfully run a curl request to access https://service1.domain.com running on cluster 1, and get the expected response, so traffic is definitely routing from cluster 2 > cluster 1. However, when I'm running the same curl command from a POD, running on a GKE node, the same curl request times out.
I have run as much troubleshooting as I can including telnet, traceroute etc and I'm really stuck why this might be. If anyone can shed light on the difference here that would be great.
I did wonder whether pod networking is somehow forwarding traffic over the clusters public IP rather than over the VPC peering connection.

So it seems you're not using a "VPC-native" cluster and what you need is "IP masquerading".
From this document:
"A GKE cluster uses IP masquerading so that destinations outside of the cluster only receive packets from node IP addresses instead of Pod IP addresses. This is useful in environments that expect to only receive packets from node IP addresses."
You can use ip-masq-agent or k8s-custom-iptables. After this, it will work since it will be like you're making a call from node, not inside of pod.

As mentioned in one of the answers IP aliases (VPC-native) should work out of the box. If using a route based GKE cluster rather than VPC-native you would need to use custom routes.
As per this document
By default, VPC Network Peering with GKE is supported when used with
IP aliases. If you don't use IP aliases, you can export custom routes
so that GKE containers are reachable from peered networks.
This is also explained in this document
If you have GKE clusters without VPC native addressing, you might have
multiple static routes to direct traffic to VM instances that are
hosting your containers. You can export these static routes so that
the containers are reachable from peered networks.

The problem your facing seems similar to the one mentioned in this SO question, perhaps your pods are using IPs outside of the VPC range and for that reason cannot access the peered VPC?

UPDATE: In Google cloud, I tried to access the service from another cluster which had VPC native networking enabled, which I believe allows pods to use the VPC routing and possibly the internal IPs.
Problem solved :-)

DNS problem on AWS EKS when running in private subnets

I have an EKS cluster setup in a VPC. The worker nodes are launched in private subnets. I can successfully deploy pods and services.
However, I'm not able to perform DNS resolution from within the pods. (It works fine on the worker nodes, outside the container.)
Troubleshooting using https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ results in the following from nslookup (timeout after a minute or so):
Server: 172.20.0.10
Address 1: 172.20.0.10
nslookup: can't resolve 'kubernetes.default'
When I launch the cluster in an all-public VPC, I don't have this problem. Am I missing any necessary steps for DNS resolution from within a private subnet?
Many thanks,
Daniel

I feel like I have to give this a proper answer because coming upon this question was the answer to 10 straight hours of debugging for me. As #Daniel said in his comment, the issue I found was with my ACL blocking outbound traffic on UDP port 53 which apparently kubernetes uses to resolve DNS records.
The process was especially confusing for me because one of my pods worked actually worked the entire time since (I think?) it happened to be in the same zone as the kubernetes DNS resolver.

To elaborate on the comment from #Daniel, you need:
an ingress rule for UDP port 53
an ingress rule for UDP on ephemeral ports (e.g. 1025–65535)
I hadn't added (2) and was seeing CoreDNS receiving requests and trying to respond, but the response wasn't getting back to the requester.
Some tips for others dealing with these kinds of issues, turn on CoreDNS logging by adding the log configuration to the configmap, which I was able to do with kubectl edit configmap -n kube-system coredns. See CoreDNS docs on this https://github.com/coredns/coredns/blob/master/README.md#examples This can help you figure out whether the issue is CoreDNS receiving queries or sending the response back.

I ran into this as well. I have multiple node groups, and each one was created from a CloudFormation template. The CloudFormation template created a security group for each node group that allowed the nodes in that group to communicate with each other.
The DNS error resulted from Pods running in separate node groups from the CoreDNS Pods, so the Pods were unable to reach CoreDNS (network communications were only permitted withing node groups). I will make a new CloudFormation template for the node security group so that all my node groups in my cluster can share the same security group.
I resolved the issue for now by allowing inbound UDP traffic on port 53 for each of my node group security groups.

So I been struggling for a couple of hours i think, lost track of time, with this issue as well.
Since i am using the default VPC but with the worker nodes inside the private subnet, it wasn't working.
I went through the amazon-vpc-cni-k8s and found the solution.
We have to sff the environment variable of the aws-node daemonset AWS_VPC_K8S_CNI_EXTERNALSNAT=true.
You can either get the new yaml and apply or just fix it through the dashboard. However for it to work you have to restart the worker node instance so the ip route tables are refreshed.
issue link is here
thankz

Re: AWS EKS Kube Cluster and Route53 internal/private Route53 queries from pods
Just wanted to post a note on what we needed to do to resolve our issues. Noting that YMMV and everyone has different environments and resolutions, etc.
Disclaimer:
We're using the community terraform eks module to deploy/manage vpcs and the eks clusters. We didn't need to modify any security groups. We are working with multiple clusters, regions, and VPC's.
ref:
Terraform EKS module
CoreDNS Changes:
We have a DNS relay for private internal, so we needed to modify coredns configmap and add in the dns-relay IP address
...
ec2.internal:53 {
errors
cache 30
forward . 10.1.1.245
}
foo.dev.com:53 {
errors
cache 30
forward . 10.1.1.245
}
foo.stage.com:53 {
errors
cache 30
forward . 10.1.1.245
}
...
VPC DHCP option sets:
Update with the IP of the above relay server if applicable--requires regeneration of the option set as they cannot be modified.
Our DHCP options set looks like this:
["AmazonProvidedDNS", "10.1.1.245", "169.254.169.253"]
ref: AWS DHCP Option Sets
Route-53 Updates:
Associate every route53 zone with the VPC-ID that you need to associate it with (where our kube cluster resides and the pods will make queries from).
there is also a terraform module for that:
https://www.terraform.io/docs/providers/aws/r/route53_zone_association.html

We had run into a similar issue where DNS resolution times out on some of the pods, but re-creating the pod couple of times resolves the problem. Also its not every pod on a given node showing issues, only some pods.
It turned out to be due to a bug in version 1.5.4 of Amazon VPC CNI, more details here -- https://github.com/aws/amazon-vpc-cni-k8s/issues/641.
Quick solution is to revert to the recommended version 1.5.3 - https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html

As many others, I've been struggling with this bug a few hours.
In my case the issue was this bug https://github.com/awslabs/amazon-eks-ami/issues/636 that basically sets up an incorrect DNS when you specify endpoint and certificate but not certificate.
To confirm, check
That you have connectivity (NACL and security groups) allowing DNS on TCP and UDP. For me the better way was to ssh into the cluster and see if it resolves (nslookup). If it doesn't resolve (most likely it is either NACL or SG), but check that the DNS nameserver in the node is well configured.
If you can get name resolution in the node, but not inside the pod, check that the nameserver in /etc/resolv.conf points to an IP in your service network (if you see 172.20.0.10, your service network should be 172.20.0.0/24 or so)

Kubernetes service ip isn't always accessible within cluster (with flannel)

I built a kubernetes cluster, using flannel overlay network. The problem is one of the service ip isn't always accessible.
I tested within the cluster, by telneting the service ip and port, ended in connection timeout. Checked with netstat, the connection was always in "SYN_SENT" state, seemed that peer didn't accept connection.
But if I telnet to the pod ip and port that backed the service directly, the connection could be made successfully.
It only happened to one of the service, other services are ok.
And if I scaled the backend pod to a larger value, like 2. Then some of requests to the service ip can succeed. It seemed that the service wasn't able to connect to one of the backed pod.
Which component may be the cause of such problem? My service configuration, kube-proxy or flannel?

Check the discussion here: https://github.com/kubernetes/kubernetes/issues/38802
It's required to sysctl net.bridge.bridge-nf-call-iptables=1 on nodes.