Accessing GCP Memorystore from kubernetes - kubernetes

I'm trying to connect to Google cloud memorystore from kubernetes pod but I always get connection timeout error.
After investigation I found the following:
when I'm trying to connect to redis from pod which scheduled on the normal node pool, it works fine.
but when I'm trying to connect to redis from pod which scheduled on the Preembtiple node pool, it fails and I get connection timeout error.
So how can I solve this problem?

it's a bit hard to give an answer with the little informations you gave, we don't know any configuration of you cluster.
Not sure if I'm totally in the wrong but it may help.
Normal or preemptible node should not have any effect on the network connections if the nodes are on the same network. Whate could cause this on gke pods is that the memorystore works by creating a vpc peering, and that gke works sort of in the same way, thus preventing the memstore and the pods to speak to one another as two peering can't exchange with one another.
What should be done in this case is the use of ip aliasing in the gke creation: https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips
Hope this can help you.

Related

Issues with outbound connections from pods on GKE cluster with NAT (and router)

I'm trying to investigate issue with random 'Connection reset by peer' error or long (up 2 minutes) PDO connection initializations but failing to find a solution.
Similar issue: https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/, but that supposed to be fixed in the version of kubernetes that I'm running.
GKE config details:
GKE is running on 1.20.12-gke.1500 version, with a NAT network configuration and a router. Cluster has 2 nodes and router has 2 static IP's assigned with dynamic port allocation and range of 32728-65536 ports per VM.
On the kubernetes:
deployments: docker image with local nginx, php-fpm, and google sql proxy
services: LoadBalancer to expose the deployment
As per replication of the issue I created a simple script connecting in a loop to database and making simple count query. I eliminated issues with the database server by testing the script on a standalone GCE VM where I didn't get any issues. When I'm running the script on any of the application pods in the cluster, I'm getting random 'Connection reset by peer' errors. I have tested that script using google sql proxy service or with direct database IP with same random connection issues.
Any help would be appreciated.
Update
On https://cloud.google.com/kubernetes-engine/docs/release-notes I can see that there has been fix released to solve potentially something that I'm getting: "The following GKE versions fix a known issue in which random TCP connection resets might happen for GKE nodes that use Container-Optimized OS with Docker (cos). To fix the issue, upgrade your nodes to any of these versions:"
I'm updating nodes this evening so I hope that will solve the issue.
Update
The update of nodes solved random connection resets.
Updating cluster and nodes to 1.20.15-gke.3400 version using google cloud panel resolved the issue.

GKE Autopilot failing to create new deployment after 10-12 deployments saying "Insufficient CPU"

I am having some issue with GKE(autopilot).
I am deploying statefulsets and for each statefulset I deploy a service with public IP.
But after deploying like 10-12 statefulsets, if I try deploying any new it remains red(Unschedulable) with message "Insufficient cpu".
When I go to cluster section is show a different message saying:
Can’t scale up because instances in managed instance groups hosting node pools ran out of IPs
Image of error: https://i.imgur.com/t8I4Yij.png
I am new to GKE and tried doing what suggested in links of those image but it seems most of steps give error saying its not supported in autopilot mode.
Any help/suggestion is appreciated.
Thanks.
If you are on GKE autopilot ideally it will create the new nodes in cluster if out of CPU or no space left to schedule the PODs.
However if it's issue of IP you can read more : https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips#not_enough_space
Cluster autoscaler might not have enough unallocated IP address space
to use to add new nodes or Pods, resulting in scale-up failures, which
are indicated by eventResult events with the reason
scale.up.error.ip.space.exhausted. You can add more IP addresses for
nodes by expanding the primary subnet, or add new IP addresses for
Pods using discontiguous multi-Pod CIDR. For more information, see Not
enough free IP space for Pods.
but you are on autopilot so wont be able to access underlaying subnet and node pools of cluster maybe.
Unfortunately, the only option at this point is to create a new cluster and make sure that the CIDR ranges you assign to the cluster have enough available IPs for the number of nodes you believe you'll need. The default setting for Autopilot should be enough.

GCP kubernetes cluster node error NetworkUnavailable

I am trying to set up kubernetes cluster on GCP with terraform.Terraform script has VPC (firewall, subnet default route)and kubernetes.
Randomly i am getting issue "NetworkUnavailable" inside cluster node, but same terraform script work fine in next run.
So there is no any issue with terraform. I dont know how to resolve this issue. If i running script 10 times then provision get fail 4 to 5 times.
Error waiting for creating GKE NodePool: All cluster resources were brought up, but the cluster API is reporting that: only 3 nodes out of 4 have registered; cluster may be unhealthy.
Please help me.
enter image description here
Thanks
Shrwan
This is a fairly common issue when using terraform to create GKE clusters. If you create the cluster manually through the GKE API, you won't have the same error.
Note that when creating a GKE cluster, you only need to create the cluster. It is not necessary to create firewall rules or routes as the GKE API will create them during cluster creation.
Most of the time, this error message means that the nodes are unable to communicate with the master node, this is usually linked to an issue with the network config.
If you are creating a zonal cluster, you might have this issue. I'll add this 3rd one as well, which has a 3rd root cause for the same issue.

DNS problem on AWS EKS when running in private subnets

I have an EKS cluster setup in a VPC. The worker nodes are launched in private subnets. I can successfully deploy pods and services.
However, I'm not able to perform DNS resolution from within the pods. (It works fine on the worker nodes, outside the container.)
Troubleshooting using https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ results in the following from nslookup (timeout after a minute or so):
Server: 172.20.0.10
Address 1: 172.20.0.10
nslookup: can't resolve 'kubernetes.default'
When I launch the cluster in an all-public VPC, I don't have this problem. Am I missing any necessary steps for DNS resolution from within a private subnet?
Many thanks,
Daniel
I feel like I have to give this a proper answer because coming upon this question was the answer to 10 straight hours of debugging for me. As #Daniel said in his comment, the issue I found was with my ACL blocking outbound traffic on UDP port 53 which apparently kubernetes uses to resolve DNS records.
The process was especially confusing for me because one of my pods worked actually worked the entire time since (I think?) it happened to be in the same zone as the kubernetes DNS resolver.
To elaborate on the comment from #Daniel, you need:
an ingress rule for UDP port 53
an ingress rule for UDP on ephemeral ports (e.g. 1025–65535)
I hadn't added (2) and was seeing CoreDNS receiving requests and trying to respond, but the response wasn't getting back to the requester.
Some tips for others dealing with these kinds of issues, turn on CoreDNS logging by adding the log configuration to the configmap, which I was able to do with kubectl edit configmap -n kube-system coredns. See CoreDNS docs on this https://github.com/coredns/coredns/blob/master/README.md#examples This can help you figure out whether the issue is CoreDNS receiving queries or sending the response back.
I ran into this as well. I have multiple node groups, and each one was created from a CloudFormation template. The CloudFormation template created a security group for each node group that allowed the nodes in that group to communicate with each other.
The DNS error resulted from Pods running in separate node groups from the CoreDNS Pods, so the Pods were unable to reach CoreDNS (network communications were only permitted withing node groups). I will make a new CloudFormation template for the node security group so that all my node groups in my cluster can share the same security group.
I resolved the issue for now by allowing inbound UDP traffic on port 53 for each of my node group security groups.
So I been struggling for a couple of hours i think, lost track of time, with this issue as well.
Since i am using the default VPC but with the worker nodes inside the private subnet, it wasn't working.
I went through the amazon-vpc-cni-k8s and found the solution.
We have to sff the environment variable of the aws-node daemonset AWS_VPC_K8S_CNI_EXTERNALSNAT=true.
You can either get the new yaml and apply or just fix it through the dashboard. However for it to work you have to restart the worker node instance so the ip route tables are refreshed.
issue link is here
thankz
Re: AWS EKS Kube Cluster and Route53 internal/private Route53 queries from pods
Just wanted to post a note on what we needed to do to resolve our issues. Noting that YMMV and everyone has different environments and resolutions, etc.
Disclaimer:
We're using the community terraform eks module to deploy/manage vpcs and the eks clusters. We didn't need to modify any security groups. We are working with multiple clusters, regions, and VPC's.
ref:
Terraform EKS module
CoreDNS Changes:
We have a DNS relay for private internal, so we needed to modify coredns configmap and add in the dns-relay IP address
...
ec2.internal:53 {
errors
cache 30
forward . 10.1.1.245
}
foo.dev.com:53 {
errors
cache 30
forward . 10.1.1.245
}
foo.stage.com:53 {
errors
cache 30
forward . 10.1.1.245
}
...
VPC DHCP option sets:
Update with the IP of the above relay server if applicable--requires regeneration of the option set as they cannot be modified.
Our DHCP options set looks like this:
["AmazonProvidedDNS", "10.1.1.245", "169.254.169.253"]
ref: AWS DHCP Option Sets
Route-53 Updates:
Associate every route53 zone with the VPC-ID that you need to associate it with (where our kube cluster resides and the pods will make queries from).
there is also a terraform module for that:
https://www.terraform.io/docs/providers/aws/r/route53_zone_association.html
We had run into a similar issue where DNS resolution times out on some of the pods, but re-creating the pod couple of times resolves the problem. Also its not every pod on a given node showing issues, only some pods.
It turned out to be due to a bug in version 1.5.4 of Amazon VPC CNI, more details here -- https://github.com/aws/amazon-vpc-cni-k8s/issues/641.
Quick solution is to revert to the recommended version 1.5.3 - https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html
As many others, I've been struggling with this bug a few hours.
In my case the issue was this bug https://github.com/awslabs/amazon-eks-ami/issues/636 that basically sets up an incorrect DNS when you specify endpoint and certificate but not certificate.
To confirm, check
That you have connectivity (NACL and security groups) allowing DNS on TCP and UDP. For me the better way was to ssh into the cluster and see if it resolves (nslookup). If it doesn't resolve (most likely it is either NACL or SG), but check that the DNS nameserver in the node is well configured.
If you can get name resolution in the node, but not inside the pod, check that the nameserver in /etc/resolv.conf points to an IP in your service network (if you see 172.20.0.10, your service network should be 172.20.0.0/24 or so)

Can't get to GCE instance from k8s pods on the same subnet

I have a cluster with container range 10.101.64.0/19 on a net A and subnet SA with ranges 10.101.0.0/18. On the same subnet, there is VM in GCE with IP 10.101.0.4 and it can be pinged just fine from within the cluster, e.g. from a node with 10.101.0.3. However, if I go to a pod on this node which got address 10.101.67.191 (which is expected - this node assigns addresses 10.101.67.0/24 or something), I don't get meaningful answer from that VM I want to access from this pod. Using tcpdump on icmp, I can see that when I ping that VM machine from the pod, the ping gets there but I don't receive ACK in the pod. Seems like VM is just throwing it away.
Any idea how to resolve it? Some routes or firewalls? I am using the same topology in the default subnet created by kubernetes where this work but I cannot find anything relevant which could explain this (there are some routes and firewall rules which could influence it but I wasn't successful when trying to mimic them in my subnet)
I think it is a firewall issue.
Here I've already provided the solution on Stakoverflow.
It may help to solve your case.