I use the default network configurations and try to run a standard cluster with 1 master and 2 workers but it always fail. Worker nodes fails to do an RPC to master or vice-versa. I also get an info message on the cluster page notifying me that
The firewall rules for specified network or subnetwork would likely
not permit sufficient VM-to-VM communication for Dataproc to function
properly
The error messages is as follow:
Cannot start master: Insufficient number of DataNodes reporting Worker
cluster-597c-w-0 unable to register with master cluster-597c-m. This
could be because it is offline, or network is misconfigured. Worker
cluster-597c-w-1 unable to register with master cluster-597c-m. This
could be because it is offline, or network is misconfigured.
Though I use the default configurations.
So I found there was a miss-configuration in the firewall rule, it was allowing tcp ports from 1 to 33535 so I changed it to 65535.
Related
In my GKE cluster, I have a node pool with 2 nodes.
Also, I have an external dedicated server with a database.
During the google maintenance, one of the nodes in this node pool was replaced with a new one.
After this, my pods on the new node can't connect to my external server with the error 'no route to host'.
But pods that are located on the old node can connect to my external server without any problems.
So the problem is with the new node only.
The settings of the network and firewall are default in the cluster.
As a result, I have 2 nodes in the node pool, but correctly works only one of them.
The hotfix that works it's a replace the problemed node with a new one.
But a new node would work correctly with a probability of 50% (yes or no).
If the new one also doesn't work, then repeat this step until you get a node that works correctly.
I think it's a very bad solution.
I add a connectivity test from this problemed node to my external dedicated server and database port.
This test says that everything is ok and the destination is reached.
But, when I connect to the node by ssh and try to connect to an external dedicated server with telnet, I get the same 'no route to host' as in pods on that node.
How is it possible?
Also didn't help:
Additional firewall rule to allow all traffic to my dedicated server
Add ip-masq-agent
I have a cluster provisioned using KubeSpray on AWS. It has two bastions, one controller, one worker, and one etcd server.
I am seeing endless messages in the APISERVER logs:
http: TLS handshake error from 10.250.227.53:47302: EOF
They come from two IP addresses, 10.250.227.53 and 10.250.250.158. The port numbers change every time.
None of the cluster nodes correspond to those two IP addresses. The subnet cidr ranges are shown below.
The cluster seems stable. This behavior does not seem to have any negative affect. But I don't like having random HTTPS requests.
How can I debug this issue?
They're from the health check configured on the AWS ELB; you can stop those messages by changing the health check configuration to be HTTPS:6443/healthz instead of the likely TCP one it is using now
How can I debug this issue?
Aside from just generally being cognizant of how your cluster was installed, and then observing that those connections come at regular intervals, I would further bet that those two IP addresses belong to the two ENIs that are allocated to the ELB in each public subnet (they'll show up in the Network Interfaces list on the console as "owner: elasticloadbalancer" or something similar)
I need to understand what is Keepalive in nginx.conf which is in a nginx-ingress-controller container (etc/nginx/nginx.conf). What does Keepalive do to the upstream server (i.e load balancers)?
keepalived enables virtualip in kubernetes cluster. It is very useful particularly when you need to setup highly available kubernetes cluster.
it helps you enable high availability for master servers.
you need to install keepalived on each of the master machine. all the nodes ( kubelets AND kube-proxies running on each node ) can reach the master using the VIrtualIP. if one of the master server is crashed the virtualip gets failed over to another available master. that way you will achieve high availability for master servers in k8s
When I trying to create the cluster with 1 master and 2 data nodes, I am getting below error:
Cannot start master: Insufficient number of DataNodes reporting
Worker test-sparkjob-w-0 unable to register with master test-sparkjob-m. This could be because it is offline, or network is misconfigured.
Worker test-sparkjob-w-1 unable to register with master test-sparkjob-m. This could be because it is offline, or network is misconfigured.
I want to know when the master nodes want to connect the etcd cluster, which etcd node will be selected?does the master node always connects the same etcd node untill it becomes unavailable?does each node in master cluster will connect the same node in etcd cluster?
The scheduler and controller-manager talk to the API server present on the same node. In a HA setup you'll have only one of them running at a time (based on a lease) and whoever is the current active will be talking to the local API server. If for some reason it fails to connect to the local API server, it doesn't renew the lease and another leader will be elected.
As described only one API server will be the leader at any given moment so that's the only place that needs to worry about reaching the etcd cluster. As for the etcd cluster itself, when you configure the kubernetes API server you pass it the etcd-servers flag, which is a list of etcd nodes like:
--etcd-servers=https://10.240.0.10:2379,https://10.240.0.11:2379,https://10.240.0.12:2379
This is then passed the Go etcd/client library which, looking at it's README states:
etcd/client does round-robin rotation on other available endpoints if the preferred endpoint isn't functioning properly. For example, if the member that etcd/client connects to is hard killed, etcd/client will fail on the first attempt with the killed member, and succeed on the second attempt with another member. If it fails to talk to all available endpoints, it will return all errors happened.
Which means that it'll try each of the available nodes until it succeeds connecting to one.