facing error when creatig dataproc cluster on google - google-cloud-dataproc

When I trying to create the cluster with 1 master and 2 data nodes, I am getting below error:
Cannot start master: Insufficient number of DataNodes reporting
Worker test-sparkjob-w-0 unable to register with master test-sparkjob-m. This could be because it is offline, or network is misconfigured.
Worker test-sparkjob-w-1 unable to register with master test-sparkjob-m. This could be because it is offline, or network is misconfigured.

Related

How to debug GKE node pool nodes being re-created

GKE cloud logging and monitoring is not leading me to root cause. Over some period every single node was replaced (I could verify by their age with kubectl) leading to a short complete outage (several minutes) for all services as detected by external monitoring.
Nodes are not preemtible
gcloud container operations list does not show any node-upgrade operations
In cloud Logging Node event logs, there were multiple of these:
Node <...> status is now: NodeNotReady
Deleting node <...> because it does not exist in the cloud provider
Node <...> event: Removing Node <...> from Controller
node-problem-detector logs has a bunch of these:
"2022/05/26 20:35:10 Failed to export to Stackdriver: rpc error: code = DeadlineExceeded desc = One or more TimeSeries could not be written: Deadline expired before operation could complete.: gce_instance{zone:europe-west2-a,instance_id:<...>} timeSeries[0-199]: kubernetes.io/internal/node/guest/net/rx_multicast{instance_name<...>,interface_namegkeb5dd8ca7306}"
Cluster autoscaler shows only a few nodes added and removed, but spread out over several hours.
During the period building up to this, one service in the cluster was receiving a DDoS attack, so network pressure was high, but there was no CPU throttling or OOM kills.

Kubernetes change multi-master node to single master node on failure

Is there any way to convert/change multi-master nodes(3 masters, HA & LB) to single master in stacked etcd configuration?
In 3 master nodes, it only tolerates 1 failure right?
So if 2 of these master node goes down, the control plane wouldn't work.
what I need to do is convert these 3masters to a single master? is there any way to do this to minimalize the downtime of the control plane? (in case the other 2 masters need some time to turn back on)
the test I've done:
I've tried to restore etcd snapshot to a fully different environment with a new setup of 1 master & 2 workers, and it seems to work fine.. the status of the other 2 master nodes is not ready, 2 worker node is ready, and request to api-server is working normally.
But, if I restore etcd snapshot to the original environment.. after resetting the last master node with kubeadm reset, the cluster seems to be broken.. the status of 2 workers is not ready, seems like it has different certificates.
any suggestion on how to make this works?
UPDATE: so apparently, I could restore etcd snapshot directly without
doing "kubeadm reset", even if doing reset.. as long as we update the
certificates, the cluster should be restored successfully.
BUT now I run into a different issue, after restoring the etcd
snapshot.. everything works fine, so basically I want to add a new
Control Plane to this Cluster, the current node status is:
master1 ready
master2 not-ready
master3 not-ready
before I add new CP, I removed the 2 failed master node from the
cluster. after removing it I tried to join new CP to cluster, and the
join process stuck at :
[etcd] Waiting for the new etcd member to join the cluster. This can
take up to 40s [kubelet-check] Initial timeout of 40s passed.
the original master node is broken again, now I can't access the
api-server. do you guys have any idea what's going wrong?

Flink HA JobManager cluster cannot elect a leader

I'm trying to deploy Apache Flink 1.6 on kubernetes. With following the tutorial at job manager high availabilty
page. I already have a working Zookeeper 3.10 cluster from its logs I can see that it's healthy and doesn't configured to Kerberos or SASL.All ACL rules are let's every client to write and read znodes. When I start the cluster everything works as expected every JobManager and TaskManager pods are successfully getting into Running state and I can see the connected TaskManager instances from the master JobManager's web-ui. But when I delete the master JobManager's pod, the other JobManager pod's cannot elect a leader with following error message on any JobManager-UI in the cluster.
{
"errors": [
"Service temporarily unavailable due to an ongoing leader election. Please refresh."
]
}
Even if I restart this page nothing changes. It stucks at this error message.
My suspicion is, the problem is related with high-availability.storageDir option. I already have a working (tested with CloudExplorer) minio s3 deployment to my k8s cluster. But flink cannot write anything to the s3 server. Here you can find every config from github-gist.
According to the logs it looks as if the TaskManager cannot connect to the new leader. I assume that this is the same for the web ui. The logs say that it tries to connect to flink-job-manager-0.flink-job-svc.flink.svc.cluster.local/10.244.3.166:44013. I cannot say from the logs whether flink-job-manager-1 binds to this IP. But my suspicion is that the headless service might return multiple IPs and Flink picks the wrong/old one. Could you log into the flink-job-manager-1 pod and check what its IP address is?
I think you should be able to resolve this problem by defining for each JobManager a dedicated service or if you use the pod hostname instead.

Google Dataproc Insufficient number of DataNodes reporting

I use the default network configurations and try to run a standard cluster with 1 master and 2 workers but it always fail. Worker nodes fails to do an RPC to master or vice-versa. I also get an info message on the cluster page notifying me that
The firewall rules for specified network or subnetwork would likely
not permit sufficient VM-to-VM communication for Dataproc to function
properly
The error messages is as follow:
Cannot start master: Insufficient number of DataNodes reporting Worker
cluster-597c-w-0 unable to register with master cluster-597c-m. This
could be because it is offline, or network is misconfigured. Worker
cluster-597c-w-1 unable to register with master cluster-597c-m. This
could be because it is offline, or network is misconfigured.
Though I use the default configurations.
So I found there was a miss-configuration in the firewall rule, it was allowing tcp ports from 1 to 33535 so I changed it to 65535.

in kubernetes 1.7 with multi-master nodes and etcd cluster, how the master node connect the etcd cluster by default?

I want to know when the master nodes want to connect the etcd cluster, which etcd node will be selected?does the master node always connects the same etcd node untill it becomes unavailable?does each node in master cluster will connect the same node in etcd cluster?
The scheduler and controller-manager talk to the API server present on the same node. In a HA setup you'll have only one of them running at a time (based on a lease) and whoever is the current active will be talking to the local API server. If for some reason it fails to connect to the local API server, it doesn't renew the lease and another leader will be elected.
As described only one API server will be the leader at any given moment so that's the only place that needs to worry about reaching the etcd cluster. As for the etcd cluster itself, when you configure the kubernetes API server you pass it the etcd-servers flag, which is a list of etcd nodes like:
--etcd-servers=https://10.240.0.10:2379,https://10.240.0.11:2379,https://10.240.0.12:2379
This is then passed the Go etcd/client library which, looking at it's README states:
etcd/client does round-robin rotation on other available endpoints if the preferred endpoint isn't functioning properly. For example, if the member that etcd/client connects to is hard killed, etcd/client will fail on the first attempt with the killed member, and succeed on the second attempt with another member. If it fails to talk to all available endpoints, it will return all errors happened.
Which means that it'll try each of the available nodes until it succeeds connecting to one.