Failed to create a cluster with command kubeadm init - kubernetes

I am tring to install kubenetes on debian 9.3, I followed the instructions on this document https://kubernetes.io/docs/setup/independent/install-kubeadm/, it failed to create the cluster with timeout error, the commands I used are as follows:
export HTTP_PROXY=http://192.168.56.1:1080 # this is my internet proxy
export HTTPS_PROXY=http://192.168.56.1:1080
export NO_PROXY=127.0.0.1,192.168.56.*,10.244.*,10.96.*
kubeadm init --apiserver-advertise-address=192.168.56.101 --pod-network-cidr=10.244.0.0/16
the last command hangs up for 1hour and failed with timeout, I found that several container had been running by command docker ps, the running containers included kube-controller-manager-amd64,etcd-amd64,kube-apiserver-amd64,kube-scheduler-amd64,4 instances of pause-amd64.
the error messages are as follows
duler-debvm01_kube-system(660259102d57385a8043d025ac189c87)": Get https://192.168.56.101:6443/api/v1/namespaces/kube-system/pods/kube-scheduler-debvm01: net/http: TLS handshake timeout
Apr 06 21:44:49 DebVM01 kubelet[10665]: E0406 21:44:49.923017 10665 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://192.168.56.101:6443/api/v1/nodes?fieldSelector=metadata.name%3Ddebvm01&limit=500&resourceVersion=0: net/http: TLS handshake timeout
Apr 06 21:44:49 DebVM01 kubelet[10665]: E0406 21:44:49.924966 10665 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://192.168.56.101:6443/api/v1/services?limit=500&resourceVersion=0: net/http: TLS handshake timeout
Apr 06 21:44:49 DebVM01 kubelet[10665]: E0406 21:44:49.925892 10665 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get xxx/api/v1/pods?fieldSelector=spec.nodeName%3Ddebvm01&limit=500&resourceVersion=0: net/http: TLS handshake timeout
Apr 06 21:44:50 DebVM01 kubelet[10665]: E0406 21:44:50.029333 10665 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "debvm01" not found
Apr 06 21:44:50 DebVM01 kubelet[10665]: E0406 21:44:50.379543 10665 kubelet_node_status.go:106] Unable to register node "debvm01" with API server: Post xxx: net/http: TLS handshake timeout
Apr 06 21:44:52 DebVM01 kubelet[10665]: E0406 21:44:52.575452 10665 event.go:209] Unable to write event: 'Post xxxx: net/http: TLS handshake timeout' (may retry after sleeping)
Apr 06 21:44:57 DebVM01 kubelet[10665]: I0406 21:44:57.380498 10665 kubelet_node_status.go:273] Setting node annotation to enable volume controller attach/detach
Apr 06 21:44:57 DebVM01 kubelet[10665]: I0406 21:44:57.430059 10665 kubelet_node_status.go:82] Attempting to register node debvm01
Apr 06 21:45:00 DebVM01 kubelet[10665]: E0406 21:45:00.030635 10665 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "debvm01" not found
Apr 06 21:45:01 DebVM01 kubelet[10665]: I0406 21:45:01.484580 10665 kubelet_node_status.go:85] Successfully registered node debvm01
the above error messages has been processed and eliminated a lot of repeated lines as follows:
Apr 06 22:46:20 DebVM01 kubelet[10665]: E0406 22:46:20.773690 10665 kubelet.go:2104] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Apr 06 22:46:25 DebVM01 kubelet[10665]: W0406 22:46:25.779141 10665 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Kubernetes v1.9.3
could anyone help me?

kubeadm init --apiserver-advertise-address=192.168.56.101
--pod-network-cidr=10.244.0.0/16
From kubeadm documentation:
--apiserver-advertise-address ip-address The IP address the API Server will advertise it's listening on. Specify '0.0.0.0' to use the
address of the default network interface.
Unless otherwise specified, kubeadm uses the default gateway’s network
interface to advertise the master’s IP. If you want to use a different
network interface, specify --apiserver-advertise-address=ip-address
From kubernetes api-server documentation:
--advertise-address ip-address The IP address on which to advertise the apiserver to members of the cluster. This address must
be reachable by the rest of the cluster. If blank, the --bind-address
will be used. If --bind-address is unspecified, the host's default
interface will be used.
I've done a couple of experiments which confirm that it is necessary for ip-address to be configured (or added as a secondary IP) to one of the master's instance interfaces.
Just double check if that interface is up.
The last error message,
network plugin is not ready: cni config uninitialized
means that kubernetes networking subsystem is absent or broken. Try to install/reinstall it with
kubectl apply -f https://docs.projectcalico.org/v3.0/getting-started/kubernetes/installation/hosted/kubeadm/1.7/calico.yaml
This part described in section "(3/4) Installing a pod network" in the document you've mentioned.
If you are stuck, try to reinstall your cluster following this manual.

Related

jenkins on kubernetes - 1 agent can connect

I have installed jenkins on kubernetes (EKS). It's been running for over a year now. i'v noticed that when you run parallell jobs, it just runs one job at a time. I cannot remember if it ever ran two jobs in parallel.
The pod works after the first one finishes.
Second jobs logs that get stuck :
jenkins-master logs:
2023-02-02 09:14:00.995+0000 [id=962] INFO hudson.slaves.NodeProvisioner#update: devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg provisioning successfully completed. We have now 3 computer(s)
[jenkins-0][jenkins] 2023-02-02 09:14:22.133+0000 [id=961] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/devops-iac-prod-eu-west-1-2-master-18-9vk02-n88q-qxlv5
[jenkins-0][config-reload] [2023-02-02 09:14:22] Working on ADDED configmap jenkins/jenkins-jenkins-jcasc-config
[jenkins-0][config-reload] [2023-02-02 09:14:22] Contents of jcasc-default-config.yaml haven't changed. Not overwriting existing file
[jenkins-0][jenkins] 2023-02-02 09:14:24.818+0000 [id=961] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/devops-iac-prod-eu-west-1-2-master-18-9vk02-n88q-qxlv5
[jenkins-0][jenkins] 2023-02-02 09:14:24.942+0000 [id=982] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #8 from /10.20.41.123:54556
[jenkins-0][jenkins] 2023-02-02 09:14:31.008+0000 [id=974] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes jenkins/devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg
[jenkins-0][jenkins] 2023-02-02 09:14:33.287+0000 [id=974] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes jenkins/devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg
[jenkins-0][jenkins] 2023-02-02 09:14:43.168+0000 [id=995] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [devops-iac-prod-eu-west-1-2-master-18-9vk02-n88q-qxlv5], container: [awsterraform][384 ms]
[jenkins-0][jenkins] 2023-02-02 09:15:03.536+0000 [id=974] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (30/1,000): devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg
[jenkins-0][jenkins] 2023-02-02 09:15:19.081+0000 [id=1023] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [devops-iac-prod-eu-west-1-2-master-18-9vk02-n88q-qxlv5], container: [awsterraform][312 ms]
[jenkins-0][jenkins] 2023-02-02 09:15:21.728+0000 [id=1027] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [devops-iac-prod-eu-west-1-2-master-18-9vk02-n88q-qxlv5], container: [awsterraform][303 ms]
[jenkins-0][jenkins] 2023-02-02 09:15:25.390+0000 [id=996] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [devops-iac-prod-eu-west-1-2-master-18-9vk02-n88q-qxlv5], container: [awsterraform][323 ms]
[jenkins-0][config-reload] [2023-02-02 09:15:27] Working on ADDED configmap jenkins/jenkins-jenkins-jcasc-config
[jenkins-0][config-reload] [2023-02-02 09:15:27] Contents of jcasc-default-config.yaml haven't changed. Not overwriting existing file
[jenkins-0][jenkins] 2023-02-02 09:15:33.771+0000 [id=974] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (60/1,000): devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg
[jenkins-0][jenkins] 2023-02-02 09:15:51.476+0000 [id=994] INFO o.c.j.p.k.p.ContainerExecDecorator$1#doLaunch: Created process inside pod: [devops-iac-prod-eu-west-1-2-master-18-9vk02-n88q-qxlv5], container: [awsterraform][285 ms]
[jenkins-0][jenkins] 2023-02-02 09:16:04.013+0000 [id=974] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (90/1,000): devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg
[jenkins-0][config-reload] [2023-02-02 09:16:32] Working on ADDED configmap jenkins/jenkins-jenkins-jcasc-config
[jenkins-0][config-reload] [2023-02-02 09:16:32] Contents of jcasc-default-config.yaml haven't changed. Not overwriting existing file
[jenkins-0][jenkins] 2023-02-02 09:16:34.209+0000 [id=974] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (120/1,000): devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg
[jenkins-0][jenkins] 2023-02-02 09:17:04.399+0000 [id=974] INFO o.c.j.p.k.KubernetesLauncher#launch: Waiting for agent to connect (150/1,000): devops-iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg
iac-prod-eu-west-1-1-master-28-71sh2-hhnlk-46vsg (stuck agent)
Feb 02, 2023 9:14:32 AM hudson.remoting.jnlp.Main createEngine
INFO: Setting up agent: devops-iac-prod-eu-west-1-nyxgib-master-28-71sh2-hhnlk-46vsg
Feb 02, 2023 9:14:33 AM hudson.remoting.jnlp.Main$CuiListener <init>
INFO: Jenkins agent is running in headless mode.
Feb 02, 2023 9:14:33 AM hudson.remoting.Engine startEngine
INFO: Using Remoting version: 4.11
Feb 02, 2023 9:14:33 AM org.jenkinsci.remoting.engine.WorkDirManager initializeWorkDir
INFO: Using /home/jenkins/agent/remoting as a remoting work directory
Feb 02, 2023 9:14:33 AM org.jenkinsci.remoting.engine.WorkDirManager setupLogging
INFO: Both error and output logs will be printed to /home/jenkins/agent/remoting
Feb 02, 2023 9:14:33 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Locating server among [http://jenkins.jenkins.svc.cluster.local:80/]
Feb 02, 2023 9:14:33 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting server accepts the following protocols: [JNLP4-connect, Ping]
Feb 02, 2023 9:14:33 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver resolve
INFO: Remoting TCP connection tunneling is enabled. Skipping the TCP Agent Listener Port availability check
Feb 02, 2023 9:14:33 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Agent discovery successful
Agent address: jenkins-agent.jenkins.svc.cluster.local
Agent port: 50000
Identity: fe:1f:fb:8a:cf:ca:ad:1e:55:a3:a0:b1:94:5a:36:2b
Feb 02, 2023 9:14:33 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Handshaking
Feb 02, 2023 9:14:33 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-agent.jenkins.svc.cluster.local:50000
Feb 02, 2023 9:16:53 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-agent.jenkins.svc.cluster.local:50000 (retrying:2)
java.io.IOException: Failed to connect to jenkins-agent.jenkins.svc.cluster.local:50000
at org.jenkinsci.remoting.engine.JnlpAgentEndpoint.open(JnlpAgentEndpoint.java:248)
at hudson.remoting.Engine.connectTcp(Engine.java:880)
at hudson.remoting.Engine.innerRun(Engine.java:757)
at hudson.remoting.Engine.run(Engine.java:540)
Caused by: java.net.ConnectException: Connection timed out
at java.base/sun.nio.ch.Net.connect0(Native Method)
at java.base/sun.nio.ch.Net.connect(Unknown Source)
at java.base/sun.nio.ch.Net.connect(Unknown Source)
at java.base/sun.nio.ch.SocketChannelImpl.connect(Unknown Source)
at java.base/java.nio.channels.SocketChannel.open(Unknown Source)
at org.jenkinsci.remoting.engine.JnlpAgentEndpoint.open(JnlpAgentEndpoint.java:206)
... 3 more
Feb 02, 2023 9:19:13 AM hudson.remoting.jnlp.Main$CuiListener status
INFO: Connecting to jenkins-agent.jenkins.svc.cluster.local:50000 (retrying:3)
java.io.IOException: Failed to connect to jenkins-agent.jenkins.svc.cluster.local:50000
at org.jenkinsci.remoting.engine.JnlpAgentEndpoint.open(JnlpAgentEndpoint.java:248)
at hudson.remoting.Engine.connectTcp(Engine.java:880)
at hudson.remoting.Engine.innerRun(Engine.java:757)
at hudson.remoting.Engine.run(Engine.java:540)
Caused by: java.net.ConnectException: Connection timed out
at java.base/sun.nio.ch.Net.connect0(Native Method)
at java.base/sun.nio.ch.Net.connect(Unknown Source)
at java.base/sun.nio.ch.Net.connect(Unknown Source)
at java.base/sun.nio.ch.SocketChannelImpl.connect(Unknown Source)
at java.base/java.nio.channels.SocketChannel.open(Unknown Source)
at org.jenkinsci.remoting.engine.JnlpAgentEndpoint.open(JnlpAgentEndpoint.java:206)
... 3 more

Apache and tomcat communication in Kubernetes

I am running two pods for apache and 4 pods for Tomcat(both pods are running in the same namespace). I expose(services) the apache and tomcat port(80,8080) to outside the world.
When I try to call the apache URL the request is not reaching the tomcat pod.
When I check the modjk log I am seeing the tomcat is not up and not running the correct port.
I am using the mod_jk for apache to tomcat communication.
I have given my worker. property file and Jkmount file here. When I directly access the tomcat the application is working fine, I am new to the Kubernetes.
Worker.properties:
worker.list=tomcat,loadbalancer
worker.tomcat.type=ajp13
#worker.myserver.host=127.0.0.1
worker.tomcat.port=8009
worker.tomcat.lbfactor=1
worker.loadbalancer.type=lb
worker.loadbalancer.balance_workers=tomcat
worker.loadbalancer.sticky_session=1
JK mount:
JkMount /Userinfo/* tomcat
httpd.conf:
JkWorkersFile /opt/apache/conf/worker.properties
Include /opt/apache/conf/jkmount.conf
LoadModule jk_module modules/mod_jk.so
Tomcat AJP:
<!-- Define an AJP 1.3 Connector on port 8009 -->
<Connector port="8009" protocol="AJP/1.3" redirectPort="8443" />
Message:
[Tue Dec 08 17:52:54.811 2020] [7:140256921777920] [info] jk_open_socket::jk_connect.c (817): connect to 127.0.0.1:8009 failed (errno=111)
[Tue Dec 08 17:52:54.811 2020] [7:140256921777920] [info] ajp_connect_to_endpoint::jk_ajp_common.c (1068): (tomcat) Failed opening socket to (127.0.0.1:8009) (errno=111)
[Tue Dec 08 17:52:54.811 2020] [7:140256921777920] [error] ajp_send_request::jk_ajp_common.c (1728): (tomcat) connecting to backend failed. Tomcat is probably not started or is listening on the wrong port (errno=111)
[Tue Dec 08 17:52:54.811 2020] [7:140256921777920] [info] ajp_service::jk_ajp_common.c (2778): (tomcat) sending request to tomcat failed (recoverable), because of error during request sending (attempt=2)
[Tue Dec 08 17:52:54.811 2020] [7:140256921777920] [error] ajp_service::jk_ajp_common.c (2799): (tomcat) connecting to tomcat failed (rc=-3, errors=1, client_errors=0).
[Tue Dec 08 17:52:54.811 2020] [7:140256921777920] [info] jk_handler::mod_jk.c (2995): Service error=-3 for worker=tomcat

Why might a gvisor based node pool fail to bootstrap properly?

I'm trying to provision a new node pool using gvisor sandboxing in GKE. I use the GCP web console to add a new node pool, use the cos_containerd OS and check the Enable gvisor Sandboxing checkbox, but the node pool provisioning fails each time with an "Unknown Error" in the GCP console notifications. The nodes never join the K8S cluster.
The GCE VM seems to boot fine and when I look in the journalctl for the node I see that cloud-init seems to have finished just fine, but the kubelet doesn't seem to be able to start. I see error messages like this:
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.184163 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.284735 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.385229 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.485626 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.522961 1143 eviction_manager.go:251] eviction manager: failed to get summary stats: failed to get node info: node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz containerd[976]: time="2020-10-12T16:58:07.576735750Z" level=error msg="Failed to load cni configuration" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.577353 1143 kubelet.go:2191] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.587824 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:07 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:07.989869 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:08 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:08.090287 1143
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.296365 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.396933 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz node-problem-detector[1166]: F1012 16:58:09.449446 2481 main.go:71] cannot create certificate signing request: Post https://172.17.0.2/apis/certificates.k8s.io/v1beta1/certificatesigningrequests?timeout=5m0s: dial tcp 172.17.0.2:443: connect: no route
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz node-problem-detector[1166]: E1012 16:58:09.450695 1166 manager.go:162] failed to update node conditions: Patch https://172.17.0.2/api/v1/nodes/gke-main-sanboxes-dd9b8d84-dmzz/status: getting credentials: exec: exit status 1
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.453825 2486 cache.go:125] failed reading existing private key: open /var/lib/kubelet/pki/kubelet-client.key: no such file or directory
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.543449 1143 kubelet.go:2271] node "gke-main-sanboxes-dd9b8d84-dmzz" not found
Oct 12 16:58:09 gke-main-sanboxes-dd9b8d84-dmzz kubelet[1143]: E1012 16:58:09.556623 2486 tpm.go:124] failed reading AIK cert: tpm2.NVRead(AIK cert): decoding NV_ReadPublic response: handle 1, error code 0xb : the handle is not correct for the use
I am not really sure what might be causing that, and I'd really like to be able to use autoscaling with this node pool, so I don't want to just fix it manually for this node and have to do so for any new nodes that join. How can I configure the node pool such that the gvisor based nodes provision fine on their own?
My cluster details:
GKE version: 1.17.9-gke.6300
Cluster type: Regional
VPC-native
Private cluster
Shielded GKE Nodes
You can report issues with Google products by following below link:
Cloud.google.com: Support: Docs: Issue Trackers
You will need to choose the: Create new Google Kubernetes Engine issue under Compute section.
I can confirm that I stumbled upon the same issue when creating a cluster as described in the question (private, shielded, etc.):
Create a cluster with one node pool.
Add the node pool with gvisor enabled after the cluster successfully created.
Creating cluster like above will push the GKE cluster to RECONCILING state:
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
gke-gvisor europe-west3 1.17.9-gke.6300 XX.XXX.XXX.XXX e2-medium 1.17.9-gke.6300 6 RECONCILING
The changes in the cluster state:
Provisoning - creating the cluster
Running - created the cluster
Reconciling - added the node pool
Running - the node pool was added (for about a minute)
Reconciling - the cluster went into that state for about 25 minutes
GCP Cloud Console (Web UI) reports: Repairing Cluster

Deploying GlusterFS DaemonSet on Kubernetes

Im new to GlusterFS and am trying to deploy GlusterFS as a DaemonSet in a new K8S cluster.
My K8S cluster is setup on Bare Metal and all the host machines are Debian9 based.
Im getting the GlusterFS DaemonSet from the official Kubernetes Incubator repo which is here. The image being used is based off of CentOS.
Now when I deploy the DaemonSet, all the pods stay in Pending state. When I do a describe on the Pods, I get livelinessProbe/ReadinessProbe failed with the following ERRORs.
[glusterfspod-6h85 /]# systemctl status glusterd
● glusterd.service - GlusterFS, a clustered file-system server
Loaded: loaded (/usr/lib/systemd/system/glusterd.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Sat 2018-11-10 19:41:53 UTC; 2min 2s ago
Process: 68 ExecStart=/usr/sbin/glusterd -p /var/run/glusterd.pid --log-level $LOG_LEVEL $GLUSTERD_OPTIONS (code=exited, status=1/FAILURE)
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: Starting GlusterFS, a clustered file-system server...
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: glusterd.service: control process exited, code=exited status=1
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: Failed to start GlusterFS, a clustered file-system server.
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: Unit glusterd.service entered failed state.
Nov 10 19:41:53 kubernetes-agent-4 systemd[1]: glusterd.service failed.
Then I exec into the Pods and check the logs and they say :
-- Unit sshd.service has begun starting up.
Nov 10 19:35:24 kubernetes-agent-4 sshd[93]: error: Bind to port 2222 on 0.0.0.0 failed: Address already in use.
Nov 10 19:35:24 kubernetes-agent-4 sshd[93]: error: Bind to port 2222 on :: failed: Address already in use.
Nov 10 19:35:24 kubernetes-agent-4 sshd[93]: fatal: Cannot bind any address.
Nov 10 19:35:24 kubernetes-agent-4 systemd[1]: sshd.service: main process exited, code=exited, status=255/n/a
Nov 10 19:35:24 kubernetes-agent-4 systemd[1]: Failed to start OpenSSH server daemon.
And
[2018-11-10 19:34:42.330154] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory
[2018-11-10 19:34:42.330165] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-11-10 19:34:42.333893] E [socket.c:802:__socket_server_bind] 0-socket.management: binding to failed: Address already in use
[2018-11-10 19:34:42.333911] E [socket.c:805:__socket_server_bind] 0-socket.management: Port is already in use
[2018-11-10 19:34:42.333926] W [rpcsvc.c:1788:rpcsvc_create_listener] 0-rpc-service: listening on transport failed
[2018-11-10 19:34:42.333938] E [MSGID: 106244] [glusterd.c:1757:init] 0-management: creation of listener failed
[2018-11-10 19:34:42.333949] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2018-11-10 19:34:42.333965] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed
[2018-11-10 19:34:42.333974] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed
[2018-11-10 19:34:42.334371] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xfd) [0x55adc15817dd] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x163) [0x55adc1581683] -->/usr/sbin/glusterd(cleanup_and_exit+0x6b) [0x55adc1580b8b] ) 0-: received signum (-1), shutting down
And
[2018-11-10 19:34:03.299298] I [MSGID: 100030] [glusterfsd.c:2691:main] 0-/usr/sbin/glusterd: Started running /usr/sbin/glusterd version 5.0 (args: /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO)
[2018-11-10 19:34:03.330091] I [MSGID: 106478] [glusterd.c:1435:init] 0-management: Maximum allowed open file descriptors set to 65536
[2018-11-10 19:34:03.330125] I [MSGID: 106479] [glusterd.c:1491:init] 0-management: Using /var/lib/glusterd as working directory
[2018-11-10 19:34:03.330135] I [MSGID: 106479] [glusterd.c:1497:init] 0-management: Using /var/run/gluster as pid file working directory
[2018-11-10 19:34:03.334414] W [MSGID: 103071] [rdma.c:4475:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device]
[2018-11-10 19:34:03.334435] W [MSGID: 103055] [rdma.c:4774:init] 0-rdma.management: Failed to initialize IB Device
[2018-11-10 19:34:03.334444] W [rpc-transport.c:339:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed
[2018-11-10 19:34:03.334537] W [rpcsvc.c:1789:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed
[2018-11-10 19:34:03.334549] E [MSGID: 106244] [glusterd.c:1798:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport
[2018-11-10 19:34:05.496746] E [MSGID: 101032] [store.c:447:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2018-11-10 19:34:05.496843] E [MSGID: 101032] [store.c:447:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/glusterd.info. [No such file or directory]
[2018-11-10 19:34:05.496846] I [MSGID: 106514] [glusterd-store.c:2304:glusterd_restore_op_version] 0-management: Detected new install. Setting op-version to maximum : 50000
[2018-11-10 19:34:05.513644] I [MSGID: 106194] [glusterd-store.c:3983:glusterd_store_retrieve_missed_snaps_list] 0-management: No missed snaps list.
Is there something I have missed ? The volumes section on the DaemonSet manifest mounts volumes from hostPath. Should I deploy glusterfs-server on my host machines aswell ? Or is this a CentOS / Debian mismatch issue ?

Connection timeout on cluster.openBucket call with Couchbase / Kubernetes

I have deployed a 4 node Couchbase cluster using Google GKE.
The master node exposes ports 8091, 8093 to the Loadbaancer.
When connecting to the Loadbalancer IP (external) via a Java app to insert data, I get the timeout error with this stack:
Apr 04, 2017 3:32:15 PM com.couchbase.client.core.endpoint.AbstractEndpoint$2 operationComplete
WARNING: [null][ViewEndpoint]: Socket connect took longer than specified timeout.
Apr 04, 2017 3:32:15 PM com.couchbase.client.core.endpoint.AbstractEndpoint$2 operationComplete
WARNING: [null][KeyValueEndpoint]: Socket connect took longer than specified timeout.
Apr 04, 2017 3:32:15 PM com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise notifyListener0
WARNING: An exception was thrown by com.couchbase.client.core.endpoint.AbstractEndpoint$2.operationComplete()
rx.exceptions.OnErrorNotImplementedException: connection timed out: /10.4.0.3:8093
at rx.Observable$26.onError(Observable.java:7955)
at rx.observers.SafeSubscriber._onError(SafeSubscriber.java:159)
at rx.observers.SafeSubscriber.onError(SafeSubscriber.java:120)
at rx.internal.operators.OperatorMap$1.onError(OperatorMap.java:48)
What's puzzling is that the stack shows 10.4.0.3:8093 which is actually the the IP of the docker container.
Appreciate all suggestions.
Have you checked the firewall rules for the master node and the workers? You need to allow ingress for the ports you have set up.
See this answer