Unable to launch Alluxio on Kubernetes - sockets

I am trying alluxio 1.7.1 with docker 1.13.1, kubernetes 1.9.6, 1.10.1
I created the alluxio docker image as per the instructions on https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Docker.html
Then I followed the https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Kubernetes.html guide to run alluxio on kubernetes. I was able to bring up the alluxio master pod properly, but when I try to bring up alluxio worker I get the error that Address in use. I have not modified anything in the yamls which I downloaded from alluxio git. Only change I did was for alluxio docker image name and api version in yamls for k8s to match properly.
I checked ports being used in my k8s cluster setup, and even on the nodes also. There are no ports that alluxio wants being used by any other process, but I still get address in use error. I am unable to understand what I can do to debug further or what I should change to make this work. I don't have any other application running on my k8s cluster setup. I tried with single node k8s cluster setup and multi node k8s cluster setup also. I tried k8s version 1.9 and 1.10 also.
There is definitely some issue from alluxio worker side which I am unable to debug.
This is the log that I get from worker pod:
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl logs po/alluxio-worker-knqt4
Formatting Alluxio Worker # vm-sushil-scrum1-08062018-alluxio-1
2018-06-08 10:09:55,723 INFO Configuration - Configuration file /opt/alluxio/conf/alluxio-site.properties loaded.
2018-06-08 10:09:55,845 INFO Format - Formatting worker data folder: /alluxioworker/
2018-06-08 10:09:55,845 INFO Format - Formatting Data path for tier 0:/dev/shm/alluxioworker
2018-06-08 10:09:55,856 INFO Format - Formatting complete
2018-06-08 10:09:56,357 INFO Configuration - Configuration file /opt/alluxio/conf/alluxio-site.properties loaded.
2018-06-08 10:09:56,549 INFO TieredIdentityFactory - Initialized tiered identity TieredIdentity(node=10.194.11.7, rack=null)
2018-06-08 10:09:56,866 INFO BlockWorkerFactory - Creating alluxio.worker.block.BlockWorker
2018-06-08 10:09:56,866 INFO FileSystemWorkerFactory - Creating alluxio.worker.file.FileSystemWorker
2018-06-08 10:09:56,942 WARN StorageTier - Failed to verify memory capacity
2018-06-08 10:09:57,082 INFO log - Logging initialized #1160ms
2018-06-08 10:09:57,509 INFO AlluxioWorkerProcess - Domain socket data server is enabled at /opt/domain.
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:164)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:45)
at alluxio.worker.WorkerProcess$Factory.create(WorkerProcess.java:37)
at alluxio.worker.AlluxioWorker.main(AlluxioWorker.java:56)
Caused by: java.lang.RuntimeException: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
at alluxio.util.CommonUtils.createNewClassInstance(CommonUtils.java:224)
at alluxio.worker.DataServer$Factory.create(DataServer.java:45)
at alluxio.worker.AlluxioWorkerProcess.<init>(AlluxioWorkerProcess.java:159)
... 3 more
Caused by: io.netty.channel.unix.Errors$NativeIoException: bind(..) failed: Address in use
at io.netty.channel.unix.Errors.newIOException(Errors.java:117)
at io.netty.channel.unix.Socket.bind(Socket.java:259)
at io.netty.channel.epoll.EpollServerDomainSocketChannel.doBind(EpollServerDomainSocketChannel.java:75)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:504)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1226)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:495)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:480)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:213)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:305)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at java.lang.Thread.run(Thread.java:748)
-----------------------
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl get all
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
ds/alluxio-worker 1 1 0 1 0 <none> 42m
ds/alluxio-worker 1 1 0 1 0 <none> 42m
NAME DESIRED CURRENT AGE
statefulsets/alluxio-master 1 1 44m
NAME READY STATUS RESTARTS AGE
po/alluxio-master-0 1/1 Running 0 44m
po/alluxio-worker-knqt4 0/1 Error 12 42m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/alluxio-master ClusterIP None <none> 19998/TCP,19999/TCP 44m
svc/kubernetes ClusterIP 10.254.0.1 <none> 443/TCP 1h
---------------------
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# kubectl describe po/alluxio-worker-knqt4
Name: alluxio-worker-knqt4
Namespace: default
Node: vm-sushil-scrum1-08062018-alluxio-1/10.194.11.7
Start Time: Fri, 08 Jun 2018 10:09:05 +0000
Labels: app=alluxio
controller-revision-hash=3081903053
name=alluxio-worker
pod-template-generation=1
Annotations: <none>
Status: Running
IP: 10.194.11.7
Controlled By: DaemonSet/alluxio-worker
Containers:
alluxio-worker:
Container ID: docker://40a1eff2cd4dff79d9189d7cb0c4826a6b6e4871fbac65221e7cdd341240e358
Image: alluxio:1.7.1
Image ID: docker://sha256:b080715bd53efc783ee5f54e7f1c451556f93e7608e60e05b4615d32702801af
Ports: 29998/TCP, 29999/TCP, 29996/TCP
Command:
/entrypoint.sh
Args:
worker
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 08 Jun 2018 11:01:37 +0000
Finished: Fri, 08 Jun 2018 11:02:02 +0000
Ready: False
Restart Count: 14
Limits:
cpu: 1
memory: 2G
Requests:
cpu: 500m
memory: 2G
Environment Variables from:
alluxio-config ConfigMap Optional: false
Environment:
ALLUXIO_WORKER_HOSTNAME: (v1:status.hostIP)
Mounts:
/dev/shm from alluxio-ramdisk (rw)
/opt/domain from alluxio-domain (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-7xlz7 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
alluxio-ramdisk:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
alluxio-domain:
Type: HostPath (bare host directory volume)
Path: /tmp/domain
HostPathType: Directory
default-token-7xlz7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-7xlz7
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 56m kubelet, vm-sushil-scrum1-08062018-alluxio-1 MountVolume.SetUp succeeded for volume "alluxio-domain"
Normal SuccessfulMountVolume 56m kubelet, vm-sushil-scrum1-08062018-alluxio-1 MountVolume.SetUp succeeded for volume "alluxio-ramdisk"
Normal SuccessfulMountVolume 56m kubelet, vm-sushil-scrum1-08062018-alluxio-1 MountVolume.SetUp succeeded for volume "default-token-7xlz7"
Normal Pulled 53m (x5 over 56m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Container image "alluxio:1.7.1" already present on machine
Normal Created 53m (x5 over 56m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Created container
Normal Started 53m (x5 over 56m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Started container
Warning BackOff 1m (x222 over 55m) kubelet, vm-sushil-scrum1-08062018-alluxio-1 Back-off restarting failed container
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :19999 | grep LISTEN
java 8949 root 29u IPv4 12518521 0t0 TCP *:dnp-sec (LISTEN)
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :19998 | grep LISTEN
java 8949 root 19u IPv4 12520458 0t0 TCP *:iec-104-sec (LISTEN)
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :29998 | grep LISTEN
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :29999 | grep LISTEN
[root#vm-sushil-scrum1-08062018-alluxio-1 kubernetes]# lsof -n -i :29996 | grep LISTEN
The alluxio-worker container is always restarting and failing again and again for the same error.
Please guide me what I can do to solve this.
Thanks

The problem was short circuit unix domain socket path. I was using whatever was present by default in alluxio git. In the default integration/kubernetes/conf/alluxio.properties.template the address for ALLUXIO_WORKER_DATA_SERVER_DOMAIN_SOCKET_ADDRESS was not complete. This is properly explained in https://www.alluxio.org/docs/1.7/en/Running-Alluxio-On-Docker.html for enabling short circuit reads in alluxio worker containers using unix domain sockets.
Just because of a missing complete path for unix domain socket the alluxio worker was not able to come up in kubernetes when short circuit read was enabled for alluxio worker.
When I corrected the path in integration/kubernetes/conf/alluxio.properties for ALLUXIO_WORKER_DATA_SERVER_DOMAIN_SOCKET_ADDRESS=/opt/domain/d
Then things started wokring properly. Now also some tests are failing but alteast the alluxio setup is properly up. Now I will debug why some tests are failing.
I have submitted this fix in alluxio git for them to merge it in master branch.
https://github.com/Alluxio/alluxio/pull/7376

On the node where your worker is running, it seems that you have a port already in use.
Try to find which process is using it:
sudo lsof -n -i :80 | grep LISTEN
I read the alluxio configuration files: try with ports 19998, 19999, 29996, 29998, 29999 substituting 80 in the above command.

Related

CockroachDB Cluster on Kubernetes Pods Crashing

I'm trying to install a CockroachDB Helm chart on a 2 node Kubernetes cluster using this command:
helm install my-release --set statefulset.replicas=2 stable/cockroachdb
I have already created 2 persistent volumes:
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pv00001 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-0 11m
pv00002 100Gi RWO Recycle Bound default/datadir-my-release-cockroachdb-1 11m
I'm getting a weird error and I'm new to Kubernetes so I'm not sure what I'm doing wrong. I've tried creating a StorageClass and using it with my PVs but then the CockroachDB PVCs won't bind to them. I suspect there may be something wrong with my PV setup?
I've tried using kubectl logs but the only error I'm seeing is this:
standard_init_linux.go:211: exec user process caused "exec format
error"
and the pods are crashing over and over:
NAME READY STATUS RESTARTS AGE
my-release-cockroachdb-0 0/1 Pending 0 11m
my-release-cockroachdb-1 0/1 CrashLoopBackOff 7 11m
my-release-cockroachdb-init-tfcks 0/1 CrashLoopBackOff 5 5m29s
Any idea why the pods are crashing?
Here's kubectl describe for the init pod:
Name: my-release-cockroachdb-init-tfcks
Namespace: default
Priority: 0
Node: axon/192.168.1.7
Start Time: Sat, 04 Apr 2020 00:22:19 +0100
Labels: app.kubernetes.io/component=init
app.kubernetes.io/instance=my-release
app.kubernetes.io/name=cockroachdb
controller-uid=54c7c15d-eb1c-4392-930a-d9b8e9225a45
job-name=my-release-cockroachdb-init
Annotations: <none>
Status: Running
IP: 10.44.0.1
IPs:
IP: 10.44.0.1
Controlled By: Job/my-release-cockroachdb-init
Containers:
cluster-init:
Container ID: docker://82a062c6862a9fd5047236feafe6e2654ec1f6e3064fd0513341a1e7f36eaed3
Image: cockroachdb/cockroach:v19.2.4
Image ID: docker-pullable://cockroachdb/cockroach#sha256:511b6d09d5bc42c7566477811a4e774d85d5689f8ba7a87a114b96d115b6149b
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
while true; do initOUT=$(set -x; /cockroach/cockroach init --insecure --host=my-release-cockroachdb-0.my-release-cockroachdb:26257 2>&1); initRC="$?"; echo $initOUT; [[ "$initRC" == "0" ]] && exit 0; [[ "$initOUT" == *"cluster has already been initialized"* ]] && exit 0; sleep 5; done
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sat, 04 Apr 2020 00:28:04 +0100
Finished: Sat, 04 Apr 2020 00:28:04 +0100
Ready: False
Restart Count: 6
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-cz2sn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-cz2sn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-cz2sn
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/my-release-cockroachdb-init-tfcks to axon
Normal Pulled 5m9s (x5 over 6m45s) kubelet, axon Container image "cockroachdb/cockroach:v19.2.4" already present on machine
Normal Created 5m8s (x5 over 6m45s) kubelet, axon Created container cluster-init
Normal Started 5m8s (x5 over 6m44s) kubelet, axon Started container cluster-init
Warning BackOff 92s (x26 over 6m42s) kubelet, axon Back-off restarting failed container
When Pods get crashed, the most important thing to troubleshoot is their descriptions(kubectl describe) and logs.
Logs of the failed Pod show that the arch of the cockroach image doesn't match to the nodes.
Run kubectl get po -o wide to get nodes where cockroach runs and check their arch.
A 2-node CockroachDB cluster is an anti-pattern. You need 3 or more nodes to avoid data or cluster-wide unavailability when a single node fails. Consider checking out these videos explaining how data in CockroachDB is organized and then how the nodes in a cluster work together to keep data available in the face of node failure.
Only if you have 3 nodes (or more), you will not risk losing data if any of the notes gets corrupted. Apart from it, its easier to explain how to do it right, than finding out what went wrong, and to find out what went wrong, one must go through the logs.
If you attach the log, I can take a look.
I also wrote a detailed guide that may address the "doing it right" part of my answer. I elaborated even more about the entire process here.

Coredns in CrashLoopBackOff (kubernetes 1.11)

I'm trying to install kubernetes on an Ubuntu 16.04 VM, followed instructions at https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm/, using weave as my pod network add-on.
I'm seeing similar issue as coredns pods have CrashLoopBackOff or Error state, but I didn't see a solution there, and the versions I'm using are different:
kubeadm 1.11.4-00
kubectl 1.11.4-00
kubelet 1.11.4-00
kubernetes-cni 0.6.0-00
Docker version 1.13.1-cs8, build 91ca5f2
weave script 2.5.0
weave 2.5.0
I'm running behind a corporate firewall, so I set my proxy variables, then ran kubeadm init as follows:
# echo $http_proxy
http://135.28.13.11:8080
# echo $https_proxy
http://135.28.13.11:8080
# echo $no_proxy
127.0.0.1,135.21.27.139,135.0.0.0/8,10.96.0.0/12,10.32.0.0/12
# kubeadm init --pod-network-cidr=10.32.0.0/12
# kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
# kubectl taint nodes --all node-role.kubernetes.io/master-
Both coredns pods stay in CrashLoopBackOff
# kubectl get pods --all-namespaces -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
default hostnames-674b556c4-2b5h2 1/1 Running 0 5h 10.32.0.6 mtpnjvzonap001 <none>
default hostnames-674b556c4-4bzdj 1/1 Running 0 5h 10.32.0.5 mtpnjvzonap001 <none>
default hostnames-674b556c4-64gx5 1/1 Running 0 5h 10.32.0.4 mtpnjvzonap001 <none>
kube-system coredns-78fcdf6894-s7rvx 0/1 CrashLoopBackOff 18 1h 10.32.0.7 mtpnjvzonap001 <none>
kube-system coredns-78fcdf6894-vxwgv 0/1 CrashLoopBackOff 80 6h 10.32.0.2 mtpnjvzonap001 <none>
kube-system etcd-mtpnjvzonap001 1/1 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-apiserver-mtpnjvzonap001 1/1 Running 0 1h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-controller-manager-mtpnjvzonap001 1/1 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-proxy-2c4tx 1/1 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
kube-system kube-scheduler-mtpnjvzonap001 1/1 Running 0 1h 135.21.27.139 mtpnjvzonap001 <none>
kube-system weave-net-bpx22 2/2 Running 0 6h 135.21.27.139 mtpnjvzonap001 <none>
coredns pods have this entry in their log
E1114 20:59:13.848196 1 reflector.go:205]
github.com/coredns/coredns/plugin/kubernetes/controller.go:313: Failed
to list *v1.Service: Get
https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0:
dial tcp 10.96.0.1:443: i/o timeout
This suggests to me that coredns cannot access apiserver pod using its cluster IP:
# kubectl describe svc/kubernetes
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.96.0.1
Port: https 443/TCP
TargetPort: 6443/TCP
Endpoints: 135.21.27.139:6443
Session Affinity: None
Events: <none>
I also went through the troubleshooting steps at https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/
I created a busybox pod for testing
I created the hostnames deployment successfully
I exposed the hostnames deployment successfully
From the busybox pod, I accessed the hostnames service by its cluster IP successfully
from the node, I accessed the hostnames service by its cluster IP successfully
So in short, I created the hostnames service which had a cluster IP in 10.96.0.0/12 space (as expected), and it works, but for some reason, pods cannot access the apiserver's cluster IP of 10.96.0.1, though from the node I can access 10.96.0.1:
# wget --no-check-certificate https://10.96.0.1/hello
--2018-11-14 21:44:25-- https://10.96.0.1/hello
Connecting to 10.96.0.1:443... connected.
WARNING: cannot verify 10.96.0.1's certificate, issued by ‘CN=kubernetes’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 403 Forbidden
2018-11-14 21:44:25 ERROR 403: Forbidden.
Some other things I checked, based on advice from others who reported a similar problem:
# sysctl net.ipv4.conf.all.forwarding
net.ipv4.conf.all.forwarding = 1
# sysctl net.bridge.bridge-nf-call-iptables
net.bridge.bridge-nf-call-iptables = 1
# iptables-save | egrep ':INPUT|:OUTPUT|:POSTROUTING|:FORWARD'
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [11:692]
:POSTROUTING ACCEPT [11:692]
:INPUT ACCEPT [1697:364811]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [1652:363693]
# ls -l /usr/sbin/conntrack
-rwxr-xr-x 1 root root 65632 Jan 24 2016 /usr/sbin/conntrack
# systemctl status firewalld
● firewalld.service
Loaded: not-found (Reason: No such file or directory)
Active: inactive (dead)
I checked the log for kube-proxy, did not see any errors.
I also tried deleting coredns pods, apiserver pod; they are recreated (as expected), but the problem remains.
Here's a copy of the log from the weave container
# kubectl logs -n kube-system weave-net-bpx22 weave
DEBU: 2018/11/14 15:56:10.909921 [kube-peers] Checking peer "aa:53:be:75:71:f7" against list &{[]}
Peer not in list; removing persisted data
INFO: 2018/11/14 15:56:11.041807 Command line options: map[name:aa:53:be:75:71:f7 nickname:mtpnjvzonap001 ipalloc-init:consensus=1 ipalloc-range:10.32.0.0/12 db-prefix:/weavedb/weave-net docker-api: expect-npc:true host-root:/host http-addr:127.0.0.1:6784 metrics-addr:0.0.0.0:6782 conn-limit:100 datapath:datapath no-dns:true port:6783]
INFO: 2018/11/14 15:56:11.042230 weave 2.5.0
INFO: 2018/11/14 15:56:11.198348 Bridge type is bridged_fastdp
INFO: 2018/11/14 15:56:11.198372 Communication between peers is unencrypted.
INFO: 2018/11/14 15:56:11.203206 Our name is aa:53:be:75:71:f7(mtpnjvzonap001)
INFO: 2018/11/14 15:56:11.203249 Launch detected - using supplied peer list: [135.21.27.139]
INFO: 2018/11/14 15:56:11.216398 Checking for pre-existing addresses on weave bridge
INFO: 2018/11/14 15:56:11.229313 [allocator aa:53:be:75:71:f7] No valid persisted data
INFO: 2018/11/14 15:56:11.233391 [allocator aa:53:be:75:71:f7] Initialising via deferred consensus
INFO: 2018/11/14 15:56:11.233443 Sniffing traffic on datapath (via ODP)
INFO: 2018/11/14 15:56:11.234120 ->[135.21.27.139:6783] attempting connection
INFO: 2018/11/14 15:56:11.234302 ->[135.21.27.139:49182] connection accepted
INFO: 2018/11/14 15:56:11.234818 ->[135.21.27.139:6783|aa:53:be:75:71:f7(mtpnjvzonap001)]: connection shutting down due to error: cannot connect to ourself
INFO: 2018/11/14 15:56:11.234843 ->[135.21.27.139:49182|aa:53:be:75:71:f7(mtpnjvzonap001)]: connection shutting down due to error: cannot connect to ourself
INFO: 2018/11/14 15:56:11.236010 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2018/11/14 15:56:11.236424 Listening for metrics requests on 0.0.0.0:6782
INFO: 2018/11/14 15:56:11.990529 [kube-peers] Added myself to peer list &{[{aa:53:be:75:71:f7 mtpnjvzonap001}]}
DEBU: 2018/11/14 15:56:11.995901 [kube-peers] Nodes that have disappeared: map[]
10.32.0.1
135.21.27.139
DEBU: 2018/11/14 15:56:12.075738 registering for updates for node delete events
INFO: 2018/11/14 15:56:41.279799 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 74.125.196.121:443: i/o timeout
INFO: 2018/11/14 20:52:47.025412 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 74.125.196.121:443: i/o timeout
INFO: 2018/11/15 01:46:32.842792 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 74.125.196.121:443: i/o timeout
INFO: 2018/11/15 09:06:03.624359 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 172.217.9.147:443: i/o timeout
INFO: 2018/11/15 14:34:02.070893 Error checking version: Get https://checkpoint-api.weave.works/v1/check/weave-net?arch=amd64&flag_docker-version=none&flag_kernel-version=4.4.0-135-generic&flag_kubernetes-cluster-size=1&flag_kubernetes-cluster-uid=ce66cb23-e825-11e8-abc3-525400314503&flag_kubernetes-version=v1.11.4&os=linux&signature=EJdydeNysrC7LC5xAJAKyDvxXCvkeWUFzepdk3QDfr0%3D&version=2.5.0: dial tcp 172.217.9.147:443: i/o timeout
Here are the events for the 2 coredns pods
# kubectl get events -n kube-system --field-selector involvedObject.name=coredns-78fcdf6894-6f9q6
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
41m 20h 245 coredns-78fcdf6894-6f9q6.1568eab25f0acb02 Pod spec.containers{coredns} Normal Killing kubelet, mtpnjvzonap001 Killing container with id docker://coredns:Container failed liveness probe.. Container will be killed and recreated.
26m 20h 248 coredns-78fcdf6894-6f9q6.1568ea920f72ddd4 Pod spec.containers{coredns} Normal Pulled kubelet, mtpnjvzonap001 Container image "k8s.gcr.io/coredns:1.1.3" already present on machine
5m 20h 1256 coredns-78fcdf6894-6f9q6.1568eaa1fd9216d2 Pod spec.containers{coredns} Warning Unhealthy kubelet, mtpnjvzonap001 Liveness probe failed: HTTP probe failed with statuscode: 503
1m 19h 2963 coredns-78fcdf6894-6f9q6.1568eb75f2b1af3e Pod spec.containers{coredns} Warning BackOff kubelet, mtpnjvzonap001 Back-off restarting failed container
# kubectl get events -n kube-system --field-selector involvedObject.name=coredns-78fcdf6894-skjwz
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
6m 20h 1259 coredns-78fcdf6894-skjwz.1568eaa181fbeffe Pod spec.containers{coredns} Warning Unhealthy kubelet, mtpnjvzonap001 Liveness probe failed: HTTP probe failed with statuscode: 503
1m 19h 2969 coredns-78fcdf6894-skjwz.1568eb7578188f24 Pod spec.containers{coredns} Warning BackOff kubelet, mtpnjvzonap001 Back-off restarting failed container
#
Any help or further troubleshooting steps are welcome
I had the same problem and needed to allow several ports in my firewall: 22, 53, 6443, 6783, 6784, 8285.
I copied the rules from an existing healthy cluster. Probably only 6443, shown above as the target port for the coredns service, is required for this error and the others are for other services I run in my cluster.
With Ubuntu this was uncomplicated firewall
ufw allow 22/tcp # allowed for ssh, included in case you had firewall disabled altogether
ufw allow 6443
ufw allow 53
ufw allow 8285
ufw allow 6783
ufw allow 6784

Minikube running out of space and failing despite --disk-size flag

I am trying to run a docker container registry in Minikube for testing a CSI driver that I am writing.
I am running minikube on mac and am trying to use the following minikube start command: minikube start --vm-driver=hyperkit --disk-size=40g. I have tried with both kubeadm and localkube bootstrappers and with the virtualbox vm-driver.
This is the resource definition I am using for the registry pod deployment.
---
apiVersion: v1
kind: Pod
metadata:
name: registry
labels:
app: registry
namespace: docker-registry
spec:
containers:
- name: registry
image: registry:2
imagePullPolicy: Always
ports:
- containerPort: 5000
volumeMounts:
- mountPath: /var/lib/registry
name: registry-data
volumes:
- hostPath:
path: /var/lib/kubelet/plugins/csi-registry
type: DirectoryOrCreate
name: registry-data
I attempt to create it using kubectl apply -f registry-setup.yaml. Before running this my minikube cluster reports itself as ready and with all the normal minikube containers running.
However, this fails to run and upon running kubectl describe pod, I see the following message:
Name: registry
Namespace: docker-registry
Node: minikube/192.168.64.43
Start Time: Wed, 08 Aug 2018 12:24:27 -0700
Labels: app=registry
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"app":"registry"},"name":"registry","namespace":"docker-registry"},"spec":{"cont...
Status: Running
IP: 172.17.0.2
Containers:
registry:
Container ID: docker://42e5193ac563c2b2e2a2b381c91350d30f7e7c5009a30a5977d33b403a374e7f
Image: registry:2
...
TRUNCATED FOR SPACE
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned registry to minikube
Normal SuccessfulMountVolume 1m kubelet, minikube MountVolume.SetUp succeeded for volume "registry-data"
Normal SuccessfulMountVolume 1m kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-kq5mq"
Normal Pulling 1m kubelet, minikube pulling image "registry:2"
Normal Pulled 1m kubelet, minikube Successfully pulled image "registry:2"
Normal Created 1m kubelet, minikube Created container
Normal Started 1m kubelet, minikube Started container
...
TRUNCATED
...
Name: storage-provisioner
Namespace: kube-system
Node: minikube/192.168.64.43
Start Time: Wed, 08 Aug 2018 12:24:38 -0700
Labels: addonmanager.kubernetes.io/mode=Reconcile
integration-test=storage-provisioner
Annotations: kubectl.kubernetes.io/last-applied-configuration={"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"Reconcile","integration-test":"storage-provis...
Status: Pending
IP: 192.168.64.43
Containers:
storage-provisioner:
Container ID:
Image: gcr.io/k8s-minikube/storage-provisioner:v1.8.1
Image ID:
Port: <none>
Host Port: <none>
Command:
/storage-provisioner
State: Waiting
Reason: ErrImagePull
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/tmp from tmp (rw)
/var/run/secrets/kubernetes.io/serviceaccount from storage-provisioner-token-sb5hz (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
tmp:
Type: HostPath (bare host directory volume)
Path: /tmp
HostPathType: Directory
storage-provisioner-token-sb5hz:
Type: Secret (a volume populated by a Secret)
SecretName: storage-provisioner-token-sb5hz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 1m default-scheduler Successfully assigned storage-provisioner to minikube
Normal SuccessfulMountVolume 1m kubelet, minikube MountVolume.SetUp succeeded for volume "tmp"
Normal SuccessfulMountVolume 1m kubelet, minikube MountVolume.SetUp succeeded for volume "storage-provisioner-token-sb5hz"
Normal Pulling 23s (x3 over 1m) kubelet, minikube pulling image "gcr.io/k8s-minikube/storage-provisioner:v1.8.1"
Warning Failed 21s (x3 over 1m) kubelet, minikube Failed to pull image "gcr.io/k8s-minikube/storage-provisioner:v1.8.1": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /storage-provisioner: no space left on device
Warning Failed 21s (x3 over 1m) kubelet, minikube Error: ErrImagePull
Normal BackOff 7s (x3 over 1m) kubelet, minikube Back-off pulling image "gcr.io/k8s-minikube/storage-provisioner:v1.8.1"
Warning Failed 7s (x3 over 1m) kubelet, minikube Error: ImagePullBackOff
------------------------------------------------------------
...
So while the registry container starts up correctly, a few of the other minikube services (including dns, http ingress service, etc) begin to fail with reasons such as the following: write /storage-provisioner: no space left on device. Despite allocating a 40GB disk-size to minikube, it seems as though minikube is trying to write to rootfs or devtempfs (depending on the vm-driver) which has only 1GB of space.
$ df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 919M 713M 206M 78% /
devtmpfs 919M 0 919M 0% /dev
tmpfs 996M 0 996M 0% /dev/shm
tmpfs 996M 8.9M 987M 1% /run
tmpfs 996M 0 996M 0% /sys/fs/cgroup
tmpfs 996M 8.0K 996M 1% /tmp
/dev/sda1 34G 1.3G 30G 4% /mnt/sda1
Is there a way to make minikube actually use the 34GB of space that was allocated to /mnt/sda1 instead of rootfs when pulling images and creating containers?
Thanks in advance for any help!
You need to configure your Minikube virtual machine for using /dev/sda1 instead of / for Docker. To log in to it, use minikube ssh command.
Than you have two options:
Mount /dev/sda1 to var/lib/docker, but don't forget to copy the content from original var/lib/docker to /mnt/sda1 before that.
Reconfigure Docker for using /mnt/sda1 instead of var/lib/docker for storing images. Look through this link for more information about it.
You can also use the minikube --docker-opt option to set the --data-root option of the dockerd daemon running inside minikube. --docker-opt can be used as a pass-through for any parameter to dockerd.
For example, in the case you describe above it would look like:
minikube start --vm-driver=hyperkit --disk-size=40g --docker-opt="--data-root /mnt/sda1"
Keep in mind that if you try to modify an existing minikube cluster you either have to copy var/lib/docker to /mnt/sda1 (as the previous answer also suggested) before restarting or delete and rebuild the cluster.
update:
After experimentation, I noticed that the above solution will not work the first time you run minikube start as it somehow interferes with minikube's own core-system build and boot-up process.
In practice this means that you need to run minikube start at least once without the --docker-opt to build the core system and then re-run it with --docker-opt.

Service never runs

I am using minikube on Linux to get started with kubernetes. Going with the examples in the readme and going with the none vm-diver, I do the following.
$ minikube start --vm-driver=none
Starting local Kubernetes v1.9.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
===================
WARNING: IT IS RECOMMENDED NOT TO RUN THE NONE DRIVER ON PERSONAL WORKSTATIONS
The 'none' driver will run an insecure kubernetes apiserver as root that may leave the host vulnerable to CSRF attacks
When using the none driver, the kubectl config and credentials generated will be root owned and will appear in the root home directory.
You will need to move the files to the appropriate location and then set the correct permissions. An example of this is below:
sudo mv /root/.kube $HOME/.kube # this will write over any previous configuration
sudo chown -R $USER $HOME/.kube
sudo chgrp -R $USER $HOME/.kube
sudo mv /root/.minikube $HOME/.minikube # this will write over any previous configuration
sudo chown -R $USER $HOME/.minikube
sudo chgrp -R $USER $HOME/.minikube
This can also be done automatically by setting the env var CHANGE_MINIKUBE_NONE_USER=true
Loading cached images from config file.
$ kubectl get nodes
No resources found.
$ kubectl run hello-minikube --image=k8s.gcr.io/echoserver:1.4 --port=8080
deployment "hello-minikube" created
$ kubectl expose deployment hello-minikube --type=NodePort
service "hello-minikube" exposed
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
hello-minikube-c6c6764d-h64t8 0/1 Pending 0 3m
Now, the problem is that this pod continues to remain pending. It looks like there are no nodes to run it on but I do not know why. Where am I going wrong?
EDIT: Here is the output of describeing the pod.
$ kubectl describe pod hello-minikube-c6c6764d-h64t8
Name: hello-minikube-c6c6764d-h64t8
Namespace: default
Node: <none>
Labels: pod-template-hash=72723208
run=hello-minikube
Annotations: <none>
Status: Pending
IP:
Controlled By: ReplicaSet/hello-minikube-c6c6764d
Containers:
hello-minikube:
Image: k8s.gcr.io/echoserver:1.4
Port: 8080/TCP
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-dw4j7 (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-dw4j7:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-dw4j7
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20h (x4 over 20h) default-scheduler no nodes available to schedule pods
Warning FailedScheduling 20h (x4 over 20h) default-scheduler no nodes available to schedule pods
Warning FailedScheduling 20h (x5 over 20h) default-scheduler no nodes available to schedule pods
Warning FailedScheduling 19h (x5 over 19h) default-scheduler no nodes available to schedule pods
Warning FailedScheduling 19h (x5 over 19h) default-scheduler no nodes available to schedule pods
Warning FailedScheduling 18h (x5 over 18h) default-scheduler no nodes available to schedule pods
Warning FailedScheduling 1s (x5 over 16s) default-scheduler no nodes available to schedule pods
try running minikube status if you get this then try running this command.
fix worked for me:
sudo chown -R $USER /home/docker/.minikube/machines/minikube/config.json
Error with minikube status:
"X Error getting host status: load: filestore: open /home/docker/.minikube/machines/minikube/config.json: permission denied
*
* Sorry that minikube crashed. If this was unexpected, we would love to hear from you:
- https://github.com/kubernetes/minikube/issues/new/choose"

failed to set up pod network: Unhandled Exception killed plugin

I'm trying to play with kubernetes 1.4 install with rkt containers on CoreOS beta (1185.1.0).
In general I have two CoreOS pc machines at home that are configured with etcd2 tls certificates.
I patched the coreos-kubernetes automated generic install script to support etcd2 tls certificates. the latest versions of the worker and controller install scripts are posted at https://github.com/kfirufk/coreos-kubernetes-multi-node-generic-install-script
I used the following environment variables for the controller coreos installation script (ip:10.79.218.2,domain:coreos-2.tux-in.com)
ADVERTISE_IP=10.79.218.2
ETCD_ENDPOINTS="https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379"
K8S_VER=v1.4.1_coreos.0
HYPERKUBE_IMAGE_REPO=quay.io/coreos/hyperkube
POD_NETWORK=10.2.0.0/16
SERVICE_IP_RANGE=10.3.0.0/24
K8S_SERVICE_IP=10.3.0.1
DNS_SERVICE_IP=10.3.0.10
USE_CALICO=true
CONTAINER_RUNTIME=rkt
ETCD_CERT_FILE="/etc/ssl/etcd/etcd1.pem"
ETCD_KEY_FILE="/etc/ssl/etcd/etcd1-key.pem"
ETCD_TRUSTED_CA_FILE="/etc/ssl/etcd/ca.pem"
ETCD_CLIENT_CERT_AUTH=true
OVERWRITE_ALL_FILES=true
CONTROLLER_HOSTNAME="coreos-2.tux-in.com"
ETCD_CERT_ROOT_DIR="/etc/ssl/etcd"
ETCD_SCHEME="https"
ETCD_AUTHORITY="coreos-2.tux-in.com:2379"
IS_MASK_UPDATE_ENGINE=false
and these are the environment variables I used for the worker coreos installation script (ip:10.79.218.3,domain:coreos-3.tux-in.com)
ETCD_AUTHORITY=coreos-3.tux-in.com:2379
ETCD_ENDPOINTS="https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379"
CONTROLLER_ENDPOINT=https://coreos-2.tux-in.com
K8S_VER=v1.4.1_coreos.0
HYPERKUBE_IMAGE_REPO=quay.io/coreos/hyperkube
DNS_SERVICE_IP=10.3.0.10
USE_CALICO=true
CONTAINER_RUNTIME=rkt
OVERWRITE_ALL_FILES=true
ADVERTISE_IP=10.79.218.3
ETCD_CERT_FILE="/etc/ssl/etcd/etcd2.pem"
ETCD_KEY_FILE="/etc/ssl/etcd/etcd2-key.pem"
ETCD_TRUSTED_CA_FILE="/etc/ssl/etcd/ca.pem"
ETCD_SCHEME="https"
IS_MASK_UPDATE_ENGINE=false
after installing kubernetes on both machines, and configuring kubectl properly, when I type kubectl get nodes I get:
NAME STATUS AGE
10.79.218.2 Ready,SchedulingDisabled 1h
10.79.218.3 Ready 1h
kubectl get pods --namespace=kube-system returns
NAME READY STATUS RESTARTS AGE
heapster-v1.2.0-3646253287-j951o 0/2 ContainerCreating 0 1d
kube-apiserver-10.79.218.2 1/1 Running 0 1d
kube-controller-manager-10.79.218.2 1/1 Running 0 1d
kube-dns-v20-u3pd0 0/3 ContainerCreating 0 1d
kube-proxy-10.79.218.2 1/1 Running 0 1d
kube-proxy-10.79.218.3 1/1 Running 0 1d
kube-scheduler-10.79.218.2 1/1 Running 0 1d
kubernetes-dashboard-v1.4.1-ehiez 0/1 ContainerCreating 0 1d
so heapster-v1.2.0-3646253287-j951o, kube-dns-v20-u3pd0 and kubernetes-dashboard-v1.4.1-ehiez are stuck in ContainerCreating status.
when I run kubectl describe on any of them, I basically get the same error: Error syncing pod, skipping: failed to SyncPod: failed to set up pod network: Unhandled Exception killed plugin.
for example, kubectl describe pods kubernetes-dashboard-v1.4.1-ehiez --namespace kube-system returns:
Name: kubernetes-dashboard-v1.4.1-ehiez
Namespace: kube-system
Node: 10.79.218.3/10.79.218.3
Start Time: Mon, 17 Oct 2016 23:31:43 +0300
Labels: k8s-app=kubernetes-dashboard
kubernetes.io/cluster-service=true
version=v1.4.1
Status: Pending
IP:
Controllers: ReplicationController/kubernetes-dashboard-v1.4.1
Containers:
kubernetes-dashboard:
Container ID:
Image: gcr.io/google_containers/kubernetes-dashboard-amd64:v1.4.1
Image ID:
Port: 9090/TCP
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 100m
memory: 50Mi
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Liveness: http-get http://:9090/ delay=30s timeout=30s period=10s #success=1 #failure=3
Volume Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-svbiv (ro)
Environment Variables: <none>
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
default-token-svbiv:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-svbiv
QoS Class: Guaranteed
Tolerations: CriticalAddonsOnly=:Exists
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1d 25s 9350 {kubelet 10.79.218.3} Warning FailedSync Error syncing pod, skipping: failed to SyncPod: failed to set up pod network: Unhandled Exception killed plugin
I'm guessing that pod networking isn't working because of faulty calico configuration..
so I tried to install calicoctl rkt container, but had problems with that. but that's a different stackoverflow question :) starting calicoctl container on coreos
so I can't really check if calico works properly.
this is the calico-network systemd service file for the controller node:
[Unit]
Description=Calico per-host agent
Requires=network-online.target
After=network-online.target
[Service]
Slice=machine.slice
Environment=CALICO_DISABLE_FILE_LOGGING=true
Environment=HOSTNAME=10.79.218.3
Environment=IP=10.79.218.3
Environment=FELIX_FELIXHOSTNAME=10.79.218.3
Environment=CALICO_NETWORKING=true
Environment=NO_DEFAULT_POOLS=true
Environment=ETCD_ENDPOINTS=https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379
Environment=ETCD_AUTHORITY=coreos-3.tux-in.com:2379
Environment=ETCD_SCHEME=https
Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
Environment=ETCD_CERT_FILE=/etc/ssl/etcd/etcd2.pem
Environment=ETCD_KEY_FILE=/etc/ssl/etcd/etcd2-key.pem
ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci --volume=var-run-calico,kind=host,source=/var/run/calico --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --volume=dns,kind=host,source=/etc/resolv.conf,readOnly=true --volume=etcd-tls-certs,kind=host,source=/etc/ssl/etcd,readOnly=true --mount=volume=dns,target=/etc/resolv.conf --mount=volume=etcd-tls-certs,target=/etc/ssl/etcd --mount=volume=var-run-calico,target=/var/run/calico --trust-keys-from-https quay.io/calico/node:v0.22.0
KillMode=mixed
Restart=always
TimeoutStartSec=0
[Install]
WantedBy=multi-user.target
and is the calico-node service file for the worker node:
[Unit]
Description=Calico per-host agent
Requires=network-online.target
After=network-online.target
[Service]
Slice=machine.slice
Environment=CALICO_DISABLE_FILE_LOGGING=true
Environment=HOSTNAME=10.79.218.2
Environment=IP=10.79.218.2
Environment=FELIX_FELIXHOSTNAME=10.79.218.2
Environment=CALICO_NETWORKING=true
Environment=NO_DEFAULT_POOLS=false
Environment=ETCD_ENDPOINTS=https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379
ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci --volume=var-run-calico,kind=host,source=/var/run/calico --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --volume=dns,kind=host,source=/etc/resolv.conf,readOnly=true --volume=etcd-tls-certs,kind=host,source=/etc/ssl/etcd,readOnly=true --mount=volume=dns,target=/etc/resolv.conf --mount=volume=etcd-tls-certs,target=/etc/ssl/etcd --mount=volume=var-run-calico,target=/var/run/calico --trust-keys-from-https quay.io/calico/node:v0.22.0
KillMode=mixed
Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
Environment=ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem
Environment=ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem
Restart=always
TimeoutStartSec=0
[Install]
WantedBy=multi-user.target
and this is the content of /etc/kubernetes/cni/net.d/10-calico.conf of the controller node:
{
"name": "calico",
"type": "flannel",
"delegate": {
"type": "calico",
"etcd_endpoints": "https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379",
"etcd_key_file": "/etc/ssl/etcd/etcd1-key.pem",
"etcd_cert_file": "/etc/ssl/etcd/etcd1.pem",
"etcd_ca_cert_file": "/etc/ssl/etcd/ca.pem",
"log_level": "none",
"log_level_stderr": "info",
"hostname": "10.79.218.2",
"policy": {
"type": "k8s",
"k8s_api_root": "http://127.0.0.1:8080/api/v1/"
}
}
}
and this is the /etc/kubernetes/cni/net.d/10-calico.conf of the worker node:
{
"name": "calico",
"type": "flannel",
"delegate": {
"type": "calico",
"etcd_endpoints": "https://coreos-2.tux-in.com:2379,https://coreos-3.tux-in.com:2379",
"etcd_key_file": "/etc/ssl/etcd/etcd2-key.pem",
"etcd_cert_file": "/etc/ssl/etcd/etcd2.pem",
"etcd_ca_cert_file": "/etc/ssl/etcd/ca.pem",
"log_level": "debug",
"log_level_stderr": "info",
"hostname": "10.79.218.3",
"policy": {
"type": "k8s",
"k8s_api_root": "https://coreos-2.tux-in.com:443/api/v1/",
"k8s_client_key": "/etc/kubernetes/ssl/worker-key.pem",
"k8s_client_certificate": "/etc/kubernetes/ssl/worker.pem"
}
}
}
now idea how to investigate the issue further.
I understand that since new calico-cni was moved to go, it doesn't store log information in a log file anymore, so i'm lost from here.
any information regarding the issue would be greatly appreciated.
thanks!
The "Unhandled Exception Killed plugin" error message is being generated by the Calico CNI plugin. From my experience that means it is unlikely to be something wrong with the calico-node.service causing that error.
As such it is probably something subtly wrong with you CNI network configuration. Could you share that file?
The CNI plugin should also emit more detailed logging information - either to stderr or to /var/log/calico/cni/calico.log based on how its configured in your CNI network config. I suspect that file will give you more clues into exactly what is going wrong.
All that said, the "Unhandled Exception" error is coming from the Python version of the CNI plugin, which is rather old at this point. I'd recommend upgrading to the latest stable release from here: https://github.com/projectcalico/calico-cni/releases