Adding removed etcd member in Kubernetes master - kubernetes

I was following Kelsey Hightower's kubernetes-the-hard-way repo and successfully created a cluster with 3 master nodes and 3 worker nodes. Here are the problems encountered when removing one of the etcd members and then adding it back, also with all the steps used:
3 master nodes:
10.240.0.10 controller-0
10.240.0.11 controller-1
10.240.0.12 controller-2
Step 1:
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
b28b52253c9d447e, started, controller-2, https://10.240.0.12:2380, https://10.240.0.12:2379
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379
ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379
Step 2 (Remove etcd member of controller-2):
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member remove b28b52253c9d447e --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Step 3 (Add the member back):
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member add controller-2 --peer-urls=https://10.240.0.12:2380 --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
Member 66d450d03498eb5c added to cluster 3e7cc799faffb625
ETCD_NAME="controller-2"
ETCD_INITIAL_CLUSTER="controller-2=https://10.240.0.12:2380,controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.240.0.12:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
Step 4 (run member list command):
isaac#controller-0:~$ sudo ETCDCTL_API=3 etcdctl member list --endpoints=https://127.0.0.1:2379 --cacert=/etc/etcd/ca.pem --cert=/etc/etcd/kubernetes.pem --key=/etc/etcd/kubernetes-key.pem
Result:
66d450d03498eb5c, unstarted, , https://10.240.0.12:2380,
f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380,
https://10.240.0.10:2379 ffed16798470cab5, started, controller-1,
https://10.240.0.11:2380, https://10.240.0.11:2379
Step 5 (Run the command to start etcd in controller-2):
isaac#controller-2:~$ sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:
2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=http://10.240.0.10:2380,controller-1=http://10.240.0.11:2380,controller-2=http://10.240.0.1
2:2380 --ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem
Result:
2019-06-09 13:10:14.958799 I | etcdmain: etcd Version: 3.3.9
2019-06-09 13:10:14.959022 I | etcdmain: Git SHA: fca8add78
2019-06-09 13:10:14.959106 I | etcdmain: Go Version: go1.10.3
2019-06-09 13:10:14.959177 I | etcdmain: Go OS/Arch: linux/amd64
2019-06-09 13:10:14.959237 I | etcdmain: setting maximum number of CPUs to 1, total number of available CPUs is 1
2019-06-09 13:10:14.959312 W | etcdmain: no data-dir provided, using default data-dir ./controller-2.etcd
2019-06-09 13:10:14.959435 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2019-06-09 13:10:14.959575 C | etcdmain: cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented
Clearly, the etcd service did not start as expected, so I do the troubleshooting as below:
isaac#controller-2:~$ sudo systemctl status etcd
Result:
● etcd.service - etcd Loaded: loaded
(/etc/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Sun 2019-06-09 13:06:55 UTC; 29min ago
Docs: https://github.com/coreos Process: 1876 ExecStart=/usr/local/bin/etcd --name controller-2
--cert-file=/etc/etcd/kubernetes.pem --key-file=/etc/etcd/kubernetes-key.pem --peer-cert-file=/etc/etcd/kubernetes.pem --peer-key-file=/etc/etcd/kube Main PID: 1876 (code=exited, status=0/SUCCESS) Jun 09 13:06:55 controller-2 etcd[1876]: stopped
peer f98dc20bce6225a0 Jun 09 13:06:55 controller-2 etcd[1876]:
stopping peer ffed16798470cab5... Jun 09 13:06:55 controller-2
etcd[1876]: stopped streaming with peer ffed16798470cab5 (writer) Jun
09 13:06:55 controller-2 etcd[1876]: stopped streaming with peer
ffed16798470cab5 (writer) Jun 09 13:06:55 controller-2 etcd[1876]:
stopped HTTP pipelining with peer ffed16798470cab5 Jun 09 13:06:55
controller-2 etcd[1876]: stopped streaming with peer ffed16798470cab5
(stream MsgApp v2 reader) Jun 09 13:06:55 controller-2 etcd[1876]:
stopped streaming with peer ffed16798470cab5 (stream Message reader)
Jun 09 13:06:55 controller-2 etcd[1876]: stopped peer ffed16798470cab5
Jun 09 13:06:55 controller-2 etcd[1876]: failed to find member
f98dc20bce6225a0 in cluster 3e7cc799faffb625 Jun 09 13:06:55
controller-2 etcd[1876]: forgot to set Type=notify in systemd service
file?
I indeed tried to start the etcd member using different commands but seems the etcd of controller-2 still stuck at unstarted state. May I know the reason of that? Any pointers would be highly appreciated! Thanks.

Turned out I solved the problem as follows (credit to Matthew):
Delete the etcd data directory with the following command:
rm -rf /var/lib/etcd/*
To fix the message cannot listen on TLS for 10.240.0.12:2380: KeyFile and CertFile are not presented, I revised the command to start the etcd as follows:
sudo etcd --name controller-2 --listen-client-urls https://10.240.0.12:2379,http://127.0.0.1:2379 --advertise-client-urls https://10.240.0.12:2379 --listen-peer-urls https://10.240.0.12:2380 --initial-advertise-peer-urls https://10.240.0.12:2380 --initial-cluster-state existing --initial-cluster controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380,controller-2=https://10.240.0.12:2380 --peer-trusted-ca-file /etc/etcd/ca.pem --cert-file /etc/etcd/kubernetes.pem --key-file /etc/etcd/kubernetes-key.pem --peer-cert-file /etc/etcd/kubernetes.pem --peer-key-file /etc/etcd/kubernetes-key.pem --data-dir /var/lib/etcd
A few points to note here:
The newly added arguments --cert-file and --key-file presented the required key and certificate of controller2.
Argument --peer-trusted-ca-file is also presented so as to check if the x509 certificate presented by controller0 and controller1 are signed by a known CA. If this is not presented, error etcdserver: could not get cluster response from https://10.240.0.11:2380: Get https://10.240.0.11:2380/members: x509: certificate signed by unknown authority may be encountered.
The value presented for the argument --initial-cluster needs to be in-line with that shown in the systemd unit file.

if you are re-adding the more easy solution is following
rm -rf /var/lib/etcd/*
kubeadm join phase control-plane-join etcd --control-plane

Related

Error: Unable to write pid file / mqtt broker - access from remote

I have been reading Eclipse mqtt documentations and relevant posts about the MQTT Broker failing to start and have implemented the suggestions and ideas which seem relevant to the solution of my problem. However as newbe I am now stuck and require more support to get Broker started and accessable from remote
I'm using Raspberry Pi OS Bullseye & Mosquitto version 2.0.11
mosquitto.conf is created in /etc/mosquitto:
pid_file /var/run/mosquitto/mosquitto.pid
per_listener_settings true
persistence true
persistence_file mosquitto.db
persistence_location /var/lib/mosquitto/
log_dest file /var/log/mosquitto.log
include_dir /etc/mosquitto/conf.d
listener 1883 192.168.1.99
protocol mqtt
log_type all
acl_file /etc/mosquitto/acls
allow-anonymous false
connection_messages true
max_keepalive 10
log_timestamp true
log_dest topic
log_dest syslog
log_dest stdout
log_type all
password_file /etc/mosquitto/pwfile
and local.conf in /etc/mosquitto/conf.d to separate local access from remote access
allow_anonymous true
listener 1883 localhost
Updated /lib/systemd/system/mosquitto.service to:
ExecStartPre=/bin/mkdir -m 740 -p /var/run/mosquitto
ExecStartPre=/bin/chown mosquitto /var/run/mosquitto
(Have tried chown mosquitto:.., chown mosquitto:mosquitto.., chown -hR mosq... and chown -R mosq...)
Rights: /var/run/mosquitto/mosquitto.pid
drwxr----- 2 mosquitto root 60 Dec 16 10:14 .
drwxr-xr-x 33 root root 1000 Dec 16 14:46 ..
-rw-r--r-- 1 mosquitto mosquitto 4 Dec 16 10:14 mosquitto.pid
Broker is started with:
mosquitto -c /etc/mosquitto/mosquitto.conf -v
Error message returned:
1639655912: Loading config file /etc/mosquitto/conf.d/local.conf
2021-12-16|12:58:32: Error: Unable to write pid file
when I sudo delete mosquitto.pid or sudo rename its directory and restart mosquitto daemon, a new mosquitto.pid is not created and I get same error message as above
Command "systemctl status mosquitto.service" returns:
Warning: The unit file, source configuration file or drop-ins of mosquitto.service
changed on disk. Run 'systemctl daemon-reload' to reload >
mosquitto.service - Mosquitto MQTT Broker
Loaded: loaded (/lib/systemd/system/mosquitto.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2021-12-16 10:14:03 CET; 2h 56min ago
Process: 5035 ExecStartPre=/bin/mkdir -m 740 -p /var/log/mosquitto (code=exited, status=0/SUCCESS)
Process: 5036 ExecStartPre=/bin/chown mosquitto /var/log/mosquitto (code=exited, status=0/SUCCESS)
Process: 5037 ExecStartPre=/bin/mkdir -m 740 -p /var/run/mosquitto (code=exited, status=0/SUCCESS)
Process: 5038 ExecStartPre=/bin/chown mosquitto /var/run/mosquitto (code=exited, status=0/SUCCESS)
Process: 5039 ExecStart=/usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf (code=exited, status=3)
Main PID: 5039 (code=exited, status=3)
Dec 16 10:14:03 Pi4 systemd[1]: mosquitto.service: Scheduled restart job, restart counter is at 5.
Dec 16 10:14:03 Pi4 systemd[1]: Stopped Mosquitto MQTT Broker.
Dec 16 10:14:03 Pi4 systemd[1]: mosquitto.service: Start request repeated too quickly.
Dec 16 10:14:03 Pi4 systemd[1]: mosquitto.service: Failed with result 'exit-code'.
Dec 16 10:14:03 Pi4 systemd[1]: Failed to start Mosquitto MQTT Broker.
I appreciate any guidance or help
Generate a new client ID. I'm testing mine using MQTTLENS which is a Chrome app. Once I did this the problem stopped and everything is working ok.

Service starting gunicorn failing with "Start request repeated too quickly"

Trying to start a service to run gunicorn as backend server for Flask, not working. Running nginx as frontend server for React, working.
Server:
Virtualization: vmware
Operating System: Red Hat Enterprise Linux 8.4 (Ootpa)
CPE OS Name: cpe:/o:redhat:enterprise_linux:8.4:GA
Kernel: Linux 4.18.0-305.3.1.el8_4.x86_64
Architecture: x86-64
Service file in /etc/systemd/system/myservice.service:
[Unit]
Description="Description"
After=network.target
[Service]
User=root
Group=root
WorkingDirectory=/home/project/app/api
ExecStart=/home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app
Restart=always
[Install]
WantedBy=multi-user.target
/app/api:
-rwxr-xr-x. 1 root root 2018 Jun 9 20:06 api.py
drwxrwxr-x+ 5 root root 100 Jun 7 10:11 venv
Error message:
● myservice.service - "Description"
Loaded: loaded (/etc/systemd/system/myservice.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2021-06-10 19:01:01 CEST; 5s ago
Process: 18307 ExecStart=/home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app (code=exited, status=203/EXEC)
Main PID: 18307 (code=exited, status=203/EXEC)
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Service RestartSec=100ms expired, scheduling restart.
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Scheduled restart job, restart counter is at 5.
Jun 10 19:01:01 xxxx systemd[1]: Stopped "Description".
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Start request repeated too quickly.
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Failed with result 'exit-code'.
Jun 10 19:01:01 xxxx systemd[1]: Failed to start "Description".
Tried, not working:
Adding Environment="PATH=/home/project/app/api/venv/bin" under [Service]
$ systemctl reset-failed myservice.service
$ systemctl daemon-reload
Reboot, ofc.
Tried, working:
Running (as root) /home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app while in /app/api directory
Does anyone know how to fix this problem?
Typically enough, I figured it out shortly after posting this issue.
SELinux is messing with permissions for files and directories, so for anyone experiencing the same issue, make sure to test with the following alterings (as root):
$ setsebool -P httpd_can_network_connect on
$ chcon -Rt httpd_sys_content_t /path/to/your/Flask/dir
In my case: $ chcon -Rt httpd_sys_content_t /home/project/app/api
While this is NOT a permanent fix, it's worth a try. Check out the SELinux docs for more permanent solutions.

Kubelet service not starting up - KUBELET_EXTRA_ARGS (code=exited, status=255)

I am trying to start up the kubelet service on a worker node (the 3rd worker node)... at the moment, I can't quite tell what the error is here.. I do however, see F0716 16:42:20.047413 556 server.go:155] unknown command: $KUBELET_EXTRA_ARGS in the output given by sudo systemctl status kubelet -l:
[svc.jenkins#node6 ~]$ sudo systemctl status kubelet -l
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Mon 2018-07-16 16:42:20 CDT; 4s ago
Docs: http://kubernetes.io/docs/
Process: 556 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 556 (code=exited, status=255)
Jul 16 16:42:20 node6 kubelet[556]: --tls-cert-file string File containing x509 Certificate used for serving HTTPS (with intermediate certs, if any, concatenated after server cert). If --tls-cert-file and --tls-private-key-file are not provided, a self-signed certificate and key are generated for the public address and saved to the directory passed to --cert-dir. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
Jul 16 16:42:20 node6 kubelet[556]: --tls-cipher-suites strings Comma-separated list of cipher suites for the server. If omitted, the default Go cipher suites will be used. Possible values: TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_RC4_128_SHA,TLS_ECDHE_RSA_WITH_3DES_EDE_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_RC4_128_SHA,TLS_RSA_WITH_3DES_EDE_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_128_CBC_SHA256,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_RC4_128_SHA (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
Jul 16 16:42:20 node6 kubelet[556]: --tls-min-version string Minimum TLS version supported. Possible values: VersionTLS10, VersionTLS11, VersionTLS12 (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
Jul 16 16:42:20 node6 kubelet[556]: --tls-private-key-file string File containing x509 private key matching --tls-cert-file. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
Jul 16 16:42:20 node6 kubelet[556]: -v, --v Level log level for V logs
Jul 16 16:42:20 node6 kubelet[556]: --version version[=true] Print version information and quit
Jul 16 16:42:20 node6 kubelet[556]: --vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
Jul 16 16:42:20 node6 kubelet[556]: --volume-plugin-dir string The full path of the directory in which to search for additional third party volume plugins (default "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/")
Jul 16 16:42:20 node6 kubelet[556]: --volume-stats-agg-period duration Specifies interval for kubelet to calculate and cache the volume disk usage for all pods and volumes. To disable volume calculations, set to 0. (default 1m0s) (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)
Jul 16 16:42:20 node6 kubelet[556]: F0716 16:42:20.047413 556 server.go:155] unknown command: $KUBELET_EXTRA_ARGS
Here is the configuration for my dropin loacated at /etc/systemd/system/kubelet.service.d/10-kubeadm.conf (it is the same on the other nodes that are in a working state):
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true"
Environment="KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"
Environment="KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local"
Environment="KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt"
Environment="KUBELET_CADVISOR_ARGS=--cadvisor-port=0"
Environment="KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs"
Environment="KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/data01/kubelet/pki"
Environment="KUBELET_EXTRA_ARGS=$KUBELET_EXTRA_ARGS --root-dir=/data01/kubelet"
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS
Just need help diagnosing what the issue preventing it from starting so that it can be resolved.. Thank in advanced :)
EDIT:
[svc.jenkins#node6 ~]$ kubelet --version
Kubernetes v1.10.4
Currently, in systemd a bit different approach is used. All options are put to separate file and systemd config script refers to that file.
In your case, it would be something like this:
/etc/sysconfig/kubelet
----------------------
KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf
KUBELET_SYSTEM_PODS_ARGS=--pod-manifest-path=/etc/kubernetes/manifests --allow-privileged=true
KUBELET_NETWORK_ARGS=--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin
KUBELET_DNS_ARGS=--cluster-dns=10.96.0.10 --cluster-domain=cluster.local
KUBELET_AUTHZ_ARGS=--authorization-mode=Webhook --client-ca-file=/etc/kubernetes/pki/ca.crt
KUBELET_CADVISOR_ARGS=--cadvisor-port=0
KUBELET_CGROUP_ARGS=--cgroup-driver=cgroupfs
KUBELET_CERTIFICATE_ARGS=--rotate-certificates=true --cert-dir=/data01/kubelet/pki
KUBELET_EXTRA_ARGS=$KUBELET_EXTRA_ARGS --root-dir=/data01/kubelet
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf
-----------------------------------------------------
...
[Service]
EnvironmentFile=/etc/sysconfig/kubelet
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_SYSTEM_PODS_ARGS $KUBELET_NETWORK_ARGS $KUBELET_DNS_ARGS $KUBELET_AUTHZ_ARGS $KUBELET_CADVISOR_ARGS $KUBELET_CGROUP_ARGS $KUBELET_CERTIFICATE_ARGS $KUBELET_EXTRA_ARGS
The variables in systemd config file could look like ${VARIABLE} or $VARIABLE. Both cases should work fine.

rkt discovery failed error while bringing flannel up (Kubernetes setup)

We are trying to setup Kubernetes cluster on 3 nodes with coreos following official step by step documentation - https://coreos.com/kubernetes/docs/latest/deploy-master.html
Servers are behind company proxy, and have proxy service defined in both
/etc/systemd/system/docker.service.d
/etc/systemd/system/flanneld.service.d
Following is picked in
systemctl cat flanneld
# /usr/lib/systemd/system/flanneld.service
[Unit]
Description=flannel - Network fabric for containers (System Application Container)
Documentation=https://github.com/coreos/flannel
After=etcd.service etcd2.service etcd-member.service
Before=docker.service flannel-docker-opts.service
Requires=flannel-docker-opts.service
[Service]
Type=notify
Restart=always
RestartSec=10s
LimitNOFILE=40000
LimitNPROC=1048576
Environment="FLANNEL_IMAGE_TAG=v0.6.2"
Environment="FLANNEL_OPTS=--ip-masq=true"
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/lib/coreos/flannel-wrapper.uuid"
EnvironmentFile=-/run/flannel/options.env
ExecStartPre=/sbin/modprobe ip_tables
ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos /run/flannel
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/flannel-wrapper.uuid
ExecStart=/usr/lib/coreos/flannel-wrapper $FLANNEL_OPTS
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/lib/coreos/flannel-wrapper.uuid
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/flanneld.service.d/40-ExecStartPre-symlink.conf
[Service]
ExecStartPre=/usr/bin/ln -sf /etc/flannel/options.env /run/flannel/options.env
# /etc/systemd/system/flanneld.service.d/proxy.conf
[Service]
Environment="HTTP_PROXY=http://10.140.65.114:8080/"
Environment="HTTPS_PROXY=http://10.140.65.114:8080/"
and
systemctl cat docker
# /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=containerd.service docker.socket early-docker.target network.target
Requires=containerd.service docker.socket early-docker.target
[Service]
Type=notify
EnvironmentFile=-/run/flannel/flannel_docker_opts.env
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/lib/coreos/dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock $DOCKER_OPTS $DOCKER_CGROUPS $DOCKER_OPT_BIP $DOCKER_OP
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/docker.service.d/40-flannel.conf
[Unit]
Requires=flanneld.service
After=flanneld.service
[Service]
EnvironmentFile=/etc/kubernetes/cni/docker_opts_cni.env
# /etc/systemd/system/docker.service.d/http-proxy.conf
[Service]
Environment="HTTP_PROXY=http://10.140.65.114:8080/"
Environment="HTTPS_PROXY=http://10.140.65.114:8080/"
# /etc/systemd/system/flanneld.service.d/40-ExecStartPre-symlink.conf
[Service]
ExecStartPre=/usr/bin/ln -sf /etc/flannel/options.env /run/flannel/options.env
# /etc/systemd/system/flanneld.service.d/proxy.conf
[Service]
Environment="HTTP_PROXY=http://10.140.65.114:8080/"
Environment="HTTPS_PROXY=http://10.140.65.114:8080/"
after running systemctl daemon-reload and systemctl start flannel, getting following error
Feb 16 19:50:40 localhost systemd[1]: Starting flannel - Network fabric for containers (System Application Container)...
-- Subject: Unit flanneld.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has begun starting up.
Feb 16 19:50:40 localhost rkt[52933]: rm: cannot get pod: no matches found for "26778eb4-9d8a-4d3c-9bb7-6ffb13a55d6a"
Feb 16 19:50:40 localhost rkt[52933]: rm: failed to remove one or more pods
Feb 16 19:50:40 localhost flannel-wrapper[52947]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/flannel-wrapper.uuid --trust-keys-from-https --mount volume=notify,target=/run/systemd/notify --volume notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --net=host --volume run-flannel,kind=host,source=/run/flannel,readOnly=false --volume etc-ssl-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=run-flannel,target=/run/flannel --mount volume=etc-ssl-certs,target=/etc/ssl/certs --mount volume=usr-share-certs,target=/usr/share/ca-certificates --mount volume=etc-hosts,target=/etc/hosts --mount volume=etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/flannel:v0.6.2 -- --ip-masq=true
Feb 16 19:50:41 localhost sudo[52978]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/bin/journalctl -e -u kubelet
Feb 16 19:50:41 localhost sudo[52978]: pam_unix(sudo:session): session opened for user root by admin(uid=0)
Feb 16 19:50:41 localhost sudo[52978]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Feb 16 19:50:41 localhost sudo[52978]: pam_unix(sudo:session): session closed for user root
Feb 16 19:50:42 localhost flannel-wrapper[52947]: image: keys already exist for prefix "quay.io/coreos/flannel", not fetching again
Feb 16 19:50:43 localhost sudo[52990]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/bin/journalctl -e -u kubelet
Feb 16 19:50:43 localhost sudo[52990]: pam_unix(sudo:session): session opened for user root by admin(uid=0)
Feb 16 19:50:43 localhost sudo[52990]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Feb 16 19:50:43 localhost sudo[52990]: pam_unix(sudo:session): session closed for user root
Feb 16 19:50:44 localhost flannel-wrapper[52947]: Downloading signature: 0 B/473 B
Feb 16 19:50:44 localhost flannel-wrapper[52947]: Downloading signature: 473 B/473 B
Feb 16 19:50:45 localhost flannel-wrapper[52947]: Downloading signature: 473 B/473 B
Feb 16 19:50:45 localhost flannel-wrapper[52947]: run: Get https://quay-registry.s3.amazonaws.com/sharedimages/36acf4f7-a5bd-470b-9a44-13cbd244b571/layer?Signature=v8rQghQZR0k%2B1UxDG8oGw89vTqY%3D&Expires=1487255465&AWSAccessKeyId=AKIAJWZWUIS24TWSMWRA: Blocked site:
Feb 16 19:50:45 localhost systemd[1]: flanneld.service: Main process exited, code=exited, status=254/n/a
Feb 16 19:50:45 localhost rkt[52993]: stop: cannot get pod: no matches found for "26778eb4-9d8a-4d3c-9bb7-6ffb13a55d6a"
Feb 16 19:50:45 localhost rkt[52993]: stop: failed to stop 1 pod(s)
Feb 16 19:50:45 localhost systemd[1]: Failed to start flannel - Network fabric for containers (System Application Container).
-- Subject: Unit flanneld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has failed.
--
-- The result is failed.
Feb 16 19:50:45 localhost systemd[1]: flanneld.service: Unit entered failed state.
Feb 16 19:50:45 localhost systemd[1]: flanneld.service: Failed with result 'exit-code'.
Feb 16 19:50:45 localhost systemd[1]: Starting flannel docker export service - Network fabric for containers (System Application Container)...
-- Subject: Unit flannel-docker-opts.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flannel-docker-opts.service has begun starting up.
Feb 16 19:50:45 localhost sudo[53003]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/bin/journalctl -e -u kubelet
Feb 16 19:50:45 localhost sudo[53003]: pam_unix(sudo:session): session opened for user root by admin(uid=0)
Feb 16 19:50:45 localhost sudo[53003]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Feb 16 19:50:45 localhost sudo[53003]: pam_unix(sudo:session): session closed for user root
Feb 16 19:50:45 localhost rkt[53000]: rm: cannot get pod: UUID cannot be empty
Feb 16 19:50:45 localhost rkt[53000]: rm: failed to remove one or more pods
Feb 16 19:50:45 localhost flannel-wrapper[53019]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/flannel-wrapper2.uuid --trust-keys-from-https --net=host --volume run-flannel,kind=host,source=/run/flannel,readOnly=false --volume etc-ssl-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=run-flannel,target=/run/flannel --mount volume=etc-ssl-certs,target=/etc/ssl/certs --mount volume=usr-share-certs,target=/usr/share/ca-certificates --mount volume=etc-hosts,target=/etc/hosts --mount volume=etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/flannel:v0.6.2 --exec=/opt/bin/mk-docker-opts.sh -- -d /run/flannel/flannel_docker_opts.env -i
Feb 16 19:50:46 localhost flannel-wrapper[53019]: run: discovery failed
Feb 16 19:50:46 localhost systemd[1]: flannel-docker-opts.service: Main process exited, code=exited, status=254/n/a
Feb 16 19:50:46 localhost systemd[1]: Failed to start flannel docker export service - Network fabric for containers (System Application Container).
-- Subject: Unit flannel-docker-opts.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flannel-docker-opts.service has failed.
--
-- The result is failed.
Feb 16 19:50:46 localhost systemd[1]: flannel-docker-opts.service: Unit entered failed state.
Feb 16 19:50:46 localhost systemd[1]: flannel-docker-opts.service: Failed with result 'exit-code'.
We tried different document https://www.upcloud.com/support/deploy-kubernetes-coreos/ following it, getting same type error while starting kubelet.
Seems to be problem with rkt and quay registry issue behind company proxy.
Let us know if we missed something or configured something wrong.
can you please try to
$ sudo rkt fetch quay.io/coreos/flannel:v0.6.2
first in the shell.
I believe the issue is due to either running https proxy over http, or rkt fetch running as an unprivileged user and not inheriting system environmental variables.

kube-apiserver unable to communicate with TLS enabled etcd

I'm trying to set up a kubernetes cluster on some raspberry pis. I have successfully set up an etcd cluster with TLS enabled, and I can access this cluster via etcdctl and curl.
However, when I try to run kube-apiserver with the same ca file, I get messages saying that the etcd cluster is misconfigured or unavailable.
My question is, why are curl and etcdctl able to view cluster health and add keys with the same ca file that kube-apiserver is trying to use, but the kube-apiserver cannot?
When I run kube-apiserver and hit 127.0.0.1 over HTTP, not HTTPS, I can start the api server.
If this and the information below is not enough to understand the problem, please let me know. I'm not experienced with TLS/x509 certificates at all. I've been using Kelsey Hightowers Kubernetes The Hard Way mixed with the CoreOS docs for spinning up a kubernetes cluster, as well as looking at github issues and things like that.
Here is my etcd unit file:
[Unit]
Description=etcd
Documentation=https://github.com/coreos/etcd
[Service]
Environment=ETCD_UNSUPPORTED_ARCH=arm
ExecStart=/usr/bin/etcd \
--name etcd-master1 \
--cert-file=/etc/etcd/etcd.pem \
--key-file=/etc/etcd/etcd-key.pem \
--peer-cert-file=/etc/etcd/etcd.pem \
--peer-key-file=/etc/etcd/etcd-key.pem \
--trusted-ca-file=/etc/etcd/ca.pem \
--peer-trusted-ca-file=/etc/etcd/ca.pem \
--initial-advertise-peer-urls=https://10.0.1.200:2380 \
--listen-peer-urls https://10.0.1.200:2380 \
--listen-client-urls https://10.0.1.200:2379,http://127.0.0.1:2379 \
--advertise-client-urls https://10.0.1.200:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster etcd-master1=https://10.0.1.200:2380,etcd-master2=https://10.0.1.201:2380 \
--initial-cluster-state new \
--data-dir=var/lib/etcd \
--debug
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Here is the kube-apiserver command I'm trying to run:
#!/bin/bash
./kube-apiserver \
--etcd-cafile=/etc/etcd/ca.pem \
--etcd-certfile=/etc/etcd/etcd.pem \
--etcd-keyfile=/etc/etcd/etcd-key.pem \
--etcd-servers=https://10.0.1.200:2379,https://10.0.1.201:2379 \
--service-cluster-ip-range=10.32.0.0/24
Here is some output of that attempt. I think it's kind of weird that it is trying to list the etcd node, but nothing is printed out:
deploy#master1:~$ sudo ./run_kube_apiserver.sh
I1210 00:11:35.096887 20480 config.go:499] Will report 10.0.1.200 as public IP address.
I1210 00:11:35.842049 20480 trace.go:61] Trace "List *api.PodTemplateList" (started 2016-12-10 00:11:35.152287704 -0500 EST):
[134.479µs] [134.479µs] About to list etcd node
[688.376235ms] [688.241756ms] Etcd node listed
[689.062689ms] [686.454µs] END
E1210 00:11:35.860221 20480 cacher.go:261] unexpected ListAndWatch error: pkg/storage/cacher.go:202: Failed to list *api.PodTemplate: client: etcd cluster is unavailable or misconfigured
I1210 00:11:36.588511 20480 trace.go:61] Trace "List *api.LimitRangeList" (started 2016-12-10 00:11:35.273714755 -0500 EST):
[184.478µs] [184.478µs] About to list etcd node
[1.314010127s] [1.313825649s] Etcd node listed
[1.314362833s] [352.706µs] END
E1210 00:11:36.596092 20480 cacher.go:261] unexpected ListAndWatch error: pkg/storage/cacher.go:202: Failed to list *api.LimitRange: client: etcd cluster is unavailable or misconfigured
I1210 00:11:37.286714 20480 trace.go:61] Trace "List *api.ResourceQuotaList" (started 2016-12-10 00:11:35.325895387 -0500 EST):
[133.958µs] [133.958µs] About to list etcd node
[1.96003213s] [1.959898172s] Etcd node listed
[1.960393274s] [361.144µs] END
A successful cluster-health query:
deploy#master1:~$ sudo etcdctl --cert-file /etc/etcd/etcd.pem --key-file /etc/etcd/etcd-key.pem --ca-file /etc/etcd/ca.pem cluster-health
member 133c48556470c88d is healthy: got healthy result from https://10.0.1.200:2379
member 7acb9583fc3e7976 is healthy: got healthy result from https://10.0.1.201:2379
I am also seeing a lot of timeouts on the etcd servers themselves trying to send heartbeats back:
Dec 10 00:19:56 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 790.808604ms)
Dec 10 00:19:56 master1 etcd[19308]: server is likely overloaded
Dec 10 00:22:40 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 122.586925ms)
Dec 10 00:22:40 master1 etcd[19308]: server is likely overloaded
Dec 10 00:22:41 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 551.618961ms)
Dec 10 00:22:41 master1 etcd[19308]: server is likely overloaded
I can still do etcd operations like gets and puts, but I'm wondering if this could be a contributing factor? Can I tell the kube-apiserver to wait longer for etcd? I've been trying to figure this out myself but IMO the technical parts of the kuberentes components aren't really well documented, and a lot of the examples are very turnkey, without really explaining what everything is doing and why. I can find all kinds of diagrams and blog posts about high-level stuff, but things like e.g. how to run the actual binary, and what flags are and are not required, are kind of lacking.