I have a preemptible node pool of size 1 on GKE. I've been running this node pool with size 1 for almost a month now. Every day the node restarts after 24 hours and rejoins the cluster. Today it restarted but did not rejoin the cluster.
Instead, I noticed that according to gcloud compute instances list the underlying instance was running but not included in the output of kubectl get node. I increased the node pool size to 2, whereupon a second instance was launched. That node immediately joined my GKE cluster and pods were scheduled onto it. The first node is still running according to gcloud, but it won't join the cluster.
What's going on? How can I debug this this problem?
Update:
I SSHed into the instance and was immediately greeted with this excellent error message:
Broken (or in progress) Kubernetes node setup! Check the cluster initialization status
using the following commands:
Master instance:
- sudo systemctl status kube-master-installation
- sudo systemctl status kube-master-configuration
Node instance:
- sudo systemctl status kube-node-installation
- sudo systemctl status kube-node-configuration
The results of sudo systemctl status kube-node-installation:
goto mark: ● kube-node-installation.service - Download and install k8s binaries and configurations
Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 945 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 941 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 937 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/com
puteMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
Process: 933 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 930 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 925 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Main PID: 945 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kube-node-installation.service
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Downloading node problem detector.
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Dload Upload Total Spent Left Speed
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: [158B blob data]
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: == Downloaded https://storage.googleapis.com/kubernetes-release/node-problem-detector/node-problem-detector-v0.4.1.tar.gz (SHA1 = a57a3fe
64cab8a18ec654f5cef0aec59dae62568) ==
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: kubernetes-manifests.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: mounter is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Done for installing kubernetes files
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Download and install k8s binaries and configurations.
And the result of sudo systemctl status kube-node-configuration:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 994 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=4)
Process: 990 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 994 (code=exited, status=4)
CPU: 33ms
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Starting Configure kubernetes node...
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Start to configure instance for kubernetes
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Configuring IP firewall rules
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Failed to start Configure kubernetes node.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Unit entered failed state.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Failed with result 'exit-code'.
So it looks like kube-node-configuration failed. I ran sudo systemctl restart kube-node-configuration and now the status output is:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: active (exited) since Fri 2017-12-29 03:41:36 UTC; 3s ago
Main PID: 20802 (code=exited, status=0/SUCCESS)
CPU: 1.851s
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Extend the docker.service configuration to set a higher pids limit
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Docker command line is updated. Restart docker to pick it up
Dec 29 03:41:30 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using kubelet binary at /home/kubernetes/bin/kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kube-proxy static pod
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start node problem detector
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using node problem detector binary at /home/kubernetes/bin/node-problem-detector
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Prepare containerized mounter
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Done for the configuration for kubernetes
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Configure kubernetes node.
...and the node joined the cluster :). But, my original question stands: what happened?
We were experiencing a similar problem on GKE with preemptible nodes, seeing error messaging like these from the nodes:
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
level=info msg="Processing signal 'terminated'"
level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
level=info msg="Daemon shutdown complete"
docker daemon exited
Start kubelet
After about a month of back-and-forth with Google Support, we learned that the Nodes were getting preempted and replaced, and the new node that comes in, uses the same name, and it all happens without the normal pod disruption of a node being evicted.
Backstory: we were running into this problem because Jenkins was running it's workers on the nodes, and during this ~2 minute "restart" of the node going and returning, Jenkins master would loose connection and fail the job.
tldr; don't use preemptible nodes for this kind of work.
Related
we're experiencing a problem with one of our ceph monitors. Cluster uses 3 monitors and they are all up&running. They can communicate with each other and gives a relevant ceph -s output. However quorum shows second monitor is down. The ceph -s output from supposedly down monitor is below:
cluster:
id: bb1ab46a-d282-4530-bf5c-021e9c940958
health: HEALTH_WARN
insufficient standby MDS daemons available
noout flag(s) set
9 large omap objects
47 pgs not deep-scrubbed in time
application not enabled on 2 pool(s)
1/3 mons down, quorum mon1,mon3
services:
mon: 3 daemons, quorum mon1,mon3 (age 3d), out of quorum: mon2
mgr: mon1(active, since 3d)
mds: filesystem:1 {0=mon1=up:active}
osd: 77 osds: 77 up (since 3d), 77 in (since 2w)
flags noout
rbd-mirror: 1 daemon active (12512649)
rgw: 1 daemon active (mon1)
data:
pools: 13 pools, 1500 pgs
objects: 65.36M objects, 23 TiB
usage: 85 TiB used, 701 TiB / 785 TiB avail
pgs: 1500 active+clean
io:
client: 806 KiB/s wr, 0 op/s rd, 52 op/s wr
systemctl status ceph-mon#2.service shows:
ceph-mon#2.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon#.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2020-12-08 12:12:58 +03; 28s ago
Process: 2681 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2681 (code=exited, status=1/FAILURE)
Dec 08 12:12:48 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:48 mon2 systemd[1]: ceph-mon#2.service failed.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service holdoff time over, scheduling restart.
Dec 08 12:12:58 mon2 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: start request repeated too quickly for ceph-mon#2.service
Dec 08 12:12:58 mon2 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service failed.
Restarting, Stop/Starting, Enable/Disabling the monitor daemon did not work. Docs mention the monitor asok file in var/run/ceph and i don't have it in the supposed directory yet the other monitors have their asok files right in place. And now im in a state that i can't even stop the monitor daemon on second monitor it only stays at failed state. There is no logs shown in /var/log/ceph monitor logs. What am i supposed to do? I don't have much experience in ceph so i don't want to change things without being absolutely sure in order to avoid messing up the cluster.
try to start the service manually on MON2 with:
/usr/bin/ceph-mon -f --cluster Ceph --id 2 --setuser ceph --setgroup ceph
I have installed postgresql for a long time, but it suddenly stopped working when I added an IP address in the pg_hba file.
I'm trying to run it using sudo service postgresql restart but this doesn't work.
Also
root#ubuntu-dev-server:/# systemctl start postgresql#9.6-main
Job for postgresql#9.6-main.service failed because the service did not take the steps required by its unit configuration.
See "systemctl status postgresql#9.6-main.service" and "journalctl -xe" for details.
And then,
root#ubuntu-dev-server:/# systemctl status postgresql#9.6-main.service
● postgresql#9.6-main.service - PostgreSQL Cluster 9.6-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; indirect; vendor preset: enabled)
Active: failed (Result: protocol) since Fri 2020-07-24 16:56:06 UTC; 1min 54s ago
Process: 31877 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 9.6-main stop (code=exited, status=0/SUCCESS)
Process: 951 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 9.6-main start (code=exited, status=1/FAILURE)
Main PID: 24944 (code=exited, status=0/SUCCESS)
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: Starting PostgreSQL Cluster 9.6-main...
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: The PostgreSQL server failed to start. Please check the log output:
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: 2020-07-24 16:56:06.236 UTC [956] FATAL: could not map anonymous shared memory: Cannot allocate memory
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: 2020-07-24 16:56:06.236 UTC [956] HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory, swap space, or huge pages. To reduce the request size (currently 148471808 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.
Jul 24 16:56:06 ubuntu-dev-server postgresql#9.6-main[951]: 2020-07-24 16:56:06.236 UTC [956] LOG: database system is shut down
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: postgresql#9.6-main.service: Can't open PID file /run/postgresql/9.6-main.pid (yet?) after start: No such file or directory
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: postgresql#9.6-main.service: Failed with result 'protocol'.
Jul 24 16:56:06 ubuntu-dev-server systemd[1]: Failed to start PostgreSQL Cluster 9.6-main.
I have tried several things but still do not start. Any recommendation is welcome.
I have a trouble/problem with my Filebeat installation.
When I try it to start with "service filebeat start", it says "Starting Filebeat". After "service filebeat status" I get 4 PIDs (until here everything looks "normal"):
[root#(Server) run]# service filebeat status
Filebeat is running with pid: 30650 30657 30658 30659
But after checking the PID, we see that it is not running:
[root#(Server) run]# ps -ef | grep 30650
root 30665 31360 0 16:27 pts/0 00:00:00 grep --color=auto 30650
Trying to start it with systemctl doesn't help:
[root#(Server) run]# systemctl start filebeat
Job for filebeat.service failed because a configured resource limit was exceeded. See "systemctl status filebeat.service" and "journalctl -xe" for details.
Status says:
[root#Server run]# systemctl status filebeat
● filebeat.service - LSB: start and stop filebeat
Loaded: loaded (/etc/rc.d/init.d/filebeat; bad; vendor preset: disabled)
Active: failed (Result: resources) since Tue 2017-09-26 16:30:33 CEST; 1min 41s ago
Docs: man:systemd-sysv-generator(8)
Process: 32118 ExecStart=/etc/rc.d/init.d/filebeat start (code=exited, status=0/SUCCESS)
Sep 26 16:30:33 Server... systemd[1]: Starting LSB: start and stop filebeat...
Sep 26 16:30:33 Server... filebeat[32118]: Starting Filebeat
Sep 26 16:30:33 Server... su[32119]: (to user) root on none
Sep 26 16:30:33 Server... systemd[1]: PID file /var/run/filebeat.pid not readable (yet?) after start.
Sep 26 16:30:33 Server... systemd[1]: Failed to start LSB: start and stop filebeat.
Sep 26 16:30:33 Server... systemd[1]: Unit filebeat.service entered failed state.
Sep 26 16:30:33 Server... systemd[1]: filebeat.service failed.
Does somebody has any idea?
Regards
Problem was "chown permissions". I installed filebeat not as root and the "data" directory had root user & group ownership. After changing that, it runs and starts automatically after boot.
Regards
On the beginning i wanted to point out i am fairly new into Linux systems, and totally, totally new with kubernetes so my question may be trivial.
As stated in the title i have problem with setting up the Kubernetes cluster. I am working on the Atomic Host Version: 7.1707 (2017-07-31 16:12:06)
I am following this guide:
http://www.projectatomic.io/docs/gettingstarted/
in addition to that i followed this:
http://www.projectatomic.io/docs/kubernetes/
(to be precise, i ran this command:
rpm-ostree install kubernetes-master --reboot
everything was going fine until this point:
systemctl start etcd kube-apiserver kube-controller-manager kube-scheduler
the problem is with:
systemctl start etcd kube-apiserver
as it gives me back this response:
Job for kube-apiserver.service failed because the control process
exited with error code. See "systemctl status kube-apiserver.service"
and "journalctl -xe" for details.
systemctl status kube-apiserver.service
gives me back:
● kube-apiserver.service - Kubernetes API Server
Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Fri 2017-08-25 14:29:56 CEST; 2s ago
Docs: https://github.com/GoogleCloudPlatform/kubernetes
Process: 17876 ExecStart=/usr/bin/kube-apiserver $KUBE_LOGTOSTDERR $KUBE_LOG_LEVEL $KUBE_ETCD_SERVERS $KUBE_API_ADDRESS $KUBE_API_PORT $KUBELET_PORT $KUBE_ALLOW_PRIV $KUBE_SERVICE_ADDRESSES $KUBE_ADMISSION_CONTROL $KUBE_API_ARGS (code=exited, status=255)
Main PID: 17876 (code=exited, status=255)
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
Aug 25 14:29:56 master systemd[1]: Failed to start Kubernetes API Server.
Aug 25 14:29:56 master systemd[1]: Unit kube-apiserver.service entered failed state.
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service failed.
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service holdoff time over, scheduling restart.
Aug 25 14:29:56 master systemd[1]: start request repeated too quickly for kube-apiserver.service
Aug 25 14:29:56 master systemd[1]: Failed to start Kubernetes API Server.
Aug 25 14:29:56 master systemd[1]: Unit kube-apiserver.service entered failed state.
Aug 25 14:29:56 master systemd[1]: kube-apiserver.service failed.
I have no clue where to start and i will be more than thankful for any advices.
It turned out to be a typo in /etc/kubernetes/config. I misunderstood the "# Comma separated list of nodes in the etcd cluster".
Idk how to close the thread or anything.
I installed it a Centos 7 box.
R studio server service could not start.
I run the command
systemctl status rstudio-server.service
and it showed:
● rstudio-server.service - RStudio Server
Loaded: loaded (/etc/systemd/system/rstudio-server.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2016-01-28 20:18:20 ICT; 1min 6s ago
Process: 48820 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=203/EXEC)
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service: control process exited, code=exited s...=203
Jan 28 20:18:20 localhost.localdomain systemd[1]: Failed to start RStudio Server.
Jan 28 20:18:20 localhost.localdomain systemd[1]: Unit rstudio-server.service entered failed state.
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service failed.
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service holdoff time over, scheduling restart.
Jan 28 20:18:20 localhost.localdomain systemd[1]: start request repeated too quickly for rstudio-server.service
Jan 28 20:18:20 localhost.localdomain systemd[1]: Failed to start RStudio Server.
Jan 28 20:18:20 localhost.localdomain systemd[1]: Unit rstudio-server.service entered failed state.
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service failed.
I installed and run an old version (rstudio-server-0.99.491-1.x86_64) on the same box without any problem.
How could I fix the issues?
Although you asked this question 3 years ago, I think it's still necessary to share my solution to this problem.
I encounter this problem after I updated R.
The reason why you can not restart rstudio-server is that the PORT 8787 was been using by previous rserver. After knowing this, the solution is easy.
First, check the pid that was using PORT 8787
sudo netstat -anp | grep 8787
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN pid/rserver
Second, kill this pid (use your pid)
sudo kill -9 pid
Third, restart rstudio-server or reinstall resutio server package