Context:
I'm trying to setup a new ceph-17.2.5 cluster
Initial setup consists of 1 mgr, 1 mon and 1 osd all installed on same box using cephadm.
So far I am able to access http://192.168.3.50:8080/api/auth and get token in the response.
But http://192.168.3.50:8080/api/rgw/bucket returns:
`
{
"status": "500 Internal Server Error",
"detail": "The server encountered an unexpected condition which prevented it from fulfilling the request.",
"request_id": "f08247f9-d9ea-4f88-99b7-d07d5141d5ea"
}
`
This seems because the dashboard shows rgw service is not running, I have tried to setup rgw service using below commands:
sudo cephadm shell --mount radosgw.yml:/var/lib/ceph/radosgw/radosgw.yml
ceph orch apply -i /var/lib/ceph/radosgw/radosgw.yml
radosgw.yml file contents:
service_type: rgw
service_id: default
placement:
hosts:
- <machine-hostname>
count_per_host: 1
spec:
rgw_realm: default
rgw_zone: default
rgw_frontend_port: 8123
networks:
- 192.168.3.0/24
After running above the cluster in ceph dashboard does show one rgw service:
But the object gateway daemon still gives error as below:
The RGW API still returns same error.
Looking at the rgw logs I see rgw keeps trying restart on a loop and eventually stops:
Nov 17 05:31:16 <hostname> systemd[1]: Started Ceph rgw.default.<hostname>.rerude for 76ec21dc-4f10-11ed-8fda-0050568dc50c. Nov 17 05:31:17 <hostname> bash[2537663]: debug 2022-11-17T05:31:17.012+0000 7fc0de7ed5c0 0 deferred set uid:gid to 167:167 (ceph:ceph) Nov 17 05:31:17 <hostname> bash[2537663]: debug 2022-11-17T05:31:17.012+0000 7fc0de7ed5c0 0 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process radosgw, pid 7 Nov 17 05:31:17 <hostname> bash[2537663]: debug 2022-11-17T05:31:17.012+0000 7fc0de7ed5c0 0 framework: beast Nov 17 05:31:17 <hostname> bash[2537663]: debug 2022-11-17T05:31:17.012+0000 7fc0de7ed5c0 0 framework conf key: endpoint, val: 192.168.3.59:8123 Nov 17 05:31:17 <hostname> bash[2537663]: debug 2022-11-17T05:31:17.012+0000 7fc0de7ed5c0 1 radosgw_Main not setting numa affinity Nov 17 05:31:17 <hostname> bash[2537663]: debug 2022-11-17T05:31:17.015+0000 7fc0de7ed5c0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 Nov 17 05:31:17 <hostname> bash[2537663]: debug 2022-11-17T05:31:17.015+0000 7fc0de7ed5c0 1 D3N datacache enabled: 0 Nov 17 05:36:17 <hostname> bash[2537663]: debug 2022-11-17T05:36:17.013+0000 7fc0dd36f700 -1 Initialization timeout, failed to initialize Nov 17 05:36:17 <hostname> systemd[1]: ceph-76ec21dc-4f10-11ed-8fda-0050568dc50c#rgw.default.<hostname>.rerude.service: Main process exited, code=exited, status=1/FAILURE Nov 17 05:36:17 <hostname> systemd[1]: ceph-76ec21dc-4f10-11ed-8fda-0050568dc50c#rgw.default.<hostname>.rerude.service: Failed with result 'exit-code'. Nov 17 05:36:27 <hostname> systemd[1]: ceph-76ec21dc-4f10-11ed-8fda-0050568dc50c#rgw.default.<hostname>.rerude.service: Service RestartSec=10s expired, scheduling restart. Nov 17 05:36:27 <hostname> systemd[1]: ceph-76ec21dc-4f10-11ed-8fda-0050568dc50c#rgw.default.<hostname>.rerude.service: Scheduled restart job, restart counter is at 4. Nov 17 05:36:27 <hostname> systemd[1]: Stopped Ceph rgw.default.<hostname>.rerude for 76ec21dc-4f10-11ed-8fda-0050568dc50c. Nov 17 05:36:27 <hostname> systemd[1]: Started Ceph rgw.default.<hostname>.rerude for 76ec21dc-4f10-11ed-8fda-0050568dc50c. Nov 17 05:36:27 <hostname> bash[2538424]: debug 2022-11-17T05:36:27.465+0000 7f2589bf95c0 0 deferred set uid:gid to 167:167 (ceph:ceph) Nov 17 05:36:27 <hostname> bash[2538424]: debug 2022-11-17T05:36:27.465+0000 7f2589bf95c0 0 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable), process radosgw, pid 6 Nov 17 05:36:27 <hostname> bash[2538424]: debug 2022-11-17T05:36:27.465+0000 7f2589bf95c0 0 framework: beast Nov 17 05:36:27 <hostname> bash[2538424]: debug 2022-11-17T05:36:27.465+0000 7f2589bf95c0 0 framework conf key: endpoint, val: 192.168.3.50:8123 Nov 17 05:36:27 <hostname> bash[2538424]: debug 2022-11-17T05:36:27.465+0000 7f2589bf95c0 1 radosgw_Main not setting numa affinity Nov 17 05:36:27 <hostname> bash[2538424]: debug 2022-11-17T05:36:27.468+0000 7f2589bf95c0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 Nov 17 05:36:27 <hostname> bash[2538424]: debug 2022-11-17T05:36:27.468+0000 7f2589bf95c0 1 D3N datacache enabled: 0 Nov 17 05:41:27 <hostname> bash[2538424]: debug 2022-11-17T05:41:27.466+0000 7f258877b700 -1 Initialization timeout, failed to initialize
> sudo ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
ssd 8 GiB 8.0 GiB 9.7 MiB 9.7 MiB 0.12
TOTAL 8 GiB 8.0 GiB 9.7 MiB 9.7 MiB 0.12
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.rgw.root 1 32 0 B 0 0 B 0 2.5 GiB
.mgr 2 1 449 KiB 2 452 KiB 0 7.6 GiB
Not sure what should I look for to get the rgw apis to work in a minimal setup.
Related
we're experiencing a problem with one of our ceph monitors. Cluster uses 3 monitors and they are all up&running. They can communicate with each other and gives a relevant ceph -s output. However quorum shows second monitor is down. The ceph -s output from supposedly down monitor is below:
cluster:
id: bb1ab46a-d282-4530-bf5c-021e9c940958
health: HEALTH_WARN
insufficient standby MDS daemons available
noout flag(s) set
9 large omap objects
47 pgs not deep-scrubbed in time
application not enabled on 2 pool(s)
1/3 mons down, quorum mon1,mon3
services:
mon: 3 daemons, quorum mon1,mon3 (age 3d), out of quorum: mon2
mgr: mon1(active, since 3d)
mds: filesystem:1 {0=mon1=up:active}
osd: 77 osds: 77 up (since 3d), 77 in (since 2w)
flags noout
rbd-mirror: 1 daemon active (12512649)
rgw: 1 daemon active (mon1)
data:
pools: 13 pools, 1500 pgs
objects: 65.36M objects, 23 TiB
usage: 85 TiB used, 701 TiB / 785 TiB avail
pgs: 1500 active+clean
io:
client: 806 KiB/s wr, 0 op/s rd, 52 op/s wr
systemctl status ceph-mon#2.service shows:
ceph-mon#2.service - Ceph cluster monitor daemon
Loaded: loaded (/usr/lib/systemd/system/ceph-mon#.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Tue 2020-12-08 12:12:58 +03; 28s ago
Process: 2681 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
Main PID: 2681 (code=exited, status=1/FAILURE)
Dec 08 12:12:48 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:48 mon2 systemd[1]: ceph-mon#2.service failed.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service holdoff time over, scheduling restart.
Dec 08 12:12:58 mon2 systemd[1]: Stopped Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: start request repeated too quickly for ceph-mon#2.service
Dec 08 12:12:58 mon2 systemd[1]: Failed to start Ceph cluster monitor daemon.
Dec 08 12:12:58 mon2 systemd[1]: Unit ceph-mon#2.service entered failed state.
Dec 08 12:12:58 mon2 systemd[1]: ceph-mon#2.service failed.
Restarting, Stop/Starting, Enable/Disabling the monitor daemon did not work. Docs mention the monitor asok file in var/run/ceph and i don't have it in the supposed directory yet the other monitors have their asok files right in place. And now im in a state that i can't even stop the monitor daemon on second monitor it only stays at failed state. There is no logs shown in /var/log/ceph monitor logs. What am i supposed to do? I don't have much experience in ceph so i don't want to change things without being absolutely sure in order to avoid messing up the cluster.
try to start the service manually on MON2 with:
/usr/bin/ceph-mon -f --cluster Ceph --id 2 --setuser ceph --setgroup ceph
OS: RHEL 8.2
I am trying to create a systemctl service for zookeeper. It fails to access the datadir.
Here is my config for zookeeper,
dataDir=/opt/zookeeper
maxClientCnxns=20
tickTime=2000
dataDir=/var/zookeeper/
initLimit=20
syncLimit=10
server.0=master:2888:3888
clientPort=2181
admin.serverPort=8082
Permission of /opt/zookeeper is set to 777.
[user1#server1 opt]$ ls -lart
total 0
dr-xr-xr-x. 17 root root 244 Jul 3 10:56 ..
drwxr-xr-x 3 root root 27 Jul 10 10:29 rh
drw-r--r-- 2 user2 user2 6 Jul 17 08:48 hsluw_data
drw-r--r-- 2 user2 user2 6 Jul 17 08:58 hsluw_config
drwxr-xr-x. 6 root root 71 Jul 17 08:58 .
drwxrwxrwx 3 user2 user2 23 Jul 17 09:40 zookeeper
If I run the command,
./bin/zookeeper-server-start.sh config/zookeeper.properties
it gives me an error message: Unable to access datadir
[2020-07-30 10:25:50,767] ERROR Invalid configuration, only one server specified (ignoring) (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2020-07-30 10:25:50,767] INFO Starting server (org.apache.zookeeper.server.ZooKeeperServerMain)
[2020-07-30 10:25:50,769] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)
[2020-07-30 10:25:50,769] ERROR Unable to access datadir, exiting abnormally (org.apache.zookeeper.server.ZooKeeperServerMain)
org.apache.zookeeper.server.persistence.FileTxnSnapLog$DatadirException: Cannot write to data directory /var/zookeeper/version-2
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:132)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:124)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:106)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:64)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:128)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Unable to access datadir, exiting abnormally
However, sudoing the above command works,
sudo ./bin/zookeeper-server-start.sh config/zookeeper.properties
Now I have created a service in /etc/systemd/system/zookeeper.service
I wrote the service in /etc/systemd/system/zookeeper.service in this way,
[Unit]
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
User=user2
ExecStart=/home/user2/kafka/bin/zookeeper-server-start.sh /home/user2/kafka/config/zookeeper.properties
ExecStop=/home/user2/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target
The SELinux status is disabled.
user2#server1$ sestatus
SELinux status: disabled
Now if I do the following
sudo systemctl daemon-reload
sudo systemctl start zookeeper
sudo systemctl enable zookeeper
I am getting the the same Unable to access the datadir error like the following,
[user2#server1 /]$ systemctl status zookeeper
\u25cf zookeeper.service
Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2020-07-30 10:13:19 CEST; 24s ago
Main PID: 12911 (code=exited, status=3)
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: org.apache.zookeeper.server.persistence.FileTxnSnapLog$Data>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.persistence.FileTxnS>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.ZooKeeperServerMain.>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.ZooKeeperServerMain.>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.ZooKeeperServerMain.>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.quorum.QuorumPeerMai>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.quorum.QuorumPeerMai>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: Unable to access datadir, exiting abnormally
Jul 30 10:13:19 server1.localdomain systemd[1]: zookeeper.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Jul 30 10:13:19 server1.localdomain systemd[1]: zookeeper.service: Failed with result 'exit-code'.
What am I missing here?
In the configuration file, this line
dataDir=/var/zookeeper/
appears twice. Removing that line solves the issue.
When I'm trying to add magento 2 varnish.vcl file by creating a symbolic link, varnish service stop working with error permission denied, while if I use default varnish configuration file varnish works smooth.
My Stack is ubuntu 16.04, varnish 4.1
ls -al
drwxr-xr-x 2 root root 4096 Mar 21 13:14 .
drwxr-xr-x 96 root root 4096 Mar 21 12:56 ..
lrwxrwxrwx 1 root root 44 Mar 21 13:14 default.vcl -> /var/www/bazaar/varnish.vcl
-rw-r--r-- 1 root root 1225 Aug 22 2017 default.vcl_bak
-rw-r--r-- 1 root root 37 Mar 21 12:56 secret
here is the status for varnish service
● varnish.service - Varnish HTTP accelerator
Loaded: loaded (/lib/systemd/system/varnish.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/varnish.service.d
└─customexec.conf
Active: failed (Result: exit-code) since Wed 2018-03-21 13:59:08 UTC; 2s ago
Docs: https://www.varnish-cache.org/docs/4.1/
man:varnishd
Process: 3093 ExecStart=/usr/sbin/varnishd -j unix,user=vcache -F -a :80 -T localhost:6082 -f /etc/varnish/default.vcl -S /etc/varnish/secret -s malloc,256m (code=exited, status=2)
Main PID: 3093 (code=exited, status=2)
Mar 21 13:59:08 bazaar systemd[1]: Stopped Varnish HTTP accelerator.
Mar 21 13:59:08 bazaar systemd[1]: Started Varnish HTTP accelerator.
Mar 21 13:59:08 bazaar varnishd[3093]: Error: Cannot read -f file (/etc/varnish/default.vcl): Permission denied
Mar 21 13:59:08 bazaar systemd[1]: varnish.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 21 13:59:08 bazaar systemd[1]: varnish.service: Unit entered failed state.
Mar 21 13:59:08 bazaar systemd[1]: varnish.service: Failed with result 'exit-code'.
my current user for nginx is bazaar
and permissions for varnish.vcl is as follow
-rw-r--r-- 1 bazaar bazaar 7226 Mar 21 13:24 varnish.vcl
Any hint or help will be highly appreciated.
Thanks.
It is likely that the user (vcache) does not have access to read in the parent directory(s) /var/www/bazaar.
I have a preemptible node pool of size 1 on GKE. I've been running this node pool with size 1 for almost a month now. Every day the node restarts after 24 hours and rejoins the cluster. Today it restarted but did not rejoin the cluster.
Instead, I noticed that according to gcloud compute instances list the underlying instance was running but not included in the output of kubectl get node. I increased the node pool size to 2, whereupon a second instance was launched. That node immediately joined my GKE cluster and pods were scheduled onto it. The first node is still running according to gcloud, but it won't join the cluster.
What's going on? How can I debug this this problem?
Update:
I SSHed into the instance and was immediately greeted with this excellent error message:
Broken (or in progress) Kubernetes node setup! Check the cluster initialization status
using the following commands:
Master instance:
- sudo systemctl status kube-master-installation
- sudo systemctl status kube-master-configuration
Node instance:
- sudo systemctl status kube-node-installation
- sudo systemctl status kube-node-configuration
The results of sudo systemctl status kube-node-installation:
goto mark: ● kube-node-installation.service - Download and install k8s binaries and configurations
Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 945 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 941 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 937 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/com
puteMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
Process: 933 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 930 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 925 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Main PID: 945 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kube-node-installation.service
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Downloading node problem detector.
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Dload Upload Total Spent Left Speed
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: [158B blob data]
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: == Downloaded https://storage.googleapis.com/kubernetes-release/node-problem-detector/node-problem-detector-v0.4.1.tar.gz (SHA1 = a57a3fe
64cab8a18ec654f5cef0aec59dae62568) ==
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: kubernetes-manifests.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: mounter is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Done for installing kubernetes files
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Download and install k8s binaries and configurations.
And the result of sudo systemctl status kube-node-configuration:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 994 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=4)
Process: 990 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 994 (code=exited, status=4)
CPU: 33ms
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Starting Configure kubernetes node...
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Start to configure instance for kubernetes
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Configuring IP firewall rules
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Failed to start Configure kubernetes node.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Unit entered failed state.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Failed with result 'exit-code'.
So it looks like kube-node-configuration failed. I ran sudo systemctl restart kube-node-configuration and now the status output is:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: active (exited) since Fri 2017-12-29 03:41:36 UTC; 3s ago
Main PID: 20802 (code=exited, status=0/SUCCESS)
CPU: 1.851s
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Extend the docker.service configuration to set a higher pids limit
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Docker command line is updated. Restart docker to pick it up
Dec 29 03:41:30 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using kubelet binary at /home/kubernetes/bin/kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kube-proxy static pod
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start node problem detector
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using node problem detector binary at /home/kubernetes/bin/node-problem-detector
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Prepare containerized mounter
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Done for the configuration for kubernetes
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Configure kubernetes node.
...and the node joined the cluster :). But, my original question stands: what happened?
We were experiencing a similar problem on GKE with preemptible nodes, seeing error messaging like these from the nodes:
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
level=info msg="Processing signal 'terminated'"
level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
level=info msg="Daemon shutdown complete"
docker daemon exited
Start kubelet
After about a month of back-and-forth with Google Support, we learned that the Nodes were getting preempted and replaced, and the new node that comes in, uses the same name, and it all happens without the normal pod disruption of a node being evicted.
Backstory: we were running into this problem because Jenkins was running it's workers on the nodes, and during this ~2 minute "restart" of the node going and returning, Jenkins master would loose connection and fail the job.
tldr; don't use preemptible nodes for this kind of work.
I installed it a Centos 7 box.
R studio server service could not start.
I run the command
systemctl status rstudio-server.service
and it showed:
● rstudio-server.service - RStudio Server
Loaded: loaded (/etc/systemd/system/rstudio-server.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2016-01-28 20:18:20 ICT; 1min 6s ago
Process: 48820 ExecStart=/usr/lib/rstudio-server/bin/rserver (code=exited, status=203/EXEC)
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service: control process exited, code=exited s...=203
Jan 28 20:18:20 localhost.localdomain systemd[1]: Failed to start RStudio Server.
Jan 28 20:18:20 localhost.localdomain systemd[1]: Unit rstudio-server.service entered failed state.
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service failed.
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service holdoff time over, scheduling restart.
Jan 28 20:18:20 localhost.localdomain systemd[1]: start request repeated too quickly for rstudio-server.service
Jan 28 20:18:20 localhost.localdomain systemd[1]: Failed to start RStudio Server.
Jan 28 20:18:20 localhost.localdomain systemd[1]: Unit rstudio-server.service entered failed state.
Jan 28 20:18:20 localhost.localdomain systemd[1]: rstudio-server.service failed.
I installed and run an old version (rstudio-server-0.99.491-1.x86_64) on the same box without any problem.
How could I fix the issues?
Although you asked this question 3 years ago, I think it's still necessary to share my solution to this problem.
I encounter this problem after I updated R.
The reason why you can not restart rstudio-server is that the PORT 8787 was been using by previous rserver. After knowing this, the solution is easy.
First, check the pid that was using PORT 8787
sudo netstat -anp | grep 8787
tcp 0 0 0.0.0.0:8787 0.0.0.0:* LISTEN pid/rserver
Second, kill this pid (use your pid)
sudo kill -9 pid
Third, restart rstudio-server or reinstall resutio server package