Error running Cygnus as a service - fiware-cygnus

I've been following this procedure to install Cygnus as a service (https://github.com/telefonicaid/fiware-cygnus/tree/master/cygnus-ngsi), but when running it the below error pop up:
● cygnus.service - SYSV: cygnus
Loaded: loaded (/etc/rc.d/init.d/cygnus; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2017-01-17 18:27:15 UTC; 8s ago
Docs: man:systemd-sysv-generator(8)
Process: 8260 ExecStart=/etc/rc.d/init.d/cygnus start (code=exited, status=1/FAILURE)
Jan 17 18:27:15 servername systemd[1]: Starting SYSV: cygnus...
Jan 17 18:27:15 servername systemd[1]: cygnus.service: control process exited, code=exited status=1
Jan 17 18:27:15 servername systemd[1]: Failed to start SYSV: cygnus.
Jan 17 18:27:15 servername systemd[1]: Unit cygnus.service entered failed state.
Jan 17 18:27:15 servername systemd[1]: cygnus.service failed.
Jan 17 18:27:15 servername cygnus[8260]: There aren't any instance of Cygnus configured. Refer to file /usr/cygnus/conf/README.md ...mation.
Hint: Some lines were ellipsized, use -l to show in full
I have already tried two mechanism to solve the issue without success:
Giving full permit to /var/run/cygnus directory; as suggested here Fiware: can not start cygnus as service.
In my case, it does run as standalone application (not as a service) by using:
`/usr/cygnus/bin/cygnus-flume-ng agent --conf /usr/cygnus/conf/ -f /usr/cygnus/conf/agent.conf -n cygnusagent -Dflume.root.logger=INFO,console`
Changing owner permission as suggested here: unable to start Fiware Cygnus as a service
Complementary information:
When I run a stat tool:
curl -X GET "http://localhost:8081/v1/stats" | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 67 100 67 0 0 3983 0 --:--:-- --:--:-- --:--:-- 4187
{
"stats": {
"channels": [],
"sinks": [],
"sources": []
},
"success": "true"
}
Files I'm using:
agent.conf
cygnus-ngsi.sources = http-source
cygnus-ngsi.sinks = hdfs-sink
cygnus-ngsi.channels = hdfs-channel
cygnus-ngsi.sources.http-source.type = org.apache.flume.source.http.HTTPSource
cygnus-ngsi.sources.http-source.channels = hdfs-channel
cygnus-ngsi.sources.http-source.port = 5050
cygnus-ngsi.sources.http-source.handler = com.telefonica.iot.cygnus.handlers.NGSIRestHandler
cygnus-ngsi.sources.http-source.handler.notification_target = /notify
cygnus-ngsi.sources.http-source.handler.default_service = default
cygnus-ngsi.sources.http-source.handler.default_service_path = /
cygnus-ngsi.sources.http-source.interceptors = ts gi
cygnus-ngsi.sources.http-source.interceptors.ts.type = timestamp
cygnus-ngsi.sources.http-source.interceptors.gi.type = com.telefonica.iot.cygnus.interceptors.NGSIGroupingInterceptor$Builder
cygnus-ngsi.sources.http-source.interceptors.gi.grouping_rules_conf_file = /usr/cygnus/conf/grouping_rules.conf #/opt/apache-flume/conf/grouping_rules.conf
cygnus-ngsi.sinks.hdfs-sink.type = com.telefonica.iot.cygnus.sinks.NGSIHDFSSink
cygnus-ngsi.sinks.hdfs-sink.channel = hdfs-channel
cygnus-ngsi.sinks.hdfs-sink.hdfs_host = iot-hdfs
cygnus-ngsi.sinks.hdfs-sink.hdfs_port = 14000
cygnus-ngsi.sinks.hdfs-sink.hdfs_username = <USERNAME>
cygnus-ngsi.sinks.hdfs-sink.oauth2_token = <TOKEN>
cygnus-ngsi.channels.hdfs-channel.type = com.telefonica.iot.cygnus.channels.CygnusMemoryChannel
cygnus-ngsi.channels.hdfs-channel.capacity = 1000
cygnus-ngsi.channels.hdfs-channel.transactionCapacity = 100
cygnus_instance.conf
CYGNUS_USER=cygnus
# Where is the config folder
CONFIG_FOLDER=/usr/cygnus/conf
# Which is the config file
CONFIG_FILE=/usr/cygnus/conf/agent.conf
# Name of the agent. The name of the agent is not trivial, since it is the base for the Flume parameters
# naming conventions, e.g. it appears in .sources.http-source.channels=...
AGENT_NAME=cygnusagent
# Name of the logfile located at /var/log/cygnus. It is important to put the extension '.log' in order to the log rotation works properly
LOGFILE_NAME=cygnus.log
# Administration port. Must be unique per instance
ADMIN_PORT=8081
# Polling interval (seconds) for the configuration reloading
POLLING_INTERVAL=30
Any hint to work this out?

As you can see in the logs when starting the service:
Jan 17 18:27:15 servername cygnus[8260]: There aren't any instance of Cygnus configured. Refer to file /usr/cygnus/conf/README.md ...mation.
When running Cygnus as a service, two files must be given, the agent one and the instance one. You already configured them, so that's OK, but an important thing is missing: both file names must include an ID string. For instance, they could be named as agent_1.conf and cygnus_instance_1.conf (the "1" string is the ID of the instance).
This is explained here. Nevertheless, we'll try to improve the documentation in order to make it clearer.

Related

Service starting gunicorn failing with "Start request repeated too quickly"

Trying to start a service to run gunicorn as backend server for Flask, not working. Running nginx as frontend server for React, working.
Server:
Virtualization: vmware
Operating System: Red Hat Enterprise Linux 8.4 (Ootpa)
CPE OS Name: cpe:/o:redhat:enterprise_linux:8.4:GA
Kernel: Linux 4.18.0-305.3.1.el8_4.x86_64
Architecture: x86-64
Service file in /etc/systemd/system/myservice.service:
[Unit]
Description="Description"
After=network.target
[Service]
User=root
Group=root
WorkingDirectory=/home/project/app/api
ExecStart=/home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app
Restart=always
[Install]
WantedBy=multi-user.target
/app/api:
-rwxr-xr-x. 1 root root 2018 Jun 9 20:06 api.py
drwxrwxr-x+ 5 root root 100 Jun 7 10:11 venv
Error message:
● myservice.service - "Description"
Loaded: loaded (/etc/systemd/system/myservice.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2021-06-10 19:01:01 CEST; 5s ago
Process: 18307 ExecStart=/home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app (code=exited, status=203/EXEC)
Main PID: 18307 (code=exited, status=203/EXEC)
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Service RestartSec=100ms expired, scheduling restart.
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Scheduled restart job, restart counter is at 5.
Jun 10 19:01:01 xxxx systemd[1]: Stopped "Description".
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Start request repeated too quickly.
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Failed with result 'exit-code'.
Jun 10 19:01:01 xxxx systemd[1]: Failed to start "Description".
Tried, not working:
Adding Environment="PATH=/home/project/app/api/venv/bin" under [Service]
$ systemctl reset-failed myservice.service
$ systemctl daemon-reload
Reboot, ofc.
Tried, working:
Running (as root) /home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app while in /app/api directory
Does anyone know how to fix this problem?
Typically enough, I figured it out shortly after posting this issue.
SELinux is messing with permissions for files and directories, so for anyone experiencing the same issue, make sure to test with the following alterings (as root):
$ setsebool -P httpd_can_network_connect on
$ chcon -Rt httpd_sys_content_t /path/to/your/Flask/dir
In my case: $ chcon -Rt httpd_sys_content_t /home/project/app/api
While this is NOT a permanent fix, it's worth a try. Check out the SELinux docs for more permanent solutions.

Rauc and Yocto on Jetson Nano - Unable to find primary boot slot

This is a continuation of my other post.
I've managed to create an image with u-boot and rauce.
I've made a simple rauc system.conf:
[system]
compatible=Jetson Nano
bootloader=uboot
#
[slot.rootfs.0]
device=/dev/mmcblk0p1
type=ext4
bootname=system0
#
[slot.rootfs.1]
device=/dev/mmcblk0p13
type=ext4
bootname=system1
[UPDATED]:
Pretty much copy pasted the contrib uboot.sh script.
Then I've added a bb file from here into my bsp layer.
And added rauc to my IMAGE_INSTALL.
When i boot up the nano with my image, rauc isn't working as it should. When i check the status on the service with systemctl status rauc-mark-service-good.service it returns:
● rauc-mark-good.service - Rauc Good-marking Service
Loaded: loaded (/lib/systemd/system/rauc-mark-good.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Tue 2019-10-01 07:51:22 UTC; 4s ago
Process: 4147 ExecStart=/usr/bin/rauc status mark-good (code=exited, status=0/SUCCESS)
Main PID: 4147 (code=exited, status=0/SUCCESS)
Oct 01 07:51:22 jetson-nano systemd[1]: Started Rauc Good-marking Service.
Oct 01 07:51:22 jetson-nano rauc[4147]: Failed getting primary slot: Failed getting primary slot: Unable to find primary boot slot
Oct 01 07:51:22 jetson-nano rauc[4147]: rauc mark: marked slot rootfs.0 as good
Oct 01 07:51:22 jetson-nano systemd[1]: rauc-mark-good.service: Succeeded.
systemctl status rauc returns:
● rauc.service - Rauc Update Service
Loaded: loaded (/lib/systemd/system/rauc.service; static; vendor preset: enabled)
Active: active (running) since Tue 2019-10-01 07:49:36 UTC; 2min 0s ago
Docs: https://rauc.readthedocs.io
Main PID: 4092 (rauc)
Tasks: 3 (limit: 4178)
Memory: 4.4M
CGroup: /system.slice/rauc.service
└─4092 /usr/bin/rauc --mount=/run/rauc service
Oct 01 07:49:36 jetson-nano systemd[1]: Starting Rauc Update Service...
Oct 01 07:49:36 jetson-nano systemd[1]: Started Rauc Update Service.
Oct 01 07:49:48 jetson-nano rauc[4092]: Failed getting primary slot: Failed getting primary slot: Unable to find primary boot slot
Oct 01 07:49:48 jetson-nano rauc[4092]: Failed to load status file /slot.raucs: No such file or directory
Oct 01 07:49:48 jetson-nano rauc[4092]: mounting slot /dev/mmcblk0p13
Oct 01 07:49:48 jetson-nano rauc[4092]: Failed to load status file /run/rauc/rootfs.1/slot.raucs: No such file or directory
Oct 01 07:51:22 jetson-nano rauc[4092]: Failed getting primary slot: Failed getting primary slot: Unable to find primary boot slot
Oct 01 07:51:22 jetson-nano rauc[4092]: rauc mark: marked slot rootfs.0 as good
And rauc status returns:
(rauc:4195): rauc-WARNING **: 07:51:46.126: Failed getting primary slot: Failed getting primary slot: Unable to find primary boot slot
Compatible: Jetson Nano
Variant:
Booted from: rootfs.0 (/dev/mmcblk0p1)
Activated: (null) ((null))
slot states:
rootfs.0: class=rootfs, device=/dev/mmcblk0p1, type=ext4, bootname=system0
state=booted, description=, parent=(none), mountpoint=/
boot status=bad
rootfs.1: class=rootfs, device=/dev/mmcblk0p13, type=ext4, bootname=system1
state=inactive, description=, parent=(none), mountpoint=(none)
boot status=bad
So there is no /slot.raucs file and it failed to find primary boot slot.
After that, systemctl status rauc-mark-good returns that the rootfs.0 slot has been marked as good in the end, but systemctl status rauc shows that the boot status is bad.
What am I missing here?
I edited the uboot script to the following:
test -n "${BOOT_ORDER}" || setenv BOOT_ORDER "system0 system1"
test -n "${BOOT_system0_LEFT}" || setenv BOOT_system0_LEFT 3
test -n "${BOOT_system1_LEFT}" || setenv BOOT_system1_LEFT 3
setenv bootargs
for BOOT_SLOT in "${BOOT_ORDER}"; do
if test "x${bootargs}" != "x"; then
# skip remaining slots
elif test "x${BOOT_SLOT}" = "xsystem0"; then
if test ${BOOT_system0_LEFT} -gt 0; then
setexpr BOOT_system0_LEFT ${BOOT_system0_LEFT} - 1
echo "Found valid slot system0, ${BOOT_system0_LEFT} attempts remaining"
setenv distro_bootpart "1"
setenv boot_line "mmc 1:1 any ${scriptaddr} /boot/extlinux/extlinux.conf"
fi
elif test "x${BOOT_SLOT}" = "xsystem1"; then
if test ${BOOT_system1_LEFT} -gt 0; then
setexpr BOOT_system1_LEFT ${BOOT_system1_LEFT} - 1
echo "Found valid slot system1, ${BOOT_system1_LEFT} attempts remaining"
setenv distro_bootpart "13"
setenv boot_line "mmc 1:D any ${scriptaddr} /boot/extlinux/extlinux.conf"
fi
fi
done
if test -n "${bootargs}"; then
saveenv
else
echo "No valid slot found, resetting tries to 3"
setenv BOOT_system0_LEFT 3
setenv BOOT_system1_LEFT 3
saveenv
reset
fi
sysboot ${boot_line}
And it ended up working. Apparently there was some issues with the BOOT_ORDER "system0 system1" in the the uboot script that was somehow not the same as in the RAUC system.conf. When i re-wrote the script, there was no issues and RAUC was running fine.

systemd: seems like ExecStop script is executed immediately after the start command is run

I am trying to start a docker-compose project as a systemd service on RHEL 7. Here is my systemd script (/etc/systemd/system/wp.service):
[Unit]
Description=wp service with docker compose
Requires=docker.service
After=docker.service
[Service]
EnvironmentFile=/home/ec2-user/projects/wp/project-dir/vars.env
WorkingDirectory=/home/ec2-user/projects/wp/project-dir
# ExecStartPre=/usr/bin/docker-compose down
ExecStart=/usr/bin/docker-compose up -d --build --remove-orphans
# ExecStop=/usr/bin/docker-compose down
[Install]
WantedBy=multi-user.target
When I execute the following command:
sudo systemctl status wp.service
Everything works fine - the containers run and stay running. Here is the output of sudo systemctl status wp.service
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: ---> Using cache
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: ---> 7392974149d3
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: Successfully built 7392974149d3
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: Successfully tagged foo_wp:latest
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: Creating mysql ...
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: [55B blob data]
Aug 15 03:07:23 ip-172-31-33-87.ec2.internal docker-compose[4185]: [37B blob data]
and the containers are up:
[ec2-user#ip-172-31-33-87 ~]$ sudo docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
579b52c8e3bc foo_wp "docker-entrypoint.s…" About a minute ago Up About a minute 0.0.0.0:9101->80/tcp wp
3c418cfe2b9c mariadb:10.3.8-bionic "docker-entrypoint.s…" About a minute ago Up About a minute 3306/tcp mysql
If, however, I uncomment the ExecStop line above (and run docker-compos down and reload the service), then the containers are removed after they are run. The output of the status command is:
Loaded: loaded (/etc/systemd/system/wp.service; disabled; vendor preset: disabled)
Active: deactivating (stop) since Wed 2018-08-15 03:12:12 UTC; 7s ago
Process: 4862 ExecStart=/usr/bin/docker-compose up -d --build --remove-orphans (code=exited, status=0/SUCCESS)
Main PID: 4862 (code=exited, status=0/SUCCESS); : 5165 (docker-compose)
Tasks: 2
Memory: 19.0M
CGroup: /system.slice/wp.service
└─control
└─5165 /usr/bin/python2 /usr/bin/docker-compose down
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Step 3/3 : COPY wordpress/ /var/www/html/
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: ---> Using cache
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: ---> 7392974149d3
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Successfully built 7392974149d3
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Successfully tagged foo_wp:latest
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Creating mysql ...
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: [55B blob data]
Aug 15 03:12:12 ip-172-31-33-87.ec2.internal docker-compose[4862]: [37B blob data]
Aug 15 03:12:12 ip-172-31-33-87.ec2.internal docker-compose[5165]: Stopping wp ...
Aug 15 03:12:12 ip-172-31-33-87.ec2.internal docker-compose[5165]: Stopping mysql ...
and the containers have been removed:
sudo docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[ec2-user#ip-172-31-33-87 foo]$
It seems as though the systemd service is executing the ExecStop script immediately after the ExecStart script. What could be the cause?
You are running docker-compose in detached mode (option -d). After starting the containers, docker-compose will daemonise the containers and exit.
Systemd monitors the PID of docker-compose, and when it exits, assumes that your program has stopped and will invoke the ExecStop commands.
Try running it without the -d option.
The reason systemd does this is because you haven't specified the type of your unit and by default it reverts to Type=simple.
See the official documentation for Type and ExecStop.

fail2ban fails to start ubuntu 16.04

I have used this tutorial to install fail2ban for my Ubuntu 16.04 server.
After going through this I tried to start with: /etc/init.d/fail2ban start
Here was the response:
[....] Starting fail2ban (via systemctl): fail2ban.serviceJob for fail2ban.service failed because the control process exited with error code. See "systemctl status fail2ban.service" and "journalctl -xe" for details.
failed!
When I then run: systemctl status fail2ban.service
I get this:
> fail2ban.service - Fail2Ban Service
Loaded: loaded (/lib/systemd/system/fail2ban.service; enabled; vendor preset: enabled)
Active: inactive (dead) (Result: exit-code) since Tue 2018-05-15 14:01:38 UTC; 1min 40s ago
Docs: man:fail2ban(1)
Process: 4468 ExecStart=/usr/bin/fail2ban-client -x start (code=exited, status=255)
May 15 14:01:38 tastycoders-prod1 systemd[1]: fail2ban.service: Control process exited, code=exited status=255
May 15 14:01:38 tastycoders-prod1 systemd[1]: Failed to start Fail2Ban Service.
May 15 14:01:38 tastycoders-prod1 systemd[1]: fail2ban.service: Unit entered failed state.
May 15 14:01:38 tastycoders-prod1 systemd[1]: fail2ban.service: Failed with result 'exit-code'.
May 15 14:01:38 tastycoders-prod1 systemd[1]: fail2ban.service: Service hold-off time over, scheduling restart.
May 15 14:01:38 tastycoders-prod1 systemd[1]: Stopped Fail2Ban Service.
May 15 14:01:38 tastycoders-prod1 systemd[1]: fail2ban.service: Start request repeated too quickly.
May 15 14:01:38 tastycoders-prod1 systemd[1]: Failed to start Fail2Ban Service.
Some tutorials at DigitalOcean contain errors. Check your /etc/fail2ban/jail.local. Try to keep it as simple as you can, i.e. keep there only those options you want to change.
Otherwise, if you have copied jail.conf to jail.local (according to the guide at DO), then delete or comment out pam section, if you do not use it, in jail.local file.
Go to line 146 of /etc/fail2ban/jail.local
# [pam-generic]
# enabled = false
# pam-generic filter can be customized to monitor specific subset of 'tty's
# filter = pam-generic
# port actually must be irrelevant but lets leave it all for some possible uses
# port = all
# banaction = iptables-allports
# port = anyport
# logpath = /var/log/auth.log
# maxretry = 6
More details are here: https://github.com/fail2ban/fail2ban/issues/1396

Preemptible node is sometimes failing to join the GKE cluster

I have a preemptible node pool of size 1 on GKE. I've been running this node pool with size 1 for almost a month now. Every day the node restarts after 24 hours and rejoins the cluster. Today it restarted but did not rejoin the cluster.
Instead, I noticed that according to gcloud compute instances list the underlying instance was running but not included in the output of kubectl get node. I increased the node pool size to 2, whereupon a second instance was launched. That node immediately joined my GKE cluster and pods were scheduled onto it. The first node is still running according to gcloud, but it won't join the cluster.
What's going on? How can I debug this this problem?
Update:
I SSHed into the instance and was immediately greeted with this excellent error message:
Broken (or in progress) Kubernetes node setup! Check the cluster initialization status
using the following commands:
Master instance:
- sudo systemctl status kube-master-installation
- sudo systemctl status kube-master-configuration
Node instance:
- sudo systemctl status kube-node-installation
- sudo systemctl status kube-node-configuration
The results of sudo systemctl status kube-node-installation:
goto mark: ● kube-node-installation.service - Download and install k8s binaries and configurations
Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 945 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 941 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 937 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/com
puteMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
Process: 933 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 930 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 925 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Main PID: 945 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kube-node-installation.service
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Downloading node problem detector.
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Dload Upload Total Spent Left Speed
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: [158B blob data]
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: == Downloaded https://storage.googleapis.com/kubernetes-release/node-problem-detector/node-problem-detector-v0.4.1.tar.gz (SHA1 = a57a3fe
64cab8a18ec654f5cef0aec59dae62568) ==
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: kubernetes-manifests.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: mounter is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Done for installing kubernetes files
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Download and install k8s binaries and configurations.
And the result of sudo systemctl status kube-node-configuration:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 994 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=4)
Process: 990 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 994 (code=exited, status=4)
CPU: 33ms
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Starting Configure kubernetes node...
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Start to configure instance for kubernetes
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Configuring IP firewall rules
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Failed to start Configure kubernetes node.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Unit entered failed state.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Failed with result 'exit-code'.
So it looks like kube-node-configuration failed. I ran sudo systemctl restart kube-node-configuration and now the status output is:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: active (exited) since Fri 2017-12-29 03:41:36 UTC; 3s ago
Main PID: 20802 (code=exited, status=0/SUCCESS)
CPU: 1.851s
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Extend the docker.service configuration to set a higher pids limit
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Docker command line is updated. Restart docker to pick it up
Dec 29 03:41:30 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using kubelet binary at /home/kubernetes/bin/kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kube-proxy static pod
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start node problem detector
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using node problem detector binary at /home/kubernetes/bin/node-problem-detector
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Prepare containerized mounter
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Done for the configuration for kubernetes
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Configure kubernetes node.
...and the node joined the cluster :). But, my original question stands: what happened?
We were experiencing a similar problem on GKE with preemptible nodes, seeing error messaging like these from the nodes:
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
level=info msg="Processing signal 'terminated'"
level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
level=info msg="Daemon shutdown complete"
docker daemon exited
Start kubelet
After about a month of back-and-forth with Google Support, we learned that the Nodes were getting preempted and replaced, and the new node that comes in, uses the same name, and it all happens without the normal pod disruption of a node being evicted.
Backstory: we were running into this problem because Jenkins was running it's workers on the nodes, and during this ~2 minute "restart" of the node going and returning, Jenkins master would loose connection and fail the job.
tldr; don't use preemptible nodes for this kind of work.