systemd: seems like ExecStop script is executed immediately after the start command is run - docker-compose

I am trying to start a docker-compose project as a systemd service on RHEL 7. Here is my systemd script (/etc/systemd/system/wp.service):
[Unit]
Description=wp service with docker compose
Requires=docker.service
After=docker.service
[Service]
EnvironmentFile=/home/ec2-user/projects/wp/project-dir/vars.env
WorkingDirectory=/home/ec2-user/projects/wp/project-dir
# ExecStartPre=/usr/bin/docker-compose down
ExecStart=/usr/bin/docker-compose up -d --build --remove-orphans
# ExecStop=/usr/bin/docker-compose down
[Install]
WantedBy=multi-user.target
When I execute the following command:
sudo systemctl status wp.service
Everything works fine - the containers run and stay running. Here is the output of sudo systemctl status wp.service
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: ---> Using cache
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: ---> 7392974149d3
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: Successfully built 7392974149d3
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: Successfully tagged foo_wp:latest
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: Creating mysql ...
Aug 15 03:07:22 ip-172-31-33-87.ec2.internal docker-compose[4185]: [55B blob data]
Aug 15 03:07:23 ip-172-31-33-87.ec2.internal docker-compose[4185]: [37B blob data]
and the containers are up:
[ec2-user#ip-172-31-33-87 ~]$ sudo docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
579b52c8e3bc foo_wp "docker-entrypoint.s…" About a minute ago Up About a minute 0.0.0.0:9101->80/tcp wp
3c418cfe2b9c mariadb:10.3.8-bionic "docker-entrypoint.s…" About a minute ago Up About a minute 3306/tcp mysql
If, however, I uncomment the ExecStop line above (and run docker-compos down and reload the service), then the containers are removed after they are run. The output of the status command is:
Loaded: loaded (/etc/systemd/system/wp.service; disabled; vendor preset: disabled)
Active: deactivating (stop) since Wed 2018-08-15 03:12:12 UTC; 7s ago
Process: 4862 ExecStart=/usr/bin/docker-compose up -d --build --remove-orphans (code=exited, status=0/SUCCESS)
Main PID: 4862 (code=exited, status=0/SUCCESS); : 5165 (docker-compose)
Tasks: 2
Memory: 19.0M
CGroup: /system.slice/wp.service
└─control
└─5165 /usr/bin/python2 /usr/bin/docker-compose down
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Step 3/3 : COPY wordpress/ /var/www/html/
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: ---> Using cache
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: ---> 7392974149d3
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Successfully built 7392974149d3
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Successfully tagged foo_wp:latest
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: Creating mysql ...
Aug 15 03:12:11 ip-172-31-33-87.ec2.internal docker-compose[4862]: [55B blob data]
Aug 15 03:12:12 ip-172-31-33-87.ec2.internal docker-compose[4862]: [37B blob data]
Aug 15 03:12:12 ip-172-31-33-87.ec2.internal docker-compose[5165]: Stopping wp ...
Aug 15 03:12:12 ip-172-31-33-87.ec2.internal docker-compose[5165]: Stopping mysql ...
and the containers have been removed:
sudo docker container ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
[ec2-user#ip-172-31-33-87 foo]$
It seems as though the systemd service is executing the ExecStop script immediately after the ExecStart script. What could be the cause?

You are running docker-compose in detached mode (option -d). After starting the containers, docker-compose will daemonise the containers and exit.
Systemd monitors the PID of docker-compose, and when it exits, assumes that your program has stopped and will invoke the ExecStop commands.
Try running it without the -d option.
The reason systemd does this is because you haven't specified the type of your unit and by default it reverts to Type=simple.
See the official documentation for Type and ExecStop.

Related

Service starting gunicorn failing with "Start request repeated too quickly"

Trying to start a service to run gunicorn as backend server for Flask, not working. Running nginx as frontend server for React, working.
Server:
Virtualization: vmware
Operating System: Red Hat Enterprise Linux 8.4 (Ootpa)
CPE OS Name: cpe:/o:redhat:enterprise_linux:8.4:GA
Kernel: Linux 4.18.0-305.3.1.el8_4.x86_64
Architecture: x86-64
Service file in /etc/systemd/system/myservice.service:
[Unit]
Description="Description"
After=network.target
[Service]
User=root
Group=root
WorkingDirectory=/home/project/app/api
ExecStart=/home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app
Restart=always
[Install]
WantedBy=multi-user.target
/app/api:
-rwxr-xr-x. 1 root root 2018 Jun 9 20:06 api.py
drwxrwxr-x+ 5 root root 100 Jun 7 10:11 venv
Error message:
● myservice.service - "Description"
Loaded: loaded (/etc/systemd/system/myservice.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2021-06-10 19:01:01 CEST; 5s ago
Process: 18307 ExecStart=/home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app (code=exited, status=203/EXEC)
Main PID: 18307 (code=exited, status=203/EXEC)
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Service RestartSec=100ms expired, scheduling restart.
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Scheduled restart job, restart counter is at 5.
Jun 10 19:01:01 xxxx systemd[1]: Stopped "Description".
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Start request repeated too quickly.
Jun 10 19:01:01 xxxx systemd[1]: myservice.service: Failed with result 'exit-code'.
Jun 10 19:01:01 xxxx systemd[1]: Failed to start "Description".
Tried, not working:
Adding Environment="PATH=/home/project/app/api/venv/bin" under [Service]
$ systemctl reset-failed myservice.service
$ systemctl daemon-reload
Reboot, ofc.
Tried, working:
Running (as root) /home/project/app/api/venv/bin/gunicorn -b 127.0.0.1:5000 api:app while in /app/api directory
Does anyone know how to fix this problem?
Typically enough, I figured it out shortly after posting this issue.
SELinux is messing with permissions for files and directories, so for anyone experiencing the same issue, make sure to test with the following alterings (as root):
$ setsebool -P httpd_can_network_connect on
$ chcon -Rt httpd_sys_content_t /path/to/your/Flask/dir
In my case: $ chcon -Rt httpd_sys_content_t /home/project/app/api
While this is NOT a permanent fix, it's worth a try. Check out the SELinux docs for more permanent solutions.

Apache Zookeeper: Unable to access data directory

OS: RHEL 8.2
I am trying to create a systemctl service for zookeeper. It fails to access the datadir.
Here is my config for zookeeper,
dataDir=/opt/zookeeper
maxClientCnxns=20
tickTime=2000
dataDir=/var/zookeeper/
initLimit=20
syncLimit=10
server.0=master:2888:3888
clientPort=2181
admin.serverPort=8082
Permission of /opt/zookeeper is set to 777.
[user1#server1 opt]$ ls -lart
total 0
dr-xr-xr-x. 17 root root 244 Jul 3 10:56 ..
drwxr-xr-x 3 root root 27 Jul 10 10:29 rh
drw-r--r-- 2 user2 user2 6 Jul 17 08:48 hsluw_data
drw-r--r-- 2 user2 user2 6 Jul 17 08:58 hsluw_config
drwxr-xr-x. 6 root root 71 Jul 17 08:58 .
drwxrwxrwx 3 user2 user2 23 Jul 17 09:40 zookeeper
If I run the command,
./bin/zookeeper-server-start.sh config/zookeeper.properties
it gives me an error message: Unable to access datadir
[2020-07-30 10:25:50,767] ERROR Invalid configuration, only one server specified (ignoring) (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
[2020-07-30 10:25:50,767] INFO Starting server (org.apache.zookeeper.server.ZooKeeperServerMain)
[2020-07-30 10:25:50,769] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)
[2020-07-30 10:25:50,769] ERROR Unable to access datadir, exiting abnormally (org.apache.zookeeper.server.ZooKeeperServerMain)
org.apache.zookeeper.server.persistence.FileTxnSnapLog$DatadirException: Cannot write to data directory /var/zookeeper/version-2
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:132)
at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:124)
at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:106)
at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:64)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:128)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Unable to access datadir, exiting abnormally
However, sudoing the above command works,
sudo ./bin/zookeeper-server-start.sh config/zookeeper.properties
Now I have created a service in /etc/systemd/system/zookeeper.service
I wrote the service in /etc/systemd/system/zookeeper.service in this way,
[Unit]
Requires=network.target remote-fs.target
After=network.target remote-fs.target
[Service]
Type=simple
User=user2
ExecStart=/home/user2/kafka/bin/zookeeper-server-start.sh /home/user2/kafka/config/zookeeper.properties
ExecStop=/home/user2/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal
[Install]
WantedBy=multi-user.target
The SELinux status is disabled.
user2#server1$ sestatus
SELinux status: disabled
Now if I do the following
sudo systemctl daemon-reload
sudo systemctl start zookeeper
sudo systemctl enable zookeeper
I am getting the the same Unable to access the datadir error like the following,
[user2#server1 /]$ systemctl status zookeeper
\u25cf zookeeper.service
Loaded: loaded (/etc/systemd/system/zookeeper.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2020-07-30 10:13:19 CEST; 24s ago
Main PID: 12911 (code=exited, status=3)
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: org.apache.zookeeper.server.persistence.FileTxnSnapLog$Data>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.persistence.FileTxnS>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.ZooKeeperServerMain.>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.ZooKeeperServerMain.>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.ZooKeeperServerMain.>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.quorum.QuorumPeerMai>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: at org.apache.zookeeper.server.quorum.QuorumPeerMai>
Jul 30 10:13:19 server1.localdomain zookeeper-server-start.sh[12911]: Unable to access datadir, exiting abnormally
Jul 30 10:13:19 server1.localdomain systemd[1]: zookeeper.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Jul 30 10:13:19 server1.localdomain systemd[1]: zookeeper.service: Failed with result 'exit-code'.
What am I missing here?
In the configuration file, this line
dataDir=/var/zookeeper/
appears twice. Removing that line solves the issue.

systemd service not starting on boot, starts when i restart it

I have made this service file to start a python script when my raspberry pi (4) boots up:
/etc/systemd/system/plants.service
[Unit]
Description=plant-sender
After=network.target
[Service]
Type=simple
User=root
Group=root
WorkingDirectory=/home/theo/Repos/plants-monitor/remote
ExecStart=/usr/bin/python main.py
Restart=on-failure
[Install]
WantedBy=multi-user.target
However, once the pi is on, I run sudo systemctl status plants, and get:
* plants.service - plant-sender
Loaded: loaded (/etc/systemd/system/plants.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2020-03-30 20:22:43 EDT; 1min 45s ago
Process: 323 ExecStart=/usr/bin/python main.py (code=exited, status=1/FAILURE)
Main PID: 323 (code=exited, status=1/FAILURE)
Mar 30 20:22:43 arpi systemd[1]: plants.service: Scheduled restart job, restart counter is at 5.
Mar 30 20:22:43 arpi systemd[1]: Stopped plant-sender.
Mar 30 20:22:43 arpi systemd[1]: plants.service: Start request repeated too quickly.
Mar 30 20:22:43 arpi systemd[1]: plants.service: Failed with result 'exit-code'.
Mar 30 20:22:43 arpi systemd[1]: Failed to start plant-sender.
But, after running sudo systemctl restart plants, the service starts up and everything is fine.
If it doesn't start on boot but does on systemctl restart, I'd be looking at whether /home/theo/Repos/plants-monitor/remote is mounted at that point.
There may be something automounting or home-mounting your home directory when you log in.
If so, you could change the working directory to something that exists always, even if only a test.
Additionally, using journalctl -n 9999 -u plants will get you more log messages, so you can see why it's failing, rather than just seeing the "tried too many times, giving up" messages.

Preemptible node is sometimes failing to join the GKE cluster

I have a preemptible node pool of size 1 on GKE. I've been running this node pool with size 1 for almost a month now. Every day the node restarts after 24 hours and rejoins the cluster. Today it restarted but did not rejoin the cluster.
Instead, I noticed that according to gcloud compute instances list the underlying instance was running but not included in the output of kubectl get node. I increased the node pool size to 2, whereupon a second instance was launched. That node immediately joined my GKE cluster and pods were scheduled onto it. The first node is still running according to gcloud, but it won't join the cluster.
What's going on? How can I debug this this problem?
Update:
I SSHed into the instance and was immediately greeted with this excellent error message:
Broken (or in progress) Kubernetes node setup! Check the cluster initialization status
using the following commands:
Master instance:
- sudo systemctl status kube-master-installation
- sudo systemctl status kube-master-configuration
Node instance:
- sudo systemctl status kube-node-installation
- sudo systemctl status kube-node-configuration
The results of sudo systemctl status kube-node-installation:
goto mark: ● kube-node-installation.service - Download and install k8s binaries and configurations
Loaded: loaded (/etc/systemd/system/kube-node-installation.service; enabled; vendor preset: disabled)
Active: active (exited) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 945 ExecStart=/home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 941 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure.sh (code=exited, status=0/SUCCESS)
Process: 937 ExecStartPre=/usr/bin/curl --fail --retry 5 --retry-delay 3 --silent --show-error -H X-Google-Metadata-Request: True -o /home/kubernetes/bin/configure.sh http://metadata.google.internal/com
puteMetadata/v1/instance/attributes/configure-sh (code=exited, status=0/SUCCESS)
Process: 933 ExecStartPre=/bin/mount -o remount,exec /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 930 ExecStartPre=/bin/mount --bind /home/kubernetes/bin /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Process: 925 ExecStartPre=/bin/mkdir -p /home/kubernetes/bin (code=exited, status=0/SUCCESS)
Main PID: 945 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 4915)
Memory: 0B
CPU: 0
CGroup: /system.slice/kube-node-installation.service
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Downloading node problem detector.
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: % Total % Received % Xferd Average Speed Time Time Time Current
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Dload Upload Total Spent Left Speed
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: [158B blob data]
Dec 28 21:08:52 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: == Downloaded https://storage.googleapis.com/kubernetes-release/node-problem-detector/node-problem-detector-v0.4.1.tar.gz (SHA1 = a57a3fe
64cab8a18ec654f5cef0aec59dae62568) ==
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: cni-0799f5732f2a11b329d9e3d51b9c8f2e3759f2ff.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: kubernetes-manifests.tar.gz is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: mounter is preloaded.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure.sh[945]: Done for installing kubernetes files
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Download and install k8s binaries and configurations.
And the result of sudo systemctl status kube-node-configuration:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2017-12-28 21:08:53 UTC; 6h ago
Process: 994 ExecStart=/home/kubernetes/bin/configure-helper.sh (code=exited, status=4)
Process: 990 ExecStartPre=/bin/chmod 544 /home/kubernetes/bin/configure-helper.sh (code=exited, status=0/SUCCESS)
Main PID: 994 (code=exited, status=4)
CPU: 33ms
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Starting Configure kubernetes node...
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Start to configure instance for kubernetes
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Configuring IP firewall rules
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[994]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Main process exited, code=exited, status=4/NOPERMISSION
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Failed to start Configure kubernetes node.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Unit entered failed state.
Dec 28 21:08:53 gke-cluster0-pool-d59e9506-g9sc systemd[1]: kube-node-configuration.service: Failed with result 'exit-code'.
So it looks like kube-node-configuration failed. I ran sudo systemctl restart kube-node-configuration and now the status output is:
● kube-node-configuration.service - Configure kubernetes node
Loaded: loaded (/etc/systemd/system/kube-node-configuration.service; enabled; vendor preset: disabled)
Active: active (exited) since Fri 2017-12-29 03:41:36 UTC; 3s ago
Main PID: 20802 (code=exited, status=0/SUCCESS)
CPU: 1.851s
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Extend the docker.service configuration to set a higher pids limit
Dec 29 03:41:28 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Docker command line is updated. Restart docker to pick it up
Dec 29 03:41:30 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using kubelet binary at /home/kubernetes/bin/kubelet
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start kube-proxy static pod
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Start node problem detector
Dec 29 03:41:35 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Using node problem detector binary at /home/kubernetes/bin/node-problem-detector
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Prepare containerized mounter
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc configure-helper.sh[20802]: Done for the configuration for kubernetes
Dec 29 03:41:36 gke-cluster0-pool-d59e9506-g9sc systemd[1]: Started Configure kubernetes node.
...and the node joined the cluster :). But, my original question stands: what happened?
We were experiencing a similar problem on GKE with preemptible nodes, seeing error messaging like these from the nodes:
Extend the docker.service configuration to set a higher pids limit
Docker command line is updated. Restart docker to pick it up
level=info msg="Processing signal 'terminated'"
level=info msg="stopping event stream following graceful shutdown" error="<nil>" module=libcontainerd namespace=moby
level=info msg="Daemon shutdown complete"
docker daemon exited
Start kubelet
After about a month of back-and-forth with Google Support, we learned that the Nodes were getting preempted and replaced, and the new node that comes in, uses the same name, and it all happens without the normal pod disruption of a node being evicted.
Backstory: we were running into this problem because Jenkins was running it's workers on the nodes, and during this ~2 minute "restart" of the node going and returning, Jenkins master would loose connection and fail the job.
tldr; don't use preemptible nodes for this kind of work.

rkt discovery failed error while bringing flannel up (Kubernetes setup)

We are trying to setup Kubernetes cluster on 3 nodes with coreos following official step by step documentation - https://coreos.com/kubernetes/docs/latest/deploy-master.html
Servers are behind company proxy, and have proxy service defined in both
/etc/systemd/system/docker.service.d
/etc/systemd/system/flanneld.service.d
Following is picked in
systemctl cat flanneld
# /usr/lib/systemd/system/flanneld.service
[Unit]
Description=flannel - Network fabric for containers (System Application Container)
Documentation=https://github.com/coreos/flannel
After=etcd.service etcd2.service etcd-member.service
Before=docker.service flannel-docker-opts.service
Requires=flannel-docker-opts.service
[Service]
Type=notify
Restart=always
RestartSec=10s
LimitNOFILE=40000
LimitNPROC=1048576
Environment="FLANNEL_IMAGE_TAG=v0.6.2"
Environment="FLANNEL_OPTS=--ip-masq=true"
Environment="RKT_RUN_ARGS=--uuid-file-save=/var/lib/coreos/flannel-wrapper.uuid"
EnvironmentFile=-/run/flannel/options.env
ExecStartPre=/sbin/modprobe ip_tables
ExecStartPre=/usr/bin/mkdir --parents /var/lib/coreos /run/flannel
ExecStartPre=-/usr/bin/rkt rm --uuid-file=/var/lib/coreos/flannel-wrapper.uuid
ExecStart=/usr/lib/coreos/flannel-wrapper $FLANNEL_OPTS
ExecStop=-/usr/bin/rkt stop --uuid-file=/var/lib/coreos/flannel-wrapper.uuid
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/flanneld.service.d/40-ExecStartPre-symlink.conf
[Service]
ExecStartPre=/usr/bin/ln -sf /etc/flannel/options.env /run/flannel/options.env
# /etc/systemd/system/flanneld.service.d/proxy.conf
[Service]
Environment="HTTP_PROXY=http://10.140.65.114:8080/"
Environment="HTTPS_PROXY=http://10.140.65.114:8080/"
and
systemctl cat docker
# /usr/lib/systemd/system/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=containerd.service docker.socket early-docker.target network.target
Requires=containerd.service docker.socket early-docker.target
[Service]
Type=notify
EnvironmentFile=-/run/flannel/flannel_docker_opts.env
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/lib/coreos/dockerd --host=fd:// --containerd=/var/run/docker/libcontainerd/docker-containerd.sock $DOCKER_OPTS $DOCKER_CGROUPS $DOCKER_OPT_BIP $DOCKER_OP
ExecReload=/bin/kill -s HUP $MAINPID
LimitNOFILE=1048576
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/docker.service.d/40-flannel.conf
[Unit]
Requires=flanneld.service
After=flanneld.service
[Service]
EnvironmentFile=/etc/kubernetes/cni/docker_opts_cni.env
# /etc/systemd/system/docker.service.d/http-proxy.conf
[Service]
Environment="HTTP_PROXY=http://10.140.65.114:8080/"
Environment="HTTPS_PROXY=http://10.140.65.114:8080/"
# /etc/systemd/system/flanneld.service.d/40-ExecStartPre-symlink.conf
[Service]
ExecStartPre=/usr/bin/ln -sf /etc/flannel/options.env /run/flannel/options.env
# /etc/systemd/system/flanneld.service.d/proxy.conf
[Service]
Environment="HTTP_PROXY=http://10.140.65.114:8080/"
Environment="HTTPS_PROXY=http://10.140.65.114:8080/"
after running systemctl daemon-reload and systemctl start flannel, getting following error
Feb 16 19:50:40 localhost systemd[1]: Starting flannel - Network fabric for containers (System Application Container)...
-- Subject: Unit flanneld.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has begun starting up.
Feb 16 19:50:40 localhost rkt[52933]: rm: cannot get pod: no matches found for "26778eb4-9d8a-4d3c-9bb7-6ffb13a55d6a"
Feb 16 19:50:40 localhost rkt[52933]: rm: failed to remove one or more pods
Feb 16 19:50:40 localhost flannel-wrapper[52947]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/flannel-wrapper.uuid --trust-keys-from-https --mount volume=notify,target=/run/systemd/notify --volume notify,kind=host,source=/run/systemd/notify --set-env=NOTIFY_SOCKET=/run/systemd/notify --net=host --volume run-flannel,kind=host,source=/run/flannel,readOnly=false --volume etc-ssl-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=run-flannel,target=/run/flannel --mount volume=etc-ssl-certs,target=/etc/ssl/certs --mount volume=usr-share-certs,target=/usr/share/ca-certificates --mount volume=etc-hosts,target=/etc/hosts --mount volume=etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/flannel:v0.6.2 -- --ip-masq=true
Feb 16 19:50:41 localhost sudo[52978]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/bin/journalctl -e -u kubelet
Feb 16 19:50:41 localhost sudo[52978]: pam_unix(sudo:session): session opened for user root by admin(uid=0)
Feb 16 19:50:41 localhost sudo[52978]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Feb 16 19:50:41 localhost sudo[52978]: pam_unix(sudo:session): session closed for user root
Feb 16 19:50:42 localhost flannel-wrapper[52947]: image: keys already exist for prefix "quay.io/coreos/flannel", not fetching again
Feb 16 19:50:43 localhost sudo[52990]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/bin/journalctl -e -u kubelet
Feb 16 19:50:43 localhost sudo[52990]: pam_unix(sudo:session): session opened for user root by admin(uid=0)
Feb 16 19:50:43 localhost sudo[52990]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Feb 16 19:50:43 localhost sudo[52990]: pam_unix(sudo:session): session closed for user root
Feb 16 19:50:44 localhost flannel-wrapper[52947]: Downloading signature: 0 B/473 B
Feb 16 19:50:44 localhost flannel-wrapper[52947]: Downloading signature: 473 B/473 B
Feb 16 19:50:45 localhost flannel-wrapper[52947]: Downloading signature: 473 B/473 B
Feb 16 19:50:45 localhost flannel-wrapper[52947]: run: Get https://quay-registry.s3.amazonaws.com/sharedimages/36acf4f7-a5bd-470b-9a44-13cbd244b571/layer?Signature=v8rQghQZR0k%2B1UxDG8oGw89vTqY%3D&Expires=1487255465&AWSAccessKeyId=AKIAJWZWUIS24TWSMWRA: Blocked site:
Feb 16 19:50:45 localhost systemd[1]: flanneld.service: Main process exited, code=exited, status=254/n/a
Feb 16 19:50:45 localhost rkt[52993]: stop: cannot get pod: no matches found for "26778eb4-9d8a-4d3c-9bb7-6ffb13a55d6a"
Feb 16 19:50:45 localhost rkt[52993]: stop: failed to stop 1 pod(s)
Feb 16 19:50:45 localhost systemd[1]: Failed to start flannel - Network fabric for containers (System Application Container).
-- Subject: Unit flanneld.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flanneld.service has failed.
--
-- The result is failed.
Feb 16 19:50:45 localhost systemd[1]: flanneld.service: Unit entered failed state.
Feb 16 19:50:45 localhost systemd[1]: flanneld.service: Failed with result 'exit-code'.
Feb 16 19:50:45 localhost systemd[1]: Starting flannel docker export service - Network fabric for containers (System Application Container)...
-- Subject: Unit flannel-docker-opts.service has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flannel-docker-opts.service has begun starting up.
Feb 16 19:50:45 localhost sudo[53003]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/bin/journalctl -e -u kubelet
Feb 16 19:50:45 localhost sudo[53003]: pam_unix(sudo:session): session opened for user root by admin(uid=0)
Feb 16 19:50:45 localhost sudo[53003]: pam_systemd(sudo:session): Cannot create session: Already running in a session
Feb 16 19:50:45 localhost sudo[53003]: pam_unix(sudo:session): session closed for user root
Feb 16 19:50:45 localhost rkt[53000]: rm: cannot get pod: UUID cannot be empty
Feb 16 19:50:45 localhost rkt[53000]: rm: failed to remove one or more pods
Feb 16 19:50:45 localhost flannel-wrapper[53019]: + exec /usr/bin/rkt run --uuid-file-save=/var/lib/coreos/flannel-wrapper2.uuid --trust-keys-from-https --net=host --volume run-flannel,kind=host,source=/run/flannel,readOnly=false --volume etc-ssl-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume usr-share-certs,kind=host,source=/usr/share/ca-certificates,readOnly=true --volume etc-hosts,kind=host,source=/etc/hosts,readOnly=true --volume etc-resolv,kind=host,source=/etc/resolv.conf,readOnly=true --mount volume=run-flannel,target=/run/flannel --mount volume=etc-ssl-certs,target=/etc/ssl/certs --mount volume=usr-share-certs,target=/usr/share/ca-certificates --mount volume=etc-hosts,target=/etc/hosts --mount volume=etc-resolv,target=/etc/resolv.conf --inherit-env --stage1-from-dir=stage1-fly.aci quay.io/coreos/flannel:v0.6.2 --exec=/opt/bin/mk-docker-opts.sh -- -d /run/flannel/flannel_docker_opts.env -i
Feb 16 19:50:46 localhost flannel-wrapper[53019]: run: discovery failed
Feb 16 19:50:46 localhost systemd[1]: flannel-docker-opts.service: Main process exited, code=exited, status=254/n/a
Feb 16 19:50:46 localhost systemd[1]: Failed to start flannel docker export service - Network fabric for containers (System Application Container).
-- Subject: Unit flannel-docker-opts.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit flannel-docker-opts.service has failed.
--
-- The result is failed.
Feb 16 19:50:46 localhost systemd[1]: flannel-docker-opts.service: Unit entered failed state.
Feb 16 19:50:46 localhost systemd[1]: flannel-docker-opts.service: Failed with result 'exit-code'.
We tried different document https://www.upcloud.com/support/deploy-kubernetes-coreos/ following it, getting same type error while starting kubelet.
Seems to be problem with rkt and quay registry issue behind company proxy.
Let us know if we missed something or configured something wrong.
can you please try to
$ sudo rkt fetch quay.io/coreos/flannel:v0.6.2
first in the shell.
I believe the issue is due to either running https proxy over http, or rkt fetch running as an unprivileged user and not inheriting system environmental variables.