I run a Kubernetes cluster with version 1.5.2, setup with Kops on AWS. The setup has nothing exotic. My nodes run on m4.xlarge with 70 Gb of disk storage with 1000 iops.
I have periods where some of my nodes get crazy with iops. Here is what I see:
So du take all my iops in the docker overlay directory. Here is what the kubelet logs display:
fsHandler.go:131] du and find on following dirs took 4.22914425s: [/var/lib/docker/overlay/592c1d88d1fd115f21e8fe6f198a8a27cd44efefb9b5dc58940fbf6d7999eda3 /var/lib/docker/containers/2347d28886bc0e6b74fc326538e1483927ddeb89b38e035acd845d5db621cb79]
fsHandler.go:131] du and find on following dirs took 24.94283434s: [/var/lib/docker/overlay/81f24df3624ebf7b7e45edc38fafeb41958bc675ae57fd0126c44cb2c3a6d6d6 /var/lib/docker/containers/43d576931081500fd4cd316afe5bfc6ff2442ff20e8e8266c27e930a0a77dd34]
fsHandler.go:131] du and find on following dirs took 18.478782737s: [/var/lib/docker/overlay/422ef31413df4e76de51acaa7d6ff6f77edc65fabde88a7c70e7edad3b1e55e5 /var/lib/docker/containers/1519a33729c8fb13297358edc53fe22f0b4b684636884976dfcb67c47fbf320b]
helpers.go:101] Unable to get network stats from pid 13515: couldn't read network stats: failure opening /proc/13515/net/dev: open /proc/13515/net/dev: no such file or directory
fsHandler.go:131] du and find on following dirs took 7.971745844s: [/var/lib/docker/overlay/45b83939bd1b4ec7dfa627bb6a9eb8b89a380007f9e22a93fff2ba4054252271 /var/lib/docker/containers/f6d3387423398d7dd4fac6c19ee0a1446d0465b5f9cf90289fcd605ad28c0d6e]
fsHandler.go:131] du and find on following dirs took 5.886763577s: [/var/lib/docker/overlay/8c01a73671eedb2e62c58fa12fc2d25df58c506545b6ea048fa0db1756d19f2c /var/lib/docker/containers/1d9c0ebcc6dbbd7065923f7f81c05c0d9d710aed0d353a1bab90ce1c994dfb57]
fsHandler.go:131] du and find on following dirs took 5.714942029s: [/var/lib/docker/overlay/26213ba30a17f240a9b9756a0d23ab32550f921de533667c9ab91cfb7f10ed5b /var/lib/docker/containers/7c27c242a49d8d33cee8b2e8335dae450af13b26f010794dc83ef5750a212d0d]
fsHandler.go:131] du and find on following dirs took 6.111478835s: [/var/lib/docker/overlay/0fe2bd0feeda24699bd6d443ca126ac1a33071cdff039ae9fd9159bbef80867b /var/lib/docker/containers/ec6fb966139e9666ec0be5e13399773f1971ddd99841b84167a7463402e28d73]
fsHandler.go:131] du and find on following dirs took 2.661604836s: [/var/lib/docker/overlay/04f9d01a8863cfee26e678e938fced84f826dda6ed03626dda11b6aad6901465 /var/lib/docker/containers/a4e37aee69c7523c46c5252c1834fa3fcd5a804a7aee256a468e44b4d6bcbd64]
fsHandler.go:131] du and find on following dirs took 11.834409809s: [/var/lib/docker/overlay/4cb1476621b90e2c2ee2b1131c0e6ac62f62dc3ca418129812b487bffac1d827 /var/lib/docker/containers/5a01521cfdd3041aff128dce7353ab336ddafa60c8c0b2254fb6bae697cb1676]
I recommend upgrading to k8s version 1.6, there are many updates noted in the CHANGELOG that should help debug your issue.
Generally, EBS volumes are not fully available in terms of IO unless you have fully "pre-warmed" them by reading and writing to every block on the device.
Related
I am trying to install kubernetes with kubeadm in my laptop which has Ubuntu 16.04. I have disabled swap, since kubelet does not work with swap on. The command I used is :
swapoff -a
I also commented out the reference to swap in /etc/fstab.
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/sda1 during installation
UUID=1d343a19-bd75-47a6-899d-7c8bc93e28ff / ext4 errors=remount-ro 0 1
# swap was on /dev/sda5 during installation
#UUID=d0200036-b211-4e6e-a194-ac2e51dfb27d none swap sw 0 0
I confirmed swap is turned off by running the following:
free -m
total used free shared buff/cache available
Mem: 15936 2108 9433 954 4394 12465
Swap: 0 0 0
When I start kubeadm, I get the following error:
kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.14.2
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
I also tried restarting my laptop, but I get the same error. What could the reason be?
below was the root cause.
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".
you need to update the docker cgroup driver.
follow the below fix
cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
EOF
mkdir -p /etc/systemd/system/docker.service.d
# Restart Docker
systemctl daemon-reload
systemctl restart docker
you could try kubeadm reset , then kubeadm init --ignore-preflight-errors Swap .
first try with sudo
sudo swapoff -a
then check if there's anything swapped
cat /proc/swaps
and
free -h
I have this Docker command:
docker run -d mongo
this will build and run a mongodb server running in a docker container
However, I get an error:
no space left on device
I am on MacOS, and using the newer versions of Docker which use hyper-v instead of VirtualBox (I think that's correct).
Here is the exact error message from the mongo container:
$ docker logs efee16702c5756659d563b98d4ae0f58ecf1f1bba8a54f63443c0ae4b520ab4e
about to fork child process, waiting until server is ready for connections.
forked process: 21
2017-05-04T20:23:51.412+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2017-05-04T20:23:51.430+0000 I CONTROL [main] ERROR: Cannot write pid file to /tmp/tmp.Lo035QkbfL: No space left on device
ERROR: child process failed, exited with error number 1
Any idea how to fix this and prevent it from happening in future?
As suggested, the output of df -h is:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 116Gi 349Gi 25% 1963838 4293003441 0% /
devfs 183Ki 183Ki 0Bi 100% 634 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
Output of docker info is:
$ docker info
Containers: 5
Running: 0
Paused: 0
Stopped: 5
Images: 741
Server Version: 17.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.13-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952 GiB
Name: moby
ID: OR4L:WYWW:FFAP:IDX3:B6UK:O2AN:UVTO:EPH6:GYSV:4GV4:L5WP:BQTH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 17
Goroutines: 30
System Time: 2017-05-04T20:45:27.056157913Z
EventsListeners: 1
No Proxy: *.local, 169.254/16
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
As you state in the comments to the question, ls -altrh ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 returns the following:
-rw-r--r--# 1 alexamil staff 53G
This is a known bug on MacOS (actually, not only) and an official dev comment could be found here. Except for one thing: I read, that different people get different size limit. In the comment it is 64Gb, but for another person it was 20Gb.
There are a couple walkarounds, but no definite solution that I could find.
The manual one
Run docker ps -a and manually remove all unused containers. Then run docker images and remove manually all the intermediate and unused images.
The simplest one
Delete the Docker.qcow2 file entirely. But you will lose all images and containers. Completely.
The less simple
Another way is to run docker volume prune, which will remove all unused volumes
The resizing one (keeps the data)
Another idea that comes to me is to expand the disk image size with QEMU or something like it:
$ brew install qemu
$ /Applications/Docker.app/Contents/MacOS/qemu-img resize ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 +5G
After you expanded the image, you will need to run a VM in which you should run GParted against Docker.qcow2 and expand the partition to use added space. You could use GParted Live ISO for that:
$ qemu-system-x86_64 -drive file=Docker.qcow2 -m 512 -cdrom ~/Downloads/gparted-live.iso -boot d -device usb-mouse -usb
Some people report this either doesn't work or doesn't help.
Yet another resizing one (wipes the data)
Create a substitute image with desired size (120G):
$ qemu-img create -f qcow2 ~/data.qcow2 120G
$ cp ~/data.qcow2 /Application/Docker.app/Contents/Resources/moby/data.qcow2
$ rm ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2
data.qcow2 is copied to ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 when you restart docker.
This walkaround comes from this comment.
Hope this helps. Good luck!
I'm attempting a whole-planet OSM data import on an AWS EC2. During or possibly after the "Ways" processing i receive the following message:
"Failed to read from node cache: Input/output error"
The EC2 has the following specs:
type: i3.xlarge
memory: 30.5 Gb
vCPUs: 4
Postgresql: v9.5.6
PostGIS: 2.2
In addition to the root volume, I have mounted 900GB SSD and a 2TB HHD (high throughput). The Postgresql data directory is on the HHD. I have commanded the osm2pgsql to write the flat-nodes file the SSD.
Here is my osm2pgsql command:
osm2pgsql -c -d gis --number-processes 4 --slim -C 20000 --flat-nodes /data-cache/flat-node-cache/flat.nodes /data-postgres/planet-latest.osm.pbf
I run the above command as user renderaccount that is a member of the following groups renderaccount ubuntu postgres. The flat-nodes file appears to be successfully created at /data-cache/flat-node-cache/flat.nodes and has this profile:
ubuntu#ip-172-31-25-230:/data-cache/flat-node-cache$ ls -l
total 37281800
-rw------- 1 renderaccount renderaccount 38176555024 Apr 13 05:45 flat.nodes
Has anyone run into and or resolved this? I suspect maybe a permissions issue? I notice now that since this last osm2pgsql failure, the mounted SSD that is the destination of the flat-nodes file has been converted to a "read-only" file system - which sounds like may happen when there are i/o errors on the mounted volume(?).
Also, does osm2pgsql write to a log that I could acquire additional info?
UPDATE: dmesg output:
[ 6206.884412] blk_update_request: I/O error, dev nvme0n1, sector 66250752
[ 6206.890813] EXT4-fs warning (device nvme0n1): ext4_end_bio:329: I/O error -5 writing to inode 14024706 (offset 10871640064 size 8388608 starting block 8281600)
[ 6206.890817] Buffer I/O error on device nvme0n1, logical block 8281344
After researching the above output, it appears it might be a bug in Ubuntu 16.04. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129?comments=all
This was an error with Ubuntu 16.04 writing to the volume nvme0n1. Solved by this https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1668129/comments/29
I installed Kubernetes on my Ubuntu machine. For some debugging purposes I need to look at the kubelet log file (if there is any such file).
I have looked in /var/logs but I couldn't find a such file. Where could that be?
If you run kubelet using systemd, then you could use the following method to see kubelet's logs:
# journalctl -u kubelet
If you are trying to go directly to the file you can find the kubelet logs in /var/log/syslog directory. This is for ubuntu 16.04 and above.
It depends how it was installed. I installed Kubernetes on some Ubuntu machines following the Docker-MultiNode instructions.
With this install, I find the logs using the logs command like this.
Find your container ID.
$ docker ps | egrep kubelet
Use that container ID to view the logs
$ docker logs `<container-id>`
Finally I could find it in /var/log/upstart directory. Kubernetes in my machine is started using upstart. That's why those log files are in upstart directory
I installed Kubernetes by kind (Kubernetes in docker).
find docker container of kind to enter
$ docker container ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
62588e4d284b kindest/node:v1.17.0 "/usr/local/bin/entr…" 2 weeks ago Up 2 weeks 127.0.0.1:32769->6443/tcp kind2-control-plane
$ docker container exec -it kind2-control-plane bash
root#kind2-control-plane:/#
Inside container kind2-control-plane, you could find logfiles in two place:
/var/log/containers/
/var/log/pods/
And then,you will find they are the same, you can see the example below:
root#kind2-control-plane:/# cat /var/log/containers/redis-master-7db7f6579f-scw95_default_master-f6374281c2c6afcfcd0ee1214d9bd51c1684c0b6c0ba1056295246ecd055563c.log | tail -n 5
2020-04-08T12:09:29.824252114Z stdout F
2020-04-08T12:09:29.824372278Z stdout F [1] 08 Apr 12:09:29.822 # Server started, Redis version 2.8.19
2020-04-08T12:09:29.824440661Z stdout F [1] 08 Apr 12:09:29.823 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
2020-04-08T12:09:29.824459317Z stdout F [1] 08 Apr 12:09:29.823 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
2020-04-08T12:09:29.82446451Z stdout F [1] 08 Apr 12:09:29.824 * The server is now ready to accept connections on port 6379
root#kind2-control-plane:/# cat /var/log/pods/default_redis-master-7db7f6579f-scw95_094824e1-25aa-4e1e-ab23-d4bae861988a/master/0.log | tail -n 5
2020-04-08T12:09:29.824252114Z stdout F
2020-04-08T12:09:29.824372278Z stdout F [1] 08 Apr 12:09:29.822 # Server started, Redis version 2.8.19
2020-04-08T12:09:29.824440661Z stdout F [1] 08 Apr 12:09:29.823 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
2020-04-08T12:09:29.824459317Z stdout F [1] 08 Apr 12:09:29.823 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
2020-04-08T12:09:29.82446451Z stdout F [1] 08 Apr 12:09:29.824 * The server is now ready to accept connections on port 6379
root#kind2-control-plane:/# ls -l /var/log/containers/ | grep redis
lrwxrwxrwx 1 root root 101 Apr 8 12:09 redis-master-7db7f6579f-scw95_default_master-f6374281c2c6afcfcd0ee1214d9bd51c1684c0b6c0ba1056295246ecd055563c.log -> /var/log/pods/default_redis-master-7db7f6579f-scw95_094824e1-25aa-4e1e-ab23-d4bae861988a/master/0.log
If you want to know more in detail about the directories, you can see 2019-2-merge-request in Github.
I am new to Zookeeper and it has being a real issue to install it and run. I am not sure what is wrong in here but I will explain what I've being doing to make it more clear:
1.- I've followed the installation guide provided by Apache. This means download the Zookeeper distribution (stable release) extracted the file and moved into the home directory.
2.- As I am using Ubuntu 12.04 I've modified the .bashrc file including this:
export ZOOKEEPER_INSTALL=/home/myusername/zookeeper-3.4.5
export PATH=$PATH:$ZOOKEEPER_INSTALL/bin
3.- Create a config file on conf/zoo.cfg
tickTime=2000
dataDir=/var/zookeeper
clientPort=2181
and also tried with:
dataDir=/var/log/zookeeper
and
dataDir=/var/bin/zookeeper
4.- When running the start command
zkServer.sh start or `bin/zkServer.sh start` nothing happens and always returns this
JMX enabled by default
Using config: /home/sasuke/zookeeper-3.4.5/bin/../conf/zoo.cfg
mkdir: cannot create directory `/var/zookeeper': Permission denied
Starting zookeeper ... /home/sasuke/zookeeper-3.4.5/bin/zkServer.sh: line 113: /var/zookeeper/zookeeper_server.pid: No such file or directory
FAILED TO WRITE PID
I have Java installed and inside the zookeper directory there is a zookeeper.jar file that I think it's not running.
Checking here on stackoverflow there was a guy that said he could run zookeeper after typing
ssh localhost
But when I try to do it I get this error
ssh: connect to host localhost port 22: Connection refused
Please help. I've being here trying to solve it for too long.
Getting started guide of zookeeper:
http://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html
Previous case solved with the shh localhost
Zookeeper: FAILED TO WRITE PID
UPDATE:
The permissions for log are:
drwxr-xr-x 19 root root 4096 Oct 10 07:52 log
and for zookeeper:
drwxr-xr-x 2 zookeeper zookeeper 4096 Mar 23 2012 zookeeper
Should I change any of these?
I have had the same problem. In my case was useful to start Zookeeper and directly specify a configuration file:
/bin/zkServer.sh start conf/zoo.conf
It seems you do not have the required permissions. The /var/log owner is is going to be root. Zookeeper stores the process id and snapshot of data in that directory. The process id of the spawned zookeeper server is stored in a file -zookeeper_server.pid (as of 3.3.6)
If you have root previleges, you could start zookeeper with sudo (root) previleges, it should work but definitely not recommended. Make sure you start zookeeper with the same(or higher) permissions as the owner of the directory.
Create a new directory in your home folder like /home/username/zookeeper-data.
Let dataDir point to that directory and it should work.
The default zookeeper installation (tar extract) comes with the conf file named conf/zoo_sample.cfg while the same extract's bin/zkServer.sh expects the conf file to be called zoo.cfg thereby resulting in a "No such file or dir" and the "failed to write pid" error. So before running zkServer.sh to start or stop zookeeper instance, either:
rename the zoo_sample.cfg in the conf dir to zoo.cfg, or
give the name (and path) to the conf file (as suggested by Ilya Lapitan), or, of course
edit zkServer.sh ;-)
When you create the Directory for dataDir make sure to use the -p option. This will allow subsequent directories to be created as required by the application placing files.
mkdir -p /var/log/zookeeperData
Then set:
dataDir=/var/log/zookeeperData
Seems there's all kinds of reasons this can happen. So many helpful answers here!
For me, I had improper line endings in my zoo.cfg file, and possibly invisible characters, so zookeeper was trying to create directories like /var/zookeeper? and /var/zookeeper\r. Reworking my zoo.cfg a bit fixed it for me, along with deleting zoo_sample.conf.
This happens to me due to low disk space. cause zookeeper cant create pid file inside zookeeper data folder.
I have faced the same issue while starting the zookeeper with this command:
hadoop#ubuntu:~/hadoop/zookeeper/zookeeper-3.4.8$ bin/zkServer.sh
start
ERROR [main] client.ConnectionManager$HConnectionImplementation:
The node /hbase is not in ZooKeeper.
It should have been written by the master. Check the value configured in zookeeper.znode.parent. There could be a mismatch with the one configured in the master.
But running the script as su rectified the issue:
hadoop#ubuntu:~/hadoop/zookeeper/zookeeper-3.4.8$ sudo bin/zkServer.sh
start
ZooKeeper JMX enabled by default Using config:
/home/hadoop/hadoop/zookeeper/zookeeper-3.4.8/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
Go to /usr/local/etc/
You will find zookeeper directory
delete the directory
and restart the server - zkServer start
Change the path give dataDir=/tmp/zookeeper. If it works then its clearly access issues
But its generally not advisable to use tmp directory.
This seems to be an ownership issue; running the following solved this for me.
$ sudo chown -R $USER /var/lib/zookeeper
N.B.
I've outlined my steps below which show the error I was getting (the same as the error in this SO question) and the attempt at trying the solution proposed by a user above, which advised to provide zoo.cfg as an argument.
13:01:29 ✔ ~ :: $ZK/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/../conf/zoo.cfg
Starting zookeeper ... /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/zkServer.sh: line 149: /var/lib/zookeeper/zookeeper_server.pid: Permission denied
FAILED TO WRITE PID
13:01:32 ✘ ~ :: $ZK/bin/zkServer.sh start $ZK/conf/zoo.cfg
ZooKeeper JMX enabled by default
Using config: /usr/local/Cellar/zookeeper/3.4.14/libexec/conf/zoo.cfg
Starting zookeeper ... /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/zkServer.sh: line 149: /var/lib/zookeeper/zookeeper_server.pid: Permission denied
FAILED TO WRITE PID
13:04:45 ✔ /var/lib :: ls -la
total 0
drwxr-xr-x 4 root wheel 128 Apr 19 18:55 .
drwxr-xr-x 27 root wheel 864 Apr 19 18:55 ..
drwxr--r-- 3 root wheel 96 Mar 24 15:07 zookeeper
13:04:48 ✔ /var/lib :: echo $USER
tallamjr
13:06:03 ✔ /var/lib :: sudo chown -R $USER zookeeper
Password:
13:06:44 ✔ /var/lib :: ls -la
total 0
drwxr-xr-x 4 root wheel 128 Apr 19 18:55 .
drwxr-xr-x 27 root wheel 864 Apr 19 18:55 ..
drwxr--r-- 3 tallamjr wheel 96 Mar 24 15:07 zookeeper
13:06:48 ✔ ~ :: $ZK/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /usr/local/Cellar/zookeeper/3.4.14/libexec/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
REF:
- https://askubuntu.com/questions/6723/change-folder-permissions-and-ownership
For me this solution worked:
I granted the read, write and execute permissions for everyone using the command $sudo chmod 777 foldername for the directory zookeeper by going inside the directory /var (/var/zookeeper).
After executing this command try running the zookeeper. It ran in my case
try to use sudo -E bin/zkServer.sh start