Rook/Ceph data not synced normally between two clusters - ceph

I have two Rook Ceph clusters and made a replicated pool with RBD mirror, but the data not synced,next is the info
[root#ceph-node5 tmp]# rbd mirror pool status replicapool --verbose
health: WARNING
images: 2 total
2 unknown
rbd: newer release of Ceph OSDs required to map image to rbd-mirror daemon instance
2019-06-05 15:15:07.132 7fddc28b7b00 -1 librbd::api::Mirror: image_instance_id_list: failed to list mirror image instances: (95) Operation not supported
pvc-b8976843-8780-11e9-910e-000c2975ae04:
global_id: 12b37e22-c5b4-49ac-a41b-5514bca6e192
state: down+unknown
description: status not found
last_update: `enter code here`
test_image:
global_id: 790eaf92-5bdc-4abf-b15d-d8c995bfe1b3
state: down+unknown
description: status not found
last_update:

Related

gitlab job pod exits unexpectedly

I currently have gitlab runner deployed in my kubernetes cluster with 2 replicas.
When I run a job in gitlab, the runners are successful in spawning pods that run the pipeline. But in some cases, after the pipeline runs the job, I suddenly get the error
Running after_script
00:00
Uploading artifacts for failed job
00:00
Cleaning up project directory and file based variables
00:00
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found
When I have a look at the runner logs, all I see is
gitlab-runners-exchange-587cdbf898-pkgt2 | grep "runner-hzfiusrx-project-37057717-concurrent-21gs8vm"
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: command terminated with exit code 137. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error streaming logs exchange/runner-hzfiusrx-project-37057717-concurrent-21gs8vm/helper:/logs-37057717-2986450184/output.log: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found. Retrying... job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Error while executing file based variables removal script error=couldn't get pod details: pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found job=2986450184 project=37057717 runner=hzFiusRx
ERROR: Job failed (system failure): pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found duration_s=2067.525269137 job=2986450184 project=37057717 runner=hzFiusRx
WARNING: Failed to process runner builds=32 error=pods "runner-hzfiusrx-project-37057717-concurrent-21gs8vm" not found executor=kubernetes runner=hzFiusRx
Im trying to understand the issue here.
My kubernetes runner config is
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = true
image = "ubuntu:20.04"
namespace = "exchange"
namespace_overwrite_allowed = ""
privileged = true
cpu_request = "5"
memory_request = "25Gi"
The nodes on which the job pods get scheduled have the following capacity
Capacity:
attachable-volumes-aws-ebs: 25
cpu: 8
ephemeral-storage: 20959212Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32523380Ki
pods: 58
So what exactly might be the issue here ? The cpu and memory dimensioning for the nodes seem correct.
looking at the utilization, everything seems good too
So what might be the issue here ? Is it that kubernetes/gitlab is not gracefully killing the job pod ? Or does it need more memory ?

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.166.132 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 10.244.104.1 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 10.244.135.2 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
NAME READY STATUS RESTARTS AGE
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 Cli.java:555 - address=/192.168.0.11:38590 url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 Clock.java:35 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 Slf4JLogger.java:146 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 Loggers.java:28 - [s6501] Error connecting to /tmp/dse.sock, trying next node
java.io.FileNotFoundException: null
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:110)
at io.netty.channel.unix.Socket.connect(Socket.java:257)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:732)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:717)
at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:559)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1366)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.connect(ConnectInitHandler.java:57)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:512)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:1024)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:276)
at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:252)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 Cli.java:555 - address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
The error java.io.FileNotFoundException: null appears also when cassandra starts successfully.
So what remains is the error:
address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api-6.8.0.20200316-LABS-all.jar --dse-socket /tmp/dse.sock --host tcp://0.0.0.0```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error "java.io.FileNotFoundException: null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

How can I run multiple network interfaces on a k8s node?

Running Openshift 4.1 on K8s v1.13.4. I'm trying to add a second network (for NFS storage) to my compute nodes, and as soon as I do, the node stops reporting NodeReady.
See below logs from kubelet. Completely lost.. How can I add another interface to my nodes?
v1.13.4
FieldPath:""}): type: 'Normal' reason: 'NodeReady' Node compute-0 status is now: NodeReady
Jun 26 05:41:22 compute-0 hyperkube[923]: E0626 05:41:22.367174 923 kubelet_node_status.go:380] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/addresses\":[{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"ExternalIP\"},{\"type\":\"InternalIP\"},{\"type\":\"Hostname\"}],\"$setElementOrder/conditions\":[{\"type\":\"MemoryPressure\"},{\"type\":\"Dis
...
Jun 26 05:41:22 compute-0 hyperkube[923]: [map[address:10.90.49.111 type:ExternalIP] map[type:ExternalIP address:10.90.51.94] map[address:10.90.49.111 type:InternalIP] map[address:10.90.51.94 type:InternalIP]]
Jun 26 05:41:22 compute-0 hyperkube[923]: doesn't match $setElementOrder list:
Resolution was to delete node from cluster kubectl delete node compute-0, reboot it, and let ignition rejoin it to the cluster.
This is a known bug

Ceph luminous rbd map hangs forever

Running a 1 node ceph cluster, and using the ceph-client from another node. Qemu is working fine with the RBD mounting. When I try to mount a RBD block device on the ceph-client I get an indefinite hang with no output. How to diagnose whats wrong?
System is ubuntu 16.04 server, and Ceph Luminous.
sudo ceph tell osd.* version
{
"version": "ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)"
}
ceph -s
cluster:
id: 4bfcc109-e432-4ac0-ba9d-bf81243aea
health: HEALTH_OK
services:
mon: 1 daemons, quorum gcmaster
mgr: gcmaster(active)
osd: 1 osds: 1 up, 1 in
data:
pools: 1 pools, 128 pgs
objects: 1512 objects, 5879 MB
usage: 7356 MB used, 216 GB / 223 GB avail
pgs: 128 active+clean
rbd info gcbase
rbd image 'gcbase':
size 512 MB in 128 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.376974b0dc51
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
create_timestamp: Fri Dec 29 17:58:02 2017
This hangs forever
rbd map gcbase --pool rbd
As does this
rbd map typo_gcbase --pool rbd
dmesg shows
Dec 29 13:27:32 cephclient1 kernel: [85798.195468] libceph: mon0 192.168.1.55:6789 feature set mismatch, my 106b84a842a42 < server's 40106b84a842a42, missing 400000000000000
Dec 29 13:27:32 cephclient1 kernel: [85798.222070] libceph: mon0 192.168.1.55:6789 missing required protocol features
The dmesg output tells what's going on: The cluster requires a feature bit that is not supported by the libceph kernel module.
The feature bit in question is either CEPH_FEATURE_CRUSH_TUNABLES5, CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING or CEPH_FEATURE_FS_FILE_LAYOUT_V2 (they are overlapping because they were introduced at the same time) which only became available on kernel 4.5, whereas Ubuntu 16.04 uses a 4.4 kernel.
A similar question (although related to CephFS) came up on the mailing list with a possible solution:
Yes, you should be able to set your CRUSH tunables profile to hammer
with "ceph osd crush tunables hammer".
This will disable some features, but should make the older kernel compatible with the cluster.
Alternatively you could upgrade to a mainline kernel or to a newer OS release.

Ran out of Docker disk space

I have this Docker command:
docker run -d mongo
this will build and run a mongodb server running in a docker container
However, I get an error:
no space left on device
I am on MacOS, and using the newer versions of Docker which use hyper-v instead of VirtualBox (I think that's correct).
Here is the exact error message from the mongo container:
$ docker logs efee16702c5756659d563b98d4ae0f58ecf1f1bba8a54f63443c0ae4b520ab4e
about to fork child process, waiting until server is ready for connections.
forked process: 21
2017-05-04T20:23:51.412+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2017-05-04T20:23:51.430+0000 I CONTROL [main] ERROR: Cannot write pid file to /tmp/tmp.Lo035QkbfL: No space left on device
ERROR: child process failed, exited with error number 1
Any idea how to fix this and prevent it from happening in future?
As suggested, the output of df -h is:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 116Gi 349Gi 25% 1963838 4293003441 0% /
devfs 183Ki 183Ki 0Bi 100% 634 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
Output of docker info is:
$ docker info
Containers: 5
Running: 0
Paused: 0
Stopped: 5
Images: 741
Server Version: 17.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.13-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952 GiB
Name: moby
ID: OR4L:WYWW:FFAP:IDX3:B6UK:O2AN:UVTO:EPH6:GYSV:4GV4:L5WP:BQTH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 17
Goroutines: 30
System Time: 2017-05-04T20:45:27.056157913Z
EventsListeners: 1
No Proxy: *.local, 169.254/16
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
As you state in the comments to the question, ls -altrh ~/Library/Containers/com.docker.docker/Data/com.docker.driver.‌​amd64-linux/Docker.q‌​cow2 returns the following:
-rw-r--r--# 1 alexamil staff 53G
This is a known bug on MacOS (actually, not only) and an official dev comment could be found here. Except for one thing: I read, that different people get different size limit. In the comment it is 64Gb, but for another person it was 20Gb.
There are a couple walkarounds, but no definite solution that I could find.
The manual one
Run docker ps -a and manually remove all unused containers. Then run docker images and remove manually all the intermediate and unused images.
The simplest one
Delete the Docker.qcow2 file entirely. But you will lose all images and containers. Completely.
The less simple
Another way is to run docker volume prune, which will remove all unused volumes
The resizing one (keeps the data)
Another idea that comes to me is to expand the disk image size with QEMU or something like it:
$ brew install qemu
$ /Applications/Docker.app/Contents/MacOS/qemu-img resize ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 +5G
After you expanded the image, you will need to run a VM in which you should run GParted against Docker.qcow2 and expand the partition to use added space. You could use GParted Live ISO for that:
$ qemu-system-x86_64 -drive file=Docker.qcow2 -m 512 -cdrom ~/Downloads/gparted-live.iso -boot d -device usb-mouse -usb
Some people report this either doesn't work or doesn't help.
Yet another resizing one (wipes the data)
Create a substitute image with desired size (120G):
$ qemu-img create -f qcow2 ~/data.qcow2 120G
$ cp ~/data.qcow2 /Application/Docker.app/Contents/Resources/moby/data.qcow2
$ rm ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2
data.qcow2 is copied to ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 when you restart docker.
This walkaround comes from this comment.
Hope this helps. Good luck!