Ceph got timeout - ceph

I upgraded my proxmox cluster from 6 to 7. After upgrading, Ceph services are not responding. Any command in the console, for example ceph -s hangs and does not return any result. And in the web interface I see Error got timeout (500)
enter image description here
In the log I saw the error cluster [WRN] overall HEALTH_WARN noout flag (s) set; 1/4 mons down, quorum server-1, server-2, server-3
But when I try to execute ceph osd unset noout I see this error
2021-12-03T18: 59: 17.068 + 0300 7fac8c30d700 0 monclient (hunting): authenticate timed out after 300

Related

Resizing Pool to 1 has been disabled by default on Ceph Pacific stable 6.0

I implemented Ceph Pacific Stable 6.0 with Ceph-Ansible on Ubuntu 20.04 LTS
but when I want to change size of my pool from 3 to 1 with following command:
sudo ceph -n client.admin --keyring=/etc/ceph/ceph.client.admin.keyring osd pool set cephfs_data size 1
I get following error:
Error EPERM: configuring pool size as 1 is disabled by default.
Add the following option under section [global] in ceph.conf:
mon_allow_pool_size_one = true
Restart the monitor service
For those working with Quincy (17.2.3) you can do:
$ ceph config set global mon_allow_pool_size_one true
$ ceph osd pool set data_pool min_size 1
$ ceph osd pool set data_pool size 1 --yes-i-really-mean-it

Cassandra pod fails after kubernetes node restart

I have successfully installed dse in my kubernetes environment using the Kubernetes Operator instructions:
With nodetool I checked that all pod successfully joined the ring
The problem is that when I reboot one of the kubernetes node the cassandra pod that was running on that node never recover:
[root#node1 ~]# kubectl exec -it -n cassandra cluster1-dc1-r2-sts-0 -c cassandra nodetool status
Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.166.132 153.82 KiB 1 77.9% 053cc18e-397c-4abe-bb1b-d48a3fef3c93 r3
DS 10.244.104.1 136.09 KiB 1 26.9% 8ae31e1c-856e-44a8-b081-c5c040b535b9 r1
UN 10.244.135.2 202.8 KiB 1 95.2% 06200794-298c-4122-b8ff-4239bc7a8ded r2
[root#node1 ~]# kubectl get pods -n cassandra
NAME READY STATUS RESTARTS AGE
cass-operator-56f5f8c7c-w6l2c 1/1 Running 0 17h
cluster1-dc1-r1-sts-0 1/2 Running 2 17h
cluster1-dc1-r2-sts-0 2/2 Running 0 17h
cluster1-dc1-r3-sts-0 2/2 Running 0 17h
I have looked into the logs but I can't figure out what is the problem.
The "kubectl logs"" command return the logs below:
INFO [nioEventLoopGroup-2-1] 2020-03-25 12:13:13,536 Cli.java:555 - address=/192.168.0.11:38590 url=/api/v0/probes/liveness status=200 OK
INFO [epollEventLoopGroup-6506-1] 2020-03-25 12:13:14,110 Clock.java:35 - Could not access native clock (see debug logs for details), falling back to Java system clock
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,111 Slf4JLogger.java:146 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x8a898bf3]'
WARN [epollEventLoopGroup-6506-2] 2020-03-25 12:13:14,116 Loggers.java:28 - [s6501] Error connecting to /tmp/dse.sock, trying next node
java.io.FileNotFoundException: null
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:110)
at io.netty.channel.unix.Socket.connect(Socket.java:257)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect0(AbstractEpollChannel.java:732)
at io.netty.channel.epoll.AbstractEpollChannel.doConnect(AbstractEpollChannel.java:717)
at io.netty.channel.epoll.EpollDomainSocketChannel.doConnect(EpollDomainSocketChannel.java:87)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.connect(AbstractEpollChannel.java:559)
at io.netty.channel.DefaultChannelPipeline$HeadContext.connect(DefaultChannelPipeline.java:1366)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelOutboundHandlerAdapter.connect(ChannelOutboundHandlerAdapter.java:47)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.ChannelDuplexHandler.connect(ChannelDuplexHandler.java:50)
at com.datastax.oss.driver.internal.core.channel.ConnectInitHandler.connect(ConnectInitHandler.java:57)
at io.netty.channel.AbstractChannelHandlerContext.invokeConnect(AbstractChannelHandlerContext.java:545)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:530)
at io.netty.channel.AbstractChannelHandlerContext.connect(AbstractChannelHandlerContext.java:512)
at io.netty.channel.DefaultChannelPipeline.connect(DefaultChannelPipeline.java:1024)
at io.netty.channel.AbstractChannel.connect(AbstractChannel.java:276)
at io.netty.bootstrap.Bootstrap$3.run(Bootstrap.java:252)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:375)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
INFO [nioEventLoopGroup-2-2] 2020-03-25 12:13:14,118 Cli.java:555 - address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
The error java.io.FileNotFoundException: null appears also when cassandra starts successfully.
So what remains is the error:
address=/192.168.0.11:38592 url=/api/v0/probes/readiness status=500 Internal Server Error
Which doesn't say much to me.
The "kubectl describe" shows the following
Warning Unhealthy 4m41s (x6535 over 18h) kubelet, node2 Readiness probe failed: HTTP probe failed with statuscode: 500
In the cassandra container only this process is running:
java -Xms128m -Xmx128m -jar /opt/dse/resources/management-api/management-api-6.8.0.20200316-LABS-all.jar --dse-socket /tmp/dse.sock --host tcp://0.0.0.0```
And in the /var/log/cassandra/system.log I can't point out any error
Andrea, the error "java.io.FileNotFoundException: null" is a harmless message about a transient error during the Cassandra pod starting up and healthcheck.
I was able to reproduce the issue you ran into. If you run kubectl get pods you should see the affected pod showing 1/2 under "READY" column, this means the Cassandra container was not brought up in the auto-restarted pod. Only the management API container is running. I suspect this is a bug in the operator and I'll work with the developers to sort it out.
As a workaround you can run kubectl delete pod/<pod_name> to recover your Cassandra cluster back to a normal state (in your case kubectl delete pod/cluster1-dc1-r1-sts-0). This will redeploy the pod and remount the data volume automatically, without losing anything.
I got this error when CoreDNS pods were not running on the node, on which I had started Cassandra. The DNS resolutions were not working properly. So, debugging network connectivity may help.

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.

Failed to start puppetserver Service

While trying to run a puppet update form a node:
sudo /opt/puppetlabs/bin/puppet agent -t
I get an error:
Error: Could not retrieve catalog; skipping run
Error: Could not send report: Connection refused - connect(2) for "puppet" port 8140`
Elsewhere indicates this is likely a problem with the puppetserver service, and suggests to reboot the server. Restarting didn't help, and when I try to restart the service I get failure:
~$ sudo service puppetserver restart
Job for puppetserver.service failed because the control process exited with error code. See "systemctl status puppetserver.service" and "journalctl -xe" for details.
I've looked at these logs, and as a puppet/linux noob, I'm not sure what to do next.
systemctl status puppetserver.service
● puppetserver.service - puppetserver Service
Loaded: loaded (/lib/systemd/system/puppetserver.service; enabled; vendor preset: enabled)
Active: activating (start-post) since Fri 2016-09-02 15:54:26 PDT; 2s ago
Process: 22301 ExecStartPre=/usr/bin/install --directory --owner=puppet --group=puppet --mode=775 /var/run/puppetlabs/puppetserver (code=exited
Main PID: 22306 (java); : 22307 (bash)
Tasks: 17
Memory: 335.7M
CPU: 5.535s
CGroup: /system.slice/puppetserver.service
├─22306 /usr/bin/java -Xms6g -Xmx6g -XX:MaxPermSize=256m -XX:OnOutOfMemoryError=kill -9 %p -Djava.security.egd=/dev/urandom -cp /opt/p
└─control
├─22307 /bin/bash /opt/puppetlabs/server/apps/puppetserver/ezbake-functions.sh wait_for_app
└─22331 sleep 1
Sep 02 15:54:26 puppet systemd[1]: Starting puppetserver Service...
Sep 02 15:54:26 puppet java[22306]: OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
puppet version 4.6.1
The puppet master communicates with the other node using port number 8140.
I don't think a restart will help, since this looks like a connection issue between the server and the node.
please try the following -
first make sure that the puppet master is actually listening on port 8140. run the following command on the puppetmaster -
netstat -ntlp | grep 8140
this command should return something like this -
tcp 0 0 0.0.0.0:8140 0.0.0.0:* LISTEN 1783/puppetmaster
If you don't get the same output, your puppetmaster is not listening, and therefore can not compile catalogs for the node.
Try checking the puppet master log at /var/log/puppetmaster.log
check that the node can communicate with the puppetmaster on the relevant port. you can check this quickly with the telnet command. run this on your node -
telnet < puppetmaster ip address \ dns name> 8140
you should get something like -
Connected to <puppet-master-IP/DNS-name>
Escape character is '^]'.
if you don't get this output, this means that something is blocking you from accessing the puppetmaster. try opening the port in your firewall to access the puppetmaster.
if you're still stuck try using the --debug flag for verbose output and edit your question.
Could be 2 things: (1) in puppet.conf you have configured more memory than you have on your machine. Or (2) You installed both apt-get install puppetserver and apt-get install puppet.
If you get failed to start puppet.service: unit not found. error on slave machine while connecting to puppet.
Close the putty and then again open and connect it.The issue wont come while starting putty on slave.
The error occurs because there is not enough RAM and to fix the error, open the Puppet server configuration file:
sudo nano /etc/sysconfig/puppetserver
And reduce the amount of allocated RAM for the Puppet server (for example, I specified 512m instead of 2g):
JAVA_ARGS="-Xms512m -Xmx512m"
Now let’s start the Puppet server:
sudo systemctl start puppetserver

mesos masters keep restarting

I have 3 mesos masters with version 0.26.0 setup with a quorum of 2. When I start them, they keep restarting even before I turn up any frameworks or slaves.
Here's the errors I'm seeing:
F0322 19:36:56.009903 51459 master.cpp:1368] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
E0322 19:37:18.300568 41095 process.cpp:1911] Failed to shutdown socket with fd 26: Transport endpoint is not connected
There's no firewall running.
I start them with supervisord and the following command:
/usr/sbin/mesos-master --cluster=int --log_dir=/var/log/mesos/int --quorum=2 --port=5050 --work_dir=/tmp/mesos/work/int --zk=zk://intMesosMaster01:2181,intMesosMaster02:2181,intMesosMaster03:2181/mesos
Zookeeper is up and running fine with 3 nodes. It's in use for other projects and has no issues at all with them.