Ceph delete client.admin accidently - ceph

While learning Ceph , I accidently delete client.admin by using
ceph auth del client.admin
Now I get
client.admin authentication error (1) Operation not permitted.Error connecting to cluster: PermissionError
all the time.
Is there a way to recover or recreate a new client.admin?
I've tried
ceph auth import -i /etc/ceph/ceph.client.admin.keyring
and
ceph add client.admin
It didn't work for me.

try to connect with mon keyring on ceph-node-1
ceph -n mon. --keyring /var/lib/ceph/mon/ceph-node-1/keyring get-or-create client.admin mon 'allow *' mds 'allow *' mgr 'allow *' osd 'allow *'
then update /etc/ceph/ceph.client.admin.keyring

Related

Ceph (cepadm) quincy: can't add osd from remote nodes (command hanging)

I stuck with a problem, while trying to create cluster of 3 nodes (AWS EC2 instancies):
fa11.something.com ~ # ceph orch host ls
HOST ADDR LABELS STATUS
fa11.something.com 172.16.24.67 _admin
fa12.something.com 172.16.23.159 _admin
fa13.something.com 172.16.25.119 _admin
3 hosts in cluster
Each of them have 2 disks (all accepted by CEPH):
fa11.something.com ~ # ceph orch device ls
HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
fa11.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol016651cf7f3b9c9dd 8589M Yes 7m ago
fa11.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol034082d7d364dfbdb 5368M Yes 7m ago
fa12.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0ec193fa3f77fee66 8589M Yes 3m ago
fa12.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol018736f7eeab725f5 5368M Yes 3m ago
fa13.something.com /dev/nvme1n1 ssd Amazon_Elastic_Block_Store_vol0443a031550be1024 8589M Yes 84s ago
fa13.something.com /dev/nvme2n1 ssd Amazon_Elastic_Block_Store_vol0870412d37717dc2c 5368M Yes 84s ago
fa11.something.com is first host, where from I manage cluster.
Adding OSD from fa11.something.com itself works fine:
fa11.something.com ~ # ceph orch daemon add osd fa11.something.com:/dev/nvme1n1
Created osd(s) 0 on host 'fa11.something.com'
But it doesn't work for other 2 hosts (it hangs forever):
fa11.something.com ~ # ceph orch daemon add osd fa12.something.com:/dev/nvme1n1
^CInterrupted
Logs on fa12.something.com shows that it hangs at following step:
fa12.something.com ~ # tail /var/log/ceph/a9ef6c26-ac38-11ed-9429-06e6bc29c1db/ceph-volume.log
...
[2023-02-14 07:38:20,942][ceph_volume.process][INFO ] Running command: /usr/bin/ceph-authtool --gen-print-key
[2023-02-14 07:38:20,964][ceph_volume.process][INFO ] Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new a51506c2-e910-4763-9a0c-f6c2194944e2
I'm not sure what might be the reason for this hanging?
*Additional details:
cephadm installed, using curl (https://docs.ceph.com/en/quincy/cephadm/install/#curl-based-installation)
I use user ceph, instead of root and port 2222 instead of 22. First node was bootstrapped, using below command:
cephadm bootstrap --mon-ip 172.16.24.67 --allow-fqdn-hostname --ssh-user ceph --ssh-config /home/anton/ceph/ssh_config --cluster-network 172.16.16.0/20 --skip-monitoring-stack
Content of /home/anton/ceph/ssh_config:
fa11.something.com ~ # cat /home/anton/ceph/ssh_config
Host *
User ceph
Port 2222
IdentityFile /home/ceph/.ssh/id_rsa
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Hosts fa12.something.com and fa13.something.com were added, using commands:
ceph orch host add fa12.something.com 172.16.23.159 --labels _admin
ceph orch host add fa13.something.com 172.16.25.119 --labels _admin
Not sure if I have to check some specific ports are not blocked? Just I expected to get some error at early stages in case if ceph can't access some port...
Thanks in advance for any suggestion!
It should be that problem caused by some port(s) blocked. I've tried to drop all restrictions on iptables level: iptables -F; iptables -P INPUT ACCEPT and it started to work. Will check later which ports exactly needed. I guess it should be described somewhere here: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

How can I upgrade PostgreSQL from version 11 to version 13?

I'm trying to upgrade PostgreSQL from 11 to 13 on a Debian system, but it fails. I have a single cluster that needs to be upgraded:
$ sudo -u postgres pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
11 main 5432 online postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log
Here's what I've tried to upgrade it:
$ sudo -u postgres pg_upgradecluster 11 main
Stopping old cluster...
Warning: stopping the cluster using pg_ctlcluster will mark the systemd unit as failed. Consider using systemctl:
sudo systemctl stop postgresql#11-main
Restarting old cluster with restricted connections...
Notice: extra pg_ctl/postgres options given, bypassing systemctl for start operation
Error: cluster configuration already exists
Error: Could not create target cluster
After this, the system is left in an unusable state:
$ sudo systemctl status postgresql#11-main.service
● postgresql#11-main.service - PostgreSQL Cluster 11-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; enabled-runtime; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2022-06-14 06:48:20 CEST; 19s ago
Process: 597 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 11-main start (code=exited, status=0/SUCCE>
Process: 4508 ExecStop=/usr/bin/pg_ctlcluster --skip-systemctl-redirect -m fast 11-main stop (code=exited, status=>
Main PID: 684 (code=exited, status=0/SUCCESS)
CPU: 1.862s
Jun 14 06:47:23 argos systemd[1]: Starting PostgreSQL Cluster 11-main...
Jun 14 06:47:27 argos systemd[1]: Started PostgreSQL Cluster 11-main.
Jun 14 06:48:20 argos postgresql#11-main[4508]: Cluster is not running.
Jun 14 06:48:20 argos systemd[1]: postgresql#11-main.service: Control process exited, code=exited, status=2/INVALIDARG>
Jun 14 06:48:20 argos systemd[1]: postgresql#11-main.service: Failed with result 'exit-code'.
Jun 14 06:48:20 argos systemd[1]: postgresql#11-main.service: Consumed 1.862s CPU time.
$ sudo systemctl start postgresql#11-main.service
Job for postgresql#11-main.service failed because the service did not take the steps required by its unit configuration.
See "systemctl status postgresql#11-main.service" and "journalctl -xe" for details.
Luckily, rebooting the system brought the old cluster back online, but nothing has been upgraded. Why does the upgrade fail? What are "the steps required by its unit configuration"? How can I upgrade PostgreSQL with minimal downtime?
I found the source of my problem: a configuration file owned by the wrong user (root instead of postgres) that could not be removed by the pg_dropcluster command because I ran it as the user postgres.
For future reference, here are the correct steps to upgrade a PostgreSQL cluster from 11 to 13:
Verify the current cluster is the still the old version:
$ pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
11 main 5432 online postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log
13 main 5434 down postgres /var/lib/postgresql/13/main /var/log/postgresql/postgresql-13-main.log
Run pg_dropcluster 13 main as user postgres:
$ sudo -u postgres pg_dropcluster 13 main
Warning: systemd was not informed about the removed cluster yet.
Operations like "service postgresql start" might fail. To fix, run:
sudo systemctl daemon-reload
Run the pg_upgradecluster command as user postgres:
$ sudo -u postgres pg_upgradecluster 11 main
Verify that everything works, and that the only online cluster is now 13:
$ pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
11 main 5434 down postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log
13 main 5432 online postgres /var/lib/postgresql/13/main /var/log/postgresql/postgresql-13-main.log
Drop the old cluster:
$ sudo -u postgres pg_dropcluster 11 main
Uninstall the previous version of PostgreSQL:
$ sudo apt remove 'postgresql*11'
The Debian packages create a cluster automatically when you install the server package, so get rid of that:
pg_dropcluster 13 main
Then stop the v11 server and try again.

Failed to start PostgreSQL Cluster 10-main when booting

when I try to boot Ubuntu, it never finishes the boot process because it appears the message "Failed to start PostgreSQL Cluster 10-main." I also get the same message with 9.5-main. But lets focus on 10.
When I execute:
systemctl status postgresql#10-main.service
I get the following message:
postgresql#10-main.service - PostgreSQL Cluster 10-main
Loaded: loaded (/lib/systemd/system/postgresql#.service; indirect; vendor preset: enabled)
Active: failed (Result: protocol) since Wed 2020-02-19 17:57:22 CET; 30 min ago
Process: 1602 ExecStart=/usr/bin/pg_ctlcluster --skip-systemctl-redirect 10-main start (code_exited, status=1/FAILURE)
PC_info systemd[1]: Starting PostgreSQL Cluster 10-main...
PC_info postgresql#10-main[1602]: Error: /usr/lib/postgresql/10/bin/pg_ctl /usr/lib/postgresql/10/bin/pg_ctl start -D /var/lib/postgresql/10/main -l /var/log/postgresql/postgresql-19-main.log -s -o -c config_file="/etc/postgresql/10/main/postgresql.conf" exit with status 1:
PC_info systemd[1]: postgresql#10-main.service: Can't open PID file /var/run/postgresql/10-main.pid (yet?) after start: No such file or directory
PC_info systemd[1]: postgresql#10-main.service: Failed with result 'protocol'.
PC_info systemd[1]: Failed to start PostgreSQL Cluster 10-main.
PC_info is information about my computer (user, date..) not relevant
I got this error from one day to an other without touching anything related to Database Servers.
I tried to fix it by my self but nothing worked
Writing the command
service postgresql#10-main start
I get
Job for postgresql#10-main.service failed because the service did not take the steps required by its unit configuration
See "systemctl status postgresql#10-main.service" and "journalctl -xe" for details.
Running this two command I get the message from the beginning.
Anyone has an idea of what is happening? How I can fix it?
I had same issue, I followed below steps,
Error status :
pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
10 main 5432 down postgres /var/lib/postgresql/10/main
/var/log/postgresql/postgresql-10-main.log
Applied Solution :
sudo chmod 700 -R /var/lib/postgresql/10/main
sudo -i -u postgres
postgres#abc:~$ /usr/lib/postgresql/10/bin/pg_ctl restart -D /var/lib/postgresql/10/main
After Solution status :
pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
10 main 5432 online postgres /var/lib/postgresql/10/main
/var/log/postgresql/postgresql-10-main.log
as mentioned by #gruentee in comment above,
/usr/lib/postgresql/10/bin/pg_ctl restart -D /var/lib/postgresql/10/main -l /var/log/postgresql/postgresql-10-main.log -s -o '-c config_file="/etc/postgresql/10/main/postgresql.conf"'
started the postgresSql DB. Dont forget to
sudo -i -u postgres
before issuing above command
I was facing the same challenge, I try a lot of methods but none work for me till I try these commands below
sudo apt-get -y install postgresql
sudo systemctl start postgresql#15-main.service
pg_lsclusters
then everything started working fine, but please the version of postgresql i'm using is 15 that's you are seeing 15 in the second command so in your case you substitute it with the version you are using

Why does kubeadm not start even after disabling swap?

I am trying to install kubernetes with kubeadm in my laptop which has Ubuntu 16.04. I have disabled swap, since kubelet does not work with swap on. The command I used is :
swapoff -a
I also commented out the reference to swap in /etc/fstab.
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/sda1 during installation
UUID=1d343a19-bd75-47a6-899d-7c8bc93e28ff / ext4 errors=remount-ro 0 1
# swap was on /dev/sda5 during installation
#UUID=d0200036-b211-4e6e-a194-ac2e51dfb27d none swap sw 0 0
I confirmed swap is turned off by running the following:
free -m
total used free shared buff/cache available
Mem: 15936 2108 9433 954 4394 12465
Swap: 0 0 0
When I start kubeadm, I get the following error:
kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.14.2
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
I also tried restarting my laptop, but I get the same error. What could the reason be?
below was the root cause.
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".
you need to update the docker cgroup driver.
follow the below fix
cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
EOF
mkdir -p /etc/systemd/system/docker.service.d
# Restart Docker
systemctl daemon-reload
systemctl restart docker
you could try kubeadm reset , then kubeadm init --ignore-preflight-errors Swap .
first try with sudo
sudo swapoff -a
then check if there's anything swapped
cat /proc/swaps
and
free -h

kube-apiserver unable to communicate with TLS enabled etcd

I'm trying to set up a kubernetes cluster on some raspberry pis. I have successfully set up an etcd cluster with TLS enabled, and I can access this cluster via etcdctl and curl.
However, when I try to run kube-apiserver with the same ca file, I get messages saying that the etcd cluster is misconfigured or unavailable.
My question is, why are curl and etcdctl able to view cluster health and add keys with the same ca file that kube-apiserver is trying to use, but the kube-apiserver cannot?
When I run kube-apiserver and hit 127.0.0.1 over HTTP, not HTTPS, I can start the api server.
If this and the information below is not enough to understand the problem, please let me know. I'm not experienced with TLS/x509 certificates at all. I've been using Kelsey Hightowers Kubernetes The Hard Way mixed with the CoreOS docs for spinning up a kubernetes cluster, as well as looking at github issues and things like that.
Here is my etcd unit file:
[Unit]
Description=etcd
Documentation=https://github.com/coreos/etcd
[Service]
Environment=ETCD_UNSUPPORTED_ARCH=arm
ExecStart=/usr/bin/etcd \
--name etcd-master1 \
--cert-file=/etc/etcd/etcd.pem \
--key-file=/etc/etcd/etcd-key.pem \
--peer-cert-file=/etc/etcd/etcd.pem \
--peer-key-file=/etc/etcd/etcd-key.pem \
--trusted-ca-file=/etc/etcd/ca.pem \
--peer-trusted-ca-file=/etc/etcd/ca.pem \
--initial-advertise-peer-urls=https://10.0.1.200:2380 \
--listen-peer-urls https://10.0.1.200:2380 \
--listen-client-urls https://10.0.1.200:2379,http://127.0.0.1:2379 \
--advertise-client-urls https://10.0.1.200:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster etcd-master1=https://10.0.1.200:2380,etcd-master2=https://10.0.1.201:2380 \
--initial-cluster-state new \
--data-dir=var/lib/etcd \
--debug
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Here is the kube-apiserver command I'm trying to run:
#!/bin/bash
./kube-apiserver \
--etcd-cafile=/etc/etcd/ca.pem \
--etcd-certfile=/etc/etcd/etcd.pem \
--etcd-keyfile=/etc/etcd/etcd-key.pem \
--etcd-servers=https://10.0.1.200:2379,https://10.0.1.201:2379 \
--service-cluster-ip-range=10.32.0.0/24
Here is some output of that attempt. I think it's kind of weird that it is trying to list the etcd node, but nothing is printed out:
deploy#master1:~$ sudo ./run_kube_apiserver.sh
I1210 00:11:35.096887 20480 config.go:499] Will report 10.0.1.200 as public IP address.
I1210 00:11:35.842049 20480 trace.go:61] Trace "List *api.PodTemplateList" (started 2016-12-10 00:11:35.152287704 -0500 EST):
[134.479µs] [134.479µs] About to list etcd node
[688.376235ms] [688.241756ms] Etcd node listed
[689.062689ms] [686.454µs] END
E1210 00:11:35.860221 20480 cacher.go:261] unexpected ListAndWatch error: pkg/storage/cacher.go:202: Failed to list *api.PodTemplate: client: etcd cluster is unavailable or misconfigured
I1210 00:11:36.588511 20480 trace.go:61] Trace "List *api.LimitRangeList" (started 2016-12-10 00:11:35.273714755 -0500 EST):
[184.478µs] [184.478µs] About to list etcd node
[1.314010127s] [1.313825649s] Etcd node listed
[1.314362833s] [352.706µs] END
E1210 00:11:36.596092 20480 cacher.go:261] unexpected ListAndWatch error: pkg/storage/cacher.go:202: Failed to list *api.LimitRange: client: etcd cluster is unavailable or misconfigured
I1210 00:11:37.286714 20480 trace.go:61] Trace "List *api.ResourceQuotaList" (started 2016-12-10 00:11:35.325895387 -0500 EST):
[133.958µs] [133.958µs] About to list etcd node
[1.96003213s] [1.959898172s] Etcd node listed
[1.960393274s] [361.144µs] END
A successful cluster-health query:
deploy#master1:~$ sudo etcdctl --cert-file /etc/etcd/etcd.pem --key-file /etc/etcd/etcd-key.pem --ca-file /etc/etcd/ca.pem cluster-health
member 133c48556470c88d is healthy: got healthy result from https://10.0.1.200:2379
member 7acb9583fc3e7976 is healthy: got healthy result from https://10.0.1.201:2379
I am also seeing a lot of timeouts on the etcd servers themselves trying to send heartbeats back:
Dec 10 00:19:56 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 790.808604ms)
Dec 10 00:19:56 master1 etcd[19308]: server is likely overloaded
Dec 10 00:22:40 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 122.586925ms)
Dec 10 00:22:40 master1 etcd[19308]: server is likely overloaded
Dec 10 00:22:41 master1 etcd[19308]: failed to send out heartbeat on time (exceeded the 100ms timeout for 551.618961ms)
Dec 10 00:22:41 master1 etcd[19308]: server is likely overloaded
I can still do etcd operations like gets and puts, but I'm wondering if this could be a contributing factor? Can I tell the kube-apiserver to wait longer for etcd? I've been trying to figure this out myself but IMO the technical parts of the kuberentes components aren't really well documented, and a lot of the examples are very turnkey, without really explaining what everything is doing and why. I can find all kinds of diagrams and blog posts about high-level stuff, but things like e.g. how to run the actual binary, and what flags are and are not required, are kind of lacking.