ceph 15.2.4 -- authentication changes -- unable to reattach with kubernetes - kubernetes

Yesterday my teammates found a way to disable cephx authentication cluster wide (2 server cluster) in order to bypass issues that were preventing us from joining a 3rd server. Unfortunately they were uncertain which of the steps taken let to the successful addition. I request assistance getting my ceph operational again. Yesterday we left off after editing /etc/ceph/ceph.conf, turning authentication back on here, then copying the file to /var/lib/ceph///config and ensuring permissions were set to 644.
This got one command to work that had previously not been -- my ceph osd df correctly shows all 24 OSDs again, but I cannot run a ceph osd status nor a ceph orch status.

Related

Ceph Mgr not responding to certain commands

We have a ceph cluster built with rook(2 mgrs, 3 mons, 2 mds each cephfs, 24 osds, rook: 1.9.3, ceph: 16.2.7, kubelet: 1.24.1). Our operation requires constantly creating and deleting cephfilesystems. Overtime we experienced issues with rook-ceph-mgr. After the cluster was built, in a week or two, rook-ceph-mgr failed to respond to certain ceph commands, like ceph osd pool autoscale-status, ceph fs subvolumegroup ls, while other commands, like ceph -s, worked fine. We have to restart rook-ceph-mgr to get it going. Now we have around 30 cephfilesystems and the issue happens more frequently.
We tried disabling mgr modules dashboard, prometheus and iostat, set ceph progress off, increased mgr_stats_period & mon_mgr_digest_period. That didn't help much. The issue happened again after one or two creating & deleting cycles.

How can I fix ceph commands hanging after a reboot?

I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...

How to sync user directory on bitbucket server to jira with both running on aks?

When trying to sync the user directories of Jira to other atlassian products (confluence and bitbucket server running on aks) a 403 error is returned.
Upon looking into this error the following steps have been attempted:
https://confluence.atlassian.com/stashkb/unable-to-connect-to-jira-for-authentication-forbidden-403-323391874.html
The IP adresses have been added to the whitelist of Jira. The next step in solutions online is to restart the Jira
service.
This however causes issues as upon running the stop/start-jira.sh files inside the pod the service returns
with none of the previous settings and all configurations including backups are gone. Taking us back to square one.
cluster size:
current set-up
3 x Standard D8 v3 (8 vcpus, 32 GiB memory) cluster on aks
Used the following images installed through UI:
atlassian/jira-software
cptactionhank/docker-atlassian-jira
Exec into pod and go to /opt/atlassian/jira/bin
run ./(start/stop)-jira.sh
What should happen is that when going back to the url the Jira instance is reset and all configuration files in the pod for the service are lost.
The logs of the pod give error no 137 as a common error when restarting.
update:
https://github.com/int128/devops-kompose/tree/master/atlassian-jira-software
The following helm chart has also been used and achieved the same result.

Why would running a container on GCE get stuck Metadata request unsuccessful forbidden (403)

I'm trying to run a container in a custom VM on Google Compute Engine. This is to perform a heavy ETL process so I need a large machine but only for a couple of hours a month. I have two versions of my container with small startup changes. Both versions were built and pushed to the same google container registry by the same computer using the same Google login. The older one works fine but the newer one fails by getting stuck in an endless list of the following error:
E0927 09:10:13 7f5be3fff700 api_server.cc:184 Metadata request unsuccessful: Server responded with 'Forbidden' (403): Transport endpoint is not connected
Can anyone tell me exactly what's going on here? Can anyone please explain why one of my images doesn't have this problem (well it gives a few of these messages but gets past them) and the other does have this problem (thousands of this message and taking over 24 hours before I killed it).
If I ssh in to a GCE instance then both versions of the container pull and run just fine. I'm suspecting the INTEGRITY_RULE checking from the logs but I know nothing about how that works.
MORE INFO: this is down to "restart policy: never". Even a simple Centos:7 container that says "hello world" deployed from the console triggers this if the restart policy is never. At least in the short term I can fix this in the entrypoint script as the instance will be destroyed when the monitor realises that the process has finished
I suggest you try creating a 3rd container that's focused on the metadata service functionality to isolate the issue. It may be that there's a timing difference between the 2 containers that's not being overcome.
Make sure you can ‘curl’ the metadata service from the VM and that the request to the metadata service is using the VM's service account.

how to update cluster status in dataproc

I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.