Deleting files in Ceph does not free up space - ceph

I am using Ceph, uploading many files through radosgw. After, I want to delete the files. I am trying to do that in Python, like this:
bucket = conn.get_bucket(BUCKET)
for key in bucket.list():
bucket.delete_key(key)
Afterwards, I use bucket.list() to list files in the bucket, and this says that the bucket is now empty, as I intended.
However, when I run ceph df on the mon, it shows that the OSDs still have high utilization (e.g. %RAW USED 90.91). If I continue writing (thinking that the status data just hasn't caught up with the state yet), Ceph essentially locks up (100% utilization).
What's going on?
Note: I do have these standing out in ceph status:
health HEALTH_WARN
3 near full osd(s)
too many PGs per OSD (2168 > max 300)
pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?)
From what I gather online, this wouldn't cause my particular issue. But I'm new to Ceph and could be wrong.
I have one mon and 3 OSDs. This is just for testing.

You can check if the object is really deleted by rados -p $pool list,
I knew for cephfs, when you delete a file, it will return ok when mds mark
it as deleted in local memory and then do real delete by sending delete messages to related osd.
Maybe radosgw use the same design to speed up delete

Related

Ceph reports bucket space utilization and total cluster utilization that is inconsistent

I copied the contents of an older Ceph cluster to a new Ceph cluster using rclone. Because several of the buckets had tens of millions of objects in a single directory I had to enumerate these individually and use the "rclone copyto" command to move them. After copying, the number of objects match but the space utilization on the second Ceph cluster is much higher.
Each Ceph cluster is configured with the default triple redundancy.
The older Ceph cluster has 1.4PiB of raw capacity.
The older Ceph cluster has 526TB in total bucket utilization as reported by "radosgw-admin metadata bucket stats". The "ceph -s" status on this cluster shows 360TiB of object utilization with a total capacity of 1.4PiB for 77% space utilization. The two indicated quantities of 360TiB used in the cluster and 526TB used by buckets are significantly different. There isn't enough raw capacity on this cluster to hold 526TB.
After copying the contents to the new Ceph cluster, the total bucket utilization of 553TB is reflected in the "ceph -s" status as 503TiB. This is slightly higher than the bucket total of the source I assume due to larger drive's block sizes, but the status utilization matches the sum of the bucket utilization as expected. The number of objects in each bucket of the destination cluster matches the source buckets also.
Is this a setting in the first Ceph cluster that merges duplicate objects like a simplistic compression? There isn't enough capacity in the first Ceph cluster to hold much over 500TB so this seems like the only way this could happen. I assume that when two objects are the same, that each bucket gets a symlink like pointer to the same object. The new Ceph cluster doesn't seem to have this capability or it's not set to behave this way.
The first cluster is Ceph version 13.2.6 and the second is version 17.2.3.

Ceph storage OSD disk upgrade (replace with larger drive)

I have three servers each with 1 x SSD drive (Ceph base OS) and 6 x 300Gb SAS drives, at the moment I'm only using 4 drives on each server as the OSD's in my Ceph storage array and everything is fine.
My question is that now I have built this and got everything up and running if say in 6 months or so I need to replace these OSD's due to the space of the storage array running out is it possible to remove one disk at a time from each server and replace it with a large drive?
For example if server 1 had OSD 0-5, server 2 has OSD 6-11 and server 3 has OSD 12-17 could I one day remove OSD0 and replace it with a 600Gb SAS drive, wait for it to heal the do the same with OSD6 then OSD12 etc. etc. until all the disks are replaced, and would this then give me a large storage pool?
OK just for anyone that is looking for this answer in the future you can upgrade your drives in the way that I mention above and here are the steps that I have taken (please note that this is in a lab and not production)
Mark the OSD as down
Mark the OSD as Out
Remove the drive in question
Install new drive (must be either the same size or larger)
I needed to reboot the server in question for the new disk to be seen by the OS
Add the new disk into Ceph as normal
Wait for the cluster to heal then repeat on a different server
I have now done this with 6 out of my 15 drives over 3 servers and each time the size of the Ceph storage has increase a little (I'm only doing 320G drives to 400Gb drives as this is only a test and I have some of these not in use).
I plan on starting this on the live production servers next week now that I know it works and going from 300G to 600G drives I should see a larger increase in storage (I hope).

HowTo zdb -e poolname to recover data from a single ZFS device

I have the following situation:
1*10TB Drive, full of data on a ZFS
I wanted to add a 100GB NVME partition as a cache
instead of using zpool add poolname cache nvmepartion I wrote zpool add poolname nvmepartition
I did not see my mistake and exported this pool.
Neither the NVME drive is availeable any more, nor the system has any information about this pool in the ZFS cache (due to export).
current status:
zpool import shows the pool but I cannot import the pool using any way found on the internet.
zdb -e poolname shows me what i know: the pool, its name, that it (sadly) has 2 children which one is not availeable any more - and that the system has no informatioon about the missing child (so all tricks i found on internet in linked a ghost device etc. wont work either)
as far i know the only way is to use ZDB to generate all files through
the "journal" and pipe/save them to another path.
**
but how? Nowhere I found any documentation on that.
**
note: the 10 TB drive was 90% full, then I added the NVME partion as a sibling - as ZFS is no real raid 0 and due to the fact that these sibling have been so unequal in size and as I did not wrote many data after my mistake happened - I am quite sure that most of my data is still there.

Google Compute Engine snapshot of instance with persistent disks attached failed

I have a working VM instance that I'm trying to copy to allow redundancy behind google load balancer.
A test run with a dummy instance worked fine, creating a new instance from a snapshot of a running one.
Now, the real "original" instance have a persistent disk attached and this cause a problem in starting up the cloned instance because of the (obviously) missing persistent disk mount.
Logs from serial console output is as:
* Stopping cold plug devices[74G[ OK ]
* Stopping log initial device creation[74G[ OK ]
* Starting enable remaining boot-time encrypted block devices[74G[ OK ]
The disk drive for /mnt/XXXX-log is not ready yet or not present.
keys:Continue to wait, or Press S to skip mounting or M for manual recovery
As I understand there is no way to send any of this key strokes to the instance, is there any other way to overcome this issue? I know that I could unmount the disk before the snapshot, but the workflow I would like to instate is creating period snapshots of production servers, so un-mounting disks every time before performing it would require instance downtime (plus all the unnecessary risks of doing an action that would seem pointless).
Is there a way to boot this type of cloned instances successfully, and attach a new persistence disk afterwards?
Is this happening because the original persistent disk is in use, or the same problem would occur even if the original instance is offline (for example due to a failure in which case I would try to created a new instance from a snapshot)?
One workaround that I am using to get away from the same issue is that I dont't actually unmount the disk rather just comment out the the mount line in /etc/fstab and take the snapshot. This way my instance has no downtime or down disks while snapshoting. (I am using Ubuntu 14.04 as OS if that matters)
Later I fix and uncomment it when I use that snapshot on a new instance.
However you can also look into adding the nofail option in the commented line to get a better solution.
By the way I am doing a similar task building a load balanced setup with multiple webserver nodes. Each being cloned from the said snapshot with extra persistent disks mounted for eg uploads,data and logs etc
I'm a little unclear as to what you're trying to accomplish. It sounds like you're looking to periodically snapshot the data volumes of a production server so you can clone them later.
In all likelihood, you simply need to sync and fsfreeze to before you make your snapshot, rather than just unmounting/remounting it. The GCP documentation has a basic example of this in the Snapshots documentation.

Linux Page Cache Replacement

I have two PostgreSQL databases named data-1 and data-2 that sit on the same machine. Both databases keep 40 GB of data, and the total memory available on the machine is 68GB.
I started data-1 and data-2, and ran several queries to go over all their data. Then, I shut down data-1 and kept issuing queries against data-2. For some reason, the OS still holds on to large parts of data-1's pages in its page cache, and reserves about 35 GB of RAM to data-2's files. As a result, my queries on data-2 keep hitting disk.
I'm checking page cache usage with fincore. When I run a table scan query against data-2, I see that data-2's pages get evicted and put back into the cache in a round-robin manner. Nothing happens to data-1's pages, although they haven't been touched for days.
Does anybody know why data-1's pages aren't evicted from the page cache? I'm open to all kind of suggestions you think it might relate to problem.
This is an EC2 m2.4xlarge instance on Amazon with 68 GB of RAM and no swap space. The kernel version is:
$ uname -r
3.2.28-45.62.amzn1.x86_64
Edit-1:
It seems that there is no NUMA configuration:
$ dmesg | grep -i numa
[ 0.000000] No NUMA configuration found
Edit-2:
I used page-types tool in Linux kernel source tree to monitor page cache statuses. From the results I conclude that:
data-1 pages are in state : referenced,uptodate,lru,active,private
data-2 pages are in state : referenced,uptodate,lru,mappedtodisk
Take a look at the cpusets you have configured in /dev/cpusets. If you have multiple directories in here then you have multiple cpusets, and potentially multiple memory nodes.
The cpusets mechanism is documented in detail here: http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html