Ceph delete huge file - ceph

I have running cluster in octopus 15.2.12 .
when I delete a file with with 2TB size, this file deleted from bucket list, but bucket size of cluster doesn't changed.
I checked my wast usage and saw this size added to waste usage.
I checked gc pool object size, and i saw doesn't changed after deleted.
gc configs on my cluster has default value.
Can anyone help me with these questions?

Related

Is there a way to calculate the total disk space used by each pod on nodes?

context
Our current context is the following: researchers are running HPC calculations on our Kubernetes cluster. Unfortunately, some pods cannot get scheduled because the container engine (here Docker) is not able to pull the images because the node is running out of disk space.
hypotheses
images too big
The first hypothesis is that the images are too big. This probably the case because we know that some images are bigger than 7 GB.
datasets being decompressed locally
Our second hypothesis is that some people are downloading their datasets locally (e.g. curl ...) and inflate them locally. This would generate the behavior we are observing.
Envisioned solution
I believe that this problem is a good case for a daemon set that would have access to the node's file system. Typically, this pod would calculate the total disk space used by all the pods on the node and would expose them as a Prometheus metric. From there is would be easy to set alert rules in place to check which pods have grown a lot over a short period of time.
How to calculate the total disk space used by a pod?
The question then becomes: is there a way to calculate the total disk space used by a pod?
Does anyone have any experience with this?
Kubernetes does not track overall storage available. It only knows things about emptyDir volumes and the filesystem backing those.
For calculating total disk space you can use below command
kubectl describe nodes
From the above output of the command you can grep ephemeral-storage which is the virtual disk size; this partition is also shared and consumed by Pods via emptyDir volumes, image layers,container logs and container writable layers.
Check where the process is still running and holding file descriptors and/or perhaps some space (You may have other processes and other file descriptors too not being released). Check Is that kubelet.
You can verify by running $ ps -Af | grep xxxx
With Prometheus you can calculate with the below formula
sum(node_filesystem_size_bytes)
Please go through Get total and free disk space using Prometheus for more information.

Ceph reports bucket space utilization and total cluster utilization that is inconsistent

I copied the contents of an older Ceph cluster to a new Ceph cluster using rclone. Because several of the buckets had tens of millions of objects in a single directory I had to enumerate these individually and use the "rclone copyto" command to move them. After copying, the number of objects match but the space utilization on the second Ceph cluster is much higher.
Each Ceph cluster is configured with the default triple redundancy.
The older Ceph cluster has 1.4PiB of raw capacity.
The older Ceph cluster has 526TB in total bucket utilization as reported by "radosgw-admin metadata bucket stats". The "ceph -s" status on this cluster shows 360TiB of object utilization with a total capacity of 1.4PiB for 77% space utilization. The two indicated quantities of 360TiB used in the cluster and 526TB used by buckets are significantly different. There isn't enough raw capacity on this cluster to hold 526TB.
After copying the contents to the new Ceph cluster, the total bucket utilization of 553TB is reflected in the "ceph -s" status as 503TiB. This is slightly higher than the bucket total of the source I assume due to larger drive's block sizes, but the status utilization matches the sum of the bucket utilization as expected. The number of objects in each bucket of the destination cluster matches the source buckets also.
Is this a setting in the first Ceph cluster that merges duplicate objects like a simplistic compression? There isn't enough capacity in the first Ceph cluster to hold much over 500TB so this seems like the only way this could happen. I assume that when two objects are the same, that each bucket gets a symlink like pointer to the same object. The new Ceph cluster doesn't seem to have this capability or it's not set to behave this way.
The first cluster is Ceph version 13.2.6 and the second is version 17.2.3.

EFS storage growing too big

We have a ECS fargate cluster that runs the fluentd application for collecting logs and routing them to elasticsearch. Logs are buffered on the disk(file buffer) before being routed to the destination. Since we are using FARGATE we mount the buffer path /var/log/fluentd/buffer/ to EFS.
What we would ideally expect is, the data in the buffer path will be flushed to elasticsearch and the buffer directory will be deleted. However we see a huge number of these buffer directories from tasks that have died and restarted several months ago.
So when a ECS tasks dies and comes back up again (autoscaling) it creates a new path /var/log/fluentd/buffer/ that gets mounted on EFS while also holding on to the path /var/log/fluentd/buffer/. I am not sure if its the EFS that holding on to these and remounting back on the new tasks.
Is there a way to delete these stale directories from EFS and just have paths specific to the running tasks. At a time, we have 5 tasks running in a service.
Any help is appreciated?

the value of container_fs_writes_bytes_total isn't correct in k8s?

If the application in the container writes or reads some files without using volume, the value of container_fs_writes_bytes_total metrics is always 0 when I upload one huge file and the application save it to one folder. And container_fs_reads_bytes_total is also always 0 when I download the uploaded file. Is there a way to get the disk IO which caused by the container.
Btw, I want to know the disk usage of each container in k8s. There is container_fs_usage_bytes metrics in cAdvisor. But the value isn't correct. If there are many container, the metrics of each container are totally same. The metrics value is the whole usage of the disk in machine. There is someone said use kubelet_vlume_stats_used_bytes, it also has the same problem.

Deleting files in Ceph does not free up space

I am using Ceph, uploading many files through radosgw. After, I want to delete the files. I am trying to do that in Python, like this:
bucket = conn.get_bucket(BUCKET)
for key in bucket.list():
bucket.delete_key(key)
Afterwards, I use bucket.list() to list files in the bucket, and this says that the bucket is now empty, as I intended.
However, when I run ceph df on the mon, it shows that the OSDs still have high utilization (e.g. %RAW USED 90.91). If I continue writing (thinking that the status data just hasn't caught up with the state yet), Ceph essentially locks up (100% utilization).
What's going on?
Note: I do have these standing out in ceph status:
health HEALTH_WARN
3 near full osd(s)
too many PGs per OSD (2168 > max 300)
pool default.rgw.buckets.data has many more objects per pg than average (too few pgs?)
From what I gather online, this wouldn't cause my particular issue. But I'm new to Ceph and could be wrong.
I have one mon and 3 OSDs. This is just for testing.
You can check if the object is really deleted by rados -p $pool list,
I knew for cephfs, when you delete a file, it will return ok when mds mark
it as deleted in local memory and then do real delete by sending delete messages to related osd.
Maybe radosgw use the same design to speed up delete