how openstack cinder distribute volume without special type while configure two ceph backend pools

how openstack cinder distribute volume without special type while configure two ceph backend pools - scheduler

I found that cinder-volume will distribute the volume to the pool which dependent on their virtual or actual free capacity while request to create a new volume without special volume type and there are two or more backend pools without setting the default_volume_type configuration option.
Actually, in my case, there is a ceph_common pool which remain 30 TiB for MAX AVAIL, the other is a ceph_specs pool which remain 10 TiB for MAX AVAIL, it will create a new volume in ceph_specs pool while creating without volume type as openstack volume create --size 10 test_vol_without_type.
I had check from these links and can't get any clue:
https://docs.openstack.org/cinder/latest/admin/default-volume-types.html
https://docs.openstack.org/cinder/latest/configuration/block-storage/samples/cinder.conf.html
Could anyone give any advice? What's the principle of this situation? THX.

According to the default cinder.conf, there are two steps(filter and weigh) to determine which host for distributing the volume.
# Which filter class names to use for filtering hosts when not specified in the
# request. (list value)
#scheduler_default_filters = AvailabilityZoneFilter,CapacityFilter,CapabilitiesFilter
# Which weigher class names to use for weighing hosts. (list value)
#scheduler_default_weighers = CapacityWeigher
1, We use default availability zone, and don't define the special volume type to create the new volume, so the AvailabilityZoneFilter and CapabilitiesFilter return True.
2, In CapacityFilter, it will return True for the host if there is sufficient capacity for creating the size of volume.
available capacity calculate by:
total_physical_capacity * max_over_subscription_ratio - backend_provisioned
total_physical_capacity: the pool TOTAL size get by `ceph df` command
max_over_subscription_ratio: Default ratio is 20.0, meaning provisioned capacity can be 20 times of the total physical capacity
backend_provisioned: the pool PROVISIONED TOTAL size get by `rbd du -p pool_name`
3, In CapacityWeigher, it will select the most available capacity host to deploy the volume.
So it seems like randomly to distribute the new volume to different backend pool in some situation. Actually, it because of the CapacityWeigher.
How to solve my problem:
1, create a volume type by define the backend: openstack volume type create --property volume_backend_name='ceph_common' vol_type_common
2, configure default_volume_type as vol_type_common in all cinder components such as cinder-api, cinder-scheduler, cinder-volume.
3, reference: Default Volume Types, Untyped volumes to default volume type

Related

Is there a way to calculate the total disk space used by each pod on nodes?

context
Our current context is the following: researchers are running HPC calculations on our Kubernetes cluster. Unfortunately, some pods cannot get scheduled because the container engine (here Docker) is not able to pull the images because the node is running out of disk space.
hypotheses
images too big
The first hypothesis is that the images are too big. This probably the case because we know that some images are bigger than 7 GB.
datasets being decompressed locally
Our second hypothesis is that some people are downloading their datasets locally (e.g. curl ...) and inflate them locally. This would generate the behavior we are observing.
Envisioned solution
I believe that this problem is a good case for a daemon set that would have access to the node's file system. Typically, this pod would calculate the total disk space used by all the pods on the node and would expose them as a Prometheus metric. From there is would be easy to set alert rules in place to check which pods have grown a lot over a short period of time.
How to calculate the total disk space used by a pod?
The question then becomes: is there a way to calculate the total disk space used by a pod?
Does anyone have any experience with this?

Kubernetes does not track overall storage available. It only knows things about emptyDir volumes and the filesystem backing those.
For calculating total disk space you can use below command
kubectl describe nodes
From the above output of the command you can grep ephemeral-storage which is the virtual disk size; this partition is also shared and consumed by Pods via emptyDir volumes, image layers,container logs and container writable layers.
Check where the process is still running and holding file descriptors and/or perhaps some space (You may have other processes and other file descriptors too not being released). Check Is that kubelet.
You can verify by running $ ps -Af | grep xxxx
With Prometheus you can calculate with the below formula
sum(node_filesystem_size_bytes)
Please go through Get total and free disk space using Prometheus for more information.

Ceph reports bucket space utilization and total cluster utilization that is inconsistent

I copied the contents of an older Ceph cluster to a new Ceph cluster using rclone. Because several of the buckets had tens of millions of objects in a single directory I had to enumerate these individually and use the "rclone copyto" command to move them. After copying, the number of objects match but the space utilization on the second Ceph cluster is much higher.
Each Ceph cluster is configured with the default triple redundancy.
The older Ceph cluster has 1.4PiB of raw capacity.
The older Ceph cluster has 526TB in total bucket utilization as reported by "radosgw-admin metadata bucket stats". The "ceph -s" status on this cluster shows 360TiB of object utilization with a total capacity of 1.4PiB for 77% space utilization. The two indicated quantities of 360TiB used in the cluster and 526TB used by buckets are significantly different. There isn't enough raw capacity on this cluster to hold 526TB.
After copying the contents to the new Ceph cluster, the total bucket utilization of 553TB is reflected in the "ceph -s" status as 503TiB. This is slightly higher than the bucket total of the source I assume due to larger drive's block sizes, but the status utilization matches the sum of the bucket utilization as expected. The number of objects in each bucket of the destination cluster matches the source buckets also.
Is this a setting in the first Ceph cluster that merges duplicate objects like a simplistic compression? There isn't enough capacity in the first Ceph cluster to hold much over 500TB so this seems like the only way this could happen. I assume that when two objects are the same, that each bucket gets a symlink like pointer to the same object. The new Ceph cluster doesn't seem to have this capability or it's not set to behave this way.
The first cluster is Ceph version 13.2.6 and the second is version 17.2.3.

Ulimits on AWS ECS Fargate

The default ULIMIT "NOFILE" is set to 1024 for containers launched using Fargate. So if I have a cluster of let's say 10 services with two or three tasks each (all running on Fargate), what are the implications if I set them all to use a huge NOFILE number such as 900000?
More specifically, do we need to care about the host machine? It's my assumption that if I were using the EC2 launch type and set all my tasks to effectively use as many files as they wanted, the hosting EC2 instance(s) could easily get overwhelmed. Or maybe the hosts wouldn't get overwhelmed but the containers registered on the hosts would get a first come first served number of files they can open potentially leading to one service starving another? But as we don't manage the instances on EC2, what's the harm in setting the ULIMIT as high as possible for all services? Do our containers sit side-by-side on a host and would therefore share the hosts resource limits. Or do we get a host per service / per task?
Of course it's possible my assumptions are wrong about how this all works.

The maximum nofile limit on fargate is 4096
Amazon ECS tasks hosted on Fargate use the default resource limit values set by the operating system with the exception of the nofile resource limit parameter which Fargate overrides. The nofile resource limit sets a restriction on the number of open files that a container can use. The default nofile soft limit is 1024 and hard limit is 4096.
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task_definition_parameters.html

A slight correction on this answer. Like the linked documentation states, these are the DEFAULT soft and hard limits for ulimit nofile. You can override this by updating your ECS Task Definition. The Ulimit settings go under the ContainerDefinitions section of the Definition.
I've successfully set the soft and hard limits for nofile on some of my AWS Fargate Tasks using this method.
So while you cannot use the Linux "ulimit -n" command to change this on the fly, you can alter it via the ECS Task Definition.
EDIT:
I've done some testing and for my setup, running AWS ECS Fargate on a Python Bullseye distro, I was able to max out at NOFILE = 1024 x 1024 = 1048576 files.
{
"ulimits": [
{
"name": "nofile",
"softLimit": 1048576,
"hardLimit": 1048576
}
],
}
Any integer multiple added to this (1024 x 1024 x INT) caused ECS to report an error when trying to start up the ECS Fargate Task:
CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container
Hope this helps someone.
Please refer to:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ecs-taskdefinition-containerdefinitions-ulimit.html

Monitoring persistent volume performance

Use case
I am operating a kafka cluster in Kubernetes which is heavily dependent on a a proper disk performance (IOPS, throughput etc.). I am using Google's compute engine disks + Google kubernetes engine. Thus I know that the disks I created have the following approx limits:
IOPS (Read/Write): 375 / 750
Throughput in MB/s (Read/Write): 60 / 60
The problem
Even though I know the approx IOPS and throughput limits I have no idea what I am actually using at the moment. I'd like to monitor it with prometheus + grafana but I couldn't find anything which would export disk io stats for persistent volumes. The best I found were disk space stats from kubelet:
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_available_bytes
The question
What possibilities do I have to monitor (preferably via prometheus) the disk io usage for my kafka persistent volumes attached in Kubernetes?
Edit:
Another find I made is using node-exporter's node_disk_io metric:
rate(node_disk_io_time_seconds_total[5m]) * 100
Unfortunately the result doesn't contain a nodename, or even a persistent volume (claim) name. Instead it has device (e.g. 'sdb') and an instance ( e.g. '10.90.206.10') label which are the only labels which would somehow allow me to monitor a specific persistent volume. The downside of these labels are that they are dynamic and can change with a pod restart or similiar.

You should be able to get the metrics that you are looking for using Stackdriver. Check the new Stackdriver Kubernetes Monitoring.
You can use this QWikiLab to test the tools without install in your environment.

You can use Stackdriver monitoring to see I/O disk of an instance. You can use Cloud Console and go to the VM instance--> monitoring page to find it.

Can the operating system use ebs transparently?

I understand we can attach and detach a volume to an instance dynamically.My question is that will the OS allocation these plysicall resource automatically or it should be configured by the user i.e. create a mount point in for file system and explicitly tell the application where the mount point is ?
I use this cloud formation to deploy mongodb to aws,the template give users option to specify the volume size to host the database server,just wonder even if I allocate the physical resource,how can the template use it ? How can I know which volume the data reside.When I try to detach one of volume for the instance ,things just break.But I am sure I do not need so many volumes to host data

Yes you need to do that manually as soon as you create a ebs and attach it to a certain instance,you need to follow following steps(On linux systems)
- Check if volume is attached and get its name.
lsblk
- Format the newly attached volume
mkfs -t ext4 /dev/<volume name>
- Create a mount_point
mkdir mount_point
- mount the volume to mount point
mount /dev/<volume_name> mount_point
- Verify the newly attached partition
df -Ht
Can't see your cloud formation template