Is there a reason not to share hosts for OSDs and Radosgw in a Ceph setup? - ceph

I am performance testing Ceph. I have a limited number of VMs to do this with. I want to have several radosgws, for a round-robin set up. Will my bechmarks be grossly inaccurate if I use the same hosts for OSDs and radosgw?

Main issue with sharing OSD with any other part of installation, is a thread count. Ceph OSD daemon creates a lot of threads during high load (you want to use Ceph under high load, aren't you?). I can't say how many threads radosgw creates, but it is a well known problem with scenario 'OSDs on compute hosts'. When you have too many threads, OS scheduler starts to mess up with them, threshing CPU cache and significantly drops performance (and raises latencies).

Ceph RGW is light weight process, does not require much CPU and Memory but it does require Network bandwidth. IMO you can collocate RGWs and OSDs provided that you have dedicated Ceph cluster and public networks and RGW should use Ceph public network.
I have done a similar kind of performance benchmarking which includes co-located and dedicated RGWs. I have not found significant performance difference between the two configurations. Co-located RGWs were performing a bit less ( but not substantial difference ).
So if one has to design a low cost object storage solution based on Ceph , then he might want to consider co-locating RGWs on OSDs. You can save some $$
FYI , co-located RGW configuration is not a supported configuration from RedHat point of view. Things are progressing preety fast in that direction.

Related

Rightsizing Kubernetes Nodes | How much cost we save when we switch from VMs to containers

We are running 4 different micro-services on 4 different ec2 autoscaling groups:
service-1 - vcpu:4, RAM:32 GB, VM count:8
service-2 - vcpu:4, RAM:32 GB, VM count:8
service-3 - vcpu:4, RAM:32 GB, VM count:8
service-4 - vcpu:4, RAM:32 GB, VM count:16
We are planning to migrate this workload on EKS (in containers)
We need help in deciding the right node configuration (in EKS) to start with.
We can start with a small machine vcpu:4, RAM:32 GB, but will not get any cost saving as each container will need a separate vm.
We can use a large machine vcpu:16, RAM: 128 GB, but when these machines scale out, scaled out machine will be large and thus can be underutiliized.
Or we can go with a Medium machine like vcpu: 8, RAM:64 GB.
Other than this recommendation, we were also evaluating the cost saving of moving to containers.
As per our understanding, every VM machine comes with following overhead
Overhead of running hypervisor/virtualisation
Overhead of running separate Operating system
Note: One large VM vs many small VMs cost the same on public cloud as cost is based on number of vCPUs + RAM.
Hypervisor/virtualization cost is only valid if we are running on-prem, so no need to consider this.
On the 2nd point, how much resources a typical linux machine can take to run a OS? If we provision a small machine (vcpu:2, RAM:4GB), an approximate cpu usage is 0.2% and memory consumption (other than user space is 500Mb).
So, running large instances (count:5 instances in comparison to small instances count:40) can save 35 times of this cpu and RAM, which does not seem significant.
You are unlikely to see any cost savings in resources when you move to containers in EKS from applications running directly on VM's.
A Linux Container is just an isolated Linux process with specified resource limits, it is no different from a normal process when it comes to resource consumption. EKS still uses virtual machines to provide compute to the cluster, so you will still be running processes on a VM, regardless of containerization or not and from a resource point of view it will be equal. (See this answer for a more detailed comparison of VM's and containers)
When you add Kubernetes to the mix you are actually adding more overhead compared to running directly on VM's. The Kubernetes control plane runs on a set of dedicated VM's. In EKS those are fully managed in a PaaS, but Amazon charges a small hourly fee for each cluster.
In addition to the dedicated control plane nodes, each worker node in the cluster need a set of programs (system pods) to function properly (kube-proxy, kubelet etc.) and you may also define containers that must run on each node (daemon sets), like log collectors and security agents.
When it comes to sizing the nodes you need to find a balance between scaling and cost optimization.
The larger the worker node is the smaller the relative overhead of system pods and daemon sets become. In theory a worker node large enough to accommodate all your containers would maximize resources consumed by your applications compared to supporting applications on the node.
The smaller the worker nodes are the smaller the horizontal scaling steps can be, which is likely to reduce waste when scaling. It also provides better resilience as a node failure will impact fewer containers.
I tend to prefer nodes that are small so that scaling can be handled efficiently. They should be slightly larger than what is required from the largest containers, so that system pods and daemon sets also can fit.

Is there a way to force the use of the same physical CPU while allocating cores to a pod in Kubernetes?

I was wondering if it was possible to force Kubernetes to allocate the cores from the same CPU while spinning up a POD. What I would like Kubernetes to do is, as new PODs are created, the cores allocated to them should come from -let's say- CPU1 as long as there are cores still available on it. CPU2's, CPU3's, etc. cores should not be used in the newly initiated pod. I would like my PODs to have cores allocated from a single CPU as long as it is possible.
Is there a way to achieve this?
Also, is there a way to see from which physical CPUs the cores(cpu) of a POD is coming from?
Thanks a lot.
Edit: Let me explain why I want to do this.
We are running a Spark cluster on Kubernetes. The lead of system/linux administration team warned us about the concept of NUMA. He told us that we could improve the performance of our executor pods if we were to allocate the cores from the same physical CPU. That is why I started digging into this.
I found this Kubernetes CPU Manager. The documentation says:
CPU Manager allocates CPUs in a topological order on a best-effort
basis. If a whole socket is free, the CPU Manager will exclusively
allocate the CPUs from the free socket to the workload. This boosts
the performance of the workload by avoiding any cross-socket traffic.
Also on the same page:
Allocate all the logical CPUs (hyperthreads) from the same physical
CPU core if available and the container requests an entire core worth
of CPUs.
So now I am starting to think maybe what I need is to enable the static policy for the CPU manager to get what I want.

Performance Postgresql on Local Volume K8s

currently I recently switched our PostgreSQL cluster from a simple "bare-metal" (vms) workload to a containerised K8s cluster (also on vms).
Currently we run zalando-incubator/postgres-operator and use Local Volume's with volumeMode: FileSystem the volume itself is a "simple" xfs volume mounted on the host.
However we actually seen performance drops up to 50% on the postgres cluster inside k8s.
Some heavy join workloads actually perform way worse than on the old cluster which did not use containers at all.
Is there a way to tune the behavior or at least measure the performance of I/O to find the actual bottleneck (i.e. what is a good way to measure I/O, etc.)
Is there a way to tune the behavior
Be cognizant of two things that might be impacting your in-cluster behavior: increased cache thrashing and the inherent problem of running concurrent containers on a Node. If you haven't already tried it, you may want to use taints and tolerations to sequester your PG Pods away from other Pods and see if that helps.
what is a good way to measure I/O, etc.
I would expect the same iostat tools one is used to using would work on the Node, since no matter how much kernel namespace trickery is going on, it's still the Linux kernel.
Prometheus (and likely a ton of other such toys) surfaces some I/O specific metrics for containers, and I would presume they are at the scrape granularity, meaning you can increase the scrape frequency, bearing in mind the observation cost impacting your metrics :-(
It appears new docker daemons ship with Prom metrics, although I don't know what version introduced that functionality. There is a separate page discussing the implications of high frequency metric collection. There also appears to be a Prometheus exporter for monitoring arbitrary processes, above and beyond the PostgreSQL specific exporter.
Getting into my opinion, it may be a very reasonable experiment to go head-to-head with ext4 versus a non-traditional FS like xfs. I can't even fathom how much extra production experience has gone into ext4, merely by the virtue of almost every Linux on the planet deploying on it by default. You may have great reasons for using xfs, but I just wanted to ensure you had at least considered that xfs might have performance characteristics that make it problematic in a shared environment like a kubernetes cluster.

How to manage page cache resources when running Kafka in Kubernetes

I've been running Kafka on Kubernetes without any major issue for a while now; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka.
Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache.
I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (i.e., there's memory available on their nodes).
In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster.
My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods?
I thought this was an interesting question, so this is a posting of some findings from a bit of digging.
Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet.
Findings:
Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed.
http://man7.org/linux/man-pages/man2/posix_fadvise.2.html
Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO:
https://lwn.net/Articles/457667/
There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint:
http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise
There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs:
http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf
Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host:
vm.dirty* for guidance on when to get written-to (dirty) pages back onto disk
vm.vfs_cache_pressure for guidance on how aggressive to be in using RAM for page cache
Support in the kernel for device-specific writeback threads goes way back to the 2.6 days:
https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics
Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning:
https://andrestc.com/post/cgroups-io/
That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files:
https://github.com/david415/linux-ftools
So there's enough there. Given specific kafka and cassandra workloads (e.g. read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model.

Should I use SSD or HDD as local disks for kubernetes cluster?

Is it worth using SSD as boot disk? I'm not planning to access local disks within pods.
Also, GCP by default creates 100GB disk. If I use 20GB disk, will it cripple the cluster or it's OK to use smaller sized disks?
Why one or the other?. Kubernetes (Google Conainer Engine) is mainly Memory and CPU intensive unless your applications need a huge throughput on the hard drives. If you want to save money you can create tags on the nodes with HDD and use the node-affinity to tweak which pods goes where so you can have few nodes with SSD and target them with the affinity tags.
I would always recommend SSD considering the small difference in price and large difference in performance. Even if it just speeds up the deployment/upgrade of containers.
Reducing the disk size to what is required for running your PODs should save you more. I cannot give a general recommendation for disk size since it depends on the OS you are using and how many PODs you will end up on each node as well as how big each POD is going to be. To give an example: When I run coreOS based images with staging deployments for nginx, php and some application servers I can reduce the disk size to 10gb with ample free room (both for master and worker nodes). On the extreme side - If I run self-contained golang application containers without storage need, each POD will only require a few MB space.