ReadWriteMany volumes on kubernetes with terabytes of data - kubernetes

We want to deploy a k8s cluster which will run ~100 IO-heavy pods at the same time. They should all be able to access the same volume.
What we tried so far:
CephFS
was very complicated to set up. Hard to troubleshoot. In the end, it crashed a lot and the cause was not entirely clear.
Helm NFS Server Provisioner
runs pretty well, but when IO peaks a single replica is not enough. We could not get multiple replicas to work at all.
MinIO
is a great tool to create storage buckets in k8s. But our operations require fs mounting. That is theoretically possible with s3fs, but since we run ~100 pods, we would need to run 100 s3fs sidecars additionally. Thats seems like a bad idea.
There has to be some way to get 2TB of data mounted in a GKE cluster with relatively high availability?
Firestorage seems to work, but it's a magnitude more expensive than other solutions, and with a lot of IO operations it quickly becomes infeasible.
I contemplated creating this question on server fault, but the k8s community is a lot smaller than SO's.

I think I have a definitive answer as of Jan 2020, at least for our usecase:
| Solution | Complexity | Performance | Cost |
|-----------------|------------|-------------|----------------|
| NFS | Low | Low | Low |
| Cloud Filestore | Low | Mediocre? | Per Read/Write |
| CephFS | High* | High | Low |
* You need to add an additional step for GKE: Change the base image to ubuntu
I haven't benchmarked Filestore myself, but I'll just go with stringy05's response: others have trouble getting really good throughput from it
Ceph could be a lot easier if it was supported by Helm.

Related

Monitoring persistent volume performance

Use case
I am operating a kafka cluster in Kubernetes which is heavily dependent on a a proper disk performance (IOPS, throughput etc.). I am using Google's compute engine disks + Google kubernetes engine. Thus I know that the disks I created have the following approx limits:
IOPS (Read/Write): 375 / 750
Throughput in MB/s (Read/Write): 60 / 60
The problem
Even though I know the approx IOPS and throughput limits I have no idea what I am actually using at the moment. I'd like to monitor it with prometheus + grafana but I couldn't find anything which would export disk io stats for persistent volumes. The best I found were disk space stats from kubelet:
kubelet_volume_stats_capacity_bytes
kubelet_volume_stats_available_bytes
The question
What possibilities do I have to monitor (preferably via prometheus) the disk io usage for my kafka persistent volumes attached in Kubernetes?
Edit:
Another find I made is using node-exporter's node_disk_io metric:
rate(node_disk_io_time_seconds_total[5m]) * 100
Unfortunately the result doesn't contain a nodename, or even a persistent volume (claim) name. Instead it has device (e.g. 'sdb') and an instance ( e.g. '10.90.206.10') label which are the only labels which would somehow allow me to monitor a specific persistent volume. The downside of these labels are that they are dynamic and can change with a pod restart or similiar.
You should be able to get the metrics that you are looking for using Stackdriver. Check the new Stackdriver Kubernetes Monitoring.
You can use this QWikiLab to test the tools without install in your environment.
You can use Stackdriver monitoring to see I/O disk of an instance. You can use Cloud Console and go to the VM instance--> monitoring page to find it.

Large number of worker nodes v/s few worker nodes with more resources

Is it preferable to have a Kubernetes cluster with 4 nodes having resources 4 CPUs, 16 GB RAM or 2 nodes cluster with resources 8 CPUs and 32 GB RAM?
What benefits user will get if they go for horizontal scaling over vertical scaling in Kubernetes concepts. I mean suppose we want to run 4 pods, is it good to go with 2 nodes cluster with resources 8 CPU and 32 GB RAM or 4 nodes cluster with resources 4 CPU and 16 GB RAM.
In general I would recommend larger nodes because it's easier to place containers on them.
If you have a pod that resources: {requests: {cpu: 2.5}}, you can only place one of them on a 4-core node, and two on 2x 4-core nodes, but you can put 3 on a single 8-core node.
+----+----+----+----+ +----+----+----+----+
|-WORKLOAD--| | |-WORKLOAD--| |
+----+----+----+----+ +----+----+----+----+
+----+----+----+----+----+----+----+----+
|-WORKLOAD--|--WORKLOAD--|-WORKLOAD--| |
+----+----+----+----+----+----+----+----+
If you have 16 cores total and 8 cores allocated, it's possible that no single node has more than 2 cores free with 4x 4-CPU nodes, but you're guaranteed to be able to fit that pod with 2x 8-CPU nodes.
+----+----+----+----+ +----+----+----+----+
|-- USED -| | |-- USED -| |
+----+----+----+----+ +----+----+----+----+
+----+----+----+----+ +----+----+----+----+
|-- USED -| | |-- USED -| |
+----+----+----+----+ +----+----+----+----+
Where |-WORKLOAD--| goes?
+----+----+----+----+----+----+----+----+
|------- USED ------| |
+----+----+----+----+----+----+----+----+
+----+----+----+----+----+----+----+----+
|------- USED ------| |
+----+----+----+----+----+----+----+----+
At the specific scale you're talking about, though, I'd be a little worried about running a 2-node cluster: if a single node dies you've lost half your cluster capacity. Unless I knew that I was running multiple pods that needed 2.0 CPU or more I might lean towards the 4-node setup here so that it will be more resilient in the event of node failure (and that does happen in reality).
Horizontal Autoscaling
Pros
Likely to have more capacity since you are expanding VMs or/and servers. You are essentially expanding your cluster.
In theory, more redundancy since you are spreading your workloads across different physical servers.
Cons
In theory, it's slower. Meaning it's slower to provision servers and VMs than pods/containers in the same machine (for vertical autoscaling)
Also, you need to provision both servers/VMs and containers/pods when you scale up.
Doesn't work that well with plain bare-metal infrastructure/servers.
Vertical Autoscaling
Pros
In theory, it should be faster to autoscale if you have large servers provisioned. (Also, faster response)
If you have data-intensive apps you might benefit from workloads running on the same machines.
Great if you have a few extra unused bare-metal servers.
Cons
If you have large servers provisioned you may waste a lot of resources.
You need to calculate the capacity of your workloads more precisely (this could be a pro or a con depending on how you see it)
If you have a fixed set of physical servers, you will run into eventual limitations of CPUs, Storage, Memory, etc.
Generally, you'd want to have a combination of both Horizontal and Vertical autoscaling.

Performance Postgresql on Local Volume K8s

currently I recently switched our PostgreSQL cluster from a simple "bare-metal" (vms) workload to a containerised K8s cluster (also on vms).
Currently we run zalando-incubator/postgres-operator and use Local Volume's with volumeMode: FileSystem the volume itself is a "simple" xfs volume mounted on the host.
However we actually seen performance drops up to 50% on the postgres cluster inside k8s.
Some heavy join workloads actually perform way worse than on the old cluster which did not use containers at all.
Is there a way to tune the behavior or at least measure the performance of I/O to find the actual bottleneck (i.e. what is a good way to measure I/O, etc.)
Is there a way to tune the behavior
Be cognizant of two things that might be impacting your in-cluster behavior: increased cache thrashing and the inherent problem of running concurrent containers on a Node. If you haven't already tried it, you may want to use taints and tolerations to sequester your PG Pods away from other Pods and see if that helps.
what is a good way to measure I/O, etc.
I would expect the same iostat tools one is used to using would work on the Node, since no matter how much kernel namespace trickery is going on, it's still the Linux kernel.
Prometheus (and likely a ton of other such toys) surfaces some I/O specific metrics for containers, and I would presume they are at the scrape granularity, meaning you can increase the scrape frequency, bearing in mind the observation cost impacting your metrics :-(
It appears new docker daemons ship with Prom metrics, although I don't know what version introduced that functionality. There is a separate page discussing the implications of high frequency metric collection. There also appears to be a Prometheus exporter for monitoring arbitrary processes, above and beyond the PostgreSQL specific exporter.
Getting into my opinion, it may be a very reasonable experiment to go head-to-head with ext4 versus a non-traditional FS like xfs. I can't even fathom how much extra production experience has gone into ext4, merely by the virtue of almost every Linux on the planet deploying on it by default. You may have great reasons for using xfs, but I just wanted to ensure you had at least considered that xfs might have performance characteristics that make it problematic in a shared environment like a kubernetes cluster.

Is there a reason not to share hosts for OSDs and Radosgw in a Ceph setup?

I am performance testing Ceph. I have a limited number of VMs to do this with. I want to have several radosgws, for a round-robin set up. Will my bechmarks be grossly inaccurate if I use the same hosts for OSDs and radosgw?
Main issue with sharing OSD with any other part of installation, is a thread count. Ceph OSD daemon creates a lot of threads during high load (you want to use Ceph under high load, aren't you?). I can't say how many threads radosgw creates, but it is a well known problem with scenario 'OSDs on compute hosts'. When you have too many threads, OS scheduler starts to mess up with them, threshing CPU cache and significantly drops performance (and raises latencies).
Ceph RGW is light weight process, does not require much CPU and Memory but it does require Network bandwidth. IMO you can collocate RGWs and OSDs provided that you have dedicated Ceph cluster and public networks and RGW should use Ceph public network.
I have done a similar kind of performance benchmarking which includes co-located and dedicated RGWs. I have not found significant performance difference between the two configurations. Co-located RGWs were performing a bit less ( but not substantial difference ).
So if one has to design a low cost object storage solution based on Ceph , then he might want to consider co-locating RGWs on OSDs. You can save some $$
FYI , co-located RGW configuration is not a supported configuration from RedHat point of view. Things are progressing preety fast in that direction.

Do you need to run RAID 10 on Mongo when using Provisioned IOPS on Amazon EBS?

I'm trying to setup a production mongo system on Amazon to use as a datastore for a realtime metrics system,
I initially used the MongoDB AMIs[1] in the Marketplace, but I'm confused in that there is only one data EBS. I've read that Mongo recommends RAID 10 on EBS storage (8 EBS on each server). Additionally, I've read that the bare minimum for production is a primary/secondary with an arbiter. Is RAID 10 still the recommended setup, or is one provisioned IOPS EBS sufficient?
Please Advise. We are a small shop, so what is the bare minimum we can get away with and still be reasonably safe?
[1] MongoDB 2.4 with 1000 IOPS - data: 200 GB # 1000 IOPS, journal: 25 GB # 250 IOPS, log: 10 GB # 100 IOPS
So, I just got off of a call with an Amazon System Engineer, and he had some interesting insights related to this question.
First off, if you are going to use RAID, he said to simply do striping, as the EBS blocks were mirrored behind the scenes anyway, so raid 10 seemed like overkill to him.
Standard EBS volumes tend to handle spiky traffic well (it may be able to handle 1K-2K iops for a few seconds), however eventually it will tail off to an average of 100 iops. One suggestion was to use many small EBS volumes and stripe them to get better iops throughput.
Some of his customers use just the ephemeral storage on the EC2 images, but then have multiple (3-5) nodes in the availability set. The ephemeral storage is the storage on the physical machine. Apparently, if you use the EC2 instance with the SSD storage, you can get up to 20K iops.
Some customers will do a huge EC2 image w/ssd for the master, then do a smaller EC2 w/ EBS for the secondary. The primary machine is performant, but the failover is available but has degraded performance.
make sure you check 'EBS Optimized' when you spin up an instance. That means you have a dedicated channel to the EBS storage (of any kind) instead of sharing the NIC.
Important! Provisioned IOPS EBS is expensive, and the bill does not shut off when you shut down the EC2 instances they are attached to. (this sucks while you are testing) His advice was to take a snapshot of the EBS volumes, then delete them. When you need them again, just create new provisioned IOPS EBS volumes, restore the snapshot, then reconfigure your EC2 instances to attache the new storage. (it's more work than it should be, but it's worth it not to get sucker punched with the IOPS bill.
I've got the same question. Both Amazon and Mongodb try to market a lot on provisioned IOPs chewing over its advantages over a standard EBS volume. We run prod instances on m2.4xlarge aws instances with 1 primary and 2 secondaries setup per service. In the highest utilized service cluster, apart from a few slow queries the monitoring charts do not reveal any drop on performance at all. Page faults are rare occurrences and that too between 0.0001 and 0.0004 faults once or twice a day. Background flushes are in milliseconds and locks and queues are so far at manageable levels. I/O waits on the Primary node at any time ranges between 0 to 2 %, mostly less than 1 and %idle steadily stays above 90% mark. Do I still need to consider provisioned IOPs given we've a budget still to improve any potential performance drag? Any guidance will be appreciated.