currently I recently switched our PostgreSQL cluster from a simple "bare-metal" (vms) workload to a containerised K8s cluster (also on vms).
Currently we run zalando-incubator/postgres-operator and use Local Volume's with volumeMode: FileSystem the volume itself is a "simple" xfs volume mounted on the host.
However we actually seen performance drops up to 50% on the postgres cluster inside k8s.
Some heavy join workloads actually perform way worse than on the old cluster which did not use containers at all.
Is there a way to tune the behavior or at least measure the performance of I/O to find the actual bottleneck (i.e. what is a good way to measure I/O, etc.)
Is there a way to tune the behavior
Be cognizant of two things that might be impacting your in-cluster behavior: increased cache thrashing and the inherent problem of running concurrent containers on a Node. If you haven't already tried it, you may want to use taints and tolerations to sequester your PG Pods away from other Pods and see if that helps.
what is a good way to measure I/O, etc.
I would expect the same iostat tools one is used to using would work on the Node, since no matter how much kernel namespace trickery is going on, it's still the Linux kernel.
Prometheus (and likely a ton of other such toys) surfaces some I/O specific metrics for containers, and I would presume they are at the scrape granularity, meaning you can increase the scrape frequency, bearing in mind the observation cost impacting your metrics :-(
It appears new docker daemons ship with Prom metrics, although I don't know what version introduced that functionality. There is a separate page discussing the implications of high frequency metric collection. There also appears to be a Prometheus exporter for monitoring arbitrary processes, above and beyond the PostgreSQL specific exporter.
Getting into my opinion, it may be a very reasonable experiment to go head-to-head with ext4 versus a non-traditional FS like xfs. I can't even fathom how much extra production experience has gone into ext4, merely by the virtue of almost every Linux on the planet deploying on it by default. You may have great reasons for using xfs, but I just wanted to ensure you had at least considered that xfs might have performance characteristics that make it problematic in a shared environment like a kubernetes cluster.
Related
I have tomcat, zookeeper and kafka deployled in local k8s(kind) cluster. The database is remote i.e. in cloud. The pages load very slowly.
But when i moved tomcat outside of the pod and started manually with zk and kafka in local k8s cluster and db in remote cloud the pages are loading fine.
Why is Tomcat very slow when inside a Kubernetes pod?
In theory, a program running in a container can run as fast as a program running on the host machine.
In practice, there are many things that can affect the performance.
When running on Windows or macOS (for instance with Docker Desktop), container doesn't run directly on the machine, but in a small Linux virtual machine. This VM will add a bit of overhead, and it might not have as much CPU and RAM as the host environment. One way to have a look at the resource usage of containers is to use docker stats; or docker run -ti --pid host alpine and then use classic UNIX tools like free, top, vmstat, ... to see the resource usage in the VM.
In most environments (at least with Docker, and with Kubernetes clusters in their most common default configurations), containers run without resource constraints and limits. However, it is fairly common (and, in fact, highly recommended!) to set resource requests and limits when running containers on Kubernetes. You can check resource limits of a pod with kubectl describe. If metrics-server is installed (which is recommended, even on dev/staging environments), you can check resource usage with kubectl top. Tools like k9s will show you resource requests, limits, and usage in a comprehensive way (as long as the data is available; i.e. you still need to install metrics-server to obtain pod metrics, for instance).
In addition to the VM overhead described above, if the container does a lot of I/O (whether it's disk or network), there might be a bit of overhead in comparison to a native process. This can become noticeable if the container writes on the container copy-on-write filesystem (instead of a volume), especially when using the device-mapper storage driver.
Applications that use "live reload" techniques (that automatically rebuild or restart when source code is edited) are particularly prone to this I/O issue, because there are unfortunately no efficient methods to watch file modifications across a virtual machine boundary. This means that many web frameworks exhibit extreme performance degradations when running in containers on Mac or Windows when the source code is mounted to the container.
In addition to these factors, there can be other subtle differences that might affect the overall performance of a containerized application. When observing performance issues, it is very helpful to use a profiler (or some kind of APM solution) to see which parts of the code take longer to execute. If no profiler or APM is available, try to execute individual portions of the code independently to compare their performance. For instance, have a small piece of code that executes a single query to the database; or executes a single task from a job queue, etc.
Good luck!
I have a python code where I process some data, write neo4j queries and then commit these queries to neo4j. When I run the code on my local machine and write the output to local neo4j it doesn't take more than 15 minutes. However, when I run my code locally and write the output to noe4j pod in k8s pod it takes double the time, and when I build my code and deploy it to k8s and run that pod and write the output to neo4j pod it takes a round 3 hours. since I'm new to k8s deployment it might something in the pod configurations or settings, so I appreciate if I can get some hints
There could be few reasons of that.
I would first check how much resources does your pod consume while you are processing data, you can do that using kubectl top pod.
Second I would check if there are any limits inside pod. You can read a great deal about them on Managing Compute Resources for Containers.
If you have a limit set then it might be too low and that's causing the extended time while processing data.
If limits are not set then it might be because of how you installed minik8s. I think as default it's being installed with 4G is memory, you can look at alternative methods of installing minik8s. With multipass you can specify more memory to allocate.
There also can be a issue with Page Cache Sizing, Heap Sizing or number of open files. Please read the Neo4j Performance Tuning.
Is there a way to enable caching between pods in Kubernetes cluster? For eg: Lets say we have more than 1 pods running on High availability mode.And we want to share some value between them using distributed caching between the pods.Is this something possible?
There are some experimental projects to let you reuse the etcd that powers the cluster, but I probably wouldn’t. Just run your own using etcd-operator or something. The specifics will massively depend on what your exact use case and software is, distributed databases are among the most complex things ever.
I've been running Kafka on Kubernetes without any major issue for a while now; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka.
Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache.
I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (i.e., there's memory available on their nodes).
In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster.
My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods?
I thought this was an interesting question, so this is a posting of some findings from a bit of digging.
Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet.
Findings:
Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed.
http://man7.org/linux/man-pages/man2/posix_fadvise.2.html
Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO:
https://lwn.net/Articles/457667/
There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint:
http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise
There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs:
http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf
Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host:
vm.dirty* for guidance on when to get written-to (dirty) pages back onto disk
vm.vfs_cache_pressure for guidance on how aggressive to be in using RAM for page cache
Support in the kernel for device-specific writeback threads goes way back to the 2.6 days:
https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics
Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning:
https://andrestc.com/post/cgroups-io/
That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files:
https://github.com/david415/linux-ftools
So there's enough there. Given specific kafka and cassandra workloads (e.g. read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model.
I am performance testing Ceph. I have a limited number of VMs to do this with. I want to have several radosgws, for a round-robin set up. Will my bechmarks be grossly inaccurate if I use the same hosts for OSDs and radosgw?
Main issue with sharing OSD with any other part of installation, is a thread count. Ceph OSD daemon creates a lot of threads during high load (you want to use Ceph under high load, aren't you?). I can't say how many threads radosgw creates, but it is a well known problem with scenario 'OSDs on compute hosts'. When you have too many threads, OS scheduler starts to mess up with them, threshing CPU cache and significantly drops performance (and raises latencies).
Ceph RGW is light weight process, does not require much CPU and Memory but it does require Network bandwidth. IMO you can collocate RGWs and OSDs provided that you have dedicated Ceph cluster and public networks and RGW should use Ceph public network.
I have done a similar kind of performance benchmarking which includes co-located and dedicated RGWs. I have not found significant performance difference between the two configurations. Co-located RGWs were performing a bit less ( but not substantial difference ).
So if one has to design a low cost object storage solution based on Ceph , then he might want to consider co-locating RGWs on OSDs. You can save some $$
FYI , co-located RGW configuration is not a supported configuration from RedHat point of view. Things are progressing preety fast in that direction.