How to stop eviction policy in memcached? - memcached

I come across the situation where I don't want any eviction policy(LRU) in my memcached server setup . How to stop eviction policy in memcached ?
In other word is there any noevicationpolicy in memcached like redis have?

No, there is no such flag in memcached. Even if the items in your memcached storage have no expiration set, and it reaches a memory full state, it will start evicting least accessed slab from memory.

Related

limit the amount of memory kube-controller-manager uses

running v1.10 and i notice that kube-controller-managers memory usage spikes and the OOMs all the time. it wouldn't be so bad if the system didn't fall to a crawl before this happens tho.
i tried modifying /etc/kubernetes/manifests/kube-controller-manager.yaml to have a resource.limits.memory=1Gi but the kube-controller-manager pod never seems to want to come back up.
any other options?
There is a bug in kube-controller-manager, and it's fixed in https://github.com/kubernetes/kubernetes/pull/65339
First of all, you missed information about the amount of memory you use per node.
Second, what do you mean by "system didn't fall to a crawl" - do you mean nodes are swapping?
All Kubernetes masters and nodes are expected to have swap disabled - it's recommended by the Kubernetes community, as mentioned in the Kubernetes documentation.
Support for swap is non-trivial and degrades performance.
Turn off swap on every node by:
sudo swapoff -a
Finally,
resource.limits.memory=1Gi
is default value per pod. These limits are hard limits. Pod reaching this level of allocated memory can cause OOM, even if you have gigabytes of unallocated memory.

How to manage page cache resources when running Kafka in Kubernetes

I've been running Kafka on Kubernetes without any major issue for a while now; however, I recently introduced a cluster of Cassandra pods and started having performance problems with Kafka.
Even though Cassandra doesn't use page cache like Kafka does, it does make frequent writes to disk, which presumably effects the kernel's underlying cache.
I understand that Kubernetes pods are managing memory resources through cgroups, which can be configured by setting memory requests and limits in Kubernetes, but I've noticed that Cassandra's utilization of page cache can increase the number of page faults in my Kafka pods even when they don't seem to be competing for resources (i.e., there's memory available on their nodes).
In Kafka more page faults leads to more writes to disk, which hamper the benefits of sequential IO and compromise disk performance. If you use something like AWS's EBS volumes, this will eventually deplete your burst balance and eventually cause catastrophic failures across your cluster.
My question is, is it possible to isolate page cache resources in Kubernetes or somehow let the kernel know that pages owned by my Kafka pods should be kept in the cache longer than those in my Cassandra pods?
I thought this was an interesting question, so this is a posting of some findings from a bit of digging.
Best guess: there is no way with k8s OOB to do this, but enough tooling is available such that it could be a fruitful area for research and development of a tuning and policy application that could be deployed as a DaemonSet.
Findings:
Applications can use the fadvise() system call to provide guidance to the kernel regarding which file-backed pages are needed by the application and which are not and can be reclaimed.
http://man7.org/linux/man-pages/man2/posix_fadvise.2.html
Applications can also use O_DIRECT to attempt to avoid the use of page cache when doing IO:
https://lwn.net/Articles/457667/
There is some indication that Cassandra already uses fadvise in a way that attempts to optimize for reducing its page cache footprint:
http://grokbase.com/t/cassandra/commits/122qha309v/jira-created-cassandra-3948-sequentialwriter-doesnt-fsync-before-posix-fadvise
There is also some recent (Jan 2017) research from Samsung patching Cassandra and fadvise in the kernel to better utilize multi-stream SSDs:
http://www.samsung.com/us/labs/pdfs/collateral/Multi-stream_Cassandra_Whitepaper_Final.pdf
Kafka is page cache architecture aware, though it doesn't appear to use fadvise directly. The knobs available from the kernel are sufficient for tuning Kafka on a dedicated host:
vm.dirty* for guidance on when to get written-to (dirty) pages back onto disk
vm.vfs_cache_pressure for guidance on how aggressive to be in using RAM for page cache
Support in the kernel for device-specific writeback threads goes way back to the 2.6 days:
https://www.thomas-krenn.com/en/wiki/Linux_Page_Cache_Basics
Cgroups v1 and v2 focus on pid-based IO throttling, not file-based cache tuning:
https://andrestc.com/post/cgroups-io/
That said, the old linux-ftools set of utilities has a simple example of a command-line knob for use of fadvise on specific files:
https://github.com/david415/linux-ftools
So there's enough there. Given specific kafka and cassandra workloads (e.g. read-heavy vs write-heavy), specific prioritizations (kafka over cassandra or vice versa) and specific IO configurations (dedicated vs shared devices), one could emerge with a specific tuning model, and those could be generalized into a policy model.

Is there a reason not to share hosts for OSDs and Radosgw in a Ceph setup?

I am performance testing Ceph. I have a limited number of VMs to do this with. I want to have several radosgws, for a round-robin set up. Will my bechmarks be grossly inaccurate if I use the same hosts for OSDs and radosgw?
Main issue with sharing OSD with any other part of installation, is a thread count. Ceph OSD daemon creates a lot of threads during high load (you want to use Ceph under high load, aren't you?). I can't say how many threads radosgw creates, but it is a well known problem with scenario 'OSDs on compute hosts'. When you have too many threads, OS scheduler starts to mess up with them, threshing CPU cache and significantly drops performance (and raises latencies).
Ceph RGW is light weight process, does not require much CPU and Memory but it does require Network bandwidth. IMO you can collocate RGWs and OSDs provided that you have dedicated Ceph cluster and public networks and RGW should use Ceph public network.
I have done a similar kind of performance benchmarking which includes co-located and dedicated RGWs. I have not found significant performance difference between the two configurations. Co-located RGWs were performing a bit less ( but not substantial difference ).
So if one has to design a low cost object storage solution based on Ceph , then he might want to consider co-locating RGWs on OSDs. You can save some $$
FYI , co-located RGW configuration is not a supported configuration from RedHat point of view. Things are progressing preety fast in that direction.

What are some useful tips/tools for monitoring/tuning memcached health?

Yesterday, I found this cool script 'memcache-top' which nicely prints out stats of memcached live. It looks like,
memcache-top v0.6 (default port: 11211, color: on, refresh: 3 seconds)
INSTANCE USAGE HIT % CONN TIME EVICT/s READ/s WRITE/s
127.0.0.1:11211 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
AVERAGE: 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
TOTAL: 1.8GB/ 2.0GB 20 0.8ms 9.0 311.3K 162.8K
(ctrl-c to quit.)
it even makes certain text red when you should pay attention to something!
Q. Broadly, what are some useful tools/techniques you've used to check that memcached is set up well?
Good interface to accessing Memcached server instances is phpMemCacheAdmin.
I prefer access from the command line using telnet.
To make a connection to Memcached using Telnet, use the following telnet localhost 11211 command from the command line.
If at any time you wish to terminate the Telnet session, simply type quit and hit return.
You can get an overview of the important statistics of your Memcached server by running the stats command once connected.
Memory is allocated in chunks internally and constantly reused. Since memory is broken into different size slabs, you do waste memory if your items do not fit perfectly into the slab the server chooses to put it in.
So Memcached allocates your data into different "slabs" (think of these as partitions) of memory automatically, based on the size of your data, which in turn makes memory allocation more optimal.
To list the slabs in the instance you are connected to, use the stats slab command.
A more useful command is the stats items, which will give you a list of slabs which includes a count of the items store within each slab.
Now that you know how to list slabs, you can browse inside each slab to list the items contained within by using the stats cachedump [slab ID] [number of items, 0 for all items] command.
If you want to get the actual value of that item, you can use the get [key] command.
To delete an item from the cache you can use the delete [key] command.
For a production systems, you should really set up active monitoring (with downtime alerts, automated restarts etc.) of Memcache using something like Monit. Here is an example config: Monitoring Memcache with Monit
It is good to monitor overall memory usage of memcached for resource planning.
Track the eviction statistics counter to know how often cached items are getting evicted due to lack of memory.
Track cache hit/misses, reclaims(The number of expired items removed to allow space for new writes), current connections, flush cmd which is available in stats.
Memcached stats (can be read from telnet, libmemcached, language specific library)
stats
stats slabs
stats items
stats sizes
stats detail
stats settings
run the above commands using telnet
or simply run using netcat
echo "stats settings" | nc 127.0.0.1 11211
Other scripts/tools
https://github.com/memcached/memcached/tree/master/scripts
memcached top
memcached metrics per slab
This is what memcached metrics per slab looks like
desc for some fields can be found here.
Memcached Prometheus exporter - Exports metrics from memcached servers for consumption by Prometheus.

Varnish restarting suddenly

Does varnish keep a crash / restart log?
I am currently monitoring a varnish server and it seems to restart every week or so, when CPU usage reaches about 100% (load gets a bit high - about 6~7 on a 2 cores machine) and IO wait takes an avg of 45% of CPU time.
Am I missing any configuration or predefined behavior? Does it mean that I have a bottleneck in my hardware causing varnish failures?
Thanks!
When the child dies you should see a message in syslog. It will say something like Child exited.... Varnish is good about keeping track of the child, so when it does crash it will be immediately restarted and it should log it.
Load of 6-7 seems high. If you are using file backed storage I suggest switching to malloc. If you need more cache space, get a box with more memory. Use the nuking behavior as your guide (varnishstat -1 | grep nuke). If the value there reported by varnish is 0 your cache size is sufficient.