What are some useful tips/tools for monitoring/tuning memcached health? - memcached

Yesterday, I found this cool script 'memcache-top' which nicely prints out stats of memcached live. It looks like,
memcache-top v0.6 (default port: 11211, color: on, refresh: 3 seconds)
INSTANCE USAGE HIT % CONN TIME EVICT/s READ/s WRITE/s
127.0.0.1:11211 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
AVERAGE: 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
TOTAL: 1.8GB/ 2.0GB 20 0.8ms 9.0 311.3K 162.8K
(ctrl-c to quit.)
it even makes certain text red when you should pay attention to something!
Q. Broadly, what are some useful tools/techniques you've used to check that memcached is set up well?

Good interface to accessing Memcached server instances is phpMemCacheAdmin.
I prefer access from the command line using telnet.
To make a connection to Memcached using Telnet, use the following telnet localhost 11211 command from the command line.
If at any time you wish to terminate the Telnet session, simply type quit and hit return.
You can get an overview of the important statistics of your Memcached server by running the stats command once connected.
Memory is allocated in chunks internally and constantly reused. Since memory is broken into different size slabs, you do waste memory if your items do not fit perfectly into the slab the server chooses to put it in.
So Memcached allocates your data into different "slabs" (think of these as partitions) of memory automatically, based on the size of your data, which in turn makes memory allocation more optimal.
To list the slabs in the instance you are connected to, use the stats slab command.
A more useful command is the stats items, which will give you a list of slabs which includes a count of the items store within each slab.
Now that you know how to list slabs, you can browse inside each slab to list the items contained within by using the stats cachedump [slab ID] [number of items, 0 for all items] command.
If you want to get the actual value of that item, you can use the get [key] command.
To delete an item from the cache you can use the delete [key] command.

For a production systems, you should really set up active monitoring (with downtime alerts, automated restarts etc.) of Memcache using something like Monit. Here is an example config: Monitoring Memcache with Monit

It is good to monitor overall memory usage of memcached for resource planning.
Track the eviction statistics counter to know how often cached items are getting evicted due to lack of memory.
Track cache hit/misses, reclaims(The number of expired items removed to allow space for new writes), current connections, flush cmd which is available in stats.
Memcached stats (can be read from telnet, libmemcached, language specific library)
stats
stats slabs
stats items
stats sizes
stats detail
stats settings
run the above commands using telnet
or simply run using netcat
echo "stats settings" | nc 127.0.0.1 11211
Other scripts/tools
https://github.com/memcached/memcached/tree/master/scripts
memcached top
memcached metrics per slab
This is what memcached metrics per slab looks like
desc for some fields can be found here.
Memcached Prometheus exporter - Exports metrics from memcached servers for consumption by Prometheus.

Related

HAProxy reverse ssl termination: Memory keeps growing. Memory leak?

I have haproxy 2.5.1 in SSL termination config running in a container of a Kubernetes POD, the backend is an Scala App that runs in another container of same POD.
I have seen that I can put 500K connections in the setup and the RSS memory usage of HAProxy is 20GB. If I remove the traffic and wait 15 minutes the RSS memory drops to 15GB, but if I repeat the same exercise one or two more times, RSS for HAProxy will hit 30GB and HAProxy will be kill as I have a limit of 30GB in the POD for HAProxy.
The question here is if this behavior of continuous memory growth is expected?
Here is the incoming traffic:
And here is the memory usage chart which shows how after 3 cycles of Placing Load and Removing Load, the RSS memory reached 30GB and then got killed (Just as an observation the two charts have different timezone but they belong to same execution)
We switched from Alpine based image(musl) into libc based image and that solved the problem. We got 5X increase on connection rate and memory growth gone too.

Monitor Nestjs backend

In the old days, when we wanted to monitor a "Daemon" / Service, we were asking the software editor the list of all the services running in the background in Windows.
If a "Daemon / service" would be down, it would be restarted.
On top of that, we would use a software like NAGIOS or Centreon to monitore this particular "Daemon / service".
I have a team of Software developper in charge of implementing a nice Nest JS.
Here is what we are going to implement:
2 differents VMs running on a high availability VMWARE cluster with a SAN
the two VMs has Vmotion / High availabity settings
an HA Proxy is setup in order to provide load balancing and additional high availability
Our questions are, how can we detect that :
one of our backend is down ?
one of our backend moving from 50ms average response time to 800ms ?
one of our backend consumes more that 15Gb of ram ?
etc
When we were using "old school" daemon, it was enough, when it comes to JS backend, I am a bit clue less.
Cheers
Kynes
nb : the datacenter in charge of our infrastructure is not "docker / kubernetes / ansible etc compliant)
To be fair, all of these seem doable out of the box for Centreon/Nagios. I'd say check the documentation...
one of our backend is down ?
VM DOWN: the centreon-vmware plugins provides monitoring of VM status.
VM UP but Backend DOWN : use the native http/https url checks provided by Centreon/Nagios to load the web page.
Or use the native SNMP plugins to monitor the status of your node process.
one of our backend moving from 50ms average response time to 800ms ?
Ping Response time: Use the native ping check
Status of the network interfaces of the VM: the centreon-vmware plugin has network interface checks for VMs.
Page loading time: use the native http/https url checks provided by Centreon/Nagios.
You may go even further and use a browser automation tool like selenium to run scenarios on your pages and monitor the time for each step.
one of our backend consumes more that 15Gb of ram ?
Total RAM consumed on server: use the native SNMP memory checks from centreon/nagios.
RAM consumed by a specific process: possible through the native SNMP memory plugin.
Like so:
/usr/lib/centreon/plugins/centreon_linux_snmp.pl --plugin os::linux::snmp::plugin --mode processcount --hostname=127.0.0.1 --process-name="centengine" --memory --cpu
OK: Number of current processes running: 1 - Total memory usage: 8.56 MB - Average memory usage: 8.56 MB - Total CPU usage: 0.00 % | 'nbproc'=1;;;0; 'mem_total'=8978432B;;;0; 'mem_avg'=8978432.00B;;;0; 'cpu_total'=0.00%;;;0;`

What is the best way to monitor Heroku Postgres memory and cpu

We're on Heroku and trying to understand if it's time to upgrade our Postgres database or not. I have two questions:
Is there any tools you know of that track heroku postgres logs to track their memory and cpu
usage stats over time?
Are those (Memory and CPU usage) even the best metrics to look at to determine if we should upgrade to a larger instance or not?
The most useful tool I've found for monitoring heroku postgres instancs is the logs associated with the database's dyno, which you can monitor using heroku logs -t -d heroku-postgres. This spits out some useful stats every 5 minutes, so if you fill your logs up quickly, this might not output anything right away — use -t to wait for the next log line.
Output will look something like this:
2022-06-27T16:34:49.000000+00:00 app[heroku-postgres]: source=HEROKU_POSTGRESQL_SILVER addon=postgresql-fluffy-55941 sample#current_transaction=81770844 sample#db_size=44008084127bytes sample#tables=1988 sample#active-connections=27 sample#waiting-connections=0 sample#index-cache-hit-rate=0.99818 sample#table-cache-hit-rate=0.9647 sample#load-avg-1m=0.03 sample#load-avg-5m=0.205 sample#load-avg-15m=0.21 sample#read-iops=14.328 sample#write-iops=15.336 sample#tmp-disk-used=543633408 sample#tmp-disk-available=72435159040 sample#memory-total=16085852kB sample#memory-free=236104kB sample#memory-cached=15075900kB sample#memory-postgres=223120kB sample#wal-percentage-used=0.0692420374380985
The main stats I pay attention to are table-cache-hit-rate which is a good proxy for how much of your active dataset fits in memory, and load-avg-1m, which tells you how much load per CPU the server is experiencing.
You can read more about all these metrics here.

AWS EB should create new instance once my docker reached its maximum memory limit

I have deployed my dockerized micro services in AWS server using Elastic Beanstalk which is written using Akka-HTTP(https://github.com/theiterators/akka-http-microservice) and Scala.
I have allocated 512mb memory size for each docker and performance problems. I have noticed that the CPU usage increased when server getting more number of requests(like 20%, 23%, 45%...) & depends on load, then it automatically came down to the normal state (0.88%). But Memory usage keeps on increasing for every request and it failed to release unused memory even after CPU usage came to the normal stage and it reached 100% and docker killed by itself and restarted again.
I have also enabled auto scaling feature in EB to handle a huge number of requests. So it created another duplicate instance only after CPU usage of the running instance is reached its maximum.
How can I setup auto-scaling to create another instance once memory usage is reached its maximum limit(i.e 500mb out of 512mb)?
Please provide us a solution/way to resolve these problems as soon as possible as it is a very critical problem for us?
CloudWatch doesn't natively report memory statistics. But there are some scripts that Amazon provides (usually just referred to as the "CloudWatch Monitoring Scripts for Linux) that will get the statistics into CloudWatch so you can use those metrics to build a scaling policy.
The Elastic Beanstalk documentation provides some information on installing the scripts on the Linux platform at http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-cw.html.
However, this will come with another caveat in that you cannot use the native Docker deployment JSON as it won't pick up the .ebextensions folder (see Where to put ebextensions config in AWS Elastic Beanstalk Docker deploy with dockerrun source bundle?). The solution here would be to create a zip of your application that includes the JSON file and .ebextensions folder and use that as the deployment artifact.
There is also one thing I am unclear on and that is if these metrics will be available to choose from under the Configuration -> Scaling section of the application. You may need to create another .ebextensions config file to set the custom metric such as:
option_settings:
aws:elasticbeanstalk:customoption:
BreachDuration: 3
LowerBreachScaleIncrement: -1
MeasureName: MemoryUtilization
Period: 60
Statistic: Average
Threshold: 90
UpperBreachScaleIncrement: 2
Now, even if this works, if the application will not lower its memory usage after scaling and load goes down then the scaling policy would just continue to trigger and reach max instances eventually.
I'd first see if you can get some garbage collection statistics for the JVM and maybe tune the JVM to do garbage collection more often to help bring memory down faster after application load goes down.

What are possible reasons for memcached to be significantly slower on a remote server?

I have a PHP/Apache server with 12GB of RAM. I have been running Memcached on the same machine with 6GB of allotted RAM.
I wanted to run Memcached on a separate server (same datacenter, vlan, subnet), just as I do for MySQL. I setup a separate, identical server with the same memcached configuration.
I am seeing a roughly 10x page load time using Memcached from the remote server than what I get when running locally. I have primed both caches and I still have a 10x load time from remote.
I'm having trouble trouble shooting this.
You're loading 500kb of data per pageload, in all small keys? How many keys per pageload is this?
Latency to a remote server is very low, but running many roundtrips is still a bad idea. Memcached clients support multi-get operations, where you batch many keys into a single request/response with much lower latency.
Just for info, DDR3-1333 is about 10667 MB/s.
If you have, let's say, Gigabit ethernet, I guess it can explains some of the problems you are experiencing...