Memcached Eviction Rate Performance - memcached

We are monitoring Memcached servers using the Dynatrace tool, the Memcached is mainly used to store user sessions related data.
Number of Memcached Servers: 3
Memory Assigned to each instance: 2 GB
We have recently increased the Memcached memory (from 1 GB to 2 GB) when we had user sessions related issues where users could not Sign In.
We thus increased the Memcached memory for 1 GB to 2 GB.
Post increasing we observed a decrease in Eviction rate in the Monitoring tool (Dynatrace)
(There were NO other server/JVM related issues)
My Question:
Can I co-relate the User sessions with the Eviction rate?
Are there any other parameter/metric that I should look in for Memcached?
Is the eviction rate normal 200/s to 400/s normal?
Here is the snap of the Eviction rate and Max bytes for last 72 hours
Eviction Rate and Max_bytes

When Dynatrace detects a problem with your user sessions it will also automatically check for anomalies in related metrics.
It is impossible to give a general recommendation on a "good" eviction rate. Focus on the user sessions and the user experience. As long as user experience is OK i wouldn't care about the eviction rate. If you should see problems with user sessions and Dynatrace should also report a relation to anomalies of the eviction rate metric, i can recommend you this article. It explains an interesting relation between the size of objects that are stored in memcached and unexpected evictions.

Related

MongoDB: Disk I/O % utilization on Data Partition has gone

Last time I get alert from MongoDB Atlas:
Disk I/O % utilization on Data Partition has gone above 70 on nvme2n1
But I have no any ideas how can I localize / query / index / part of code / problematic collection.
In what way can I perform any analyze to find out problem root-cause?
Not answer, but just seen that many people faced with similar problem.
In My case root cause was: we had collection with huge documents that contain array of data (in fact - list of coordinates with some metadata), and update it as many times, as coordinates we have (when adding new coordinates). + some additional operations.
As I know MongoDB cannot fetch just part of document, it fetch full document, and when we fetch many different and big documents, they are not fit into MongoDB in-memory cache, and each time access into hard disc, that lead to this issue.
So, we just split up this document on several, and this fixed issue. While we need frequent access to update/add this data, we keep it into different documents, and finally, after process done, we gather back all this documents into one big document, for "history check" purpose.
Recently, we met this alert on MongoDB Atlas Disk I/O % utilization on Data Partition has gone above 90 after the instance reboots maintenance. After a discussion with Atlas support guys, we clearly understand this metric.
Understanding Disk I/O % Utilization
The definition of Disk I/O % Utilization and Disk I/O % utilization on Data Partition per doc
Disk I/O % Utilization alerts indicate that the percentage of time during which requests are being issued reaches a specified threshold.
Disk I/O % utilization on Data Partition occurs if the percentage of time during which requests are being issued to any partition that contains the MongoDB collection data meets or exceeds the threshold.
Two traps in iostat: %util and svctm
Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.
This means if there was even just one I/O operation in progress for a given time period, the operating system would report 100% Disk Util, as the disk was in use 100% of that time.
Thus, the disk utilization percentage by itself is NOT an indicator of stress on the disk relative to its maximum IOPS capacity.
Having disk utilization at 100% does not in itself imply there is an issue. Disk utilization is the percentage of time requests are issued to any partition containing the MongoDB collection data. This includes requests from any process, not just MongoDB processes. Modern disk storage can sustain multiple I/O operations simultaneously, so having a ~100% utilization is not unusual, because it just means that the disk is constantly processing at least one operation during the 100% interval.
Conclusion
We should look at a combination of all the available disk-related metrics, as well as IOWait in the System CPU when diagnosing potential disk performance-related issues.
Possible actions to help resolve Disk Utilization % alerts
Optimize your queries
Create an Index to Support Read Operations
Pay attention to Query Selectivity and Covered Query
Use the Atlas Performance Advisor to view slow queries and suggested indexes.
Review Indexing Strategies for possible further indexing improvements.
Analyze Query Performance to review how your queries are using your indexes.
Analyze Profile to optimize the long execution time query
Increase hardware resources, such as instance size and IOPS on Atlas
Source: Mongo Doc
As the alert says, it is due to the high utilization of the disk. The most common cause of it is unoptimized queries with poor Query Targeting Ratio, or simply reading/writing a lot of documents from/to the disk in a relatively shorter time window.
In order to identify these queries, start with the Profiler and look for the operations with a poor Examined:Returned ratio. You can also refer to the Performance Advisor to see if it suggests any indexes on the inefficient operations. Since Profiler's window is limited to the last 24 hours, you can also refer to your logs to identify the Slow Queries.
Ultimately, the effort to solve this is tri-directional:
Optimizing the query execution with efficient indexing and filtering strategies
Keep a check on the volume of data being read/written in one go.
Increase the IOPS of the cluster
For official reference, checkout the documentation here.

How to determine resource limit for Openshift Pods required for my tomcat application?

I've a web application (soap service) running in Tomcat 8 server in Openshift. The payload size is relatively small with 5-10 elements and the traffic is also small (300 calls per day, 5-10 max threads at a time). I'm little confused on the Pod resource restriction. How do I come up with min and max cpu and memory limits for each pod if I'm going to use min 1 and max 3 pods for my application?
It's tricky to configure accurate limitation value without performance test.
Because we don't expect your application is required how much resources process per requests. A good rule of thumb is to limit the resource based on heaviest workload on your environment. Memory limitation can trigger OOM-killer, so you should set up afforded value which is based on your tomcat heap and static memory size.
As opposed to CPU limitation will not kill your pod if reached the limitation value, but slow down the process speed.
My suggestion of each limitation value's starting point is as follows.
Memory: Tomcat(Java) memory size + 30% buffer
CPU: personally I think CPU limitation is useless to maximize the
process performance and efficiency. Even though CPU usage is afforded and the pod
can use full cpu resources to process the requests as soon as
possible at that time, the limitation setting can disturb it. But if
you should spread the resource usage evenly for suppressing some
aggressive resource eater, you can consider the CPU limitation.
This answer might not be what you want to, but I hope it help you to consider your capacity planning.

How to calculate the number of CPU, memory and storage that my Google Cloud SQL needs

My DB is reaching the 100% of CPU utilization and increasing the number of CPU is not working anymore.
What kind of information should I consider to create my Google Cloud SQL? How do you set up the DB configuration?
Info I have:
For 10-50 minute a day I have 120 request/seconds and the CPU reaches 100% of utilization
Memory usage is the maximum 2.5GB during this critical period
Storage usage is currently around 1.3GB
Current configuration:
vCPUs: 10
Memory: 10 GB
SSD storage: 50 GB
Unfortunately, there is no magic formula for determining the correct database size. This is because queries have variable load - some are small and simple and take no time at all, others are complex or huge and take lots of resources to complete.
There are generally two strategies to dealing with high load: Reduce your load (use connection pooling, optimize your queries, cache results), or increase the size of your database (add additional CPUs, Storage, or Read replicas).
Usually, when we have CPU utilization, it is because the CPU is overloaded or we have too many database tables in the same instances. Here are some common issues and fixes provided by Google’s documentation:
If CPU utilization is over 98% for 6 hours, your instance is not properly sized for your workload, and it is not covered by the SLA.
If you have 10,000 or more database tables on a single instance, it could result in the instance becoming unresponsive or unable to perform maintenance operations, and the instance is not covered by the SLA.
When the CPU is overloaded, it is recommended to use this documentation to view the percentage of available CPU your instance is using on the Instance details page in the Google Cloud Console.
It is also recommended to monitor your CPU usage and receive alerts at a specified threshold, set up a Stackdriver alert.
Increasing the number of CPUs for your instance should reduce the strain of your instance. Note that changing CPUs requires an instance restart. If your instance is already at the maximum number of CPUs, shard your database to multiple instances.
Google has this very interesting documentation about investigating high utilization and determining whether a system or user task is causing high CPU utilization. You could use it to troubleshoot your instance and find what's causing the high CPU utilization.

Couchbase - Full Eviction vs Value Eviction

How advantageous is Value Eviction over Full Eviction? In case of Value eviction , I assume meta-data is present in the RAM. How does presence of meta-data help in quicker retrieval of content? Is the NRU documents evicted high watermark is reached? Are there any aspects that we need to consider before changing the eviction policy from value eviction to full eviction.
Value eviction keeps all document meta data in memory and full eviction does not.
Let's say you do a get on a key that does not exist. In value eviction mode you instantly know that the key is not there since it is a memory only operation. In full eviction mode if the meta data for that key is not in memory then you have to do a disk fetch to be sure that they key does not exist.
Basically any operation that requires knowledge of some information about a keys meta data may require a background fetch. Some other operations that may be slow if the meta data is not in memory are CAS set (check and set only if the value has not changed), appends, incrs, decrs, and prepends. Also keep in mind that extra disk activity may cause disk contention that affects other parts of Couchbase.
The NRU is the same across both full eviction and value eviction and Couchbase will do its best to keep your working set in memory.
I would recommend trying to get an idea of what your workload looks like before switching modes and test it out with full eviction because you may see performance issues with will vary depending on your workload.
In addition to Mike's answer it's worth mentioning bloom filters here which is a very powerful feature of Couchbase and can decrease the trips to disk significantly. Bloom filters are also enabled in value-only ejection but Couchbase really leverages their functionality in full ejection mode. I was in the situation where the system had reached its limit with value-only ejections buckets and I tested the two eviction modes and eventually the full ejection ended up being far superior - at least for my case.

Impact of Compaction and Flushing on Write Latency in Cassandra

Will frequent Compaction and Memtable Flushing affect write latency of the cluster?
In our implementation we have a bunch of Counter Column Families [about 30] which gets updated very actively. Every request to our system does around 15-20 updates[all diff CFs].
We are able to notice Compaction and Flushing happening very frequently in our system logs of cassandra on heavy traffic. And By the time we also experience high load on nodes responsible for the keys [Day Timestamp, Minute Timestamp, Hour Timestamp] and write latency of the cluster increases than usual [0.6 ms to 26 ms]
We haven't touched any of the defaults of cassandra and our machines running cassandra have reasonably good enough configuration[32G ram and 16 Cores] 4G to cassandra
We tried disabling durable_writes to know whether it helps but it didn't do that much good as we expected
Short version: if Cassandra is configured as recommended with commitlog on a separate disk from the data directories, then flush and compaction should have negligible impact.
Caveats:
Updates are primarily CPU-bound, and compaction takes a lot of CPU. If you are running on machines or VMs with less than 4 cores [not your situation, but for the sake of completeness] you might want to reduce compaction_throughput_mb_per_sec to throttle it down.
If you have enough CFs flushing all at the same time (which it sounds like may be the case with updating 2/3 of your CFs with each request) then Cassandra may block writes temporarily to make sure that it's not accepting writes faster than it can flush them (which could otherwise eventually result in running out of memory and dying). 4 GB is a relatively small heap for high volume inserts across many CFs; I'd suggest increasing that to 8. It's also worth enabling the JVM GC logging to see how hard the JVM is having to work -- example settings are in cassandra-env.sh.
Finally, you don't mention the Cassandra version you are using but performance has reliably increased with each major release. Especially if you are using something older than 0.8, I would recommend upgrading.