On AWS RDS Postgres, what could cause disk latency to go up while iOPs / throughput go down? - postgresql

I'm investigating an approximately 3 hour period of increased query latency on a production Postgres RDS instance (m4.xlarge, 400 GiB of gp2 storage).
The driver seems to be a spike in both read and write disk latencies: I see them going from a baseline of ~0.0005 up to a peak of 0.0136 write latency / 0.0081 read latency.
I also see an increase in disk queue depth from a baseline of around 2, to a peak of 14.
When there's a spike in disk latencies, I generally expect to see an increase in data being written to disk. But read iOPS, write iOPS, read throughput, and write throughput all went down (by approximately 50%) during the time when latency was elevated.
I also have server-side metrics on the total query volume I'm sending (measured in both queries per second and amount of data written: this is a write-heavy workload), and those metrics were flat during this time period.
I'm at a loss for what to investigate next. What are possible reasons that disk latency could increase while iOPs go down?

Related

Troubleshooting Latency Increase for Lambda to EFS Reads

The Gist
We've got a Lambda job running that reads data from EFS (elastic throughout) for up to 200 TPS of read requests.
PercentIOLimit is well below 20%.
Latency goes from about 20 ms to about 400 ms during traffic spikes.
Are there any steps I can take to get more granularity into where the latency for the reads is coming from?
Additional Info:
At low TPS (~5), reads take about 10-20 ms.
At higher TPS (~50), p90 can take 300-400 ms.
I'd really like to narrow down what limit is causing these latency spikes, especially when the IOPercent usage is around 60%.

NVMe SSD's bandwidth decreases when increasing the number of I/O queues

As far as I have learned from all the relevant articles about NVMe SSDs, one of NVMe SSDs' benefits is multiple queues. Leveraging multiple NVMe I/O queues, NVMe bandwidth can be greatly utilized.
However, what I have found from my own experiment does not agree with that.
I want to do parallel 4k-granularity sequential reads from an NVMe SSD. I'm using Samsung 970 EVO Plus 250GB. I used FIO to benchmark the SSD. The command I used is:
fio --size=1000m --directory=/home/xxx/fio_test/ --ioengine=libaio --direct=1 --name=4kseqread --bs=4k --iodepth=64 --rw=read --numjobs 1/2/4 --group_reporting
And below is what I got testing 1/2/4 parallel sequential reads:
numjobs=1: 1008.7MB/s
numjobs=2: 927 MB/s
numjobs=4: 580 MB/s
Even if will not increasing bandwidth, I expect increasing I/O queues would at least keep the same bandwidth as the single-queue performance. The bandwidth decrease is a little bit counter-intuitive. What are the possible reasons for the decrease?
Thank you.
I would like to highlight 3 reasons why you may see the issue:
Effective Queue Depth is too high,
Capacity under the test is limited to 1GB only,
Drive Precondition
First, parameter --iodepth=X is specified per Job. It means in your last experiment (--iodepth=64 and --numjobs=4) effective Queue Depth is 4x64=256. This may be too high for your Drive. Based on the vendor specification of your 250GB Drive, 4KB Random Read should show 250 KIOPS (1GB/s) for the Queue Depth of 32. By this Vendor is stating that QD32 is quite optimal for your Drive operation in order to reach best performance. If we start to increase QD, then commands will start aggregating and waiting in the Submission Queue. It does not improve performance. Vice Versa it will start to eat system resources (CPU, memory) and will degrade the throughput.
Second, limiting capacity under test to such a small range (1GB) can cause lot of collisions inside SSD. It is the situation when Reads will hit the same Media Physical Read Unit (aka Die aka LUN). In such situation new Reads will have to wait for previous one to complete. Increase of the testing capacity to entire Drive or at least to 50-100GB should minimize the collisions.
Third, in order to get performance numbers as per specification, Drive needs to be preconditioned accordingly. For the case of measuring Sequential and Random Reads it is better to use Full Drive Sequential Precondition. Command bellow will perform 128KB Sequential Write at QD32 to the Entire Drive Capacity.
fio --size=100% --ioengine=libaio --direct=1 --name=128KB_SEQ_WRITE_QD32 --bs=128k --iodepth=4 --rw=write --numjobs=8

Meaning of ADX Cache utilization more than 100%

We see Cache utilization dashboard for an ADX cluster on Azure portal, but at times I have noticed that this utilization shows up to be more than 100%. I am trying to understand how to interpret it. Say , for example , if cache utilization shows up as 250% , does it mean that 100% of memory cache is utilized and then beyond that 150% disk cache is being utilized?
as explained in the documentation for the Cache Utilization metric:
[this is the] Percentage of allocated cache resources currently in use by the cluster.
Cache is the size of SSD allocated for user activity according to the defined cache policy.
An average cache utilization of 80% or less is a sustainable state for a cluster.
If the average cache utilization is above 80%, the cluster should be scaled up to a storage optimized pricing tier or scaled out to more instances. Alternatively, adapt the cache policy (fewer days in cache).
If cache utilization is over 100%, the size of data to be cached, according to the caching policy, is larger that the total size of cache on the cluster.
Utilization > 100% means that there's not enough room in the (SSD) cache to hold all the data that the policy indicates should be cached. If auto-scale is enabled then the cluster will be scaled-out as a result.
The cache applies an LRU eviction policy, so that even when utilization exceeds 100% query performance will be as good as possible (though, of course, if queries constantly reference data more than what the cache can hold some performance degradation will be observed.)

GCE SSD persistent disk slower than Standard persistent disk

We are using GCE for a MongoDB replica set with three members. As our data is quite large the initial sync for a new member is taking quite a lot. In our case the initial sync takes 7 hours for copying records and then 30 hours to create indexes.
The database is stored on a separate disk with these properties (copy-paste from the GCE console):
Type: Standard persistent disk
Size: 2000 GB
Zone: us-central1-c
Sustained random IOPS limit - estimated (R/W): 1,500 / 3,000
Sustained throughput limit (MB/s) - estimated (R/W): 180 / 120
To speed up we tried to add an SSD disk:
Type: SSD persistent disk
Size: 1000 GB
Zone: us-central1-c
Sustained random IOPS limit - estimated (R/W): 15,000 / 15,000
Sustained throughput limit (MB/s) - estimated: 240 / 240
One would expect that SSD disk should be quite faster than a Standard disk. But our results are different. During the initial MongoDB sync Standard disk was several time faster than the SSD. While it took 7 hours for the Standard disk to copy all data, the SSD disk after 12 hours had copied just half of data. We used Linux tool iostat and measured, Standard disk is achieving around 80,000 kB_wrtn/s while the SSD disk is around 8,000 kB_wrtn/s. How is possible that SSD disks is 10 times slower than the Standard disk?

Why is swap not good when using a SSD?

On Digitalocean I came up with this message when I want to add swap:
Although swap is generally recommended for systems utilizing traditional spinning hard drives, using swap with SSDs can cause issues with hardware degradation over time. Due to this consideration, we do not recommend enabling swap on DigitalOcean or any other provider that utilizes SSD storage. Doing so can impact the reliability of the underlying hardware for you and your neighbors. This guide is provided as reference for users who may have spinning disk systems elsewhere.
If you need to improve the performance of your server on DigitalOcean, we recommend upgrading your Droplet. This will lead to better results in general and will decrease the likelihood of contributing to hardware issues that can affect your service.
Why is that? I thought it was necessary for creating a stable server (not running into memory issues)
I believe that here's your answer.
Early SSDs had a reputation for failing after fewer writes than HDDs. If the swap was used often, then the SSD may fail sooner. This might be why you heard it could be bad to use an SSD for swap.
Modern SSDs don't have this issue, and they should not fail any faster than a comparable HDD. Placing swap on an SSD will result in better performance than placing it on an HDD due to its faster speeds.
I believe this is referring to the fact that SSDs have a relatively limited lifetime measured in number of times data is written in each memory location. Although such number has gotten big enough that using SSD as storage drives should not be a concern anymore, Swap memory, as a backup for ram memory, can potentially be written on pretty frequently, thus reducing the overall life of the SSD.
SSD Endurance is measured in so called DWPD units. DWPD stands for Drive full Writes Per Day. For Mobile, Client and Enterprise Storage Market segments DWPD requirements are very different. SSD Vendors usually state warranty as, for example, 0.8 DWPD / 3 years or 3.0 DWPD / 5 years. First example means that writing 80% of Drive Capacity every single day will result into 3 years life-time. Technically you can kill your 480GB Drive (let's say with 1 DWPD / 3 years warranty) within 12 days if to perform non-stop write access at the speed of 500 MB/s.
SSDs show much higher throughput on the one side if to compare with HDDs, but at the same time quite low endurance level. Partially it is due to the media physical structure and mapping. For example, when writing 1GB of user data to the HDD drive - internally physical media will receive around 10% more data (meta data, error protection data, etc.). Ratio between Host Data Amount and Internal Data Amount is called Write Amplification Factor (WAF). In comparison SSD may need to write 4 times more data than received from Host. Pure Random access is the worst scenario, when writing 1GB of Host Data will result into writing 4GB of data to the Internal Flash Media. If to perform only sequential write access WAF for SSDs will be close to 1.0, like for HDDs.
Enabling System swap and its intensive usage (probably due to DRAM shortage) will generate more Random access to the SSD. Endurance will degrade quicker if to compare with disable swap. Unless you are running Enterprise System with non-stop IO traffic to the SSD, I would not expect Swap enablement to affect SSD endurance much. You can always monitor SSD SMART Health parameter called - SSD Life Left. How it is changing in dynamic with/without swap enabled will help to make a decision.