Troubleshooting Latency Increase for Lambda to EFS Reads - amazon-efs

The Gist
We've got a Lambda job running that reads data from EFS (elastic throughout) for up to 200 TPS of read requests.
PercentIOLimit is well below 20%.
Latency goes from about 20 ms to about 400 ms during traffic spikes.
Are there any steps I can take to get more granularity into where the latency for the reads is coming from?
Additional Info:
At low TPS (~5), reads take about 10-20 ms.
At higher TPS (~50), p90 can take 300-400 ms.
I'd really like to narrow down what limit is causing these latency spikes, especially when the IOPercent usage is around 60%.

Related

Understand CPU utilisation with image preprocessing applications

I'm trying to understand how to compute the CPU utilisation for audio and video use cases.
In real time audio applications, this is what I typically do:
if an application takes 4ms to process 28ms of audio data, I say that the CPU utilisation is 14.28% (4/28).
How should this be done for applications like resize/crop? let's say I'm resizing an image from 162*122 to 128*128 size image at 1FPS, and it takes 11ms.. What would be the CPU utilisation?
CPU utilization is quite complicated, and strongly depends on stuff like:
The CPU itself
The algorithms utilized for the task
Other tasks running alongside the CPU
CPU utilization is also strongly related to the process scheduling of your PC, hence the operating system used, so most operating systems will expose some kind of API for CPU utilization diagnostics, but such API is highly platform-dependent.
But how does CPU utilization calculations work anyway?
The most simple way in which CPU utilization is calculated is taking a (for example) 1 second period, in which you observe how long the CPU has been idling (not executing any processes), and divide that by the time interval you selected. For example, if the CPU did useful calculations for 10 milliseconds, and you were observing for 500ms, this would mean that the CPU utilization is 2%.
Answering your question / TL; DR
You can apply this principle in your program. For the case you provided (processing video), this could be done in more or less the same way: you calculate how long it takes to calculate one frame, and divide that by the length of a frame (1 / FPS). Of course, this could be done for a longer period of time, to get a more accurate reading, in the following way: you track how much time it takes to process, for example, 2 seconds of video, and divide that by 2. Then, you'll have your CPU utilization.
NOTE: if you aren't able to process the frame in time, for example, your video is 10FPS (0.1ms), and processing one frame takes 0.5ms, then your CPU utilization will be seemingly 500%, but obviously you can't utilize more than 100% of your CPU, so you should just cap the CPU utilization at 100%.

NVMe SSD's bandwidth decreases when increasing the number of I/O queues

As far as I have learned from all the relevant articles about NVMe SSDs, one of NVMe SSDs' benefits is multiple queues. Leveraging multiple NVMe I/O queues, NVMe bandwidth can be greatly utilized.
However, what I have found from my own experiment does not agree with that.
I want to do parallel 4k-granularity sequential reads from an NVMe SSD. I'm using Samsung 970 EVO Plus 250GB. I used FIO to benchmark the SSD. The command I used is:
fio --size=1000m --directory=/home/xxx/fio_test/ --ioengine=libaio --direct=1 --name=4kseqread --bs=4k --iodepth=64 --rw=read --numjobs 1/2/4 --group_reporting
And below is what I got testing 1/2/4 parallel sequential reads:
numjobs=1: 1008.7MB/s
numjobs=2: 927 MB/s
numjobs=4: 580 MB/s
Even if will not increasing bandwidth, I expect increasing I/O queues would at least keep the same bandwidth as the single-queue performance. The bandwidth decrease is a little bit counter-intuitive. What are the possible reasons for the decrease?
Thank you.
I would like to highlight 3 reasons why you may see the issue:
Effective Queue Depth is too high,
Capacity under the test is limited to 1GB only,
Drive Precondition
First, parameter --iodepth=X is specified per Job. It means in your last experiment (--iodepth=64 and --numjobs=4) effective Queue Depth is 4x64=256. This may be too high for your Drive. Based on the vendor specification of your 250GB Drive, 4KB Random Read should show 250 KIOPS (1GB/s) for the Queue Depth of 32. By this Vendor is stating that QD32 is quite optimal for your Drive operation in order to reach best performance. If we start to increase QD, then commands will start aggregating and waiting in the Submission Queue. It does not improve performance. Vice Versa it will start to eat system resources (CPU, memory) and will degrade the throughput.
Second, limiting capacity under test to such a small range (1GB) can cause lot of collisions inside SSD. It is the situation when Reads will hit the same Media Physical Read Unit (aka Die aka LUN). In such situation new Reads will have to wait for previous one to complete. Increase of the testing capacity to entire Drive or at least to 50-100GB should minimize the collisions.
Third, in order to get performance numbers as per specification, Drive needs to be preconditioned accordingly. For the case of measuring Sequential and Random Reads it is better to use Full Drive Sequential Precondition. Command bellow will perform 128KB Sequential Write at QD32 to the Entire Drive Capacity.
fio --size=100% --ioengine=libaio --direct=1 --name=128KB_SEQ_WRITE_QD32 --bs=128k --iodepth=4 --rw=write --numjobs=8

On AWS RDS Postgres, what could cause disk latency to go up while iOPs / throughput go down?

I'm investigating an approximately 3 hour period of increased query latency on a production Postgres RDS instance (m4.xlarge, 400 GiB of gp2 storage).
The driver seems to be a spike in both read and write disk latencies: I see them going from a baseline of ~0.0005 up to a peak of 0.0136 write latency / 0.0081 read latency.
I also see an increase in disk queue depth from a baseline of around 2, to a peak of 14.
When there's a spike in disk latencies, I generally expect to see an increase in data being written to disk. But read iOPS, write iOPS, read throughput, and write throughput all went down (by approximately 50%) during the time when latency was elevated.
I also have server-side metrics on the total query volume I'm sending (measured in both queries per second and amount of data written: this is a write-heavy workload), and those metrics were flat during this time period.
I'm at a loss for what to investigate next. What are possible reasons that disk latency could increase while iOPs go down?

aerospike bad latencies with aws

We have aerospike running in the Soft layer in bare metal machines in 2 node cluster. our profile average size is 1.5 KB and at peak, operations will be around 6000 ops/sec in each node. The latencies are all fine which is at peak > 1ms will be around 5%.
Now we planned to migrate to aws. So we booted 2 i3.xlarge machines. We ran the benchmark with the 1.5KB object size with the 3x load. results were satisfactory, that is around 4-5%(>1ms). Now we started actual processing, the latencies at peak jumped to 25-30% that is > 1ms and maximum it can accommodate is some 5K ops/sec. So we added one more node, we did benchmark (4.5KB object size and 3x load). The results were 2-4%(>1ms). Now after adding to cluster, the peak came down to 16-22%. We added one more node and peak is now at 10-15%.
The version in aws is aerospike-server-community-3.15.0.2 the version in Sl is Aerospike Enterprise Edition 3.6.3
Our config as follows
#Aerospike database configuration file.
service {
user xxxxx
group xxxxx
run-as-daemon
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
}
logging {
#Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
port 13000
address h1 reuse-address
}
heartbeat {
mode mesh
port 13001
address h1
mesh-seed-address-port h1 13001
mesh-seed-address-port h2 13001
mesh-seed-address-port h3 13001
mesh-seed-address-port h4 13001
interval 150
timeout 10
}
fabric {
port 13002
address h1
}
info {
port 13003
address h1
}
}
namespace XXXX {
replication-factor 2
memory-size 27G
default-ttl 10d
high-water-memory-pct 70
high-water-disk-pct 60
stop-writes-pct 90
storage-engine device {
device /dev/nvme0n1
scheduler-mode noop
write-block-size 128K
}
}
What should be done to bring down latencies in aws?
This comes down to the difference in the performance characteristics of the SSDs of the i3 nodes, compared to what you had on Softlayer. If you ran Aerospike on a floppy disk you'd get 0.5TPS.
Piyush's comment mentions ACT, the open source tool Aerospike has created to benchmark SSDs with real database workloads. The point of ACT is to find the sustained rate in which the SSD can be relied on to deliver the latency you want. Burst rates don't matter much for databases.
The performance engineering team at Aerospike has used ACT to find what the i3 1900G SSD can do, and published the results in a post. Its ACT rating is 4x, meaning that the full 1900G SSD can do 8Ktps reads, 4Ktps writes with the standard 1.5K object size, 128K block size, and stay at 95% < 1ms, 99% < 8ms, 99.9% < 64ms. This is not particularly good for an SSD. By comparison, a Micron 9200 PRO rates at 94.5x, nearly 24 times higher TPS load. What more, with the i3.xlarge you're sharing half that drive with a neighbor. There's no way to cap the IOPS so that you each get half, there's only a partition of the storage. This means that you can expect latency spikes originating in the neighbor. The i3.2xlarge is the smallest instance that gives you the entire SSD.
So, you take the ACT information and you use it to do capacity planning. The main factors you need to know are the average object size (you can find that using objsz histogram), number of objects (again, available via asadm), peak read TPS and peak write TPS (how does the 60Ktps you mentioned split between reads and writes?).
Check your logs for your cache-read-pct values. If they're in the range of 10% or higher you should be raising your post-write-queue value to get better read latencies (and also reduce IOPS pressure from the drive).

What is the meaning of 99th percentile latency and throughput

I've read some article, benchmarking the performance of stream processing engines like Spark streaming, Storm, and Flink. In the evaluation part, the criterion was 99th percentile and throughput. For example, Apache Kafka sent data at around 100.000 events per seconds and those three engines act as stream processor and their performance was described using 99th percentile latency and throughput.
Can anyone clarify these two criteria for me?
99th percentile latency of X milliseconds in stream jobs means that 99% of the items arrived at the end of the pipeline in less than X milliseconds. Read this reference for more details.
When application developers expect a certain latency, they often need
a latency bound. We measure several latency bounds for the stream
record grouping job which shuffles data over the network. The
following figure shows the median latency observed, as well as the
90-th, 95-th, and 99-th percentiles (a 99-th percentile of latency of
50 milliseconds, for example, means that 99% of the elements arrive at
the end of the pipeline in less than 50 milliseconds).