aerospike bad latencies with aws - latency

We have aerospike running in the Soft layer in bare metal machines in 2 node cluster. our profile average size is 1.5 KB and at peak, operations will be around 6000 ops/sec in each node. The latencies are all fine which is at peak > 1ms will be around 5%.
Now we planned to migrate to aws. So we booted 2 i3.xlarge machines. We ran the benchmark with the 1.5KB object size with the 3x load. results were satisfactory, that is around 4-5%(>1ms). Now we started actual processing, the latencies at peak jumped to 25-30% that is > 1ms and maximum it can accommodate is some 5K ops/sec. So we added one more node, we did benchmark (4.5KB object size and 3x load). The results were 2-4%(>1ms). Now after adding to cluster, the peak came down to 16-22%. We added one more node and peak is now at 10-15%.
The version in aws is aerospike-server-community-3.15.0.2 the version in Sl is Aerospike Enterprise Edition 3.6.3
Our config as follows
#Aerospike database configuration file.
service {
user xxxxx
group xxxxx
run-as-daemon
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
}
logging {
#Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
port 13000
address h1 reuse-address
}
heartbeat {
mode mesh
port 13001
address h1
mesh-seed-address-port h1 13001
mesh-seed-address-port h2 13001
mesh-seed-address-port h3 13001
mesh-seed-address-port h4 13001
interval 150
timeout 10
}
fabric {
port 13002
address h1
}
info {
port 13003
address h1
}
}
namespace XXXX {
replication-factor 2
memory-size 27G
default-ttl 10d
high-water-memory-pct 70
high-water-disk-pct 60
stop-writes-pct 90
storage-engine device {
device /dev/nvme0n1
scheduler-mode noop
write-block-size 128K
}
}
What should be done to bring down latencies in aws?

This comes down to the difference in the performance characteristics of the SSDs of the i3 nodes, compared to what you had on Softlayer. If you ran Aerospike on a floppy disk you'd get 0.5TPS.
Piyush's comment mentions ACT, the open source tool Aerospike has created to benchmark SSDs with real database workloads. The point of ACT is to find the sustained rate in which the SSD can be relied on to deliver the latency you want. Burst rates don't matter much for databases.
The performance engineering team at Aerospike has used ACT to find what the i3 1900G SSD can do, and published the results in a post. Its ACT rating is 4x, meaning that the full 1900G SSD can do 8Ktps reads, 4Ktps writes with the standard 1.5K object size, 128K block size, and stay at 95% < 1ms, 99% < 8ms, 99.9% < 64ms. This is not particularly good for an SSD. By comparison, a Micron 9200 PRO rates at 94.5x, nearly 24 times higher TPS load. What more, with the i3.xlarge you're sharing half that drive with a neighbor. There's no way to cap the IOPS so that you each get half, there's only a partition of the storage. This means that you can expect latency spikes originating in the neighbor. The i3.2xlarge is the smallest instance that gives you the entire SSD.
So, you take the ACT information and you use it to do capacity planning. The main factors you need to know are the average object size (you can find that using objsz histogram), number of objects (again, available via asadm), peak read TPS and peak write TPS (how does the 60Ktps you mentioned split between reads and writes?).
Check your logs for your cache-read-pct values. If they're in the range of 10% or higher you should be raising your post-write-queue value to get better read latencies (and also reduce IOPS pressure from the drive).

Related

How many Postgres connections is normal for a 4 CPU & 32 GB machine (db.m5.2xlarge)?

We have about 100 concurrent connections open and I'm wondering if that is suitable or we should be working to reduce that number. CPU is about 80% utilized.
I'm happy to extrapolate from other machine sizes if you have numbers for larger or smaller VMs.

Erasure Coded Pool suggested PG count

I'm messing around with pg calculator to figure out the best pg count for my cluster. I have an erasure coded FS pool which will most likely use half space of the cluster in the forseeable future. But the pg calculator only has options for replicated pools. Should i just type according to the erasure-code ratio for replica # or is there another way around this?
From Ceph Nautilus version onwards there's a pg-autoscaler that does the scaling for you. You just need to create a pool with an initial (maybe low) value. As for the calculation itself your assumption is correct, you take the number of chunks into account when planning the pg count.
From :
redhat docs:
3.3.4. Calculating PG Count
If you have more than 50 OSDs, we recommend approximately 50-100 placement groups per OSD to balance out resource usage, data durability and distribution. If you have less than 50 OSDs, choosing among the PG Count for Small Clusters is ideal. For a single pool of objects, you can use the following formula to get a baseline:
(OSDs * 100)
Total PGs = ------------
pool size
Where pool size is either the number of replicas for replicated pools or the K+M sum for erasure coded pools (as returned by ceph osd erasure-code-profile get).
You should then check if the result makes sense with the way you designed your Ceph cluster to maximize data durability, data distribution and minimize resource usage.
The result should be rounded up to the nearest power of two. Rounding up is optional, but recommended for CRUSH to evenly balance the number of objects among placement groups.
For a cluster with 200 OSDs and a pool size of 3 replicas, you would estimate your number of PGs as follows:
(200 * 100)
----------- = 6667. Nearest power of 2: 8192
3
With 8192 placement groups distributed across 200 OSDs, that evaluates to approximately 41 placement groups per OSD. You also need to consider the number of pools you are likely to use in your cluster, since each pool will create placement groups too. Ensure that you have a reasonable maximum PG count.

Choosing the compute resources of the nodes in the cluster with horizontal scaling

Horizontal scaling means that we scale by adding more machines into the pool of resources. Still, there is a choice of how much power (CPU, RAM) each node in the cluster will have.
When cluster managed with Kubernetes it is extremely easy to set any CPU and memory limit for Pods. How to choose the optimal CPU and memory size for cluster nodes (or Pods in Kubernetes)?
For example, there are 3 nodes in a cluster with 1 vCPU and 1GB RAM each. To handle more load there are 2 options:
Add the 4th node with 1 vCPU and 1GB RAM
Add to each of the 3 nodes more power (e.g. 2 vCPU and 2GB RAM)
A straightforward solution is to calculate the throughput and cost of each option and choose the cheaper one. Are there any more advanced approaches for choosing the compute resources of the nodes in a cluster with horizontal scalability?
For this particular example I would go for 2x vCPU instead of another 1vCPU node, but that is mainly cause I believe running OS for anything serious on a single vCPU is just wrong. System to behave decently needs 2+ cores available, otherwise it's too easy to overwhelm that one vCPU and send the node into dust. There is no ideal algorithm for this though. It will depend on your budget, on characteristics of your workloads etc.
As a rule of thumb, don't stick to too small instances as you have a bunch of stuff that has to run on them always, regardless of their size and the more node, the more overhead. 3x 4vCpu+16/32GB RAM sounds like nice plan for starters, but again... it depends on what you want, need and can afford.
The answer is related to such performance metrics as latency and throughput:
Latency is a time interval between sending request and receiving response.
Throughput is a request processing rate (requests per second).
Latency has influence on throughput: bigger latency = less throughput.
If a business transaction consists of multiple sequential calls of the services that can't be parallelized, then compute resources (CPU and memory) has to be chosen based on the desired latency value. Adding more instances of the services (horizontal scaling) will not have any positive influence on the latency in this case.
Adding more instances of the service increases throughput allowing to process more requests in parallel (if there are no bottlenecks).
In other words, allocate CPU and memory resources so that service has desired response time and add more service instances (scale horizontally) to handle more requests in parallel.

On AWS RDS Postgres, what could cause disk latency to go up while iOPs / throughput go down?

I'm investigating an approximately 3 hour period of increased query latency on a production Postgres RDS instance (m4.xlarge, 400 GiB of gp2 storage).
The driver seems to be a spike in both read and write disk latencies: I see them going from a baseline of ~0.0005 up to a peak of 0.0136 write latency / 0.0081 read latency.
I also see an increase in disk queue depth from a baseline of around 2, to a peak of 14.
When there's a spike in disk latencies, I generally expect to see an increase in data being written to disk. But read iOPS, write iOPS, read throughput, and write throughput all went down (by approximately 50%) during the time when latency was elevated.
I also have server-side metrics on the total query volume I'm sending (measured in both queries per second and amount of data written: this is a write-heavy workload), and those metrics were flat during this time period.
I'm at a loss for what to investigate next. What are possible reasons that disk latency could increase while iOPs go down?

Scaling Kafka for Throughput

I have setup a sample Kafka cluster on AWS and am trying to identify maximum throughput possible with the given configurations. I am currently following post provided here for this analysis.
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
I would appreciate it if you could clarify the following issues.
I observed a throughput of 40MB/s for messages of size 512 bytes ( single producer - single consumer ) with given hardware. Assume I need to achieve a throughput of 80MB/s.
As I understand one way to do this to increase the number of partitions per topic and increase the number of threads in producer and consumer. ( Assuming I do not change the default values for batch size, compression ratio etc. )
How to find the maximum throughput possible with given hardware? The point after which we are required to improve our hardware resources if we are to further improve the throughput?
( In other words how to make the decision "With X GB RAM and Y GB disk space this is the maximum throughput I can achieve. If I need to further improve the throughput I have to upgrade RAM to XX GB and disk space to YY GB" )
2.Should we scale the cluster vertically or horizontally? What is the recommended approach?
Thank you.
If we define throughput as the volume of data transmitted over the network per second, the maximum throughput should not exceed #machine number * bandwidth. Given a single machine whose NIC is configured with 1Gbps, the max TPS on single machine cannot be larger than 1Gbps. In your case, TPS is 40MB/s, namely 320Mbps,which is quite less than 1Gbps, meaning there is still room for improvement. However, if your target is far larger than 1Gbps, you definitely need more machines.
AFAIK, bandwidth is the most likely cause for the system bottleneck. Unlike CPU and RAM, it's not easy to scale vertically, so a horizontally scaling might be an option.
You could do some maths before scaling. Say the throughput target is "produce 2 billion of records with 512Bytes in 1 hour". That's to say, the TPS has to achieve 2,000,000,000 * 8 * 512 / 3600 / 1024 / 1024 = 2170mbps. Assuming available bandwidth for single machine is 700mbps(Over 70% usage normally brings 'packet loss'), at least 4 machines should be planned for the producer application.