How many Postgres connections is normal for a 4 CPU & 32 GB machine (db.m5.2xlarge)?

How many Postgres connections is normal for a 4 CPU & 32 GB machine (db.m5.2xlarge)? - postgresql

We have about 100 concurrent connections open and I'm wondering if that is suitable or we should be working to reduce that number. CPU is about 80% utilized.
I'm happy to extrapolate from other machine sizes if you have numbers for larger or smaller VMs.

Related

Is there a way to set worker weight?

I have two machine to do the load test. One machine has worse CPU performance. And the machine will reach high CPU usage when number of users keep increasing, while the other machine still has low CPU usage. Locust complains:
[2022-07-28 11:22:15,529] PF1YW96X-MUO/WARNING/root: CPU usage above 90%! This may constrain your throughput and may even give inconsistent response time measurements! See https://docs.locust.io/en/stable/running-locust-distributed.html for how to distribute the load over multiple CPU cores or machines
[2022-07-28 11:25:06,766] PF1YW96X-MUO/WARNING/locust.runners: CPU usage was too high at some point during the test! See https://docs.locust.io/en/stable/running-distributed.html for how to distribute the load over multiple CPU cores or machines
I want to set lower weight for the machine who has worse CPU perfomance. Is there a way to do that?

You can run fewer worker processes on the weak machine. If necessary you could run more than one process per core on the strong machine, just to make it take more Users.

Using a cluster of Raspberry Pi 4 as a cluster for number crunching?

So I am currently developing an algorithm in MATLAB that is computationally expensive but is parallel processing friendly. Given that, I have been using the parallel processing library but I am still falling short of my computation time goals.
I am currently running my algorithm on an Intel i7 8086k CPU (6 Core, 12 logical, #4.00GHz, turbo is 5GHz)
Here are my questions:
If I was to purchase, lets say 10 raspberry pi 4 SBCs (4 cores #1.5GHz), could I use my main desktop as the host and the PIs as the clients? (Let us assume I migrate my algorithm to C++ and run it in Ubuntu for now).
1a. If I was to go through with the build in question 1, will there be a significant upgrade in computation for the ~$500 spent?
1b. If I am not able to use my desktop as host (I believe this shouldn't be an issue), how many raspberry PIs would I need to equate to my current CPU or how many would I need to make it advantageous to work on a PI cluster vs my computer?
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Thanks for your help, any other advise and recommendations are welcome

Does your algorithm bottleneck on raw FMA / FLOPS throughput? If so then a cluster of weak ARM cores is more trouble than it's worth. I'd expect a used Zen2 machine, or maybe Haswell or Broadwell, could be good if you can find one cheaply. (You'd have to look at core counts, clocks, and FLOPS/$. And whether the problem would still not be memory bottlenecked on an older system with less memory bandwidth.)
If you bottleneck instead on cache misses from memory bandwidth or latency (e.g. cache-unfriendly data layout), there might possibly be something to gain from having more weaker CPUs each with their own memory controller and cache, even if those caches are smaller than your Intel.
Does Matlab use your GPU at all (e.g. via OpenCL)? Your current CPU's peak double (FP64) throughput from the IA cores is 96 GFLOPS, but its integrated GPU is capable of 115.2 GFLOPS. Or for single-precision, 460.8 GFLOPS GPU vs. 192 GFLOPS from your x86 cores. Again, theoretical max throughput, running 2x 256-bit SIMD FMA instructions per clock cycle per core on the CPU.
Upgrading to a beefy GPU could be vastly more effective than a cluster of RPi4. e.g. https://en.wikipedia.org/wiki/FLOPS#Hardware_costs shows that cost per single-precision GFLOP in 2017 was about 5 cents, adding big GPUs to a cheapo CPU. Or 79 cents per double-precision GFLOP.
If your problem is GPU-friendly but Matlab hasn't been using your GPU, look into that. Maybe Matlab has options, or you could use OpenCL from C++.
will there be a significant upgrade in computation for the ~$500 spent?
RPi4 model B has a Broadcom BCM2711 SoC. The CPU is Cortex-A72.
Their cache hierachy 32 KB data + 48 KB instruction L1 cache per core. 1MB shared L2 cache. That's weaker than your 4GHz i7 with 32k L1d + 256k L2 private per-core, and a shared 12MiB L3 cache. But faster cores waste more cycles for the same absolute time waiting for a cache miss, and the ARM chips run their DRAM at a competitive DDR4-2400.
RPi CPUs are not FP powerhouses. There's a large gap in the raw numbers, but with enough of them the throughput does add up.
https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors shows that Cortex-A72 has peak FPU throughput of 2 double FLOPS per core per cycle, vs. 16 for Intel since Haswell, AMD since Zen2.
Dropping to single precision float improves x86 by a factor of 2, but A72 by a factor of 4. Apparently their SIMD units have lower throughput for FP64 instructions, as well as half the work per SIMD vector. (Some other ARM cores aren't extra slow for double, just the expected 2:1, like Cortex-A57 and A76.)
But all this is peak FLOPS throughput; coming close to that in real code is only achieved with well-tuned code with good computational intensity (lots of work each time the data is loaded into cache, and/or into registers). e.g. a dense matrix multiply is the classic example: O(n^3) FPU work over O(n^2) data, in a way that makes cache-blocking possible. Or Prime95 is another example.
Still, a rough back of the envelope calculation, being generous and assuming sustained non-turbo clocks for the Coffee Lake. (All 6 cores busy running 2x 256-bit FMA instructions per clock makes a lot of heat. That's literally what Prime95 does, so expect that level of power consumption if your code is that efficient.)
6 * 4GHz * 4 elements/vec * 2 vec/cycle = 48G FMAs / sec = 96 GFLOP/sec on the CFL
4 * 1.5GHz * 2 DP flops / clock = 12 GFLOP / sec per RPi.
With 5x RPi systems, that's 60 GFLOPS added to your existing 96 GFLOP.
Doesn't sound worth the trouble to manage 5 RPi systems for less than your existing total FP throughput. But again, if your problem has the right kind of parallelism, a GPU can run it much more efficiently. 60 GFLOPS for 500$ is not a good deal compared to ~50$ per 60 GFLOP from a high-end (in 2017) video card.
The GPU in an RPi might have some compute capability, but almost certainly not worth it compared to slapping a 500$ discrete GPU into your existing machine if your code is CPU-friendly.
Or your problem might not scale with theoretical max FLOPS, but instead perhaps with cache bandwidth or some other factor.
Is it possible to run Windows on the host computer and linux on the clients(Pis) so that I continue using MATLAB?
Zero clue; I'm only considering theoretical best case for efficient machine code running on these CPUs.

aerospike bad latencies with aws

We have aerospike running in the Soft layer in bare metal machines in 2 node cluster. our profile average size is 1.5 KB and at peak, operations will be around 6000 ops/sec in each node. The latencies are all fine which is at peak > 1ms will be around 5%.
Now we planned to migrate to aws. So we booted 2 i3.xlarge machines. We ran the benchmark with the 1.5KB object size with the 3x load. results were satisfactory, that is around 4-5%(>1ms). Now we started actual processing, the latencies at peak jumped to 25-30% that is > 1ms and maximum it can accommodate is some 5K ops/sec. So we added one more node, we did benchmark (4.5KB object size and 3x load). The results were 2-4%(>1ms). Now after adding to cluster, the peak came down to 16-22%. We added one more node and peak is now at 10-15%.
The version in aws is aerospike-server-community-3.15.0.2 the version in Sl is Aerospike Enterprise Edition 3.6.3
Our config as follows
#Aerospike database configuration file.
service {
user xxxxx
group xxxxx
run-as-daemon
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
pidfile /var/run/aerospike/asd.pid
service-threads 8
transaction-queues 8
transaction-threads-per-queue 8
proto-fd-max 15000
}
logging {
#Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
port 13000
address h1 reuse-address
}
heartbeat {
mode mesh
port 13001
address h1
mesh-seed-address-port h1 13001
mesh-seed-address-port h2 13001
mesh-seed-address-port h3 13001
mesh-seed-address-port h4 13001
interval 150
timeout 10
}
fabric {
port 13002
address h1
}
info {
port 13003
address h1
}
}
namespace XXXX {
replication-factor 2
memory-size 27G
default-ttl 10d
high-water-memory-pct 70
high-water-disk-pct 60
stop-writes-pct 90
storage-engine device {
device /dev/nvme0n1
scheduler-mode noop
write-block-size 128K
}
}
What should be done to bring down latencies in aws?

This comes down to the difference in the performance characteristics of the SSDs of the i3 nodes, compared to what you had on Softlayer. If you ran Aerospike on a floppy disk you'd get 0.5TPS.
Piyush's comment mentions ACT, the open source tool Aerospike has created to benchmark SSDs with real database workloads. The point of ACT is to find the sustained rate in which the SSD can be relied on to deliver the latency you want. Burst rates don't matter much for databases.
The performance engineering team at Aerospike has used ACT to find what the i3 1900G SSD can do, and published the results in a post. Its ACT rating is 4x, meaning that the full 1900G SSD can do 8Ktps reads, 4Ktps writes with the standard 1.5K object size, 128K block size, and stay at 95% < 1ms, 99% < 8ms, 99.9% < 64ms. This is not particularly good for an SSD. By comparison, a Micron 9200 PRO rates at 94.5x, nearly 24 times higher TPS load. What more, with the i3.xlarge you're sharing half that drive with a neighbor. There's no way to cap the IOPS so that you each get half, there's only a partition of the storage. This means that you can expect latency spikes originating in the neighbor. The i3.2xlarge is the smallest instance that gives you the entire SSD.
So, you take the ACT information and you use it to do capacity planning. The main factors you need to know are the average object size (you can find that using objsz histogram), number of objects (again, available via asadm), peak read TPS and peak write TPS (how does the 60Ktps you mentioned split between reads and writes?).
Check your logs for your cache-read-pct values. If they're in the range of 10% or higher you should be raising your post-write-queue value to get better read latencies (and also reduce IOPS pressure from the drive).

What is the maximum memory per worker that MATLAB can address?

Short version: Is there a maximum amount of RAM / worker, that MATLAB can address?
Long version: My wife uses MATLAB's parallel processing capabilities in data-heavy spatial analyses (I don't really understand it, I just know how to build computers that make her work quicker) and I would like to build a new computer so she can radically reduce her process times.
I am leaning toward something in the 10-16 core range, since prices on such processors seem to be dropping with each generation and I would like to use 128 GB of RAM, because 'why not' if you can stomach the cost and see some meaningful time savings?
The number of cores I shoot for will depend on the maximum amount of RAM that MATLAB can address for each worker (if such a limit exists). The computer I built for similar work in 2013 has 4 physical cores (Intel i7-3770k) and 32 GB RAM (which she maxed out), and whatever I build next, I would like to have at least the same memory/core. With 128 GB of RAM a given, 10 cores would be 12.8 GB/core, 12 cores would be ~10.5 GB/core and 16 cores would be 8 GB/core. I am inclined to maximize cores rather than memory, but since she doesn't know what will benefit her processes the most, I would like to know how realistic those three options are. As for your next question, she has an nVidia GPU capable of parallel processing, but she believes her processes would not benefit from its CUDA cores.
Thank you for your insights. Many, many Google searches did not yield an answer.

Scaling Kafka for Throughput

I have setup a sample Kafka cluster on AWS and am trying to identify maximum throughput possible with the given configurations. I am currently following post provided here for this analysis.
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
I would appreciate it if you could clarify the following issues.
I observed a throughput of 40MB/s for messages of size 512 bytes ( single producer - single consumer ) with given hardware. Assume I need to achieve a throughput of 80MB/s.
As I understand one way to do this to increase the number of partitions per topic and increase the number of threads in producer and consumer. ( Assuming I do not change the default values for batch size, compression ratio etc. )
How to find the maximum throughput possible with given hardware? The point after which we are required to improve our hardware resources if we are to further improve the throughput?
( In other words how to make the decision "With X GB RAM and Y GB disk space this is the maximum throughput I can achieve. If I need to further improve the throughput I have to upgrade RAM to XX GB and disk space to YY GB" )
2.Should we scale the cluster vertically or horizontally? What is the recommended approach?
Thank you.

If we define throughput as the volume of data transmitted over the network per second, the maximum throughput should not exceed #machine number * bandwidth. Given a single machine whose NIC is configured with 1Gbps, the max TPS on single machine cannot be larger than 1Gbps. In your case, TPS is 40MB/s, namely 320Mbps，which is quite less than 1Gbps, meaning there is still room for improvement. However, if your target is far larger than 1Gbps, you definitely need more machines.
AFAIK, bandwidth is the most likely cause for the system bottleneck. Unlike CPU and RAM, it's not easy to scale vertically, so a horizontally scaling might be an option.
You could do some maths before scaling. Say the throughput target is "produce 2 billion of records with 512Bytes in 1 hour". That's to say, the TPS has to achieve 2,000,000,000 * 8 * 512 / 3600 / 1024 / 1024 = 2170mbps. Assuming available bandwidth for single machine is 700mbps(Over 70% usage normally brings 'packet loss'), at least 4 machines should be planned for the producer application.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse