Simulating very high loads using ApacheBench (ab) - webserver

Is it possible to simulate very high (~30k concurrent requests) using apache bench. I increased my ulimit to 30k and then load tested my server using ab -n 60000 -c 30000 .... Sometimes, I get receive exceptions on ab. But, the number of requests that ab is posting to my server is greater than what I am specifying (I monitored my server stats, it is getting greater than 60k requests). Why am I getting this weird behaviour?

Related

HAProxy reverse ssl termination: Memory keeps growing. Memory leak?

I have haproxy 2.5.1 in SSL termination config running in a container of a Kubernetes POD, the backend is an Scala App that runs in another container of same POD.
I have seen that I can put 500K connections in the setup and the RSS memory usage of HAProxy is 20GB. If I remove the traffic and wait 15 minutes the RSS memory drops to 15GB, but if I repeat the same exercise one or two more times, RSS for HAProxy will hit 30GB and HAProxy will be kill as I have a limit of 30GB in the POD for HAProxy.
The question here is if this behavior of continuous memory growth is expected?
Here is the incoming traffic:
And here is the memory usage chart which shows how after 3 cycles of Placing Load and Removing Load, the RSS memory reached 30GB and then got killed (Just as an observation the two charts have different timezone but they belong to same execution)
We switched from Alpine based image(musl) into libc based image and that solved the problem. We got 5X increase on connection rate and memory growth gone too.

ActiveMQ Artemis produce/consume latency issue

I have been monitoring the end to end latency of my microservice applications. Each service is loosely coupled via an ActiveMQ Artemis queue.
------------- ------------- -------------
| Service 1 | --> | Service 2 | --> | Service 3 |
------------- ------------- -------------
Service 1 listens as an HTTP endpoint and produces to a queue 1. Service 2 consumes from queue 1, modifies the message, & produces to queue 2. Service 3 consumes from queue 2. Each service inserts to db a row in a separate table. From there I can also monitor latency. So "end-to-end" is going into "Service 1" and coming out of "Service 3".
Each service processing time remains steady, and most messages have a reasonable e2e latency of a few milliseconds. I produce with a constant rate using JMeter of 400 req/sec, and I can monitor this via Grafana.
Sporadically I notice a dip in this constant rate which can be seen throughout the chain. At first I thought it could be the producer side (Service 1) since the rate suddenly dropped to 370 req/sec and might be attributed to GC or possibly the JMeter HTTP simulator fault, but this does not explain why certain messages e2e latency jumps to ~2-3 sec.
Since it would be hard to reproduce my scenario I checked out this load generator for ActiveMQ Artemis and bumped the versions up to 2.17.0, 5.16.2 & 0.58.0. To match my broker 2.17.0. Which is a cluster of 2 masters/slaves using nfsv4 shared storage.
The below command generated 5,000,000 messages to a single queue q6, with 4 producer/consumer with a max overall produce rate of 400. Messages are persistent. The only code change in the artemis-load-generator was in ConsumerLatencyRecorderTask when elapsedTime > 1sec I would print out the message ID and latency.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out /tmp/test1.txt --name q6 --iterations 5000000 --runs 1 --warmup 20000 --forks 4 --destinations 1
From this I noticed that there were outlier messages with produce/consume latency nearing 2 secs. Most (90.00%) were below 3358.72 microseconds.
I am not sure why and how this happens? Is this reasonable ?
EDIT/UPDATE
I have run the test a few times this an output of a shorter run.
java -jar destination-bench.jar --persistent --bytes 1000 --protocol artemis --url tcp://localhost:61616?producerMaxRate=100 --out ~/test-perf1.txt --name q6 --iterations 400000 --runs 1 --warmup 20000 --forks 4 --destinations 1
The result is below
RUN 1 EndToEnd Throughput: 398 ops/sec
**************
EndToEnd SERVICE-TIME Latencies distribution in MICROSECONDS
mean 10117.30
min 954.37
50.00% 1695.74
90.00% 2637.82
99.00% 177209.34
99.90% 847249.41
99.99% 859832.32
max 5939134.46
count 1600000
The JVM Threads Statusis what I am noticing in my actual system on the broker a lot of time_waiting threads and were there are spike push-to-queue latency seems to increase.
Currently my data is as i said hosted on ntfs v4 as shown . I read Artemis persistence section that
If the journal is on a volume which is shared with other processes which might be writing other files (e.g. bindings journal, database, or transaction coordinator) then the disk head may well be moving rapidly between these files as it writes them, thus drastically reducing performance.
Should I move the bindings folder outside ntfs on the vms disk? Will this improve performance ? It is unclear to me.
How does this affect Shared Store HA?
I started a fresh, default instance of ActiveMQ Artemis 2.17.0, cloned and built the artemis-load-generator (with a modification to alert immediately on messages that take > 1 second to process), and then ran the same command you ran. I let the test run for about an hour on my local machine, but I didn't let it finish because it was going to take over 3 hours (5 million messages at 400 messages per second). Out of roughly 1 million messages I saw only 1 "outlier" - certainly nothing close to the 10% you're seeing. It's worth noting that I was still using my computer for my normal development work during this time.
At this point I have to attribute this to some kind of environmental issue, e.g.:
Garbage Collection
Low performance disk
Network latency
Insufficient CPU, RAM, etc.

How can I increase 5000 concurrent request with mpm_prefork on Apache

I've configured Apache web server on my CentOs sever machine. I want to increase 5000 concurrent request with MPM_Prefork. Please suggest best Prefork configuration for that. I've done Prefork configuration on httpd.conf file, but its not working.
My Prefork configuration:
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 30
MaxSpareServers 40
MaxClients 5000
ServerLimit 20
MaxRequestsPerChild 500
</IfModule>
KeepAlive On
MaxKeepAliveRequests 5000
KeepAliveTimeout 5
Some suggestions.
MaxRequestsPerChild if your apache is stable, you can increase that one to a MUCH higher value. This prevents your children processes from dying to often. I have a web site set at 10 000 (high volume).
MaxClients (renamed MaxRequestsWorkers) is set to 5000, but ServerLimit is at 20. So MaxRequestsWorkers is blocked at 20. ServerLimit is the highest MaxRequestsWorkers will be allowed to grow. So put ServerLimit higher.
You said your tests showed 1000 requests at a time with ServerLimit == 20. So do not set ServerLimit to 5000!
If you expect high traffic, increase StartServer so it will be ready.
When you perform your tests, ramp up the load up to it's maximum, then let it sit there for a while. Do not try 5000 at one go.
Setup server-status in your apache. This will allow you to view the state of your Apache (number of workers, what they are doing, ...). If you see all workers busy and have still not reached your 5000, increase the values accordingly.
And finaly realise that 5000 concurrent requests means 5000 browser actively requesting data at the same time, with an open connection. Real users have think time, read time so your requests are staggered more than with a load testing tool.

Docker blocking outgoing connections on high load?

We have a node.js web server that makes some outgoing http requests to an external API. It's running in docker using dokku.
After some time of load (30req/s) these outgoing requests aren't getting responses anymore.
Here's a graph I made while testing with constant req/s:
incoming and outgoing is the amount of concurrent requests (not the number of initialized requests). (It's hard to see in the graph, but it's fairly constant at ~10 requests for each.)
response time is for external requests only.
You can clearly see that they start failing all of a sudden (hitting our 1000ms timeout).
The more req/s we send, the faster we run into this problem, so we must have some sort of limit we're getting closer to with each request.
I used netstat -ant | tail -n +3 | wc -l on the host to get the number of open connections, but it was only ~450 (most of them TIME_WAIT). That shouldn't hit the socket limit. We aren't hitting any RAM or CPU limits, either.
I also tried running the same app on the same machine outside docker and it only happens in docker.
It could be due to the Docker userland proxy. If you are running a recent version of Docker, try running the daemon with the --userland-proxy=false option. This will make Docker handle port forwarding with just iptables and there is less overhead.

What are some useful tips/tools for monitoring/tuning memcached health?

Yesterday, I found this cool script 'memcache-top' which nicely prints out stats of memcached live. It looks like,
memcache-top v0.6 (default port: 11211, color: on, refresh: 3 seconds)
INSTANCE USAGE HIT % CONN TIME EVICT/s READ/s WRITE/s
127.0.0.1:11211 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
AVERAGE: 88.8% 94.8% 20 0.8ms 9.0 311.3K 162.8K
TOTAL: 1.8GB/ 2.0GB 20 0.8ms 9.0 311.3K 162.8K
(ctrl-c to quit.)
it even makes certain text red when you should pay attention to something!
Q. Broadly, what are some useful tools/techniques you've used to check that memcached is set up well?
Good interface to accessing Memcached server instances is phpMemCacheAdmin.
I prefer access from the command line using telnet.
To make a connection to Memcached using Telnet, use the following telnet localhost 11211 command from the command line.
If at any time you wish to terminate the Telnet session, simply type quit and hit return.
You can get an overview of the important statistics of your Memcached server by running the stats command once connected.
Memory is allocated in chunks internally and constantly reused. Since memory is broken into different size slabs, you do waste memory if your items do not fit perfectly into the slab the server chooses to put it in.
So Memcached allocates your data into different "slabs" (think of these as partitions) of memory automatically, based on the size of your data, which in turn makes memory allocation more optimal.
To list the slabs in the instance you are connected to, use the stats slab command.
A more useful command is the stats items, which will give you a list of slabs which includes a count of the items store within each slab.
Now that you know how to list slabs, you can browse inside each slab to list the items contained within by using the stats cachedump [slab ID] [number of items, 0 for all items] command.
If you want to get the actual value of that item, you can use the get [key] command.
To delete an item from the cache you can use the delete [key] command.
For a production systems, you should really set up active monitoring (with downtime alerts, automated restarts etc.) of Memcache using something like Monit. Here is an example config: Monitoring Memcache with Monit
It is good to monitor overall memory usage of memcached for resource planning.
Track the eviction statistics counter to know how often cached items are getting evicted due to lack of memory.
Track cache hit/misses, reclaims(The number of expired items removed to allow space for new writes), current connections, flush cmd which is available in stats.
Memcached stats (can be read from telnet, libmemcached, language specific library)
stats
stats slabs
stats items
stats sizes
stats detail
stats settings
run the above commands using telnet
or simply run using netcat
echo "stats settings" | nc 127.0.0.1 11211
Other scripts/tools
https://github.com/memcached/memcached/tree/master/scripts
memcached top
memcached metrics per slab
This is what memcached metrics per slab looks like
desc for some fields can be found here.
Memcached Prometheus exporter - Exports metrics from memcached servers for consumption by Prometheus.