jboss7 and HornetQ with asyncio low performance - jboss

I am using jboss7 which has embedded the HornetQ server. I want to use ASYNCIO instead of Java NIO as the bibliography indicates that the ASYNCIO has better performance from the NIO. Although in my measurements when I use NIO the system can transfer 600 messages in the queue per second and when I use ASYNCIO the system transfers 250 messages per seconds. What can cause this low performance when I use ASYNCIO?

Try increasing the journal sizes... use 10 files of 10 Mib each. If you are using the defaults it's quite low.

Related

Maximizing S3 upload performance with AWS C++ SDK

I am using a c5.18xlarge instance with the ENA adapter enabled (so expect to have 25 Gbps connectivity to S3 per AWS support). I am using the AWS C++ SDK (version 1.3.59) on RHEL 7 to upload a 70 GB file to a single S3 object using a 256 MB part size. Per AWS support, I've set the ClientConfiguration's maxConnections field to 999 and its executor field to use a PooledThreadExecutor with a pool size of 999 (and these have improved my performance). I am performing a series of S3Client::UploadPart() calls, threading these myself; I get very similar performance when using UploadPartCallable() and letting the SDK manage the threading.
Here's the performance I'm seeing:
- 36 threads: 7.5 Gbps
- 200 threads: 15.7 Gbps
AWS support reported similar behavior (actually they were using 900 threads).
I've looked through the underlying implementation of S3Client and all the low level thread management and curl handle management. I don't see anything obviously inefficient going on. It just doesn't make any sense to me that I would need 200 threads to achieve this performance on a machine that has 36 physical cores. Is this expected? Could someone provide an explanation for what's happening or a different way to configure the SDK to not require this many threads? I think I could provide my own HTTPClientFactory and customize things to cut out a mutex in how the curl handles are managed if I'm careful, but this seems unlikely to account for what I'm seeing.
Thanks for any help.
-Adam
I am using the AWS C++ SDK (version 1.3.59) on RHEL 7 to upload a 70 GB file to a single S3 object using a 256 MB part size.
You're probably being limited by your disk/storage device's read throughput. It's actually impressive that you're able to reach 15.7 Gbps.
In my test, I see all threads created by Aws::Utils::Threading::PooledThreadExecutor are running in one single CPU core(while the spot instance has 72 vCPUs). Have you seen the same behavior in your tests?
The way I further improved the performance is by using my own threading model with S3Client blocking APIs instead of PooledThreadExecutor with S3 async methods(such as UploadPartAsync()).

Multi core machine - cpu load metric

In a multi core machine what is the best metric to understand whether cpu is loaded or not ?
I have a web application that sends a post request to apache CGI server. CGI server loops over the post data and launches perl process for each of the item in the loop. Since requests from clients ends up hitting a single endpoint, I am concerned if I end up creating lots of processes which my server can't handle. Hence I wanted to understand what system metric should I check before launching a new process from loop.
Note: I have a 20 core machine.
The reason the answer isn't easy to find, is that it depends on the nature of your processes, and which system constraint is your limiting factor.
For CPU intensive work, then the metric to look at is load average - load average is a measure of processes in a runnable state - very roughly if LA is the same as number of cores, then you're running your CPUs at maximum.
However, it's increasingly the case that CPU is not the limiting factor - you may have a finite amount of memory, and memory hungry processes will consume it. 'spare' memory is used for caching, so filling the whole lot up actually starts to slow things down (because you have a smaller cache). Over spilling the available will either cause swapping or OOMkiller.
But as you mention apache and web, then chances are pretty good that your network pipe is a limiting factor - controlling bandwidth from the local host is actually surprisingly hard.
And then there's disk IO - which may also be a factor - I think that's unlikely for a web server, because your outbound network will usually be a tighter limit.
It all depends what your processes are doing - if they're lightweight 'helpers' that are mostly idle, or heavyweight 'grinders' that all introduce noticeable load.
So the best answer I can give is a very vague estimate - if your processes are CPU intensive, cap them at 2 per core. If your processes are memory, aim to consume about 50% of your system RAM. If your processes are IO intensive, aim to consume about 50% of your IO (either network or disk).

Mirth performance benchmark

We are using mirth connect for message transformation from hl7 to text and storing the transformed messages to azure sql database. Our current performance is 45000 messages per hour .
machine configuration is
8 GB RAM and 2 core CPU. Memory assigned to mirth is -XMS = 6122MB
We don't have any idea about what could be performance parameters for Mirth with above configurations. Anyone have idea about performance benchmarks for Mirth connect?
I'd recommend looking into the Max Processing Threads option in version 3.4 and above. It's configurable in the Source Settings (Source tab). By default it's set to 1, which means only one message can process through the channel's main processing thread at any given time. This is important for certain interfaces where order of messages is paramount, but obviously it limits throughput.
Note that whatever client is sending your channel messages also needs to be reconfigured to send multiple messages in parallel. For example if you have a single-threaded process that is sending your channel messages via TCP/MLLP one after another in sequence, increasing the max processing threads isn't necessarily going to help because the client is still single-threaded. But, for example, if you stand up 10 clients all sending to your channel simultaneously, then you'll definitely reap the benefits of increasing the max processing threads.
If your source connector is a polling type, like a File Reader, you can still benefit from this by turning the Source Queue on and increasing the Max Processing Threads. When the source queue is enabled and you have multiple processing threads, multiple queue consumers are started and all read and process from the source queue at the same time.
Another thing to look at is destination queuing. In the Advanced (wrench icon) queue settings, there is a similar option to increase the number of Destination Queue Threads. By default when you have destination queuing enabled, there's just a single queue thread that processes messages in a FIFO sequence. Again, good for message order but hampers throughput.
If you do need messages to be ordered and want to maximize parallel throughput (AKA have your cake and eat it too), you can use the Thread Assignment Variable in conjunction with multiple destination Queue Threads. This allows you to preserve order among messages with the same unique identifier, while messages pertaining to different identifiers can process simultaneously. A common use-case is to use the patient MRN for this, so that all messages for a given patient are guaranteed to process in the order they were received, but messages longitudinally across different patients can process simultaneously.
We are using an AWS EC2 4c.4xlarge instance to test a bare bone Proof of Concept performance limit. We got about 50 msgs/sec without obvious bottlenecks on cpu/memory/network/disk io/db io and etc. Want to push the limits higher. Please share your observations if any.
We run the same process. Mirth -> Azure SQL Database. We're running through performance testing right now and have been stuck at 12 - 15 messages/second (43000 - 54000 per hour).
We've run tests on each channel and found this:
1 channel source: file reader -> destination: Azure SQL DB was about 36k per hour
2 channel source: file reader -> destination: Azure SQL DB was about 59k per hour
3 channel source: file reader -> destination: Azure SQL DB was about 80k per hour
We've added multi-threading (2,4,8) to both the source and destination on 1 channel with no performance increase. Mirth is running on 8GB mem and 2 Cores with heap size set to 2048MB.
We are now going to run through a few tests with mirth running on similar "hardware" as a C4.4xlarge which in Azure is 16 cores and 32GB mem. There is 200gb of SSD available as well.
Our goal is 100k messages per hour per channel.

NUMA awareness of JVM

My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.
I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.

NServiceBus: How to stop distributor acting as a processing bottleneck (reduces rate 65%)

We have an event processing system that will process events sent directly from the source to handler process at 200 eps (events per second). The queues and message sends are transactional. Adding the NSB distributor between the event generator and the handler process reduces this rate from 200 eps to 70 eps. The disk usage and CPU on the distributor box become significantly higher as well.
Seen with commercial build of NServiceBus, version 2.6.0.1505.
Has anyone else seen this behaviour or have any advice?
One thing you can play with is where MSDTC is located. You can have your workers use the same MSDTC as the distributor, therefore downgrading the level of the transaction and speeding up commits. I would recommend if you do this that you cluster MSDTC to protect against failures.
Assuming you are operating on a DB you could shard your databases to work on different sets of data. You could also move the DB(s) closer to the workers(to the same machine).
I would also check into the settings of your DB provider and MSMQ as there are a few things to tweak there in terms of timeouts and such. Note that there is a trade off when applying certain settings, but it sounds like you'd prefer the quickest throughput.
There are lots of other system level things to check, I'll assume you've been through all those items(network/disk/RAM/etc).