Scala: performance boost on incremental garbage collection - scala

I have written an application in Scala. Basically, the first step is to create a array of objects an then to initialise these objects from a csv file. When running the application on the jvm it is really slow, and after some experimenting I found out that using the -J-Xincgc flag which enables incremental garbage collection speeds up the application by a factor of 4 (it's 4 times faster with the switch!). I wonder:
Why?
Did I use some inefficient coding, and if so, where should I start to find out whats going on?
Thanks!

I'll assume you're running this on hotspot.
The hotspot JVM has a whole zoo of garbage collectors, most of which also may have some sort of sub-modes or various command-line switches that significantly alter their behavior.
Which GC is used by default varies based on JVM version, operating system and 32/64bit VM.
So you basically changed whatever the default was to a specific algorithm that happened to perform "faster" for your workload.
But "faster" is a fuzzy measure. Wall time is not the same as CPU cycles spent if you consider multi-threading. And some collectors may simply choose to grow the heap more aggressively, thus deferring the cost of collection to a later point in time, which you might not have measured if your program didn't run long enough.
To make an accurate assessment much more information would be needed
what GC was used by default
your VM version
how many cores your CPU has
what kind of workload do you have (multi/single-thread, long/short-running, expected memory footprint, object allocation rate)
Oracle's GC tuning guide may prove useful for you
In your case, -Xincgc translates to CMS in incremental mode, which is intended for single-core environments and has been deprecated as of java8. It probably just happened to be better than the default, but it's not necessarily an optimal choice.

If you get into a situation where you are running close to your heap-size limit, you can waste a lot of GC time, which can lead to a lot of false findings about performance. If that's your situation, first increase your heap-size limit before doing anything else. Consider use of jvisualvm to eyeball the situation - it's trivially easy to get started with.

Related

Fastest way to reset or free copy-on-write mapped area

I have a CoW region of memory that I need to reset to the original state.
Sadly, MADV_DONTNEED behaves exactly the same as munmap, and is seemingly freeing all pages. munmap is extremely expensive and the performance is horrendous to say the least, and its way cheaper to create a new $thing from scratch using MAP_ANONYMOUS, initialize it manually, then munmap that. That makes zero sense to me, and just shows something is really broken with mmap and CoW mappings. Unfortunately for me I really need CoW. It's that or memcpy from one range to another, and since we are in 2021 I expect that Linux will be able to do copy-on-write.
See: https://kostja.github.io/2012/04/04/1111.html
I would like to only discard the dirty pages of my memory range.
munmap anon: 435110ns (435 micros)
munmap memfd: 21958015ns (21958 micros)
This is the average time when freeing 400x 128MB ranges. If I only free 1 range then I get sane numbers, so there's some kind of bad scaling going on in the kernel that I don't understand. The tmpfs-backed area is untouched after allocating it with MAP_NORESERVE. This is completely insane. Are memfd (tmpfs-backed) files just that slow?
The fastest way ended up being to use the hardware virtualization itself to implement copy-on-write mechanisms. It ended up being extremely complex, fraught with footguns, but most importantly very fast. It is possible to use just a few pages of working memory to call into a copy-on-write VM. Most of the pages are duplicated page table entries.
Additionally, this opens up the possibility for copies of copies, as well as flattening a copy-on-write VM so that it can be used as master.
Linux has no support for this whatsoever.

Throttling CPU usage in a Swift thread

I want to traverse the file tree for a potentially large directory in a macOS app. It takes about 3 mins for my example case if I just do it, but the CPU spikes to 80% or so for those 3 minutes.
I can afford to do it more slowly on a background thread, but am not sure of what the best approach would be.
I thought of just inserting 1 millisecond sleep inside the loop, but I am not confident that won't have some negative impact on scheduling / disk IO etc. An alternative would be to do 1 second of work, then wait 2-3 seconds, but I am guessing there is something more elegant?
The core functionality I want is traversing a directory in a nested fashion checking file attributes:
let enumerator = FileManager.default.enumerator(atPath: filePath)
while let element = enumerator?.nextObject() as? String {
// do something here
}
It's generally most energy efficient to spike the CPU for a short time than to run it at a low level for a longer time. As long as your process has a lower priority than other processes, running the CPU at even 100% for a short time isn't a problem (particularly if it doesn't turn the fans on). Modern CPUs would like to be run very hard for short periods of time, and then be completely idle. "Somewhat busy" for a longer time is much worse because the CPU can't power-off any subsystems.
Even so, users get very upset when they see high CPU usage. I used to work on system management software, and we spoke with Apple about throttling our CPU usage. They told us the above. We said "yes, but when users see us running at 100%, they complain to IT and try to uninstall our app." Apple's answer was to use sleep, like you're describing. If it makes your process take longer, then it will likely have a negative overall impact in total energy use. But I wouldn't expect it to cause any other trouble.
That said, if you are scanning the same directory tree more than once, you should look at File System Events and File Metadata Search which may perform this operations much more efficiently.
See also: Schedule Background Activity in the Energy Efficiency Guide for Mac Apps. I highly recommend this entire doc. There are many tools that have been added to macOS in recent years that may be useful for your problem. I also recommend Writing Energy Efficient Apps from WWDC 2017.
If you do need to scan everything directly with an enumerator, you can likely greatly improve things by using the URL-based API rather than the String-based API. It allows you to pre-fetch certain values (including attributeModificationDateKey, which may be of use here). Also, be aware of the fileAttributes property of DirectoryEnumerator, which caches the last-read file's attributes (so you don't need to query them again).
Three minutes is a long time; it's possible you're doing more work than needed. Run your operation using the find commandline tool and use that as a benchmark for how much time it should take.

Parallel processing input/output, queries, and indexes AS400

IBM V6.1
When using the I system navigator and when you click System values the following display.
By default the Do not allow parallel processing is selected.
What will the impact be on processing in programs when you choose multiple processes, we have allot of rpgiv programs and sql queries being executed and I think it will increase performance?
Basically I want to turn this on in production environment but not sure if I will break anything by doing this for example input or output of different programs running parallel or data getting out of sequence?
I did do some research :
https://publib.boulder.ibm.com/iseries/v5r2/ic2924/index.htm?info/rzakz/rzakzqqrydegree.htm
And understand each option but I do not know the risk of changing it from default to multiple.
First off, in order get the most out of *MAX and *OPTIMIZE, you'd need a system with more than one core (enabled for IBM i / DB2) along with the DB2 Symmetric Multiprocessing (SMP) (57xx-SS1 option 26) license program installed; thus allowing the system to use SMP for queries and index builds.
For *IO, the system can use multiple tasks via simultaneous multithreading (SMT) even on a single core POWER 5 or higher box. SMT is enabled via the Processor multi tasking (QPRCMLTTSK) system value
You're unlikely to "break" anything by changing the value. As long as your applications don't make bad assumptions about result set ordering. For example, CPYxxxIMPF makes use of SQL behind the scenes; with anything but *NONE you might end up with the rows in your DB2 table in different order from the rows in the import file.
You will most certainly increase the CPU usage. This is not a bad thing; unless you're currently pushing 90% + CPU usage regularly. If you're only using 50% of your CPU, it's probably a good thing to make use of SMT/SMP to provide better response time even if it increases the CPU utilization to 60%.
Having said that, here's a story of it being a problem... http://archive.midrange.com/midrange-l/200304/msg01338.html
Note that in the above case, the OP was pre-building work tables at sign on in order to minimize the wait when it was time to use them. Great idea 20 years ago with single threaded systems. Today, the alternative would be to take advantage of SMP/SMT and build only what's needed when needed.
As you note in a comment, this kind of change is difficult to test in non-production environments since workloads in DEV & TEST are different. So it's important to collect good performance data before & after the change. You might also consider moving it stages *NONE --> *IO --> *OPTIMIZE and then *MAX if you wish. I'd spend at least a month at each level, if you have periodic month end jobs.

Couchbase - Value Eviction to Full Eviction change for huge data base

We are having production servers with high volume of data with Value Eviction Buckets. Since we are running out of memory we have decided to change the eviction mode to Full Eviction. If we do this
Is there any impact for live operations ?
Is there any process running ? (Ex: like re balancing)
What are the pros and cons ?
Yes there are. There are not many, but that operation requires the memcached processes to be restarted on all nodes at the same time and warm up the caches. So you will incur downtime of course. How much depends on a few factors.
Not that I can think of. It just has to restart the processes.
Pros: You have more room in RAM as the meta-data is ejected now in addition to values. Cons: If you have it in your code to do any operation that checks for the existence of an object first, it will be much slower. I will give you an example. If you do an upsert, the DB has to check if that object exists first as part of the process. If you are running value eviction, it checks the for the metadata object in RAM which is super quick. That object ID is either there or not. If you are running with full eviction, now Couchbase has to go to disk to look through the meta-data there. As you might imagine, there is a penalty for that, which depending on some factors could be large.
IMO, running out of memory is not a good enough reason to move to full eviction. You need to have a functional reason. Without knowing more information (resident ratios, RAM size, cache sizes, etc. Etc.), you are probably better off adding more servers or larger ones, your choice. Keeping Couchbase properly sized, like most databases, but especially Couchbase is critical to a well functioning system. If you have an Enterprise contract with Couchbase, their Support team can help you with this. If not, read the documentation on this REALLY carefully before you turn on this feature. Like I said, have more than "I am running out of RAM" as the reason you are changing how the DB works, otherwise you may be doing more harm than good.

Using Drools in a heavy batch process

We used Drools as part of a solution to act as a sort of filter in a very intense processing application, maybe running up to 100 rules on 500,000 + working memory objects.
turns out that it is extremely slow.
anybody else have any experience using Drools in a batch type processing application?
Kind of depends on your rules - 500K objects is reasonable given enough memory (it has to populate a RETE network in memory, so memory usage is a multiple of 500K objects - ie space for objects + space for network structure, indexes etc) - its possible you are paging to disk which would be really slow.
Of course, if you have rules that match combinations of the same type of fact, that can cause an explosion of combinations to try, which even if you have 1 rule will be really really slow.
If you had any more information on the analysis you are doing that would probably help with possible solutions.
I've used a Drools with a stateful working memory containing over 1M facts. With some tuning of both your rules and the underlying JVM, performance can be quite good after a few minutes for initial start-up. Let me know if you want more details.
I haven't worked with the latest version of Drools (last time I used it was about a year ago), but back then our high-load benchmarks proved it to be utterly slow. A huge disappointment after having based much of our architecture on it.
At least something good I remember about drools is that their dev team was available on IRC and very helpful, you might give them a try, they're the experts after all: irc.codehaus.org #drools
I'm just learning drools myself, so maybe I'm missing something, but why is the whole batch of five hundred thousand objects added to working memory at once? The only reason I can think of is that there are rules that kick in only when two or more items in the batch are related.
If that isn't the case, then perhaps you could use a stateless session and assert one object at a time. I assume rules will run 500k times faster in that case.
Even if it is the case, do all your rules need access to all 500k objects? Could you speed things up by applying per-item rules one at a time, and then in a second phase of processing apply batch level rules using a different rulebase and working memory? This would not change the volume of data, but the RETE network would be smaller because the simple rules would have been removed.
An alternative approach would be to try and identify the related groups of objects and assert the objects in groups during the second phase, further reducing the volume of data in working memory as well as splitting up the RETE network.
Drools is not really designed to be run on a huge number of objects. It's optimized for running complex rules on a few objects.
The working memory initialization for each additional object is too slow and the caching strategies are designed to work per working memory object.
Use a stateless session and add the objects one at a time ?
I had problems with OutOfMemory errors after parsing a few thousand objects. Setting a different default optimizer solved the problem.
OptimizerFactory.setDefaultOptimizer(OptimizerFactory.SAFE_REFLECTIVE);
We were looking at drools as well, but for us the number of objects is low so this isn't an issue. I do remember reading that there are alternate versions of the same algorithm that take memory usage more into account, and are optimized for speed while still being based on the same algorithm. Not sure if any of them have made it into a real usable library though.
this optimizer can also be set by using parameter
-Dmvel2.disable.jit=true