Argon2id Hashing Slow (Large Variance) - hash

I'm using the C# Argon2 implementation provided through the Isopoh.Cryptography.Argon2 NuGet package (latest version 1.1.2 from here: https://github.com/mheyman/Isopoh.Cryptography.Argon2). Generating and verifying Argon2 hashes is sometimes fast and sometimes extremely slow (several seconds), even though I understand the configuration below to be on the low-cost end.
My Argon2 configuration is as follows:
Type = Argon2Type.HybridAddressing,
Version = Argon2Version.Nineteen,
MemoryCost = 16, //I had this at 2048 and lowered to 16 for testing, still slow
TimeCost = 2,
Lanes = 2,
Threads = 1,
HashLength = 32,
ClearPassword = true,
ClearSecret = true
This results in Argon2 hashes that show the following configuration header:
$argon2id$v=19$m=16,t=2,p=2$<<hash>>
I wrote a performance profiler and found that the implementation gets (mostly) slower with each iteration after roughly the 10 to 12th iteration even on repeat tests. The slowness is ~2 orders of magnitude or more (going from ~10ms to several seconds), which leads me to believe that there is a garbage collection/memory leak issue.

I found that the issue was indeed related to either the Argon2 implementation and/or garbage collection.
Forcing garbage collection via GC.Collect() (which I usually would not advise to do manually) immediately after generating the hash (which includes a using wrap for both the hash instance and the SecureArray<byte> instance, both of which were present already when the issue occurred) removes the odd variance in hash generation speeds.
It also allowed me to tune the parameters to a more secure setting while staying in an ~100-150ms generation time envelope.

Same here, using Verify() in this libary, memory is constantly and drastically increasing. GC.Collect() after each iteration (obviouisly not recommended) solves the issue.
This unfortunately happens in production while QA missed it (it is felt under heavy load) so we'll either have to switch or to modify the OS library itself.

Related

Throttling CPU usage in a Swift thread

I want to traverse the file tree for a potentially large directory in a macOS app. It takes about 3 mins for my example case if I just do it, but the CPU spikes to 80% or so for those 3 minutes.
I can afford to do it more slowly on a background thread, but am not sure of what the best approach would be.
I thought of just inserting 1 millisecond sleep inside the loop, but I am not confident that won't have some negative impact on scheduling / disk IO etc. An alternative would be to do 1 second of work, then wait 2-3 seconds, but I am guessing there is something more elegant?
The core functionality I want is traversing a directory in a nested fashion checking file attributes:
let enumerator = FileManager.default.enumerator(atPath: filePath)
while let element = enumerator?.nextObject() as? String {
// do something here
}
It's generally most energy efficient to spike the CPU for a short time than to run it at a low level for a longer time. As long as your process has a lower priority than other processes, running the CPU at even 100% for a short time isn't a problem (particularly if it doesn't turn the fans on). Modern CPUs would like to be run very hard for short periods of time, and then be completely idle. "Somewhat busy" for a longer time is much worse because the CPU can't power-off any subsystems.
Even so, users get very upset when they see high CPU usage. I used to work on system management software, and we spoke with Apple about throttling our CPU usage. They told us the above. We said "yes, but when users see us running at 100%, they complain to IT and try to uninstall our app." Apple's answer was to use sleep, like you're describing. If it makes your process take longer, then it will likely have a negative overall impact in total energy use. But I wouldn't expect it to cause any other trouble.
That said, if you are scanning the same directory tree more than once, you should look at File System Events and File Metadata Search which may perform this operations much more efficiently.
See also: Schedule Background Activity in the Energy Efficiency Guide for Mac Apps. I highly recommend this entire doc. There are many tools that have been added to macOS in recent years that may be useful for your problem. I also recommend Writing Energy Efficient Apps from WWDC 2017.
If you do need to scan everything directly with an enumerator, you can likely greatly improve things by using the URL-based API rather than the String-based API. It allows you to pre-fetch certain values (including attributeModificationDateKey, which may be of use here). Also, be aware of the fileAttributes property of DirectoryEnumerator, which caches the last-read file's attributes (so you don't need to query them again).
Three minutes is a long time; it's possible you're doing more work than needed. Run your operation using the find commandline tool and use that as a benchmark for how much time it should take.

Why is Swift multithreading much less efficient when using functions in imported modules?

I am going to leave out a lot of 'irrelevant' details here to help people concentrate on the actual question.
I have a Swift project which involves a great deal of calculations (numerical integration and multi parameter best fit etc). To speed things up I am aiming to use concurrent processing.
Using XCTest classes, I have discovered that with my closure calling a function defined within my module, if I use DispatchQueue.concurrentPerform with a single iteration, it takes time t. When repeated with 5 iterations, it runs about 5% slower (I am happy with that).
NB the function is a static function on a struct (my collection of calculus routines).
However, if I put the function in a separate module and import this, repeating the test with 1 iteration takes a similar time t. But now when I try it with 5 iterations, the call takes twice as long (105% slower in fact).
Swift version: 4.2.1
OS: macOS 10.14.3
Xcode 10.1
Processor: 6 core Core i7 (Mac mini 2018)
All 'objects' are structs and value types used everywhere, apart from the function reference.
Quick summary again: using DispatchQueue.concurrentPerform(), compared to the baseline time for 1 iteration of a same-module defined function, 5 iterations is 5% slower. However when doing the same process using a function which has been defined in an imported module, baseline time remains unchanged for 1 iteration but 5 iterations is 105% slower.
Can anyone explain why this happens, and hopefully suggest a way of avoiding this slowdown while keeping my collection in an importable module?
Feel free to ask questions if you need more information.
Problem has been solved. No clue what the cause was.
Removed my project, created a fresh one, imported the files back in to it, included it in the workspace.
Now 5 concurrent threads process in only 10% longer than a single thread (doing 5x the amount of work). Would still love to know what caused the issue, and would be good if the efficiency drop was only 5% as had been in the earlier mentioned case.
But for this improvement just from starting a fresh project, I'm not going to quibble!

Scala: performance boost on incremental garbage collection

I have written an application in Scala. Basically, the first step is to create a array of objects an then to initialise these objects from a csv file. When running the application on the jvm it is really slow, and after some experimenting I found out that using the -J-Xincgc flag which enables incremental garbage collection speeds up the application by a factor of 4 (it's 4 times faster with the switch!). I wonder:
Why?
Did I use some inefficient coding, and if so, where should I start to find out whats going on?
Thanks!
I'll assume you're running this on hotspot.
The hotspot JVM has a whole zoo of garbage collectors, most of which also may have some sort of sub-modes or various command-line switches that significantly alter their behavior.
Which GC is used by default varies based on JVM version, operating system and 32/64bit VM.
So you basically changed whatever the default was to a specific algorithm that happened to perform "faster" for your workload.
But "faster" is a fuzzy measure. Wall time is not the same as CPU cycles spent if you consider multi-threading. And some collectors may simply choose to grow the heap more aggressively, thus deferring the cost of collection to a later point in time, which you might not have measured if your program didn't run long enough.
To make an accurate assessment much more information would be needed
what GC was used by default
your VM version
how many cores your CPU has
what kind of workload do you have (multi/single-thread, long/short-running, expected memory footprint, object allocation rate)
Oracle's GC tuning guide may prove useful for you
In your case, -Xincgc translates to CMS in incremental mode, which is intended for single-core environments and has been deprecated as of java8. It probably just happened to be better than the default, but it's not necessarily an optimal choice.
If you get into a situation where you are running close to your heap-size limit, you can waste a lot of GC time, which can lead to a lot of false findings about performance. If that's your situation, first increase your heap-size limit before doing anything else. Consider use of jvisualvm to eyeball the situation - it's trivially easy to get started with.

Data management in matlab versus other common analysis packages

Background:
I am analyzing large amounts of data using an object oriented composition structure for sanity and easy analysis. Often times the highest level of my OO is an object that when saved is about 2 gigs. Loading the data into memory is not an issue always, and populating sub objects then higher objects based on their content is much more java memory efficient than just loading in a lot of mat files directly.
The Problem:
Saving these objects that are > 2 gigs will often fail. It is a somewhat well known problem that I have gotten around by just deleting a number of sub objects until the total size is below 2-3 gigs. This happens regardless of how boss the computer is, a 16 gigs of ram 8 cores etc, will still fail to save the objects correctly. Back versioning the save also does not help
Questions:
Is this a problem that others have solved somehow in MATLAB? Is there an alternative that I should look into that still has a lot of high level analysis and will NOT have this problem?
Questions welcome, thanks.
I am not sure this will help, but here: Do you make sure to use recent version of mat file? Check for instance save. Quoting from the page:
'-v7.3' 7.3 (R2006b) or later Version 7.0 features plus support for data items greater than or equal to 2 GB on 64-bit systems.
'-v7' 7.0 (R14) or later Version 6 features plus data compression and Unicode character encoding. Unicode encoding enables file sharing between systems that use different default character encoding schemes.
Also, could by any chance your object by or contain a graphic handle object? In that case, it is wise to use hgsave

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.