How to efficiently process 300+ Files concurrently in scala - scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).

Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.

You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.

Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"

If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.

I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.

Related

What if my mmap virtual memory exceeds my computer’s RAM?

Background and Use Case
I have around 30 GB of data that never changes, specifically, every dictionary of every language.
Client requests to see the definition of a word, I simply respond with it.
On every request I have to conduct an algorithmic search of my choice so I don’t have to loop through the over two hundred million words I have stored in my .txt file.
If I open the txt file and read it so I can search for the word, it would take forever due to the size of the file (even if that file is broken down into smaller files, it is not feasible nor it is what I want to do).
I came across the concept of mmap, mentioned to me as a possible solution to my problem by a very kind gentleman on discord.
Problem
As I was learning about mmap I came across the fact that mmap does not store the data on the RAM but rather on a virtual RAM… well regardless of which it is, my server or docker instances may have no more than 64 GB of RAM and that chunk of data taking 30 of them is quite painful and makes me feel like there needs to be an alternative that is better. Even on a worst case scenario, if my server or docker container does not have enough RAM for the data stored on mmap, then it is not feasible, unless I am wrong as to how this works, which is why I am asking this question.
Questions
Is there better solution for my use case than mmap?
Will having to access such a large amount of data through mmap so I don’t have to open and read the file every time allocate RAM memory of the amount of the file that I am accessing?
Lastly, if I was wrong about a specific statement I made on what I have written so far, please do correct me as I am learning lots about mmap still.
Requirements For My Specific Use Case
I may get a request from one client that has tens of words that I have to look up, so I need to be able to retrieve lots of data from the txt file effectively.
The response to the client has to be as quick as possible, the quicker the better, I am talking ideally a less than three seconds, or if impossible, then as quick as it can be.

Best Time Series Format for Querying and Converting to Matlab (HDF5)

I have somewhat of a unique problem that looks similar to the problem here :
https://news.ycombinator.com/item?id=8368509
I have a high-speed traffic analysis box that is capturing at about 5 Gbps, and picking out specific packets from this to save into some format in a C++ program. Each day there will probably be 1-3 TB written to disk. Since it's network data, it's all time series down to the nanosecond level, but it would be fine to save it at second or millisecond level and have another application sort the embedded higher-resolution timestamps afterwards. My problem is deciding which format to use. My two requirements are:
Be able to write to disk at about 50 MB/s continuously with several different timestamped parameters.
Be able to export chunks of this data into MATLAB (HDF5).
Query this data once or twice a day for analytics purposes
Another nice thing that's not a hard requirement is :
There will be 4 of these boxes running independently, and it would be nice to query across all of them and combine data if possible. I should mention all 4 of these boxes are in physically different locations, so there is some overhead in sharing data.
The second one is something I cannot change because of legacy applications, but I think the first is more important. The types of queries I may want to export into matlab are something like "Pull metric X between time Y and Z", so this would eventually have to go into an HDF5 format. There is an external library called MatIO that I can use to write matlab files if needed, but it would be even better if there wasn't a translation step. I have read the entire thread mentioned above, and there are many options that appear to stand out: kdb+, Cassandra, PyTables, and OpenTSDB. All of these seem to do what I want, but I can't really figure out how easy it would be to get it into the MATLAB HDF5 format, and if any of these would make it harder than others.
If anyone has experience doing something similar, it would be a big help. Thanks!
A KDB+ tickerplant is certainly capable of capturing data at that rate, however there's lots of things you need to make sure (whatever solution you pick)
Do the machine(s) that are capturing the data have enough cores? Best to taskset a tickerplant, for example, to a core that nothing else will contend with
Similarly with disk - SSD, be sure there is no contention on the bus
Separate the workload - can write different types of data (maybe packets can be partioned by source or stream?) to different cpus/disks/tickerplant processes.
Basically there's lots of ways you can cut this. I can say though that with the appropriate hardware KDB+ could do the job. However, given you want HDF5 it's probably even better to have a simple process capturing the data and writing/converting to disk on the fly.

Which is more efficient, storing output in variable or output to file?

I am using Robocopy in PowerShell to sort through and output millions of filenames older than a user-specified age. My question is this: Is it better to make use of Robocopy's logging feature, then import the log via Get-Content -ReadCount, or would it be better to store Robocopy's output in a variable so that the script doesn't have to write to disk?
I would have to regex either way to get the actual file names. I'm using Robocopy because many of the files have paths longer than 248 chars.
Is one way more preferred than the other? Don't want to miss something that should be considered obvious.
You can skip all the theory and speculation about the multiple factors in play by measuring how long each method takes using Measure-Command, for example:
Measure-Command {$rc_output = robocopy <arguments>}
Measure-Command {robocopy <arguments> /log:rc.log; Get-Content rc.log [...]}
You'll get output telling you exactly how long each version took, down to the millisecond. Try it out on a small amount of sample data, see which one is quicker, then apply it to your millions of files.
I will add to #mjolinor's comment, and the other comments. To answer the question directly:
Saving information to a variable (and therefore to RAM) is always faster than direct to disk. But only in the following situations:
Variables are designed to be used to store small (<10Mb) amounts of data. They are not designed to hold things like entire databases. If the size of the data is large (i.e. millions of rows of data, i.e. tens of megabytes), then disk is always better. The problem is that if you shove a ton of information into a variable, you will fill up your RAM, and once your RAM is full, things slow down, paging memory to disk starts happening, and basically everything stops working, including any commands that you currently running (i.e. Robocopy).
Overall, because you are dealing with millions of rows, my recommendation is to write it to disk, because your results are likely to take up quite a bit of space, much more than a variable "should" hold.
Now, after saying all that and delving into the details of how programs manipulate bits in memory, it all doesn't really matter, because the time spent writing things to disk is very very small compared to the amount of time that it takes to process all the files.
If you are processing 1,000,000 files, and you process them at a good speed, say, 1,000 files a second, then it will take 1,000 seconds to process. That means that it takes over 16 Minutes to run through all the files.
If lets say writing to disk is bad, and causes you to be able to process 5 files slower per second, so 995 files instead, it will run only 5 seconds longer. 5 seconds is an impact of 0.5% which is nothing compared to the amount of time it takes to run the whole process.
It is much more likely that writing to a variable will cause much more troubles than writing to disk.
It depends on how much output you're talking about, and what your available system resources are. It will be faster to write them out to a file and then read them back in if the disk I/O time is less than the additional overhead required for memory managment to get into memory. You can try it both ways and time it, but I'd try reading it into memory first while monitoring it with Task Manager. If it starts throwing lots of page faults, that's a clue that you may be better off using the disk as intermediate storage.

Perl: performance hit with reading multiple files

I was wondering what is better in this case?
I have to read in thousands of files. I was thinking of opening into each file and reading one and closing it. Or cat all the files into one file and read that.
Suggestions? This is all in Perl.
It shouldn't make that much of a difference. This sounds like premature optimization to me.
If the time for cating all files into one bigger file doesn't matter it will be faster (only when reading the file sequentially which is the default).
Of course if the process is taken into account it'll be much slower because you have to read, write and read again.
In general reading one file of 1000M should be faster than reading 100 files of 10M because for the 100 files you'll need to look for the metadata.
As tchrist says the performance difference might not be important. I think it depends on the type of file (e.g. for a huge number of files which are very small it would differ much more) and the overall performance of your system and its storage.
Note that cat * can fail if number of files is greater than your ulimit -n value. So sequential read can actually be safer.
Also, consider using opendir and readdir instead of glob if all your files are located in the same dir.
Just read the files sequentially. Perl's file i/o functions are pretty thin wrappers around native file i/o calls in the OS, so there isn't much point in fretting about performance from simple file i/o.

Using combination of File Size and Hash value of only first 20KB of a file to detect duplicates?

A project I'm working on requires detection of duplicate files. Under normal circumstances I would simply compare the file bytes in blocks or hash value of the entire file contents. However, the system does not have access to the entire file - only the first 50KB or so. It also knows the total file size of the original file.
I was thinking of implementing the following: each time a file is added, I would look for possible duplicates using both the total file size and a hash calculation of (file-size)+(first-20KB-of-file). The hash algorithm itself is not the issue at this stage, but will likely be MurmurHash2.
Another option is to also store, say, bytes 9000 through 9020 and use that as a third condition when looking up a duplicate copy or alternatively to compare byte-to-byte when the aforementioned lookup method returns possible duplicates in a last attempt to discard false positives.
How naive is my proposed implementation? Is there a reliable way to predict the amount of false positives? What other considerations should I be aware of?
Edit: I forgot to mention that the files are generally going to be compressed archives (ZIP,RAR) and on occasion JPG images.
You can use file size, hashes and partial-contents to quickly detect when two files are different, but you can only determine if they are exactly the same by comparing their contents in full.
It's up to you to decide whether the failure rate of your partial-file-check will be low enough to be acceptable in your specific circumstances. Bearing in mind that even an "exceedingly unlikely" event will happen frequently if you have enough volume. But if you know the type of data that the files will contain, you can judge the chances of two near-identical files (idenitcal in the first 50kB) cropping up.
I would think that if a partial-file-match is acceptable to you, then a hash of those partial file contents is probably going to be pretty acceptable too.
If you have access to 50kB then I'd use all 50kB rather than only the first 20kB in my hash.
Picking an arbitrary 20 bytes probably won't help much (your file contents will either be very different in which case hash+size clashes will be unlikely, or they will be very similar in which case the chances of a randomly chosen 20 bytes being different will be quite low)
In your circumstances I would check the size, then a hash of the available daa (50kB), then if this suggests a file match, a brute-force comparison of the available data just to minimise the risks, if you don't expect to be adding so many duplicates that this would bog the system down.
It depends on the file types, but in most cases false positives will be pretty rare.
You probably won't have any in Office and graphical files. And executables are supposed have a checksum in the header.
I'd say that the most likely false positive you may encounter is in source code files. They change often and it may happen that a programmer replaces a few symbols something after the first 20K.
Other than that I'd say they are pretty unlikely.
Why don't use a hash of the first 50 KB, and then store the size on the side? That would give you the most security with what you have to work with (with that said, there could be totally different content in the files after the first 50 KB without you knowing, so it's not a really secure system).
I find it difficult. It's likely that you would catch most duplicates with this method, but the possibility of false positives is huge. What about two versions of a 5MB XML document whose last chapter is modified?