How to keep a big hash on disk instead of in RAM?

How to keep a big hash on disk instead of in RAM? - perl

I've got too little RAM to finish a calculation because of a large hash. Is there a drop-in Perl module that would let me use the hash without keeping it all in RAM? I expect it to top out around 4GB, and I've got a bit less than 2GB available for the script. I don't think processing time or disk I/O would be an issue.

You can use dbm_open to open a hash connected to a DBM file. These are not particularly sophisticated and can handle shallow hashes of simple keys and values.
For anything more sophisticated, I would recommend using SQLite.

You may try DB_File module (or similar modules).
Memory usage hints: https://www.perlmonks.org/?node_id=146377
Take a look at AnyDBM_File for other similar modules available with rudimentary comparison.
$hash{$key1,$key2} syntax can be used to turn multi level hash into flat (single level) hash.
see $SUBSCRIPT_SEPARATOR in man perlvar for details.

Related

Perl "Out of memory!" when processing a large batch job

A few others and I are now the happy maintainers of a few legacy batch jobs written in Perl. About 30k lines of code, split across maybe 10-15 Perl files.
We have a lot of long-term fixes for improving how the batch process works, but in the short term, we have to keep the lights on for the various other projects that depend on the output of these batch jobs.
At the core of the main part of these batch jobs is a hash that is loaded up with a bunch of data collected from various data files in a bunch of directories. When these were first written, everything fit nicely into memory - no more than 100MB or so. Things of course grew over the years, and the hash now grows up to what the box can handle (8GB), leaving us with a nice message from Perl:
Out of memory!
This is, of course, a poor design for a batch job, and we have a clear (long-term) roadmap to improve the process.
I have two questions however:
What kind of short-term options can we look at, short of throwing more memory at the machine? Any OS settings that can be tweaked? Perl runtime/compile flags that can be set?
I'd also like to understand WHY Perl crashes with the "out of memory!" error, as opposed to using the swap space that is available on machine.
For reference, this is running on a Sun SPARC M3000 running Solaris 10 with 8 cores, 8 GB RAM, 10 GB swap space.
The reason throwing more memory at the machine is not really an ideal solution is mostly because of the hardware it's running on. Buying more memory for these Sun boxes is crazy expensive compared to the x86 world, and we probably won't be keeping these around much longer than another year.
The long-term solution is of course refactoring a lot of the codebase, and moving to Linux on x86.

There aren't really any generally-applicable methods of reducing a program's memory footprint; it takes someone familiar with Perl to scan the code and find something relevant to your specific situation
You may find that storing your hash as a disk-based database helps, and the more general way is to use Tie::Hash::DBD which will allow you to use any database that DBI supports, but it won't help with hashes whose values can be references, such as nested hashes. (As ThisSuitIsBlackNot has commented, DBM::Deep overcomes even this obstacle.)
I presume your Perl code is crashing at startup? If you have a memory leak then it should be simpler to find the cause. Alternatively it may be obvious to you that the initial population of the hash is wasteful in that it is storing data that will never be used. If you show that part of your code then I am sure someone will be able to assist

Try to use 64bit version of interpreter. I had the same issue with "Out of memory" message. In my case 32bit strawberry perl ate 2Gb of RAM before termination. 64bit version of interpreter can use bigger amount. It ate the rest of my 16Gb and than started to swap like hell. But I received a result.

Perl DBM vs. Storable

for my current project i need to store a little database on disk, that i read once my program runs and write it once.
I have looked into perls DBM functionality and from what I understand it provides merely a hash that is stored on disk with every read and write going directly to disk.
My question is: Could I not simply use Storable or any of the related modules to achieve the same (a persistent hash) with far less File I/O overhead? (The hashes will never be to large to fit into memory easily)
Regards
Nick

SQLite is fast becoming the standard for simple on-disk databases. And in Perl you can just use DBD::SQLite and you're good to go.

Since the previous answers didn't really answer your actual question, "yes, you can"... with the following caveats:
Storable isn't really suited to concurrent access.
You will need to roll your own "atomic" update (ie: you will need to write to a tmp file, then rename).
If performance isn't really an issue, you could also use Data::Dumper (with the resulting file being somewhat human readable).
You could splat the contents to CSV.
I often use Dumper when there is only going to be a single task accessing the file - and it gives me a way to read/modify the contents if I see fit.

Looking for a Perl module to store a Hash structure in shared RAM

I'd like to store a data structure persistently in RAM and have it accessible from pre-forked
web server processes in Perl.
Ideally I would like it to behave like memcached but without the need for a separate daemon. Any ideas?

Use Cache::FastMmap and all you need is a file. It uses mmap to provide a shared in-memory cache for IPC, which means it is quite fast. See the documentation for possible issues and caveats.

IPC::SharedMem might fit the bill.

Mod_perl shares RAM on systems with properly implemented copy-on-write forking. Load your Perl hash in a BEGIN block of your mod_perl program, and all forked instances of the mod_perl program will share the memory, as long as there are no writes to the pages storing your hash. This doesn't work perfectly (some pages will get written to) but on my servers and data it decreases memory usage by 70-80%.
Mod_perl also speeds up your server by eliminating the compile-time for Perl on subsequent web requests. The downside of mod_perl that you have to program carefully, and avoid programs that modify global variables, since these variables, like your hash, are shared by all the mod_perl instances. It is worthwhile to learn enough Perl so that you don't need to change globals, anyway!
The performance gains from mod_perl are fantastic, but mod_perl is not available in many shared hosts. It is easy to screw up, and hard to debug while you are learning it. I only use it when the performance improvements are appreciated enough by my customers to justify my development pain.

How can I efficiently group a large list of URLs by their host name in Perl?

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:
{
'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}
My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?
My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.
EDIT
Here is my implementation (I've cut off irrelevant things):
while($line = <STDIN>) {
chomp($line);
$line =~ /(http:\/\/.+?)(\/|$)/i;
$host = "$1";
push #{$urls{$host}}, $line;
}
store \%urls, 'out.hash';

One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.

If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).
But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.
EDIT:
Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.

Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.
One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.

What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).

Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.

You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.

Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"

If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.

I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.