How to sort a large file on two levels efficiently? - perl

I have a very large file, over 100GB (many billions of lines), and I would like to conduct a two-level sort as quick as possible on a unix system with limited memory. This will be one step in a large perl script, so I'd like to use perl if possible.
So, how can I do this? My data looks like this:
A 129
B 192
A 388
D 148
D 911
A 117
...But for billions of lines. I need to first sort by letter, and then by number. Would it be easier to use a unix sort, like...
sort -k1,2 myfile
Or can I do this all in perl somehow? My system will have something like 16GB ram, but the file is about 100GB.
Thanks for any suggestions!

The UNIX sort utility can handle sorting large data (e.g. larger than your working 16GB of RAM) by creating temporary working files on disk space.
So, I'd recommend simply using UNIX sort for this as you've suggested, invoking the option -T tmp_dir, and making sure that tmp_dir has enough disk space to hold all of the temporary working files that will be created there.
By the way, this is discussed in a previous SO question.

The UNIX sort is the best option for sorting data of this scale. I would recommend use fast compression algorithm LZO for it. It is usually distributed as lzop. Set big sort buffer using -S option. If you have some disk faster than then where you have default /tmp set also -T. Also, if you want sort by a number you have to define sorting number sorting as second sorting field. So you should use line like this for best performance:
LC_ALL=C sort -S 90% --compress-program=lzop -k1,1 -k2n

I had the exact same issue!
After searching a lot and since i did not want any dependency on the shell (UNIX) to make it portable on windows i came up with the below solution:
#!/usr/bin/perl
use File::Sort qw(sort_file);
my $src_dic_name = 'C:\STORAGE\PERSONAL\PROJECTS\perl\test.txt';
sort_file({k => 1, t=>" ", I => $src_dic_name, o => $src_dic_name.".sorted"});
I know this is a old post but updating it with the solution so that it is easy to find.
Documentation here

Related

How to keep a big hash on disk instead of in RAM?

I've got too little RAM to finish a calculation because of a large hash. Is there a drop-in Perl module that would let me use the hash without keeping it all in RAM? I expect it to top out around 4GB, and I've got a bit less than 2GB available for the script. I don't think processing time or disk I/O would be an issue.
You can use dbm_open to open a hash connected to a DBM file. These are not particularly sophisticated and can handle shallow hashes of simple keys and values.
For anything more sophisticated, I would recommend using SQLite.
You may try DB_File module (or similar modules).
Memory usage hints: https://www.perlmonks.org/?node_id=146377
Take a look at AnyDBM_File for other similar modules available with rudimentary comparison.
$hash{$key1,$key2} syntax can be used to turn multi level hash into flat (single level) hash.
see $SUBSCRIPT_SEPARATOR in man perlvar for details.

Perl: performance hit with reading multiple files

I was wondering what is better in this case?
I have to read in thousands of files. I was thinking of opening into each file and reading one and closing it. Or cat all the files into one file and read that.
Suggestions? This is all in Perl.
It shouldn't make that much of a difference. This sounds like premature optimization to me.
If the time for cating all files into one bigger file doesn't matter it will be faster (only when reading the file sequentially which is the default).
Of course if the process is taken into account it'll be much slower because you have to read, write and read again.
In general reading one file of 1000M should be faster than reading 100 files of 10M because for the 100 files you'll need to look for the metadata.
As tchrist says the performance difference might not be important. I think it depends on the type of file (e.g. for a huge number of files which are very small it would differ much more) and the overall performance of your system and its storage.
Note that cat * can fail if number of files is greater than your ulimit -n value. So sequential read can actually be safer.
Also, consider using opendir and readdir instead of glob if all your files are located in the same dir.
Just read the files sequentially. Perl's file i/o functions are pretty thin wrappers around native file i/o calls in the OS, so there isn't much point in fretting about performance from simple file i/o.

How can I efficiently group a large list of URLs by their host name in Perl?

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:
{
'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}
My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?
My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.
EDIT
Here is my implementation (I've cut off irrelevant things):
while($line = <STDIN>) {
chomp($line);
$line =~ /(http:\/\/.+?)(\/|$)/i;
$host = "$1";
push #{$urls{$host}}, $line;
}
store \%urls, 'out.hash';
One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.
If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).
But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.
EDIT:
Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.
Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.
One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.
What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.

How can I search a large sorted file in Perl?

Can you suggest me any CPAN modules to search on a large sorted file?
The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash.
Thanks.
Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).
Some code:
while (<STDIN>) {
if (/regular expression/) {
process each matched line
}
}
You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.
Put the above code in a script file and run it with your file redirected to stdin.
A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.
Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.
You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:
Break up the file into a database. That could be SQLite, even.
Pre-index the file based on the data that you want to search.
Cache the results from previous searches.
Run common searches ahead of time, automatically.
All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.
You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?
Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.
When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.