How can I search a large sorted file in Perl? - perl

Can you suggest me any CPAN modules to search on a large sorted file?
The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash.
Thanks.

Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).
Some code:
while (<STDIN>) {
if (/regular expression/) {
process each matched line
}
}
You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.
Put the above code in a script file and run it with your file redirected to stdin.

A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.
Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.

You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:
Break up the file into a database. That could be SQLite, even.
Pre-index the file based on the data that you want to search.
Cache the results from previous searches.
Run common searches ahead of time, automatically.
All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.
You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?

Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.

When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.

Related

How to hash a filename down to a small number or digit for output processing

I am not a Perl programmer but I've inherited existing code that is going to a directory, finding all files iren that folder and subfolder (usually JPG or Office files) and then converting this into a single file to use to load into a SQL Server database. The customer has about 500,000 of these files.
It takes about 45 mins to create the file and then another 45 mins for SQL to load the data. Crudely, it's doing about 150 per second which is reasonable but time is the issue for the job. There are many reasons I don't want to use other techniques so please don't suggest other options unless closely aligned to this process.
What I was considering is to improve speed by running something like 10 processes concurrent. Each process would get passed another argument (0-9). Each process would go to the directory and find all files as it is currently doing but for each file found, it would hash or kludge the filename down to a single digit (0-9) and if that matched the supplied argument, the process would process that file and write it out to it's unique file stream.
Then I would have 10 output files at the end. I doubt that the SQL Server side could be improved as I would have to load to separate tables and then merge in the database and as these are BLOB objects, will not be fast.
So I am looking for some basic code or clues on what functions to use in Perl to take a variable (the file name $File) and generate a single 0 to 9 value based on that. It is probably done by getting the ascii value of each char, then adding these together to get a long number, then add these individual numbers together and eventually you'll get an answer.
Any clues or suggested techniques?
Here's an easy one to implement suggested in the unpack function documentation:
sub string_to_code {
# convert an arbitrary string to a digit from 0-9
my ($string) = #_;
return unpack("%32W*",$string) % 10;
}

Perl read file vs traverse array Performance

I need to test lines in a file against multiple values
What are the difference in terms of time between opening a file and reading line by line each time vs opening the file once placing it in an array and traversing the array each time?
To expand upon what #mpacpec said in his comment, file IO is always slower than memory read/writes. But there's more to the story. "Test lines in a file against multiple values" can be interpreted in a lot of ways, so without knowing more about what exactly you are trying to do, then no one can tell you anything more specifically. So the answer is, "It depends". It depends on the file size, what you're testing and how often, and how you're testing.
However, pragmatically speaking, based upon my understanding of what you've said, you'll have to read the whole file one way or another, and you'll have to test every line, one way or another. Do what's easiest to write/read/understand, and see if that's fast enough. If it isn't, you have a much more useful baseline from which to ask the question. Personally, I'd start with a linewise read and test loop and work from there, simply because I think that'd be easier and faster to write correctly.
Make it work, then make it fast :)
Provided in the former case you can do all the tests you need on each line (rather than re-reading file each time), then the two approaches should be roughly the same speed and I/O, CPU efficiency (ignoring second-order effects such as whether the disk IO gets distracted by other processes more easily). However, the latter case - reading whole file - may hit memory limits for large files, which may cause it to lose performance dramatically or even fail.
The main cost of processing the file line by line is loss of flexibility - for instance if you need to cross-reference the lines, it would not be easy (whilst if they are all in memory, the code to do that would be simpler and faster).

Recover standard out from a failed Hadoop job

I'm running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.
In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that's processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I'm saving the data to a database), but it's not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?
One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I'm not sure how to go about retrieving the good/bad list of files this way.
Thanks for any suggestions.
Hadoop captures system.out data here :
/mnt/hadoop/logs/userlogs/task_id
However, I've found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather - the convetion is to use counters.
For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.
Then, you can have 2 counters: a counter for "good" files, and a counter for "bad" files.
It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].
Finally, you obviously will have to look at the results after the job is done running.
The problem, of course, with system.out statements is that the jobs running on various machines can't integrate their data. Counters get around this problem - they are easily integrated into a clear and accurate picture of the overall job.
Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.
WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file
In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.
In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text --- i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters...

Perl: performance hit with reading multiple files

I was wondering what is better in this case?
I have to read in thousands of files. I was thinking of opening into each file and reading one and closing it. Or cat all the files into one file and read that.
Suggestions? This is all in Perl.
It shouldn't make that much of a difference. This sounds like premature optimization to me.
If the time for cating all files into one bigger file doesn't matter it will be faster (only when reading the file sequentially which is the default).
Of course if the process is taken into account it'll be much slower because you have to read, write and read again.
In general reading one file of 1000M should be faster than reading 100 files of 10M because for the 100 files you'll need to look for the metadata.
As tchrist says the performance difference might not be important. I think it depends on the type of file (e.g. for a huge number of files which are very small it would differ much more) and the overall performance of your system and its storage.
Note that cat * can fail if number of files is greater than your ulimit -n value. So sequential read can actually be safer.
Also, consider using opendir and readdir instead of glob if all your files are located in the same dir.
Just read the files sequentially. Perl's file i/o functions are pretty thin wrappers around native file i/o calls in the OS, so there isn't much point in fretting about performance from simple file i/o.

How can I efficiently group a large list of URLs by their host name in Perl?

I have text file that contains over one million URLs. I have to process this file in order to assign URLs to groups, based on host address:
{
'http://www.ex1.com' => ['http://www.ex1.com/...', 'http://www.ex1.com/...', ...],
'http://www.ex2.com' => ['http://www.ex2.com/...', 'http://www.ex2.com/...', ...]
}
My current basic solution takes about 600 MB of RAM to do this (size of file is about 300 MB). Could you provide some more efficient ways?
My current solution simply reads line by line, extracts host address by regex and puts the url into a hash.
EDIT
Here is my implementation (I've cut off irrelevant things):
while($line = <STDIN>) {
chomp($line);
$line =~ /(http:\/\/.+?)(\/|$)/i;
$host = "$1";
push #{$urls{$host}}, $line;
}
store \%urls, 'out.hash';
One approach that you could take is tieing your url hash to a DBM like BerkeleyDB. You can explicitly give it options for how much memory it can use.
If you read 600MB from two files and store them in memory (in the hash) there is not much room for optimization in terms of memory use (short of compressing the data, which is probably not a viable option).
But depending on how you are going to use the data in the hash, it might be worth to consider storing the data in a database, and querying it for the information you need.
EDIT:
Based on the code you have posted, a quick optimization would be to not store the entire line but just the relative url. After all you already have the host name as a key in your hash.
Other than by storing your data structures to disk (tied DBM hash as suggested by Leon Timmermans, an SQL database such as SQLite3, etc.), you're not going to be able to reduce memory consumption much. 300M of actual data, plus the perl interpreter, plus the bytecode representation of your program, plus metadata on each of the extracted strings is going to add up to substantially more than 300M of total memory used if you keep it all in memory. If anything, I'm mildly surprised that it's only double the size of the input file.
One other thing to consider is that, if you're going to be processing the same file more than once, storing the parsed data structure on disk means that you'll never have to take the time to re-parse it on future runs of the program.
What exactly you are trying to acheive? If you are going for some complex analysis, storing to database is a good idea, of the grouping is just and intermediary step, you might just sort the text file and than process it sequentially directly deriving the results you are looking for.