Print unique lines of a 10GB file - perl

I have a 10GB file with 200 million lines. I need to get unique lines of this file.
My code:
while(<>) {
chomp;
$tmp{$_}=1;
}
#print...
I only have 2GB memory. How can I solve this problem?

As I commented on David's answer, a database is the way to go, but a nice one might be DBM::Deep since its pure-Perl and easy to install and use; its essentially a Perl hash tied to a file.
use DBM::Deep;
tie my %lines, 'DBM::Deep', 'data.db';
while(<>) {
chomp;
$lines{$_}=1;
}
This is basically what you already had, but the hash is now a database tied to a file (here data.db) rather than kept in memory.

If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep):
sort -u file

In most cases, you could store the line as a key in a hash. However, when you get this big, this really isn't very efficient. In this case, you'd be better off using a database.
One thing to try is the Berkeley Database that use to be included in Unix (BDB). Now, it's apparently owned by Oracle.
Perl can use the BerkeleyDB module to talk with a BDB database. In fact, you can even tie a Perl hash to a BDB database. Once this is done, you can use normal Perl hashes to access and modify the database.
BDB is pretty robust. Bitcoins uses it, and so does SpamAssassin, so it is very possible that it can handle the type of database you have to create in order to find duplicate lines. If you already have DBD installed, writing a program to handle your task shouldn't take that long. If it doesn't work, you wouldn't have wasted too much time with this.
The only other thing I can think of is using an SQL database which would be slower and much more complex.
Addendum
Maybe I'm over thinking this...
I decided to try a simple hash. Here's my program:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant DIR => "/usr/share/dict";
use constant WORD_LIST => qw(words web2a propernames connectives);
my %word_hash;
for my $count (1..100) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
The files read in contain a total of about 313,000 lines. I do this 100 times to get a hash with 31,300,000 keys in it. It is about as inefficient as it can be. Each and every key will be unique. The amount of memory will be massive. Yet...
It worked. It took about 10 minutes to run despite the massive inefficiencies in the program, and it maxed out at around 6 gigabytes. However, most of that was in virtual memory. Strangely, even though it was running, gobbling memory, and taking 98% of the CPU, my system didn't really slow down all that much. I guess the question really is what type of performance are you expecting? If taking about 10 minutes to run isn't that much of an issue for you, and you don't expect this program to be used that often, then maybe go for simplicity and use a simple hash.
I'm now downloading DBD from Oracle, compiling it, and installing it. I'll try the same program using DBD and see what happens.
Using a BDB Database
After doing the work, I think if you have MySQL installed, using Perl DBI would be easier. I had to:
Download Berkeley DB from Oracle, and you need an Oracle account. I didn't remember my password, and told it to email me. Never got the email. I spent 10 minutes trying to remember my email address.
Once downloaded, it has to be compiled. Found directions for compiling for the Mac and it seemed pretty straight forward.
Running CPAN crashed. Ends up that CPAN is looking for /usr/local/BerkeleyDB and it was installed as /usr/local/BerkeleyDB.5.3. Creating a link fixed the issue.
All told, about 1/2 an hour getting BerkeleyDB installed. Once installed, modifying my program was fairly straight forward:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use BerkeleyDB;
use constant {
DIR => "/usr/share/dict",
BDB_FILE => "bdb_file",
};
use constant WORD_LIST => qw(words web2a propernames connectives);
unlink BDB_FILE if -f BDB_FILE;
our %word_hash;
tie %word_hash, "BerkeleyDB::Hash",
-Filename => BDB_FILE,
-Flags => DB_CREATE
or die qq(Cannot create DBD_Database file ") . BDB_FILE . qq("\n);
for my $count (1..10) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
All I had to do was add a few lines.
Running the program was a disappointment. It wasn't faster, but much, much slower. It took over 2 minutes while using a pure hash took a mere 13 seconds.
However, it used a lot less memory. While the old program gobbled gigabytes, the BDB version barely used a megabyte. Instead, it created a 20MB database file.
But, in these days of VM and cheap memory, did it accomplish anything? In the old days before virtual memory and good memory handling, a program would crash your computer if it used all of the memory (and memory was measured in megabytes and not gigabytes). Now, if your program wants more memory than is available, it simply is given virtual memory.
So, in the end, using a Berkeley database is not a good solution. Whatever I saved in programming time by using tie was wasted with the installation process. And, it was slow.
Using BDB simply used a DBD file instead of memory. A modern OS will do the same, and is faster. Why do the work when the OS will handle it for you?
The only reason to use a database is if your system really doesn't have the required resources. 200 million lines is a big file, but a modern OS will probably be okay with it. If your system really doesn't have the resource, use a SQL database on another system, and not a DBD database.

You might consider calculating a hash code for each line, and keeping track of (hash, position) mappings. You wouldn't need a complicated hash function (or even a large hash) for this; in fact, "smaller" is better than "more unique", if the primary concern is memory usage. Even a CRC, or summing up the chars' codes, might do. The point isn't to guarantee uniqueness at this stage -- it's just to narrow the candidate matches down from 200 million to a few dozen.
For each line, calculate the hash and see if you already have a mapping. If you do, then for each position that maps to that hash, read the line at that position and see if the lines match. If any of them do, skip that line. If none do, or you don't have any mappings for that hash, remember the (hash, position) and then print the line.
Note, i'm saying "position", not "line number". In order for this to work in less than a year, you'd almost certainly have to be able to seek right to a line rather than finding your way to line #1392499.

If you don't care about time/IO constraints, nor disk constraints (e.g. you have 10 more GB space), you can do the following dumb algorithm:
1) Read the file (which sounds like it has 50 character lines). While scanning it, remember the longest line length $L.
2) Analyze the first 3 characters (if you know char #1 is identical - say "[" - analyze the 3 characters in position N that is likely to have more diverse ones).
3) For each line with 3 characters $XYZ, append that line to file 3char.$XYZ and keep the count of how many lines in that file in a hash.
4) When your entire file is split up that way, you should have a whole bunch (if the files are A-Z only, then 26^3) of smaller files, and at most 4 files that are >2GB each.
5) Move the original file into "Processed" directory.
6) For each of the large files (>2GB), choose the next 3 character positions, and repeat steps #1-#5, with new files being 6char.$XYZABC
7) Lather, rinse, repeat. You will end up with one of the 2 options eventually:
8a) A bunch of smaller files each of which is under 2GB, all of which have mutually different strings, and each (due to its size) can be processed individually by standard "stash into a hash" solution in your question.
8b) Or, most of the files being smaller, but, you have exausted all $L characters while repeating step 7 for >2GB files, and you still have between 1-4 large files. Guess what - since
those up-to-4 large files have identical characters within a file in positions 1..$L, they can ALSO be processed using the "stash into a hash" method in your question, since they are not going to contain more than a few distinct lines despite their size!
Please note that this may require - at the worst possible distributions - 10GB * L / 3 disk space, but will ONLY require 20GB disk space if you change step #5 from "move" to "delete".
Voila. Done.
As an alternate approach, consider hashing your lines. I'm not a hashing expert but you should be able to compress a line into a hash <5 times line size IMHO.
If you want to be fancy about this, you will do a frequency analysis on character sequences on the first pass, and then do compression/encoding this way.

If you have more processor and have at least 15GB free space and your storage fast enough, you could try this out. This will process it in paralel.
split --lines=100000 -d 4 -d input.file
find . -name "x*" -print|xargs -n 1 -P10 -I SPLITTED_FILE sort -u SPLITTED_FILE>unique.SPLITTED_FILE
cat unique.*>output.file
rm unique.* x*

You could break you file into 10 1 Gbyte files Then reading in one file at a time, sorting lines from that file and writing it back out after they are sorted. Opening all of the 10 files and merge them back into one file (making sure you you merge them in the correct order). Open an output file to save the unique lines. Then read the merge file one line at a time, keeping the last line for comparison. If the last line and the current line are not a match, write out the last line and save the current line as the last line for comparison. Otherwise get the next line from the merged file. That will give you a file which has all of the unique lines.
It may take a while to do this, but if you are limited on memory, then breaking the file up and working on parts of it will work.
It may be possible to do the comparison when writing out the file, but that would be a bit more complicated.

Why use perl for this at all? posix shell:
sort | uniq
done, let's go drink beers.

Related

Perl : Parallel or multithread or Bloom-Faster or fork to fill hash from a file of 500 million lines

what is the best solution to do this script faster like parallel runs ?
#!usr/bin/perl
use warnings;
use strict;
use threads;
open(R1 ,"<$ARGV[0]") || die " problem in oppening $ARGV[0]: $!\n";
my %dict1 : shared;
my $i=0;
while (my $l = <R1>){
chomp($l);
$l=~ s/\s$//;
$l=~ s/^\s//;
if ($l=~ /(.*)\s(.*)/){
$i++;
#print $1,"\n";
#my $t = threads->create($dict1{$1}++);
$dict1{$1}++;
}
}
print $i, "\n";
close R1;
You can make array of $N elements which correspond to equal parts of file,
my $N = 6;
my $step = int( $file_size/$N );
my #arr = map { ($_-1) * $step } 1 .. $N;
Then correct these numbers by seeking to file positions (perldoc -f seek), reading the rest of the line(perldoc -f readline), and telling corrected file position (perldoc -f tell).
Start $N threads where each already knows where to start with their extracting job, and join their results at the end. However you may find that reading from media is actual bottleneck as #ikegami already pointed out.
The most likely case is that you're being limited by the speed that your data can be read from the disk ("I/O bound"), not by processing time ("CPU bound"). If that's the case, then there is nothing you can do with threads or parallel execution to speed this up - if it has any effect at all, parallelization would slow you down by forcing the disk to jump back and forth between the parts of the file being read by the different processes/threads.
An easy way to test whether this is the case would be to open a shell and run the command cat my_data_file > /dev/null This should tell you how long it takes to just read the file from disk, without actually doing anything to it. If it's roughly the same as the time it takes to run your program on my_data_file, then don't bother trying to optimize or speed it up. You can't. There are only two ways to improve performance in that situation:
Change the way your code works so that you don't need to read the entire file. If you're dealing with something that will run multiple times, indexing records in the file or using a database may help, but that doesn't do any good if it's a one-shot operation (since you'd still need to read the whole file once to create the index/database).
Use faster storage media.
If you're not I/O bound, the next most likely case is that you're memory bound - the data won't all fit in memory at once, causing the disk to thrash as it moves chunks of data into and out of virtual memory. Once again, parallelizing the process would make things worse, not better.
The solutions in this case are similar to before:
Change what you're doing so that you don't need all the data in memory at once. In this case, indexing or a database would probably be beneficial even for a one-off operation.
Buy more RAM.
Unless you're doing much heavier processing of the data than just the few regexes and stuffing it in a hash that you've shown, though, you're definitely not CPU bound and parallelization will not provide any benefit.

What is the fastest way to 'print' to file in perl?

I've been writing output from perl scripts to files for some time using code as below:
open( OUTPUT, ">:utf8", $output_file ) or die "Can't write new file: $!";
print OUTPUT "First line I want printed\n";
print OUTPUT "Another line I want printing\n";
close(OUTPUT);
This works, and is faster than my initial approach which used "say" instead of print (Thank you NYTProf for enlightening my to that!)
However, my current script is looping over hundreds of thousands of lines and is taking many hours to run using this method and NYTProf is pointing the finger at my thousands of 'print' commands. So, the question is... Is there a faster way of doing this?
Other Info that's possibly relevant...
Perl Version: 5.14.2 (On Ubuntu)
Background of the script in question...
A number of '|' delimited flat files are being read into hashes, each file has some sort of primary key matching entries from one to another. I'm manipulating this data and them combining them into one file for import into another system.
The output file is around 3 Million lines, and the program starts to noticeably slow down after writing around 30,000 lines to said file. (A little reading around seemed to point towards running out of write buffer in other languages but I couldn't find anything about this with regard to perl?)
EDIT: I've now tried adding the line below, just after the open() statement, to disable print buffering, but the program still slows around the 30,000th line.
OUTPUT->autoflush(1);
I think you need to redesign the algorithm your program uses. File output speed isn't influenced by the amount of data that has been output, and it is far more likely that your program is reading and processing data but not releasing it.
Check the amount of memory used by your process to see if it increases inexorably
Beware of for (<$filehandle>) loops, which read whole files into memory at once
As I said in my comment, disable the relevant print statements to see how performance changes
Have you tried to concat all the single print's into a single scalar and then print scalar all at once? I have a script that outputs an average of 20 lines of text for each input line. When using individual print statements, even sending the output to /dev/null, took a long time. But when I packed all the output (for a single input line) together, using things like:
$output .= "...";
$output .= sprintf("%s...", $var);
Then just before leaving the line processing sub-routine, I 'print $output'. Printing all the lines at once. The number of calls to print went from ~7.7M to about 386K - equal to the number of lines in the input date file. This shaved about 10% off of my total execution time.

How do I properly format plain text data for a simple Perl dictionary app?

I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c

Parallel computing in Perl

I want to parse through a 8 GB file to find some information. This is taking me more than 4 hours to finish. I gone through perl Parallel::ForkManager module for this. But it doesn't make much difference. What is the better way to implement this?
The following is the part of the code used to parse this Jumbo file. I actually have list of domains which I have to look in a 8 GB sized zone file and find out what company it is hosted with.
unless(open(FH, $file)) {
print $LOG "Can't open '$file' $!";
die "Can't open '$file' $!";
}
### Reading Zone file : $file
DOMAIN: while(my $line = <FH> ){
#domain and the dns with whom he currently hosted
my($domain, undef, $new_host) = split(/\s|\t/, $line);
next if $seen{$domain};
$seen{$domain} =1;
$domain.=".$domain_type";
$domain = lc ($domain);
#already in?
if($moved_domains->{$domain}){
#Get the next domain if this on the same host, there is nothing to record
if($new_host eq $moved_domains->{$domain}->{PointingHost}){
next DOMAIN;
}
#movedout
else{
#INSERTS = ($domain, $data_date, $new_host, $moved_domains->{$domain}->{Host});
log_this($data_date, $populate, #INSERTS);
}
delete $moved_domains->{$domain};
}
#new to MovedDomain
else{
#is this any of our interested HOSTS
my ($interested) = grep{$new_host =~/\b$_\b/i} keys %HOST;
#if not any of our interested DNS, NEXT!
next DOMAIN if not $interested;
#INSERTS = ($domain, $data_date, $new_host, $HOST{$interested});
log_this($data_date, $populate, #INSERTS);
}
next DOMAIN;
}
A basic line-by-line parsing pass through a 1GB file -- for example, running a regex or something -- takes just a couple of minutes on my 5-year-old Windows box. Even if the parsing work is more extensive, 4 hours sounds like an awfully long time for 8GB of data.
Are you sure that your code does not have a glaring inefficiency? Are you storing a lot of information during the parsing and bumping up against your RAM limits? CPAN has tools that will allow you to profile your code, notably Devel::NYTProf.
Before going through the hassle of parallelizing your code, make sure that you understand where the bottleneck is. If you explain what you are doing or, even better, provide code that illustrates the problem in a compact way, you might get better answers.
With the little information you've given: Parallel::ForkManager sounds like an appropriate tool. But you're likely to get better help if you give more detail about your problem.
Parallelizing is always a difficult problem. How much you can hope to gain depends a lot on the nature of the task. For example, are you looking for a specific line in the file? Or a specific fixed-size record? Or all the chunks that match a particular bit pattern? Do you process the file from beginning to end, or can you skip some parts, or do you do a lot of shuffling back and forth? etc.
Also is the 8GB file an absolute constraint, or might you be able to reorganize the data to make the information easier to find?
With the speeds you're giving, if you're just going through the file once, I/O is not the bottleneck, but it's close. It could be the bottleneck if other processes are accessing the disk at the same time. It may be worth fine-tuning your disk access patterns; this would be somewhat OS- and filesystem-dependent.

Fast algorithm to check membership of a pair of numbers in large (x,y) coordinates in Perl

I have a list of sorted coordinates (let's call it xycord.txt) that looks like this:
chr1 10003486 10043713
chr1 10003507 10043106
chr2 10003486 10043713
chr2 10003507 10043162
chr2 10003532 10042759
In reality the this file is very2 large with 10^7 lines.
What I want to do is given another two-point coordinates I want to check if they
fall in between any coordinates in xycord.txt file.
The current approach I have is super slow. Because
there are also many others two-point coordinates against this large xycord.txt file.
Is there a fast way to do it?
#!/usr/bin/perl -w
my $point_to_check_x = $ARGV[0] || '10003488';
my $point_to_check_y = $ARGV[1] || '10003489';
my $chrid = $ARGV[2] || "chr1";
my %allxycordwithchr;
# skip file opening construct
while (<XYCORD_FILE>) {
my ($chr,$tx,$ty) = split(/\s+/,$_);
push #{$allxycordwithchr{$chr}},$tx."-".$ty;
}
my #chosenchr_cord = #{$allxycordwithchr{$chrid}};
for my $chro_cords (#chosenchr_cord){
my ($repox,$repoy) = split("-",$chro_cord);
my $stat = is_in_xycoordsfile($repox,$repoy,$point_to_check_x,$point_to_check_y);
if ($stat eq "IN"){
print "IN\n";
}
}
sub is_in_xycoordsfile {
my ($x,$y,$xp,$yp) = #_;
if ( $xp >= $x && $yp <= $y ) {
return "IN";
}
else {
return "OUT";
}
}
Update: I apologize for correcting this. In my earlier posting I oversimplified
the problem.
Actually, there is one more query field (e.g. chromosome name).
Hence the DB/RB-trees/SQL approaches maybe infeasible in this matter?
A few suggestions:
You could store your data in a database, such as MySQL or SQLite. You could then use a simple request such as:
"SELECT * FROM coordinates WHERE x<"+xp+" AND y>"+yp
Provided you have indexes on x and y, this should be super fast.
You could also take a look at R-Trees. I used R-trees a few years ago to store tens of thousands of city coordinates, and I could find the closest city from a given point in a fraction of a second. In your example, you are storing 1D ranges but I'm pretty sure R-trees would work fine too. You might find R-tree implementations for Perl here. Or you can use RectanglesContainingDot, which seems to do what you need.
You could cache the coordinates in memory: each number looks like it would take 4 bytes to store, so this would lead to about 80 MB of memory usage if you have 10^7 couples of numbers. That's what firefox uses on my machine! Of course if you do this, you need to have some sort of daemon running in order to avoid reloading the whole file every time you need to check coordinates.
You can mix solutions 2 & 3.
My preference is for solution 1: it has a good efficiency/complexity ratio.
In addition to Udi Pasmon's good advice, you could also convert your large file to a DBM and then tie the DBM file to a hash for easy look ups.
Convert the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_File;
my $dbfile = "coords.db";
tie my %coords, "DB_File", $dbfile
or die "could not open $dbfile: $!";
while (<>) {
my ($x, $y) = split;
$coords{"$x-$y"} = 1;
}
Check to see if arguments are members of the file:
#!/usr/bin/perl
use strict;
use warnings;
use DB_file;
my ($x, $y) = #ARGV;
tie my %coords, "DB_File", "coords.db"
or die "could not open coords.db: $!";
print "IN\n" if $coords{"$x-$y"};
Try a binary search rather than sequential search. There are two appearant options to do this:
Split the files to smaller files (xycord001.txt, xycord002.txt and so on). Now you can easily determine in which file to search, and the search is rather faster. The big con here is that if you need to add data to a file it might get messy.
Make a binary search over the file: Start at the middle, splitting the file into two logical parts. Decide in which part you coordinates might be, and look at the middle of that part. You'll rapidly (exponentially) reduce the size of the file you're looking in, till you'll be searching one line only. Read more about seeking into files; There is a perl example about binary searching a file here.
EDIT: Generally, using a database or DB file is preferred; However, binary file search is the quick-and-dirty way, especially if the script should run on different files on different machines (thanks #MiniQuark, #Chas. Owens)
If both inputs or atleast the large one are sorted you can try a variation of merge-join between them.
If the lookup file (smaller file) isn't too large, then easiest is to just read it in, put it in a hash keyed by the name with sorted arrays of start-end pairs for value.
Then go through each row in the large file, lookup the array of lookup values that could match it by its name. Go through each pair in the lookup array, if the lookup start is less than the input pairs start, discard that value as it can no longer match anything. If the lookup start is past input end, break the loop as no further lookup values can match. If the lookup end is before the input end you have a match and you can add the input and lookup to the list of matches.
My Perl is rusty, so no Perl example code, but I threw together a quick and dirty Python implementation. On my arbitrary randomly generated dataset matching 10M rows to 10k lookup rows for 14k matches took 22s, matching to 100k lookups for 145k matches took 24s and matching to 1M lookups for 1.47M matches took 35s.
If the smaller file is too big to fit in memory at once, it can be loaded in batches of keys as the keys are encountered in the input file.
Restating your question, do you want to print all ranges in a file that contains the (x, y) pair and also have the same id? If that's the case, you don't need to parse the file and storing it in memory.
while (<DATA>) {
my ($chr, $tx, $ty) = split /\s+/;
print "IN : $chr, $tx, $ty\n"
if $chr eq $chrid
&& $point_to_check_x >= $tx
&& $point_to_check_y <= $ty;
}
OK, so let me clarify the problem, based on my understanding of your code. You have a file with a very large number of entries in it. Each entry includes a label "chr1", "chr2", etc. and two numbers, the first of which is less than the second. You then have a query which comprises a label and a number, and you wish to know whether there is a record in the large file which has the same label as the query, and has the two values such that one is less than the query number and the other is greater than it. Essentially, whether the number in the query is within the interval specified by the two numbers in the record.
Provided my understanding is correct, the first thing to notice is that any of the records which do not have the same label as the query have no part to play in the problem. So you can ignore them. Your program reads them all in, puts them in a hash and then doesn't look at most of the data. Sure, if you have several queries to do, you're going to need to keep data for each of the labels you're interested in, but you can throw the rest away as early as possible. This will keep your memory requirement down.
I would be tempted to take this a little further. Is there a possibility of breaking the huge file up into smaller files? It would seem to be a good idea to break it into files which have only certain labels in them. For example, you might have one file per label, or one file for all data with labels beginning with "a", or so on. This way you can open only those files which you know you're going to be interested in, and save yourself a lot of time and effort.
These changes alone may make enough difference for you. If you still need more performance, I would start thinking about how you are storing the records you're going to need. Storing them ordered on the lower (or higher) of the two numbers should cut down a bit the time taken to find what you're after, particularly if you store them in a binary search tree or a trie.
That should give you enough to work on.
PDL for genomic data processing
We processed a lot of files in the same format as you show in your question and found that PDL (documentation) is a very good tool for doing this. You need some time to learn it --- but it's definitely worth the effort (if you do genomics data processing): PDL can process huge files a few thousand times faster than MySQL.
Here are some hints where to go:
First of all, PDL is a language somewhat like Matlab --- but fully integrated with perl. Read the documentation, do some examples. Consult a mathematician for advise which features to use for what purpose.
PDL stores it's data in plain C arrays. Learn about Inline::C and access this data directly from C if PDL doesn't do the job for you. To me, PDL and Inline::C seem like a perfect match: PDL for high-level operations; Inline::C for anything missing. Still PDL is as fast as your best C because it does it's work in C.
use PDL::IO::FastRaw --- store and access data in files on disk. I often write the files "by hand" (see below) and read them as memory mapped files (using PDL::IO::FastRaw::mapfraw, often with the flag ReadOnly=>1). This is the most efficient way to read data in Linux from the disk.
The format of the data files is trivial: just a sequence of C numbers. You can easily write such files in perl with 'print FileHandle pack "i*",#data;' Check 'perldoc -f pack'.
In my experience, just reading the input files line by line and printing them out in binary format is the slowest part of processing: Once you have them ready for PDL to 'mmap' them, processing is much faster.
I hope this advise helps --- even though not much code is given.