PERL: Jumping to lines in a huge text file - perl

I have a very large text file (~4 GB).
It has the following structure:
S=1
3 lines of metadata of block where S=1
a number of lines of data of this block
S=2
3 lines of metadata of block where S=2
a number of lines of data of this block
S=4
3 lines of metadata of block where S=4
a number of lines of data of this block
etc.
I am writing a PERL program that read in another file,
foreach line of that file (where it must contain a number),
search the huge file for a S-value of that number minus 1,
and then analyze the lines of data of the block belongs to that S-value.
The problem is, the text file is HUGE, so processing each line with a
foreach $line {...} loop
is very slow. As the S=value is strictly increasing, are there any methods to jump to a particular line of the required S-value?

are there any methods to jump to a particular line of the required S-value?
Yes, if the file does not change then create an index. This requires reading the file in its entirety once and noting the positions of all the S=# lines using tell. Store it in a DBM file with the key being the number and the value being the byte position in the file. Then you can use seek to jump to that point in the file and read from there.
But if you're going to do that, you're better off exporting the data into a proper database such as SQLite. Write a program to insert the data into the database and add normal SQL indexes. This will probably be simpler than writing the index. Then you can query the data efficiently using normal SQL, and make complex queries. If the file change you can either redo the export, or use the normal insert and update SQL to update the database. And it will be easy for anyone who knows SQL to work with, as opposed to a bunch of custom indexing and search code.

I know the op has already accepted an answer, but a method that's served me well is to slurp the file into an array, based on changing the "record separator" ($/).
If you do something like this (not tested, but this should be close):
$/ = "S=";
my #records=<fh>;
print $records[4];
The output should be the entire fifth record (the array starts at 0, but your data starts at 1), starting with the record number (5) on a line by itself (you might need to strip that out later), following by all the remaining lines in that record.
It's very simple and fast, although it is a memory pig...

If the blocks of text are of the same length (in bytes or characters) you can calculate the position of the needed S-value in the file and seek there, then read. Otherwise, in principle you need to read lines to find the S value.
However, if there are only a few S-values to find you can estimate the needed position and seek there, then read enough to capture an S-value. Then analyze what you read to see how far off you are, and either seek again or read lines with <> to get to the S-value.
use warnings;
use strict;
use feature 'say';
use Fcntl qw(:seek);
my ($file, $s_target) = #ARGV;
die "Usage: $0 filename\n" if not $file or not -f $file;
$s_target //= 5; #/ default, S=5
open my $fh, '<', $file or die $!;
my $est_text_len = 1024;
my $jump_by = $est_text_len * $s_target; # to seek forward in file
my ($buff, $found);
seek $fh, $jump_by, SEEK_CUR; # get in the vicinity
while (1) {
my $rd = read $fh, $buff, $est_text_len;
warn "error reading: $!" if not defined $rd;
last if $rd == 0;
while ($buff =~ /S=([0-9]+)/g) {
my $s_val = $1;
# Analyze $s_val and $buff:
# (1) if overshot $s_target adjust $jump_by and seek back
# (2) if in front of $s_target read with <> to get to it
# (3) if $s_target is in $buff extract needed text
if ($s_val == $s_target) {
say "--> Found S=$s_val at pos ", pos $buff, " in buffer";
seek $fh, - $est_text_len + pos($buff) + 1, SEEK_CUR;
while (<$fh>) {
last if /S=[0-9]+/; # next block
print $_;
}
$found = 1;
last;
}
}
last if $found;
}
Tested with your sample, enlarged and cleaned up (change S=n in text as it is the same as the condition!), with $est_text_len and $jump_by set at 100 and 20.
This is a sketch. A full implementation needs to negotiate over and under seeking as outlined in comments in code. If text-block sizes don't vary much it can get in front of the needed S-value in two seek-and-reads, and then read with <> or use regex as in the example.
Some comments
The "analysis" sketched above need be done carefully. For one, a buffer may contain multiple S-value lines. Also, note that the code keeps reading if an S-value isn't in buffer.
Once you are close enough and in front of $s_target read lines by <> to get to it.
The read may not get as much as requested so you should really put that in a loop. There are recent posts with that.
Change to sysread from read for efficiency. In that case use sysseek, and don't mix with <> (which is buffered).
The code above presumes one S-value to find; adjust for more. It absolutely assumes that S-values are sorted.
This is clearly far more complex than reading lines but it does run much faster, with a very large file and only a few S-values to find. If there are many values then this may not help.
The foreach (<$fh>), indicated in the question, would cause the whole file to be read first (to build the list for foreach to go through); use while (<$fh>) instead.
If the file doesn't change (or the same file need be searched many times) you can first process it once to build an index of exact positions of S-values. Thanks to Danny_ds for a comment.

Binary search of a sorted list is an O(log N) operation. Something like this using seek:
open my $fh, '>>+', $big_file;
$target = 123_456_789;
$low = 0;
$high = -s $big_file;
while ($high - $low > 0.01 * -s $big_file) {
$mid = ($low + $high) / 2;
seek $fh, $mid, 0;
while (<$fh>) {
if (/^S=(\d+)/) {
if ($1 < $target) { $low = $mid; }
else { $high = $mid }
last;
}
}
}
seek $fh, $low, 0;
while (<$fh>) {
# now you are searching through the 1% of the file that contains
# your target S
}

Sort the numbers in the second file. Now you can proceed thru the huge file in order, processing each S-value as needed.

Related

To increase the performance of a script in perl

I have 2 files here which is newFile and LookupFile (which are huge files).
The contents in newFile will be searched in LookupFile and further processing happens. This script is working fine, however, it is taking more time to execute. Could you please let me know what can be done here to increase the performance? Could you please let me know if we can convert files into hash to increase performance?
My file looks like below
NewFile and LookupFile:
acl sourceipaddress subnet destinationipaddress subnet portnumber
.
.
Script:
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp::Tiny 'read_file';
use File::Copy;
use Data::Dumper;
use File::Copy qw(copy);
my %options = (
LookupFile => {
type => "=s",
help => "File name",
variable => 'gitFile',
required => 1,
}, newFile => {
type => "=s",
help => "file containing the acl lines to checked for",
variable => ‘newFile’,
required => 1,
} );
$opts->addOptions(%options);
$opts->parse();
$opts->validate();
my $newFile = $opts->getOption('newFile');
my $LookupFile = $opts->getOption('LookupFile');
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>)
{
chomp($line);
my #columns = split(' ',$line);
$var = #columns;
my $fld1;
my $cnt;
my $fld2;
my $fld3;
my $fld4;
my $fld5;
my $dIP;
my $sIP;
my $sHOST;
my $dHOST;
if(....)
if (....) further checks and processing
)
First thing to do before any optimization is to profile your code. Rather than guessing, this will tell you what lines are taking up the most time, and how often they're called. Devel::NYTProf is a good tool for the job.
This is a problem.
my #LookupFile = read_file ("$LookupFile");
my #newFile = read_file ("$newFile");
#LookupFile = split (/\n/,$LookupFile[0]);
#newLines = split (/\n/,$newFile[0]);
read_file reads the whole file in as one big string (it should be my $contents = read_file(...), using an array is awkward). Then it splits the whole thing into newlines, copying everything in the file. This is very slow and hard on memory and unnecessary.
Instead, use read_lines. This will split the file into lines as it reads avoiding a costly copy.
my #lookups = read_lines($LookupFile);
my #new = read_lines($newFile);
Next problem is $newFile is opened again and iterated through line by line.
open FILE1, "$newFile" or die "Could not open file: $! \n";
while(my $line = <FILE1>) {
This is a waste as you've already read that file into memory. Use one or the other. However, in general, it's better to work with files line-by-line than to slurp them all into memory.
The above will speed things up, but they don't get at the crux of the problem. This is likely the real problem...
The contents in newFile will be searched in LookupFile and further processing happens.
You didn't show what you're doing, but I'm going to imagine it looks something like this...
for my $line (#lines) {
for my $thing (#lookups) {
...
}
}
That is, for each line in one file, you're looking at every line in the other. This is what is known as an O(n^2) algorithm meaning that as you double the size of the files you quadruple the time.
If each file has 10 lines, it will take 100 (10^2) turns through the inner loop. If they have 100 lines, it will take 10,000 (100^2). With 1,000 lines it will take 1,000,000 times.
With O(n^2) as the sizes get bigger things get very slow very quickly.
Could you please let me know if we can convert files into hash to increase performance?
You've got the right idea. You could convert the lookup file to a hash to speed things up. Let's say they're both lists of words.
# input
foo
bar
biff
up
down
# lookup
foo
bar
baz
And you want to check if any lines in input match any lines in lookup.
First you'd read lookup in and turn it into a hash. Then you'd read input and check if each line is in the hash.
use strict;
use warnings;
use autodie;
use v5.10;
...
# Populate `%lookup`
my %lookup;
{
open my $fh, $lookupFile;
while(my $line = <$fh>) {
chomp $line;
$lookup{$line} = 1;
}
}
# Check if any lines are in %lookup
open my $fh, $inputFile;
while(my $line = <$fh>) {
chomp $line;
print $line if $lookup{$line};
}
This way you only iterate through each file once. This is an O(n) algorithm meaning is scales linearly, because hash lookups are basically instantaneous. If each file has 10 lines, it will only take 10 iterations of each loop. If they have 100 lines it will only take 100 iterations of each loop. 1000 lines, 1000 iterations.
Finally, what you really want to do is skip all this and create a database for your data and search that. SQLite is a SQL database that requires no server, just a file. Put your data in there and perform SQL queries on it using DBD::SQLite.
While this means you have to learn SQL, and there is a cost to building and maintaining the database, this is fast and most importantly very flexible. SQLite can do all sorts of searches quickly without you having to write a bunch of extra code. SQL databases are a very common, so it's a very good investment to learn SQL.
Since you're splitting the file up with my #columns = split(' ',$line); it's probably a file with many fields in it. That will likely map to a SQL table very well.
SQLite can even import files like that for you. See this answer for details on how to do that.

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.
It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

Building indexes for files in Perl

I'm currently new to Perl, and I've stumbled upon a problem :
My task is to create a simple way to access a line of a big file in Perl, the fastest way possible.
I created a file consisting of 5 million lines with, on each line, the number of the line.
I've then created my main program that will need to be able to print any content of a given line.
To do this, I'm using two methods I've found on the internet :
use Config qw( %Config );
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
my $file = "testfile.err";
open(FILE, "< $file") or die "Can't open $file for reading: $!\n";
open(INDEX, "+>$file.idx")
or die "Can't open $file.idx for read/write: $!\n";
build_index(*FILE, *INDEX);
my $line = line_with_index(*FILE, *INDEX, 129);
print "$line";
sub build_index {
my $data_file = shift;
my $index_file = shift;
my $offset = 0;
while (<$data_file>) {
print $index_file pack($off_t, $offset);
$offset = tell($data_file);
}
}
sub line_with_index {
my $data_file = shift;
my $index_file = shift;
my $line_number = shift;
my $size; # size of an index entry
my $i_offset; # offset into the index of the entry
my $entry; # index entry
my $d_offset; # offset into the data file
$size = length(pack($off_t, 0));
$i_offset = $size * ($line_number-1);
seek($index_file, $i_offset, 0) or return;
read($index_file, $entry, $size);
$d_offset = unpack($off_t, $entry);
seek($data_file, $d_offset, 0);
return scalar(<$data_file>);
}
Those methods sometimes work, I get a value once out of ten tries on different set of values, but most of the time I get "Used of uninitialized value $line in string at test2.pl line 10" (when looking for line 566 in the file) or not the right numeric value. Moreover, the indexing seems to work fine on the first two hundred or so lines, but afterwards I get the error. I really don't know what I'm doing wrong..
I know you can use a basic loop that will parse each line, but I really need a way of accessing, at any given time, one line of a file without reparsing it all over again.
Edit : I've tried using a little tip found here : Reading a particular line by line number in a very large file
I've replaced the "N" template for pack with :
my $off_t = $Config{lseeksize} > $Config{ivsize} ? 'F' : 'j';
It makes the process work better, until line 128, where instead of getting 128 , I get a blank string. For 129, I get 3, which doesn't mean much..
Edit2 : Basically what I need is a mechanism that enables me to read the next 2 lines for instance for a file that is already being read, while keeping the read "head" at the current line (and not 2 lines after).
Thanks for your help !
Since you are writing binary data to the index file, you need to set the filehandle to binary mode, especially if you are in Windows:
open(INDEX, "+>$file.idx")
or die "Can't open $file.idx for read/write: $!\n";
binmode(INDEX);
Right now, when you perform something like this in Windows:
print $index_file pack("j", $offset);
Perl will convert any 0x0a's in the packed string to 0x0d0a's. Setting the filehandle to binmode will make sure line feeds are not converted to carriage return-line feeds.

How can I remove non-unique lines from a large file with Perl?

Duplicate data removal using Perl called within via a batch file within Windows
A DOS window in Windows called via a batch file.
A batch file calls the Perl script which carries out the actions. I have the batch file.
The code script I have works duplicate data is removal so long as the data file is not too big.
The problem that requires resolving is with data files which are larger, (2 GB or more), with this size of file a memory error occurs when trying to load the complete file in to an array for duplicate data removal.
The memory error occurs in the subroutine at:-
#contents_of_the_file = <INFILE>;
(A completely different method is acceptable so long as it solves this issue, please suggest).
The subroutine is:-
sub remove_duplicate_data_and_file
{
open(INFILE,"<" . $output_working_directory . $output_working_filename) or dienice ("Can't open $output_working_filename : INFILE :$!");
if ($test ne "YES")
{
flock(INFILE,1);
}
#contents_of_the_file = <INFILE>;
if ($test ne "YES")
{
flock(INFILE,8);
}
close (INFILE);
### TEST print "$#contents_of_the_file\n\n";
#unique_contents_of_the_file= grep(!$unique_contents_of_the_file{$_}++, #contents_of_the_file);
open(OUTFILE,">" . $output_restore_split_filename) or dienice ("Can't open $output_restore_split_filename : OUTFILE :$!");
if ($test ne "YES")
{
flock(OUTFILE,1);
}
for($element_number=0;$element_number<=$#unique_contents_of_the_file;$element_number++)
{
print OUTFILE "$unique_contents_of_the_file[$element_number]\n";
}
if ($test ne "YES")
{
flock(OUTFILE,8);
}
}
You are unnecessarily storing a full copy of the original file in #contents_of_the_file and -- if the amount of duplication is low relative to the file size -- nearly two other full copies in %unique_contents_of_the_file and #unique_contents_of_the_file. As ire_and_curses noted, you can reduce the storage requirements by making two passes over the data: (1) analyze the file, storing information about the line numbers of non-duplicate lines; and (2) process the file again to write non-dups to the output file.
Here is an illustration. I don't know whether I've picked the best module for the hashing function (Digest::MD5); perhaps others will comment on that. Also note the 3-argument form of open(), which you should be using.
use strict;
use warnings;
use Digest::MD5 qw(md5);
my (%seen, %keep_line_nums);
my $in_file = 'data.dat';
my $out_file = 'data_no_dups.dat';
open (my $in_handle, '<', $in_file) or die $!;
open (my $out_handle, '>', $out_file) or die $!;
while ( defined(my $line = <$in_handle>) ){
my $hashed_line = md5($line);
$keep_line_nums{$.} = 1 unless $seen{$hashed_line};
$seen{$hashed_line} = 1;
}
seek $in_handle, 0, 0;
$. = 0;
while ( defined(my $line = <$in_handle>) ){
print $out_handle $line if $keep_line_nums{$.};
}
close $in_handle;
close $out_handle;
You should be able to do this efficiently using hashing. You don't need to store the data from the lines, just identify which ones are the same. So...
Don't slurp - Read one line at a time.
Hash the line.
Store the hashed line representation as a key in a Perl hash of lists. Store the line number as the first value of the list.
If the key already exists, append the duplicate line number to the list corresponding to that value.
At the end of this process, you'll have a data-structure identifying all the duplicate lines. You can then do a second pass through the file to remove those duplicates.
Perl does heroic things with large files, but 2GB may be a limitation of DOS/Windows.
How much RAM do you have?
If your OS doesn't complain, it may be best to read the file one line at a time, and write immediately to output.
I'm thinking of something using the diamond operator <> but I'm reluctant to suggest any code because on the occasions I've posted code, I've offended a Perl guru on SO.
I'd rather not risk it. I hope the Perl cavalry will arrive soon.
In the meantime, here's a link.
Here's a solution that works no matter how big the file is. But it doesn't use RAM exclusively, so its slower than a RAM-based solution. You can also specify the amount of RAM you want this thing to use.
The solution uses a temporary file that the program treats as a database with SQLite.
#!/usr/bin/perl
use DBI;
use Digest::SHA 'sha1_base64';
use Modern::Perl;
my $input= shift;
my $temp= 'unique.tmp';
my $cache_size_in_mb= 100;
unlink $temp if -f $temp;
my $cx= DBI->connect("dbi:SQLite:dbname=$temp");
$cx->do("PRAGMA cache_size = " . $cache_size_in_mb * 1000);
$cx->do("create table x (id varchar(86) primary key, line int unique)");
my $find= $cx->prepare("select line from x where id = ?");
my $list= $cx->prepare("select line from x order by line");
my $insert= $cx->prepare("insert into x (id, line) values(?, ?)");
open(FILE, $input) or die $!;
my ($line_number, $next_line_number, $line, $sha)= 1;
while($line= <FILE>) {
$line=~ s/\s+$//s;
$sha= sha1_base64($line);
unless($cx->selectrow_array($find, undef, $sha)) {
$insert->execute($sha, $line_number)}
$line_number++;
}
seek FILE, 0, 0;
$list->execute;
$line_number= 1;
$next_line_number= $list->fetchrow_array;
while($line= <FILE>) {
$line=~ s/\s+$//s;
if($next_line_number == $line_number) {
say $line;
$next_line_number= $list->fetchrow_array;
last unless $next_line_number;
}
$line_number++;
}
close FILE;
Well you could use the inline replace mode of command line perl.
perl -i~ -ne 'print unless $seen{$_}++' uberbigfilename
In the "completely different method" category, if you've got Unix commands (e.g. Cygwin):
cat infile | sort | uniq > outfile
This ought to work - no need for Perl at all - which may, or may not, solve your memory problem. However, you will lose the ordering of the infile (as outfile will now be sorted).
EDIT: An alternative solution that's better able to deal with large files may be by using the following algorithm:
Read INFILE line-by-line
Hash each line to a small hash (e.g. a hash# mod 10)
Append each line to a file unique to the hash number (e.g. tmp-1 to tmp-10)
Close INFILE
Open and sort each tmp-# to a new file sortedtmp-#
Mergesort sortedtmp-[1-10] (i.e. open all 10 files and read them simultaneously), skipping duplicates and writing each iteration to the end output file
This will be safer, for very large files, than slurping.
Parts 2 & 3 could be changed to a random# instead of a hash number mod 10.
Here's a script BigSort that may help (though I haven't tested it):
# BigSort
#
# sort big file
#
# $1 input file
# $2 output file
#
# equ sort -t";" -k 1,1 $1 > $2
BigSort()
{
if [ -s $1 ]; then
rm $1.split.* > /dev/null 2>&1
split -l 2500 -a 5 $1 $1.split.
rm $1.sort > /dev/null 2>&1
touch $1.sort1
for FILE in `ls $1.split.*`
do
echo "sort $FILE"
sort -t";" -k 1,1 $FILE > $FILE.sort
sort -m -t";" -k 1,1 $1.sort1 $FILE.sort > $1.sort2
mv $1.sort2 $1.sort1
done
mv $1.sort1 $2
rm $1.split.* > /dev/null 2>&1
else
# work for empty file !
cp $1 $2
fi
}

How can I randomly sample the contents of a file?

I have a file with contents
abc
def
high
lmn
...
...
There are more than 2 million lines in the files.
I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand function (Or a handy shell command would be neat).
Related (Possibly Duplicate) Questions:
Randomly Pick Lines From a File Without Slurping It With Unix
How can I get exactly n random lines from a file with Perl?
Assuming you basically want to output about 2.5% of all lines, this would do:
print if 0.025 > rand while <$input>;
Shell way:
sort -R file | head -n 50000
From perlfaq5: "How do I select a random line from a file?"
Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
srand;
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that algorithm:
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.
Perl way:
use CPAN. There is module File::RandomLine that does exactly what you need.
If you need to extract an exact number of lines:
use strict;
use warnings;
# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = #ARGV;
open(my $fh, '<', $file)
or die "Can't read file '$file' [$!]\n";
# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
$lines += ($buffer =~ tr/\n//);
}
# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;
# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
my $n = int(rand($lines)) + 1;
redo if $picked{$n}++
}
# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
print if $picked{$.};
}
close $fh;