Storing large file in array is not possible? - perl

There is a process which stores the file in an array. Unfortunately when the file is too big (Let's say 800K lines or more than 60 MB) an error is returned like "Out of memory!". Is there any solution to this? For example the following code throw "Out of memory!".
#! /usr/bin/perl
die unless (open (INPUT, "input.txt"));
#file=<INPUT>; # It fails here
print "File stored in array\n"; # It never reaches here
$idx=0;
while ($idx < #file) {
$idx++;
}
print "The line count is = $idx\n";

I'd use Tie::File for that:
use Tie::File;
my #file
tie #file, 'Tie::File', "input.txt";
print "File reflected in array\n";
print "The line count is ", scalar(#file);

Most of the time, you don't need to read in the whole file at once. The readline operator returns only one line at a time when called in scalar context:
1 while <INPUT>; # read a line, and discard it.
say "The line count is = $.";
The $. special variable is the line number of the last read filehandle.
Edit: Line counting was just an example
Perl has no problem with large arrays, it just seems that your system doesn't have enough memory. Be aware that Perl arrays use more memory than C arrays, as a scalar allocate additional memory for flags etc, and because arrays grow in increasing steps.
If memory is an issue, you have to transform your algorithm from one that has to load a whole file into memory to one that only keeps one line at a time.
Example: Sorting a multi-gigabyte file. The normal approach print sort <$file> won't work here. Instead, we sort portions of the file, write them to tempfiles, and then switch between the tempfiles in a clever way to produce one sorted output:
use strict; use warnings; use autodie;
my $blocksize = shift #ARGV; # take lines per tempfile as command line arg
mkdir "/tmp/$$"; # $$ is the process ID variable
my $tempcounter = 0;
my #buffer;
my $save_buffer = sub {
$tempcounter++;
open my $tempfile, ">", "/tmp/$$/$tempcounter";
print $tempfile sort #buffer;
#buffer = ();
};
while (<>) {
push #buffer, $_;
$save_buffer->() if $. % $blocksize == 0;
}
$save_buffer->();
# open all files, read 1st line
my #head =
grep { defined $_->[0] }
map { open my $fh, "<", $_; [scalar(<$fh>), $fh] }
glob "/tmp/$$/*";
# sort the line-file pairs, pick least
while((my $least, #head) = sort { $a->[0] cmp $b->[0] } #head){
my ($line, $fh) = #$least; print $line;
# read next line
if (defined($line = <$fh>)){
push #head, [$line, $fh];
}
}
# clean up afterwards
END {
unlink $_ for glob "/tmp/$$/*";
rmdir "/tmp/$$";
}
Could be called like $ ./sort-large-file 10000 multi-gig-file.txt >sorted.txt.
This general approach can be applied to all kinds of problems. This is a “divide and conquer” strategy: If the problem is too big, solve a smaller problem, and then combine the pieces.

Related

Can a Perl program know the line number where __DATA__ begins?

Is there a way to get the line number (and maybe filename) where a __DATA__ token was coded? Or some other way to know the actual line number in the original source file where a line of data read from the DATA filehandle came from?
Note that $. counts from 1 when reading from the DATA filehandle. So if the line number of the __DATA__ token were added to $. it would be what I'm looking for.
For example:
#!/usr/bin/perl
while (<DATA>) {
my $n = $. + WHAT??;
die "Invalid data at line $n\n" if /bad/;
}
__DATA__
something good
something bad
I want this to say "Invalid data at line 9", not "line 2" (which is what you get if $. is used by itself).
In systems that support /proc/<pid> virtual filesystems (e.g., Linux), you can do:
# find the file where <DATA> handle is read from
my $DATA_FILE = readlink("/proc/$$/fd/" . fileno(*DATA));
# find the line where DATA begins
open my $THIS, "<", $DATA_FILE;
my #THIS = <$THIS>;
my ($DATA_LINE) = grep { $THIS[$_] =~ /^__DATA__\b/ } 0 .. $#THIS;
File don't actually have lines; they're just sequences of bytes. The OS doesn't even offer the capability of getting a line from a file, so it has no concept of line numbers.
Perl, on the other hand, does keep track of a line number for each handle. It is accessed via $..
However, the Perl handle DATA is created from a file descriptor that's already been moved to the start of the data —it's the file descriptor that Perl itself uses to load and parse the file— so there's no record of how many lines have already been read. So the line 1 of DATA is the first line after __DATA__.
To correct the line count, one must seek back to the start of the file, and read it line by line until the file handle is back at the same position it started.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use Fcntl qw( SEEK_SET );
# Determines the line number at the current file position without using «$.».
# Corrects the value of «$.» and returns the line number.
# Sets «$.» to «1» and returns «undef» if unable to determine the line number.
# The handle is left pointing to the same position as when this was called, or this dies.
sub fix_line_number {
my ($fh) = #_;
( my $initial_pos = tell($fh) ) >= 0
or return undef;
seek($fh, 0, SEEK_SET)
or return undef;
$. = 1;
while (<$fh>) {
( my $pos = tell($fh) ) >= 0
or last;
if ($pos >= $initial_pos) {
if ($pos > $initial_pos) {
seek($fh, $initial_pos, SEEK_SET)
or die("Can't reset handle: $!\n");
}
return $.;
}
}
seek($fh, $initial_pos, SEEK_SET)
or die("Can't reset handle: $!\n");
$. = 1;
return undef;
}
my $prefix = fix_line_number(\*DATA) ? "" : "+";
while (<DATA>) {
printf "%s:%s: %s", __FILE__, "$prefix$.", $_;
}
__DATA__
foo
bar
baz
Output:
$ ./a.pl
./a.pl:48: foo
./a.pl:49: bar
./a.pl:50: baz
$ perl <( cat a.pl )
/dev/fd/63:+1: foo
/dev/fd/63:+2: bar
/dev/fd/63:+3: baz
Perl keeps track of the file and line at which each symbol is created. A symbol is normally created when the parser/compiler first encounters it. But if __DATA__ is encountered before DATA is otherwise created, this will create the symbol. We can take advantage of this to set the line number associated with the file handle in DATA.
For the case where the Package::DATA handle is not used in Package.pm itself, the line number of the __DATA__ token could be obtained via B::GV->LINE on the DATA handle:
$ cat Foo.pm
package Foo;
1;
__DATA__
good
bad
$ perl -I. -MFoo -MB -e '
my $ln = B::svref_2object(\*Foo::DATA)->LINE;
warn "__DATA__ at line $ln\n";
Foo::DATA->input_line_number($ln);
while(<Foo::DATA>){ die "no good" unless /good/ }
'
__DATA__ at line 4
no good at -e line 1, <DATA> line 6.
In the case where the DATA handle is referenced in the file itself, a possible kludge would be to use an #INC hook:
$ cat DH.pm
package DH;
unshift #INC, sub {
my ($sub, $fname) = #_;
for(#INC){
if(open my $fh, '<', my $fpath = "$_/$fname"){
$INC{$fname} = $fpath;
return \'', $fh, sub {
our (%ln, %pos);
if($_){ $pos{$fname} += length; ++$ln{$fname} }
}
}
}
};
$ cat Bar.pm
package Bar;
print while <DATA>;
1;
__DATA__
good
bad
$ perl -I. -MDH -MBar -e '
my $fn = "Bar.pm";
warn "__DATA__ at line $DH::ln{$fn} pos $DH::pos{$fn}\n";
seek Bar::DATA, $DH::pos{$fn}, 0;
Bar::DATA->input_line_number($DH::ln{$fn});
while (<Bar::DATA>){ die "no good" unless /good/ }
'
good
bad
__DATA__ at line 6 pos 47
no good at -e line 6, <DATA> line 8.
Just for the sake of completion, in the case where you do have control over the file, all could be easily done with:
print "$.: $_" while <DATA>;
BEGIN { our $ln = __LINE__ + 1; DATA->input_line_number($ln) }
__DATA__
...
You can also use the first B::GV solution, provided that you reference the DATA handle via an eval:
use B;
my ($ln, $data) = eval q{B::svref_2object(\*DATA)->LINE, \*DATA}; die $# if $#;
$data->input_line_number($ln);
print "$.: $_" while <$data>;
__DATA__
...
None of these solutions assumes that the source file are seekable (except if you want to read the DATA more than once, as I did in the second example), or try to reparse your files, etc.
Comparing the end of the file to itself in reverse might do what you want:
#!/usr/bin/perl
open my $f, "<", $0;
my #lines;
my #dataLines;
push #lines ,$_ while <$f>;
close $f;
push #dataLines, $_ while <DATA>;
my #revLines= reverse #lines;
my #revDataLines=reverse #dataLines;
my $count=#lines;
my $offset=0;
$offset++ while ($revLines[$offset] eq $revDataLines[$offset]);
$count-=$offset;
print "__DATA__ section is at line $count\n";
__DATA__
Hello there
"Some other __DATA__
lkjasdlkjasdfklj
ljkasdf
Running give a output of :
__DATA__ section is at line 19
The above script reads itself (using $0 for file name) into the #lines array and reads the DATA file into the #dataLines array.
The arrays are reversed and then compared element by element until they are different. The number of lines are tracked in $offset and this is subtracted from the $count variable which is the number of lines in the file.
The result is the line number the DATA section starts at. Hope that helps.
Thank you #mosvy for the clever and general idea.
Below is a consolidated solution which works anywhere. It uses a symbolic reference instead of eval to avoid mentioning "DATA" at compile time, but otherwise uses the same ideas as mosvy.
The important point is that code in a package containing __DATA__ must not refer to the DATA symbol by name so that that symbol won't be created until the compiler sees the __DATA__ token. The way to avoid mentioning DATA is to use a filehandle ref created at run-time.
# Get the DATA filehandle for a package (default: the caller's),
# fixed so that "$." provides the actual line number in the
# original source file where the last-read line of data came
# from, rather than counting from 1.
#
# In scalar context, returns the fixed filehandle.
# In list context, returns ($fh, $filename)
#
# For this to work, a package containing __DATA__ must not
# explicitly refer to the DATA symbol by name, so that the
# DATA symbol (glob) will not yet be created when the compiler
# encounters the __DATA__ token.
#
# Therefore, use the filehandle ref returned by this
# function instead of DATA!
#
sub get_DATA_fh(;$) {
my $pkg = $_[0] // caller;
# Using a symbolic reference to avoid mentioning "DATA" at
# compile time, in case we are reading our own module's __DATA__
my $fh = do{ no strict 'refs'; *{"${pkg}::DATA"} };
use B;
$fh->input_line_number( B::svref_2object(\$fh)->LINE );
wantarray ? ($fh, B::svref_2object(\$fh)->FILE) : $fh
}
Usage examples:
my $fh = get_DATA_fh; # read my own __DATA__
while (<$fh>) { print "$. : $_"; }
or
my ($fh,$fname) = get_DATA_fh("Otherpackage");
while (<$fh>) {
print " $fname line $. : $_";
}

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.
As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).
Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

Perl sub skips foreach within which it is called

I'm having some problem with a subroutine that locates certain files and extracts some data out of them.
This subroutine is called inside a foreach loop, but whenever the call is made the loop skips to its next iteration. So I am wondering whether any of the next;'s are somehow escaping from the subroutine to the foreach loop where it is called?
To my knowledge the sub looks solid though so I'm hoping if anyone can see something I'm missing?
sub FindKit{
opendir(DH, "$FindBin::Bin\\data");
my #kitfiles = readdir(DH);
closedir(DH);
my $nametosearch = $_[0];
my $numr = 1;
foreach my $kitfile (#kitfiles)
{
# skip . and .. and Thumbs.db and non-K-files
if($kitfile =~ /^\.$/) {shift #kitfiles; next;}
if($kitfile =~ /^\.\.$/) {shift #kitfiles; next;}
if($kitfile =~ /Thumbs\.db/) {shift #kitfiles; next;}
if($kitfile =~ /^[^K]/) {shift #kitfiles; next;}
# $kitfile is the file used on this iteration of the loop
open (my $fhkits,"<","data\\$kitfile") or die "$!";
while (<$fhkits>) {}
if ($. <= 1) {
print " Empty File!";
next;
}
seek($fhkits,0,0);
while (my $kitrow = <$fhkits>) {
if ($. == 0 && $kitrow =~ /Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/g) {
close $fhkits;
return $1;
}
}
$numr++;
close $fhkits;
}
return 0;
}
To summarize comments, the refactored code:
use File::Glob ':bsd_glob';
sub FindKit {
my $nametosearch = $_[0];
my #kitfiles = glob "$FindBin::Bin/data/K*"; # files that start with K
foreach my $kitfile (#kitfiles)
{
open my $fhkits, '<', $kitfile or die "$!";
my $kitrow_first_line = <$fhkits>; # read first line
return if eof; # next read is end-of-file so it was just header
my ($result) = $kitrow_first_line =~
/Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/;
return $result if $result;
}
return 0;
}
I use core File::Glob and enable :bsd_glob option, which can handle spaces in filenames. I follow the docs note to use "real slash" on Win32 systems.
I check whether there is only a header line using eof.†
I do not see how this can affect the calling code, other than by its return value. Also, I don't see how the posted code can make the caller skip the beat, either. That problem is unlikely to be in this sub.
Please let me know if I missed some point with the above rewrite.
† Previous version used to check whether there is just one (header) line by
1 while <$fhkits>; # check number of lines ...
return if $. == 1; # there was only one line, the header
Also correct but eof is way better
The thing that is almost certainly screwing you here, is that you are shifting the list that you are iterating.
That's bad news, as you're deleting elements ... but in places you aren't necessarily thinking.
For example:
#!/usr/bin/env perl
use strict;
use warnings;
my #list = qw ( one two three );
my $count;
foreach my $value ( #list ) {
print "Iteration ", ++$count," value is $value\n";
if ( $value eq 'two' ) { shift #list; next };
}
print "#list";
How many times do you think that should iterate, and which values should end up in the array?
Because you shift you never process element 'three' and you delete element 'one'. That's almost certainly what's causing you problems.
You also:
open using a relative path, when your opendir used an absolute one.
skip a bunch of files, and then skip anything that doesn't start with K. Why not just search for things that do start with K?
read the file twice, and one is to just check if it's empty. The perl file test -z will do this just fine.
you set $kitrow for each line in the file, but don't really use it for anything other than pattern matching. It'd probably work better using implicit variables.
You only actually do anything on the first line - so you don't ever need to iterate the whole file. ($numr seems to be discarded).
you use a global match, but only use one result. The g flag seems redundant here.
I'd suggest a big rewrite, and do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use FindBin;
sub FindKit{
my ($nametosearch) = #_;
my $numr = 1;
foreach my $kitfile (glob "$FindBin::Bin\\data\\K*" )
{
if ( -z $kitfile ) {
print "$kitfile is empty\n";
next;
}
# $kitfile is the file used on this iteration of the loop
open (my $fhkits,"<", $kitfile) or die "$!";
<$kitfile> =~ m/Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/
and return $1;
return 0;
}
}
As a big fan of the Path::Tiny module (me have it always installed and using it in every project) my solution would be:
use strict;
use warnings;
use Path::Tiny;
my $found = FindKit('mykit');
print "$found\n";
sub FindKit {
my($nametosearch) = #_;
my $datadir = path($0)->realpath->parent->child('data');
die "$datadir doesn't exists" unless -d $datadir;
for my $file ($datadir->children( qr /^K/ )) {
next if -z $file; #skip empty
my #lines = $file->lines;
return $1 if $lines[0] =~ /Maakartikel :\s*(\S+)\s+Montagekit.*?($nametosearch)\s{3,}/;
}
return;
}
Some comments and still opened issues:
Using the Path::Tiny you could always use forward slashes in the path-names, regardless of the OS (UNIX/Windows), e.g. the data/file will work on windows too.
AFAIK the FindBin is considered broken - so the above uses the $0 and realpath ...
what if the Kit is in multiple files? The above always returns on the 1st found one
the my #lines = $file->lines; reads all lines - unnecessary - but on small files doesn't big deal.
the the reality this function returns the arg for the Maakartikel, so probably better name would be find_articel_by_kit or find_articel :)
easy to switch to utf8 - just change the $file->lines to $file->lines_utf8;

Perl: Manipulating while(<>) loop in file reading

My question is regarding the while loop that reads line from files. The situation is that I want to store values from or the entire next line when the while loop while(<FILEHANDLE>) is performing the action on present line ($_). So what is the way to address this problem? Is there a specific function or module that does this thing?
If you want to process four lines at a time and each set of lines is separated by #FCC then you need to change perl's input file separator.
In your script put
$/="\#FCC"
This means that when you do (<>), each record you get in $_ is now four lines of your file.
use warnings;
use strict;
local $/="\#FCC";
while (<>) {
chomp;
#Each time we iterate, $_ is now all four lines of each record.
}
Edit
You'll need to backslash the #
You can read from <> anywhere, not just in the head of the loop, e.g.
while (my $line = <>) {
chomp $line;
my $another_line = <>;
chomp $another_line;
print "$line followed by $another_line\n";
}
Assuming your file is small-ish (perhaps less than 1gb) you could just stuff it into an array and walk it:
use warnings;
use strict;
my #lines;
while (<>) {
chomp;
push #lines, $_;
}
my $num_lines = #lines; #use of array in scalar context give length of array
# don't do last line (there is no next one)
$num_lines -= 1;
foreach (my $i = 0; $i < $num_lines; $i++) {
my $next_line = $i+1;
print "line $i plus $next_line:",$lines[$i],$lines[$i+1],"\n";
}
Note that the semantics of my solution is a bit different from the answer above. My solution would print out everything except the first line twice; if you wanted everything to be printed once, the above solution might make more sense.
If you want to read n lines at a time from a file you can use Tie::File and use an array to reference n elements at a time, like this:
use strict;
use warnings;
use Tie::File;
my $filename = 'path_to_your_file';
tie my #array, 'Tie::File', $filename or die 'Unable to open file';
my $index = 0;
my $size = #array;
while (1) {
last if ($index > $size); # Be careful! Try to do a better check than this!
print #array[$index..$index+3];
print "----\n";
$index += 4;
}
(This is just an example, try to write better code)
As the documentation says, the file is not loaded into memory all at once, so it will work even for large files.

Parsing the large files in Perl

I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using Tie::File.so i have done it through 'while' but problem remains. see my code below...
use strict;
use Tie::File;
# use warnings;
my #arr;
# tie #arr, 'Tie::File', 'title_Nov19.txt';
# open(IT,"<title_Nov19.txt");
# my #arr=<IT>;
# close(IT);
open(RE,">>res.txt");
open(IN,"<input.txt");
while(my $data=<IN>){
chomp($data);
print"$data\n";
my $occ=0;
open(IT,"<title_Nov19.txt");
while(my $line2=<IT>){
my $line=$line2;
chomp($line);
if($line=~m/\b$data\b/is){
$occ++;
}
}
print RE"$data\t$occ\n";
}
close(IT);
close(IN);
close(RE);
so help me to reduce it...
Lots of things wrong with this.
Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.
I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.
Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.
I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as #LeoNerd explained in his answer. Then I use a hash to keep track of the match count:
#!/usr/bin/env perl
use strict;
use warnings;
# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my #comparison = <$comp_file>);
close $comp_file;
# prepare comparison
open my $input, '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();
# compare each line
while (my $title = <$input>) {
chomp $title;
# iterate comparison strings
foreach my $comp (#comparison) {
$count{$comp}++ if $title =~ /\b$comp\b/i;
}
}
# done
close $input;
# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (#comparison) {
print $output "$comp\t$count{$comp}\n";
}
close $output;
Just to get you started... If someone wants to further work on this: these were my test files:
title_Nov19.txt
This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!
input.txt
foo
bar
And the result of the program was written to res.txt:
foo 3
bar 1
Here's another option using memowe's (thank you) data:
use strict;
use warnings;
use File::Slurp qw/read_file write_file/;
my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';
for ( read_file 'title_Nov19.txt' ) {
my %seen;
!$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
}
write_file 'res.txt', map "$_\t$count{$_}\n",
sort { $count{$b} <=> $count{$a} } keys %count;
Numerically-sorted output to res.txt:
foo 3
bar 1
An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.
Hope this helps!
Try this:
grep -i -c -w -f input.txt title_Nov19.txt > res.txt