Parsing huge text file in Perl - perl

I have genome file something about 30 gb similar to under below ,
>2RHet assembled 2006-03-27 md5sum:88c0ac39ebe4d9ef5a8f58cd746c9810
GAGAGGTGTGGAGAGGAGAGGAGAGGAGTGGTGAGGAGAGGAGAGGTGAG
GAGAGGAGAGGAGAGGAGAGGAATGGAGAGGAGAGGAGTCGAGAGGAGAG
GAGAGGAGTGGTGAGGAGAGGAGAGGAGTGGAGAGGAGACGTGAGGAGTG
GAGAGGAGAGTAGTGGAGAGGAGTGGAGAGGAGAGGAGAGGAGAGGACGG
ATTGTGTTGAGGACGGATTGTGTTACACTGATCGATGGCCGAGAACGAAC
I am trying to parse the file and achieve my task fast ,
using the below code character by character
but the character is not getting printed
open (FH,"<:raw",'genome.txt') or die "cant open the file $!\n";
until ( eof(FH) ) {
$ch = getc(FH);
print "$ch\n";# not printing ch
}
close FH;

Your mistake is forgetting an eof:
until (eof FH) { ... }
But that is very unlikely to be the most efficient solution: Perl is slower than, say … C, so we want as few loop iterations as possible, and as much work done inside perl internals as we can get. This means that reading a file character by character is slow.
Also, use lexical variables (declared with my) instead of globals; this can lead to a performance increase.
Either pick a natural record delimiter (like \n), or read a certain number of bytes:
local $/ = \256; # read 256 bytes at a time.
while (<FH>) {
# do something with the bytes
}
(see perlvar)
You could also shed all the luxuries that open, readline and even getc do for you, and use sysopen and sysread for total control. However, that way lies madness.
# not tested; I will *not* use sysread.
use Fcntl;
use constant NUM_OF_CHARS => 1; # equivalent to getc; set higher maybe.
sysopen FH, "genome.txt", O_RDONLY or die;
my $char;
while (sysread FH, $char, NUM_OF_CHARS, 0) {
print($char .= "\n"); # appending should be better than concatenation.
}
If we are gone that far, using Inline::C is just a small and possibly preferable step.

Related

Perl, find a match and read next line in perl

I would like to use
myscript.pl targetfolder/*
to read some number from ASCII files.
myscript.pl
#list = <#ARGV>;
# Is the whole file or only 1st line is loaded?
foreach $file ( #list ) {
open (F, $file);
}
# is this correct to judge if there is still file to load?
while ( <F> ) {
match_replace()
}
sub match_replace {
# if I want to read the 5th line in downward, how to do that?
# if I would like to read multi lines in multi array[row],
# how to do that?
if ( /^\sName\s+/ ) {
$name = $1;
}
}
I would recommend a thorough read of perlintro - it will give you a lot of the information you need. Additional comments:
Always use strict and warnings. The first will enforce some good coding practices (like for example declaring variables), the second will inform you about potential mistakes. For example, one warning produced by the code you showed would be readline() on unopened filehandle F, giving you the hint that F is not open at that point (more on that below).
#list = <#ARGV>;: This is a bit tricky, I wouldn't recommend it - you're essentially using glob, and expanding targetfolder/* is something your shell should be doing, and if you're on Windows, I'd recommend Win32::Autoglob instead of doing it manually.
foreach ... { open ... }: You're not doing anything with the files once you've opened them - the loop to read from the files needs to be inside the foreach.
"Is the whole file or only 1st line is loaded?" open doesn't read anything from the file, it just opens it and provides a filehandle (which you've named F) that you then need to read from.
I'd strongly recommend you use the more modern three-argument form of open and check it for errors, as well as use lexical filehandles since their scope is not global, as in open my $fh, '<', $file or die "$file: $!";.
"is this correct to judge if there is still file to load?" Yes, while (<$filehandle>) is a good way to read a file line-by-line, and the loop will end when everything has been read from the file. You may want to use the more explicit form while (my $line = <$filehandle>), so that your variable has a name, instead of the default $_ variable - it does make the code a bit more verbose, but if you're just starting out that may be a good thing.
match_replace(): You're not passing any parameters to the sub. Even though this code might still "work", it's passing the current line to the sub through the global $_ variable, which is not a good practice because it will be confusing and error-prone once the script starts getting longer.
if (/^\sName\s+/){$name = $1;}: Since you've named the sub match_replace, I'm guessing you want to do a search-and-replace operation. In Perl, that's called s/search/replacement/, and you can read about it in perlrequick and perlretut. As for the code you've shown, you're using $1, but you don't have any "capture groups" ((...)) in your regular expression - you can read about that in those two links as well.
"if I want to read the 5th line in downward , how to do that ?" As always in Perl, There Is More Than One Way To Do It (TIMTOWTDI). One way is with the range operator .. - you can skip the first through fourth lines by saying next if 1..4; at the beginning of the while loop, this will test those line numbers against the special $. variable that keeps track of the most recently read line number.
"and if I would like to read multi lines in multi array[row], how to do that ?" One way is to use push to add the current line to the end of an array. Since keeping the lines of a file in an array can use up more memory, especially with large files, I'd strongly recommend making sure you think through the algorithm you want to use here. You haven't explained why you would want to keep things in an array, so I can't be more specific here.
So, having said all that, here's how I might have written that code. I've added some debugging code using Data::Dumper - it's always helpful to see the data that your script is working with.
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper; # for debugging
$Data::Dumper::Useqq=1;
for my $file (#ARGV) {
print Dumper($file); # debug
open my $fh, '<', $file or die "$file: $!";
while (my $line = <$fh>) {
next if 1..4;
chomp($line); # remove line ending
match_replace($line);
}
close $fh;
}
sub match_replace {
my ($line) = #_; # get argument(s) to sub
my $name;
if ( $line =~ /^\sName\s+(.*)$/ ) {
$name = $1;
}
print Data::Dumper->Dump([$line,$name],['line','name']); # debug
# ... do more here ...
}
The above code is explicitly looping over #ARGV and opening each file, and I did say above that more verbose code can be helpful in understanding what's going on. I just wanted to point out a nice feature of Perl, the "magic" <> operator (discussed in perlop under "I/O Operators"), which will automatically open the files in #ARGV and read lines from them. (There's just one small thing, if I want to use the $. variable and have it count the lines per file, I need to use the continue block I've shown below, this is explained in eof.) This would be a more "idiomatic" way of writing that first loop:
while (<>) { # reads line into $_
next if 1..4;
chomp; # automatically uses $_ variable
match_replace($_);
} continue { close ARGV if eof } # needed for $. (and range operator)

Size of a string in bytes (windows)

I'm attempting to have a progress indicator when processing a large file by counting the length of each string. Unfortunately, it's counting each line ending "\r\n" as a single character, therefore leading to a drift of my running total.
The following script demonstrates:
use strict;
use warnings;
use autodie;
my $file = 'length_vs_size.txt';
open my $fh, '>', $file;
my $length = 0;
while (<DATA>) {
$length += length;
print $fh $_;
}
close $fh;
my $size = -s $file;
print "Length = $length\n";
print "Size = $size\n";
__DATA__
11...chars
22...chars
33...chars
44...chars
55...chars
Using Strawberry Perl, this outputs:
Length = 55
Size = 60
As one would expect, when viewing the file in a hex editor, each line ending is actually "\r\n", taking two bytes. Therefore the total file size is 5 more than the length.
Is there a way to count the length of bytes of a string?
I've played around with the bytes pragma, and even a little bit of unpack, but no luck yet. I'm hoping for a generalized solution other than just adding 1 to my length call.
On Windows, files have the :crlf encoding layer enabled by default. On reading, this transforms \r\n to \n, and reverses this when writing. This means that scripts which assume Unix line endings won't break quite as often.
If you don't want this behaviour, remove any PerlIO layers by using the :raw pseudolayer:
binmode STDIN, ':raw'; # for one handle
or
use open IO => ':raw'; # for all handles
(of course, this is a simplification, and the actual behavior of :raw is explained in PerlIO)

Perl printing binary to files - cr lf

I am not a regular Perl programmer and I could not find anything about this in the forum or few books I have.
I am trying to write binary data to a file using the construct:
print filehandle $record
I note that all of my records truncate when an x'0A' is encountered so apparently Perl uses the LF as and end of record indicator. How can I write the complete records, using for example, a length specifier? I am worried about Perl tampering with other binary "non printables" as well.
thanks
Fritz
You want to use
open(my $fh, '<', $qfn) or die $!;
binmode($fh);
or
open(my $fh, '<:raw', $qfn) or die $!;
to prevent modifications. Same goes for output handles.
This "truncation at 0A" talk makes it sound like you're using readline and expect to do something other than read a line.
Well, actually, it can! You just need to tell readline you want it to read fix width records.
local $/ = \4096;
while (my $rec = <$fh>) {
...
}
The other alternative would be to use read.
while (1) {
my $rv = read($fh, my $rec, 4096);
die $! if !defined($rv);
last if !$rv;
...
}
binmode
open
read
readline (aka <> and <$fh>)
$/
Perl is not "tampering" with your writes. If your records are being truncated when they encounter a line feed, then that's a problem with the code that reads them, not the code that writes them. (Unless the format specifies that line feeds must be escaped, in which case the "problem" with the code writing the file is that it doesn't tamper with the data (by escaping line feeds) and instead writes exactly what you tell it to.)
Please provide a small (but runnable) code sample demonstrating your issue, ideally including both reading and writing, along with the actual result and the desired result, and we'll be able to give more specific help.
Note, however, that \n does not map directly to a single data byte (ASCII character) unless you're in binary mode. If the file is being read or written in text mode, \n could be just a CR, just a LF, or a CRLF, depending on the operating system it's being run under.

get last few lines of file stored in variable

How could I get the last few lines of a file that is stored in a variable? On linux I would use the tail command if it was in a file.
1) How can I do this in perl if the data is in a file?
2) How can I do this if the content of the file is in a variable?
To read the end of a file, seek near the end of the file and begin reading. For example,
open my $fh, '<', $file;
seek $fh, -1000, 2;
my #lines = <$fh>;
close $fh;
print "Last 5 lines of $file are: ", #lines[-5 .. -1];
Depending on what is in the file or how many lines you want to look at, you may want to use a different magic number than -1000 above.
You could do something similar with a variable, either
open my $fh, '<', \$the_variable;
seek $fh, -1000, 2;
or just
open my $fh, '<', \substr($the_variable, -1000);
will give you an I/O handle that produces the last 1000 characters in $the_variable.
The File::ReadBackwards module on the CPAN is probably what you want. You can use it thus. This will print the last three lines in the file:
use File::ReadBackwards
my $bw = File::ReadBackwards->new("some_file");
print reverse map { $bw->readline() } (1 .. 3);
Internally, it seek()s to near the end of the file and looks for line endings, so it should be fairly efficient with memory, even with very big files.
To some extent, that depends how big the file is, and how many lines you want. If it is going to be very big you need to be careful, because reading it all into memory will take a lot longer than just reading the last part of the file.
If it is small. the easiest way is probably to File::Slurp it into memory, split by record delimiters, and keep the last n records. In effect, something like:
# first line if not yet in a string
my $string = File::Slurp::read_file($filename);
my #lines = split(/\n/, $string);
print join("\n", #lines[-10..-1])
If it is large, too large to find into memory, you might be better to use file system operations directly. When I did this, I opened the file and used seek() and read the last 4k or so of the file, and repeated backwards until I had enough data to get the number of records I needed.
Not a detailed answer, but the question could be a touch more specific.
I know this is an old question, but I found it while looking for a way to search for a pattern in the first and last k lines of a file.
For the tail part, in addition to seek (if the file is seekable), it saves some memory to use a rotating buffer, as follows (returns the last k lines, or less if fewer than $k are available):
my $i = 0; my #a;
while (<$fh>) {
$a[$i++ % $k] = $_;
}
my #tail = splice #a,0,$i % $k;
splice #a,#a,0,#tail;
return #a;
A lot has already been stated on the file side, but if it's already in a string, you can use the following regex:
my ($lines) = $str ~= /
(
(?:
(?:(?<=^)|(?<=\n)) # match beginning of line (separated due to variable lookbehind limitation)
[^\n]*+ # match the line
(?:\n|$) # match the end of the line
){0,5}+ # match at least 0 and at most 5 lines
)$ # match must be from end of the string
/sx # s = treat string as single line
# x = allow whitespace and comments
This runs extremely fast. Benchmarking shows between 40-90% faster compared to the split/join method (variable due to current load on machine). This is presumably due to less memory manipulations. Something you might want to consider if speed is essential. Otherwise, it's just interesting.

How do I efficiently parse a CSV file in Perl?

I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient.
My approach has been to split() the file by lines first, and then split() each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application.
My question is, what is the most time efficient means of parsing a large CSV file using only built in tools?
note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although it might work effectively.
edit
It can only involve built-in tools that ship with Perl 5.8. For bureaucratic reasons, I cannot use any third party modules (even if hosted on cpan)
another edit
Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory.
yet another edit
I just grasped how stupid this question is. Sorry for wasting your time. Voting to close.
The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.
About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:
my $file = 'somefile.csv';
my #data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
chomp $line;
my #fields = split(/,/, $line);
push #data, \#fields;
}
This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split with this:
my #fields = Text::ParseWords::parse_line(',', 0, $line);
Here is a version that also respects quotes (e.g. foo,bar,"baz,quux",123 -> "foo", "bar", "baz,quux", "123").
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
$line =~ s/\r?\n$//;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
Use it like this:
while(my $line = <FILE>) {
my #cells = csvsplit($line); # or csvsplit($line, $my_custom_seperator)
}
As other people mentioned, the correct way to do this is with Text::CSV, and either the Text::CSV_XS back end (for FASTEST reading) or Text::CSV_PP back end (if you can't compile the XS module).
If you're allowed to get extra code locally (eg, your own personal modules) you could take Text::CSV_PP and put it somewhere locally, then access it via the use lib workaround:
use lib '/path/to/my/perllib';
use Text::CSV_PP;
Additionally, if there's no alternative to having the entire file read into memory and (I assume) stored in a scalar, you can still read it like a file handle, by opening a handle to the scalar:
my $data = stupid_required_interface_that_reads_the_entire_giant_file();
open my $text_handle, '<', \$data
or die "Failed to open the handle: $!";
And then read via the Text::CSV interface:
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
while (my $row = $csv->getline($text_handle)) {
...
}
or the sub-optimal split on commas:
while (my $line = <$text_handle>) {
my #csv = split /,/, $line;
... # regular work as before.
}
With this method, the data is only copied a bit at a time out of the scalar.
You can do it in one pass if you read the file line by line. There is no need to read the whole thing into memory at once.
#(no error handling here!)
open FILE, $filename
while (<FILE>) {
#csv = split /,/
# now parse the csv however you want.
}
Not really sure if this is significantly more efficient though, Perl is pretty fast at string processing.
YOU NEED TO BENCHMARK YOUR IMPORT to see what is causing the slowdown. If for example, you are doing a db insertion that takes 85% of the time, this optimization won't work.
Edit
Although this feels like code golf, the general algorithm is to read the whole file or part of the fie into a buffer.
Iterate byte by byte through the buffer until you find a csv delimeter, or a new line.
When you find a delimiter, increment your column count.
When you find a newline increment your row count.
If you hit the end of your buffer, read more data from the file and repeat.
That's it. But reading a large file into memory is really not the best way, see my original answer for the normal way this is done.
Assuming that you have your CSV file loaded into $csv variable and that you do not need text in this variable after you successfully parsed it:
my $result=[[]];
while($csv=~s/(.*?)([,\n]|$)//s) {
push #{$result->[-1]}, $1;
push #$result, [] if $2 eq "\n";
last unless $2;
}
If you need to have $csv untouched:
local $_;
my $result=[[]];
foreach($csv=~/(?:(?<=[,\n])|^)(.*?)(?:,|(\n)|$)/gs) {
next unless defined $_;
if($_ eq "\n") {
push #$result, []; }
else {
push #{$result->[-1]}, $_; }
}
Answering within the constraints imposed by the question, you can still cut out the first split by slurping your input file into an array rather than a scalar:
open(my $fh, '<', $input_file_path) or die;
my #all_lines = <$fh>;
for my $line (#all_lines) {
chomp $line;
my #fields = split ',', $line;
process_fields(#fields);
}
And even if you can't install (the pure-Perl version of) Text::CSV, you may be able to get away with pulling up its source code on CPAN and copy/pasting the code into your project...