<> operator buffer size - perl

I have a file with very long lines that I need to process and I found that the process gets stuck/'really slow' because the buffer is not big enough or due to the fact that handling a very long line might take a while. Here's a code sample:
open FH, "<$fname" or die "...";
while (<FH>) {
my #arr = split //, $_;
pop #arr;
pop #arr;
... for some "limited small portion of the string length" number of times ...
pop #arr;
if ($arr[-1] eq '0') {
print "done!\n";
last;
}
push #big_arr, join('', #arr);
}
The line processing is not "heavy".
I looked for something to solve it and came across PerlIO::buffersize but it looks like it wasn't maintained for a while now and I don't want to use a module with version 0.001. How can I modify the <> operator buffer size? Or alternatively, is there any way to know the line length before reading it with <>?

It may be that what you need is this:
$/ - can be set to a numeric value, for a number of bytes to read from a file.
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer number of characters.
Source: perlvar

How can I modify the <> operator buffer size?
<> reads into a scalar that can grow to any size, so I think you are referring to the size of the buffer passed to the read system call.
Before 5.14, Perl read from file handles in 4 KiB chunks. 5.14 made this configurable, with a default of 8 KiB.
$ perl -e'print("x" x 9_999, "\n") for 1..2' >large_lines
$ strace 5.10.1t/bin/perl -e'my $line = <>' large_lines 2>&1 | grep read.*xxx
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
$ strace 5.14.2t/bin/perl -e'my $line = <>' large_lines 2>&1 | grep read.*xxx
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
It can only be configured when perl is built, using the following command
./Configure -Accflags=-DPERLIOBUF_DEFAULT_BUFSIZ=8192
This applies to all buffered reading functions, including read, readline (for which <> is an alias), readpipe and eof, but not sysread.
Note that setting $/ to a reference to a number will cause readline (<>) to act as read, which is still buffered.
$ strace perl -e'$/ = \8193; my $block = <>' large_lines 2>&1 | grep read.*xxx
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
If you actually want to perform a single read system call, you need to use sysread.
$ strace perl -e'sysread(STDIN, $buf, 8193)' <large_lines 2>&1 | grep read.*xxx
read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8193) = 8193

Altering Perl's read buffer size is unlikely to make any significant difference to the speed of your program, and the impact you are seeing is much more likely to be a result of the longer read time from the disk drive itself. Take a look at Perl Read-Ahead I/O Buffering on perlmonks.org
Furthermore, implementing your own buffering by using read or setting the record separator $/ to a fixed size, is more than likely to slow down your program, as you still have to separate what you have read into lines of data but now have to do it in Perl code instead of letting perl do it for you in C
Note also that the measure of changing $/ to a fixed record size will still use Perl's standard, probably 8KB, buffer. The only difference is that the amount of data handed back to you will be determined according to a byte count instead of the position of a separator string

Related

Is there a way to get the current file handle that would be used with the <> operator in perl?

I've seen that close ARGV can close the currently processed file, but it would seem that ARGV isn't actually a file handle, so I can't use it in a read call. Is there any way to get the current file handle, or am I going to have to explicitly open the files myself?
... but it would seem that ARGV isn't actually a file handle, so I can't use it in a read call
ARGV is a filehandle and it can be used within read.
To cite from perlvar:
... a plain filehandle corresponding to the last file opened by <>"*
So it is a filehandle and it can be used within read. But you need to have to use <> first so that the file gets actually opened. And it will not magically continue with the next file as <> would do.
To test simply do (UNIX shell syntax, you might need to adapt this for Windows):
perl -e '<>; read(ARGV, my $buf, 10); print $buf' file
The <> will open the given file and read the first line. The read then will read the next 10 bytes from the same file.
<> is short for readline( ARGV ).
The file handle used is ARGV.
However, readline has special code to open/reopen ARGV which read doesn't have.
You can, however, achieve a read using readline by manipulating $/.
$ echo abcdef | perl -Mv5.14 -e'local $/ = \2; $_ = <>; say "<<$_>>";'
<<ab>>
$ perl -Mv5.14 -e'local $/ = \2; $_ = <>; say "<<$_>>";' <( echo abcdef )
<<ab>>

Need serious help optimizing script for memory use

[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]
Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.
I've got a script that produces frequency lists. It essentially does the following:
Reads in lines from a file having the format $frequency \t $item. Any given $item may occur multiple times, with different values for $frequency.
Eliminates certain lines depending on the content of $item.
Sums the frequencies of all identical $items, regardless of their case, and merges these entries into one.
Performs a reverse natural sort on the resulting array.
Prints the results to an output file.
The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).
The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.
So I need to somehow optimize the script. And that's what I'm here asking for help with.
Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.
I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.
Thanks in advance!
(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.
#!/usr/bin/env perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);
# DEFINE VARIABLES
my $delimiter = "\t";
my $split_char = "\t";
my $input_file_name = "";
my $output_file_name = "";
my $in_basename = "";
my $frequency = 0;
my $item = "";
# READ COMMAND LINE OPTIONS
GetOptions (
"input|i=s" => \$input_file_name,
"output|o=s" => \$output_file_name,
);
# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
die
"\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}
# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {
# STRIP EXTENSION FROM INPUT FILE NAME
$in_basename = $input_file_name;
$in_basename =~ s/(.+)\.(.+)/$1/;
# GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
$output_file_name = "$in_basename.output.txt";
}
# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
or die "\nERROR: Can't open input file ($input_file_name): $!";
# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";
# PROCESS INPUT FILE LINE BY LINE
my %F;
while (<INPUTFILE>) {
chomp;
# PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
( $frequency, $item ) = split( /$split_char/, $_, 2 );
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;
# PROCESS INPUT LINES
$F{ lc($item) } += $frequency;
}
close INPUTFILE;
# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
|| die "\nERROR: The output file \($output_file_name\) couldn't be opened for writing!\n";
# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}
close OUTPUTFILE;
exit;
Below is some sample input from the source file. It's tab-separated, and the first column is $frequency, while all the rest together is $item.
2 útil volver a valdivia
8 útil volver la vista
1 útil válvula de escape
1 útil vía de escape
2 útil vía fax y
1 útil y a cabalidad
43 útil y a el
17 útil y a la
1 útil y a los
21 útil y a quien
1 útil y a raíz
2 útil y a uno
UPDATE In my tests a hash takes 2.5 times the memory that its data "alone" would take. However, the program size for me is consistently 3-4 times as large as its variables. This would turn 6.3Gb data file into a ~ 15Gb hash, for a ~ 60Gb program, just as reported in comments.
So 6.3Gb == 60Gb, so to say. This still improved the starting situation enough so to work for the current problem but is clearly not a solution. See the (updated) Another approach below for a way to run this processing without loading the whole hash.
There is nothing obvious to lead to an order-of-magnitude memory blowup. However, small errors and inefficiences can add up so let's first clean up. See other approaches at end.
Here is a simple re-write of the core of the program, to try first.
# ... set filenames, variables
open my $fh_in, '<:encoding(utf8)', $input_file_name
or die "\nERROR: Can't open input file ($input_file_name): $!";
my %F;
while (<$fh_in>) {
chomp;
s/^\s*//; #/trim leading space
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
# ... increment counters and aggregates and add to hash
# (... any other processing?)
$F{ lc($item) } += $frequency;
}
close $fh_in;
# Sort and print to file
# (Or better write: "value key-length key" and sort later. See comments)
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
foreach my $item ( sort {
$F{$b} <=> $F{$a} || length($b) <=> length($a) || $a cmp $b
} keys %F )
{
print $fh_out $F{$item}, "\t", $item, "\n";
}
close $fh_out;
A few comments, let me know if more is needed.
Always add $! to error-related prints, to see the actual error. See perlvar.
Use lexical filehandles (my $fh rather than IN), it's better.
If layers are specified in the three-argument open then layers set by open pragma are ignored, so there should be no need for use open ... (but it doesn't hurt either).
The sort here has to at least copy its input, and with multiple conditions more memory is needed.
That should take no more memory than 2-3 times the hash size. While initially I suspected a memory leak (or excessive data copying), by reducing the program to basics it was shown that the "normal" program size is the (likely) culprit. This can be tweaked by devising custom data structures and packing the data economically.
Of course, all this is fiddling if your files are going to grow larger and larger, as they tend to do.
Another approach is to write out the file unsorted and then sort using a separate program. That way you don't combine the possible memory swelling from processing with final sorting.
But even this pushes the limits, due to the greatly increased memory footprint as compared to data, since hash takes 2.5 times the data size and the whole program is yet 3-4 as large.
Then find an algorithm to write the data line-by-line to the output file. That is simple to do here since by the shown processing we only need to accumulate frequencies for each item
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
my $cumulative_freq;
while (<$fh_in>) {
chomp;
s/^\s*//; #/ leading only
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
$cumulative_freq += $frequency; # would-be hash value
# Add a sort criterion, $item's length, helpful for later sorting
say $fh_out $cumulative_freq, "\t", length $item, "\t", lc($item);
#say $fh_out $cumulative_freq, "\t", lc($item);
}
close $fh_out;
Now we can use the system's sort, which is optimized for very large files. Since we wrote a file with all sorting columns, value key-length key, run in a terminal
sort -nr -k1,1 -k2,2 output_file_name | cut -f1,3- > result
The command sorts numerically by the first and then by the second field (then it sorts by third itself) and reverses the order. This is piped into cut which pulls out the first and third fields from STDIN (with tab as default delimiter), what is the needed result.
A systemic solution is to use a database, and a very convenient one is DBD::SQLite.
I used Devel::Size to see memory used by variables.
Sorting input requires keeping all input in memory, so you can't do everything in a single process.
However, sorting can be factored: you can easily sort your input into sortable buckets, then process the buckets, and produce the correct output by combining the outputs in reversed-sorted bucket order. The frequency counting can be done per bucket as well.
So just keep the program you have, but add something around it:
partition your input into buckets, e.g. by the first character or the first two characters
run your program on each bucket
concatenate the output in the right order
Your maximum memory consumption will be slightly more than what your original program takes on the biggest bucket. So if your partitioning is well chosen, you can arbitrarily drive it down.
You can store the input buckets and per-bucket outputs to disk, but you can even connect the steps directly with pipes (creating a subprocess for each bucket processor) - this will create a lot of concurrent processes, so the OS will be paging like crazy, but if you're careful, it won't need to write to disk.
A drawback of this way of partitioning is that your buckets may end up being very uneven in size. An alternative is to use a partitioning scheme that is guaranteed to distribute the input equally (e.g. by putting every nth line of input into the nth bucket) but that makes combining the outputs more complex.

workaround for SPLIT 1000 file limit?

I need to split a few large files into specifically sized smaller files, with 500-5000 smaller files output. I'm using split with a -b designation, so I'm using a manual workaround when reaching the split 1000 file limit. Is there a another UNIX command or Perl one-liner that will accomplish this?
Are you sure about the 1000 file limit?
The original split had no such limit, and there's no limit for GNU or BSD version of split. Maybe you're confusing the suffix length with some sort of limit. On BSD, the suffix starts at .aaa and goes all of the way to .zzz which is over 17,000 files.
You can use the -a flag to adjust the suffix size if the three character suffix isn't enough.
$ split -a 5 $file
If I try to create lots of files, I get
$ perl -e'print "x"x5000' | split -b 1 && echo done.
split: output file suffixes exhausted
By default, the suffix length is two, which allows for 262 = 676 parts. Increasing it to three allows for 263 = 17,576 parts
$ perl -e'print "x"x5000' | split -b 1 -a 3 && echo done.
done.
One can control Perl's notion of an input record by setting $/:
Setting $/ to a reference to an integer, scalar containing an integer,
or scalar that's convertible to an integer will attempt to read
records instead of lines, with the maximum record size being the
referenced integer number of characters. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 characters from $fh.
So to split a large file into smaller files no larger than 1024 bytes, one could use the following:
use strict;
use warnings;
$/ = \1024;
my $filename = 'A';
while (<>) {
open my $fh, '>', ($filename++ . '.txt') or die $!;
print $fh $_;
close $fh or die $!;
}

Size of a string in bytes (windows)

I'm attempting to have a progress indicator when processing a large file by counting the length of each string. Unfortunately, it's counting each line ending "\r\n" as a single character, therefore leading to a drift of my running total.
The following script demonstrates:
use strict;
use warnings;
use autodie;
my $file = 'length_vs_size.txt';
open my $fh, '>', $file;
my $length = 0;
while (<DATA>) {
$length += length;
print $fh $_;
}
close $fh;
my $size = -s $file;
print "Length = $length\n";
print "Size = $size\n";
__DATA__
11...chars
22...chars
33...chars
44...chars
55...chars
Using Strawberry Perl, this outputs:
Length = 55
Size = 60
As one would expect, when viewing the file in a hex editor, each line ending is actually "\r\n", taking two bytes. Therefore the total file size is 5 more than the length.
Is there a way to count the length of bytes of a string?
I've played around with the bytes pragma, and even a little bit of unpack, but no luck yet. I'm hoping for a generalized solution other than just adding 1 to my length call.
On Windows, files have the :crlf encoding layer enabled by default. On reading, this transforms \r\n to \n, and reverses this when writing. This means that scripts which assume Unix line endings won't break quite as often.
If you don't want this behaviour, remove any PerlIO layers by using the :raw pseudolayer:
binmode STDIN, ':raw'; # for one handle
or
use open IO => ':raw'; # for all handles
(of course, this is a simplification, and the actual behavior of :raw is explained in PerlIO)

Parsing huge text file in Perl

I have genome file something about 30 gb similar to under below ,
>2RHet assembled 2006-03-27 md5sum:88c0ac39ebe4d9ef5a8f58cd746c9810
GAGAGGTGTGGAGAGGAGAGGAGAGGAGTGGTGAGGAGAGGAGAGGTGAG
GAGAGGAGAGGAGAGGAGAGGAATGGAGAGGAGAGGAGTCGAGAGGAGAG
GAGAGGAGTGGTGAGGAGAGGAGAGGAGTGGAGAGGAGACGTGAGGAGTG
GAGAGGAGAGTAGTGGAGAGGAGTGGAGAGGAGAGGAGAGGAGAGGACGG
ATTGTGTTGAGGACGGATTGTGTTACACTGATCGATGGCCGAGAACGAAC
I am trying to parse the file and achieve my task fast ,
using the below code character by character
but the character is not getting printed
open (FH,"<:raw",'genome.txt') or die "cant open the file $!\n";
until ( eof(FH) ) {
$ch = getc(FH);
print "$ch\n";# not printing ch
}
close FH;
Your mistake is forgetting an eof:
until (eof FH) { ... }
But that is very unlikely to be the most efficient solution: Perl is slower than, say … C, so we want as few loop iterations as possible, and as much work done inside perl internals as we can get. This means that reading a file character by character is slow.
Also, use lexical variables (declared with my) instead of globals; this can lead to a performance increase.
Either pick a natural record delimiter (like \n), or read a certain number of bytes:
local $/ = \256; # read 256 bytes at a time.
while (<FH>) {
# do something with the bytes
}
(see perlvar)
You could also shed all the luxuries that open, readline and even getc do for you, and use sysopen and sysread for total control. However, that way lies madness.
# not tested; I will *not* use sysread.
use Fcntl;
use constant NUM_OF_CHARS => 1; # equivalent to getc; set higher maybe.
sysopen FH, "genome.txt", O_RDONLY or die;
my $char;
while (sysread FH, $char, NUM_OF_CHARS, 0) {
print($char .= "\n"); # appending should be better than concatenation.
}
If we are gone that far, using Inline::C is just a small and possibly preferable step.