workaround for SPLIT 1000 file limit? - perl

I need to split a few large files into specifically sized smaller files, with 500-5000 smaller files output. I'm using split with a -b designation, so I'm using a manual workaround when reaching the split 1000 file limit. Is there a another UNIX command or Perl one-liner that will accomplish this?

Are you sure about the 1000 file limit?
The original split had no such limit, and there's no limit for GNU or BSD version of split. Maybe you're confusing the suffix length with some sort of limit. On BSD, the suffix starts at .aaa and goes all of the way to .zzz which is over 17,000 files.
You can use the -a flag to adjust the suffix size if the three character suffix isn't enough.
$ split -a 5 $file

If I try to create lots of files, I get
$ perl -e'print "x"x5000' | split -b 1 && echo done.
split: output file suffixes exhausted
By default, the suffix length is two, which allows for 262 = 676 parts. Increasing it to three allows for 263 = 17,576 parts
$ perl -e'print "x"x5000' | split -b 1 -a 3 && echo done.
done.

One can control Perl's notion of an input record by setting $/:
Setting $/ to a reference to an integer, scalar containing an integer,
or scalar that's convertible to an integer will attempt to read
records instead of lines, with the maximum record size being the
referenced integer number of characters. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 characters from $fh.
So to split a large file into smaller files no larger than 1024 bytes, one could use the following:
use strict;
use warnings;
$/ = \1024;
my $filename = 'A';
while (<>) {
open my $fh, '>', ($filename++ . '.txt') or die $!;
print $fh $_;
close $fh or die $!;
}

Related

<> operator buffer size

I have a file with very long lines that I need to process and I found that the process gets stuck/'really slow' because the buffer is not big enough or due to the fact that handling a very long line might take a while. Here's a code sample:
open FH, "<$fname" or die "...";
while (<FH>) {
my #arr = split //, $_;
pop #arr;
pop #arr;
... for some "limited small portion of the string length" number of times ...
pop #arr;
if ($arr[-1] eq '0') {
print "done!\n";
last;
}
push #big_arr, join('', #arr);
}
The line processing is not "heavy".
I looked for something to solve it and came across PerlIO::buffersize but it looks like it wasn't maintained for a while now and I don't want to use a module with version 0.001. How can I modify the <> operator buffer size? Or alternatively, is there any way to know the line length before reading it with <>?
It may be that what you need is this:
$/ - can be set to a numeric value, for a number of bytes to read from a file.
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer number of characters.
Source: perlvar
How can I modify the <> operator buffer size?
<> reads into a scalar that can grow to any size, so I think you are referring to the size of the buffer passed to the read system call.
Before 5.14, Perl read from file handles in 4 KiB chunks. 5.14 made this configurable, with a default of 8 KiB.
$ perl -e'print("x" x 9_999, "\n") for 1..2' >large_lines
$ strace 5.10.1t/bin/perl -e'my $line = <>' large_lines 2>&1 | grep read.*xxx
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
$ strace 5.14.2t/bin/perl -e'my $line = <>' large_lines 2>&1 | grep read.*xxx
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
It can only be configured when perl is built, using the following command
./Configure -Accflags=-DPERLIOBUF_DEFAULT_BUFSIZ=8192
This applies to all buffered reading functions, including read, readline (for which <> is an alias), readpipe and eof, but not sysread.
Note that setting $/ to a reference to a number will cause readline (<>) to act as read, which is still buffered.
$ strace perl -e'$/ = \8193; my $block = <>' large_lines 2>&1 | grep read.*xxx
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
read(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8192) = 8192
If you actually want to perform a single read system call, you need to use sysread.
$ strace perl -e'sysread(STDIN, $buf, 8193)' <large_lines 2>&1 | grep read.*xxx
read(0, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 8193) = 8193
Altering Perl's read buffer size is unlikely to make any significant difference to the speed of your program, and the impact you are seeing is much more likely to be a result of the longer read time from the disk drive itself. Take a look at Perl Read-Ahead I/O Buffering on perlmonks.org
Furthermore, implementing your own buffering by using read or setting the record separator $/ to a fixed size, is more than likely to slow down your program, as you still have to separate what you have read into lines of data but now have to do it in Perl code instead of letting perl do it for you in C
Note also that the measure of changing $/ to a fixed record size will still use Perl's standard, probably 8KB, buffer. The only difference is that the amount of data handed back to you will be determined according to a byte count instead of the position of a separator string

Size of a string in bytes (windows)

I'm attempting to have a progress indicator when processing a large file by counting the length of each string. Unfortunately, it's counting each line ending "\r\n" as a single character, therefore leading to a drift of my running total.
The following script demonstrates:
use strict;
use warnings;
use autodie;
my $file = 'length_vs_size.txt';
open my $fh, '>', $file;
my $length = 0;
while (<DATA>) {
$length += length;
print $fh $_;
}
close $fh;
my $size = -s $file;
print "Length = $length\n";
print "Size = $size\n";
__DATA__
11...chars
22...chars
33...chars
44...chars
55...chars
Using Strawberry Perl, this outputs:
Length = 55
Size = 60
As one would expect, when viewing the file in a hex editor, each line ending is actually "\r\n", taking two bytes. Therefore the total file size is 5 more than the length.
Is there a way to count the length of bytes of a string?
I've played around with the bytes pragma, and even a little bit of unpack, but no luck yet. I'm hoping for a generalized solution other than just adding 1 to my length call.
On Windows, files have the :crlf encoding layer enabled by default. On reading, this transforms \r\n to \n, and reverses this when writing. This means that scripts which assume Unix line endings won't break quite as often.
If you don't want this behaviour, remove any PerlIO layers by using the :raw pseudolayer:
binmode STDIN, ':raw'; # for one handle
or
use open IO => ':raw'; # for all handles
(of course, this is a simplification, and the actual behavior of :raw is explained in PerlIO)

How reliable is the -B file test?

When I open a SQLite database file there is a lot of readable text in the beginning of the file - how big is the chance that a SQLite file is filtered wrongly away due the -B file test?
#!/usr/bin/env perl
use warnings;
use strict;
use 5.10.1;
use File::Find;
my $dir = shift;
my $databases;
find( {
wanted => sub {
my $file = $File::Find::name;
return if not -B $file;
return if not -s $file;
return if not -r $file;
say $file;
open my $fh, '<', $file or die "$file: $!";
my $firstline = readline( $fh ) // '';
close $fh or die $!;
push #$databases, $file if $firstline =~ /\ASQLite\sformat/;
},
no_chdir => 1,
},
$dir );
say scalar #$databases;
The perlfunc man page has the following to say about -T and -B:
The -T and -B switches work as follows. The first block or so of the file is
examined for odd characters such as strange control codes or characters with
the high bit set. If too many strange characters (>30%) are found, it's a -B
file; otherwise it's a -T file. Also, any file containing a zero byte in the
first block is considered a binary file.
Of course you could now do a statistic analysis of a number of sqlite files, parse their "first block or so" for "odd characters", calculate the probability of their occurrence, and that would give you an idea of how likely it is that -B fails for sqlite files.
However, you could also go the easy route. Can it fail? Yes, it's a heuristic. And a bad one at that. So don't use it.
File type recognition on Unix is usually done by evaluating the file's content. And yes, there are people who've done all the work for you already: it's called libmagic (the thingy that yields the file command line tool). You can use it from Perl with e.g. File::MMagic.
Well, all files are technically a collection of bytes, and thus binary. Beyond that, there is no accepted definition of binary, so it's impossible to evaluate -B's reliability unless you care to posit a definition by which it is to be evaluated.

Filter common values between text files

I'm a beginner at perl and I'm trying to filter a large text file with 1 column of ID names, each a few characters long and unique, e.g.:
Aghm
Tbc2
Popc
Ltr1
Iubr
Osv5
and filter this list against a second text file with some of the same ID names, e.g.:
Popc
Iubr
Trv7
Ybd8
I only want to find the common ID names and print into a new text file. In the example above I want to generate the list:
Popc
Iubr
How can I do it using perl script?
To put you on a path, you seem to make a Perl filter.
You could try by opening the first file, looping on the diamond operator (that is <>) and writing selected lines to the second file.
You should try to get a copy of the Perl Cookbook, the chapter 07 is dealing of such case.
Having the id file ids.txt, filter file filter_ids.txt this would write the desired result to filtered_ids.txt:
#!/usr/bin/perl
use strict;
use warnings;
open my $rh, '<', 'filter_ids.txt' or die "$!\n";
my %filter = map {$_ => 1} <$rh>;
open $rh, '<', 'ids.txt' or die "$!\n";
open my $wh, '>', 'filtered_ids.txt' or die "$!\n";
map {print $wh $_} grep $filter{$_}, <$rh>;
close $wh;
Personally I'd rather do this with grep:
grep -f filter_ids.txt ids.txt > filtered_ids.txt
Result in either case:
flesk#flesk:~$ more filtered_ids.txt
Popc
Iubr

Why do I use up so much memory when I read a file into memory in Perl?

I have a text file that is 310MB in size (uncompressed). When using PerlIO::gzip to open the file and uncompress it into memory, this file easily fills 2GB of RAM before perl runs out of memory.
The file is opened as below:
open FOO, "<:gzip", "file.gz" or die $!;
my #lines = <FOO>;
Obviously, this is a super convenient way to open gzipped files easily in perl, but it takes up a ridiculous amount of space! My next step is to uncompress the file to the HD, read the lines of the file to #lines, operate on #lines, and compress it back. Does anyone have any idea why over 7 times as much memory is consumed when opening a zipped file? Does anyone have an alternate idea as to how I can uncompress this gzipped file into memory without it taking a ridiculous amount of memory?
You're reading all the content of the file into a #lines array. Of course that'll pull all the uncompressed content into memory. What you might have wanted instead is reading from your handle line-by-line, only keeping one line at a time in memory:
open my $foo, '<:gzip', 'file.gz' or die $!;
while (my $line = <$fh>) {
# process $line here
}
When you do:
my #lines = <FOO>;
you are creating an array with as many elements as there are lines in file. At 100 characters per line, that's about 3.4 million array entries. There is overhead associated with each array entry which means the memory footprint will be much larger than just the uncompressed size of the file.
You can avoid slurping and process the file line-by-line. Here is an example:
C:\Temp> dir file
2010/10/04 09:18 PM 328,000,000 file
C:\Temp> dir file.gz
2010/10/04 09:19 PM 1,112,975 file.gz
And, indeed,
#!/usr/bin/perl
use strict; use warnings;
use autodie;
use PerlIO::gzip;
open my $foo, '<:gzip', 'file.gz';
while ( my $line = <$foo> ) {
print ".";
}
has no problems.
To get an idea of the memory overhead, note:
#!/usr/bin/perl
use strict; use warnings;
use Devel::Size qw( total_size );
my $x = 'x' x 100;
my #x = ('x' x 100);
printf "Scalar: %d\n", total_size( \$x );
printf "Array: %d\n", total_size( \#x );
Output:
Scalar: 136
Array: 256
With such big files I see only one solution: you can use command line to uncompress/compress file. Do your manipulation in Perl, then use again external tools to compress/decompress file :)