Size of a string in bytes (windows) - perl

I'm attempting to have a progress indicator when processing a large file by counting the length of each string. Unfortunately, it's counting each line ending "\r\n" as a single character, therefore leading to a drift of my running total.
The following script demonstrates:
use strict;
use warnings;
use autodie;
my $file = 'length_vs_size.txt';
open my $fh, '>', $file;
my $length = 0;
while (<DATA>) {
$length += length;
print $fh $_;
}
close $fh;
my $size = -s $file;
print "Length = $length\n";
print "Size = $size\n";
__DATA__
11...chars
22...chars
33...chars
44...chars
55...chars
Using Strawberry Perl, this outputs:
Length = 55
Size = 60
As one would expect, when viewing the file in a hex editor, each line ending is actually "\r\n", taking two bytes. Therefore the total file size is 5 more than the length.
Is there a way to count the length of bytes of a string?
I've played around with the bytes pragma, and even a little bit of unpack, but no luck yet. I'm hoping for a generalized solution other than just adding 1 to my length call.

On Windows, files have the :crlf encoding layer enabled by default. On reading, this transforms \r\n to \n, and reverses this when writing. This means that scripts which assume Unix line endings won't break quite as often.
If you don't want this behaviour, remove any PerlIO layers by using the :raw pseudolayer:
binmode STDIN, ':raw'; # for one handle
or
use open IO => ':raw'; # for all handles
(of course, this is a simplification, and the actual behavior of :raw is explained in PerlIO)

Related

Setting diamond operator to work in binmode?

Diamond operator (<>) works in text mode by default, is it possible to change binmode for it? Seems binmode function accepts handle only.
See perldoc perlopentut:
Binary Files
On certain legacy systems with what could charitably be called terminally
convoluted (some would say broken) I/O models, a file isn't a file--at
least, not with respect to the C standard I/O library. On these old
systems whose libraries (but not kernels) distinguish between text and
binary streams, to get files to behave properly you'll have to bend over
backwards to avoid nasty problems. On such infelicitous systems, sockets
and pipes are already opened in binary mode, and there is currently no way
to turn that off. With files, you have more options.
Another option is to use the "binmode" function on the appropriate handles
before doing regular I/O on them:
binmode(STDIN);
binmode(STDOUT);
while (<STDIN>) { print }
Passing "sysopen" a non-standard flag option will also open the file in
binary mode on those systems that support it. This is the equivalent of
opening the file normally, then calling "binmode" on the handle.
sysopen(BINDAT, "records.data", O_RDWR | O_BINARY)
|| die "can't open records.data: $!";
Now you can use "read" and "print" on that handle without worrying about
the non-standard system I/O library breaking your data. It's not a pretty
picture, but then, legacy systems seldom are. CP/M will be with us until
the end of days, and after.
On systems with exotic I/O systems, it turns out that, astonishingly
enough, even unbuffered I/O using "sysread" and "syswrite" might do sneaky
data mutilation behind your back.
while (sysread(WHENCE, $buf, 1024)) {
syswrite(WHITHER, $buf, length($buf));
}
Depending on the vicissitudes of your runtime system, even these calls may
need "binmode" or "O_BINARY" first. Systems known to be free of such
difficulties include Unix, the Mac OS, Plan 9, and Inferno.
<> is a convenience. If it only iterated through filenames specified on the command line, you could use $ARGV within while (<>) to detect when a new file was opened, binmode it, and then fseek to the beginning. Of course, this does not work in the presence of redirection (console input is a whole other story).
One solution is to detect if #ARGV does contain something, and open each file individual, and default to reading from STDIN. A rudimentary implementation of this using an iterator could be:
#!/usr/bin/env perl
use strict;
use warnings;
use Carp qw( croak );
my $argv = sub {
#_ or return sub {
my $done;
sub {
$done and return;
$done = 1;
binmode STDIN;
\*STDIN;
}
}->();
my #argv = #_;
sub {
#argv or return;
my $file = shift #argv;
open my $fh, '<', $file
or croak "Cannot open '$file': $!";
binmode $fh;
$fh;
};
}->(#ARGV);
binmode STDOUT;
while (my $fh = $argv->()) {
while (my $line = <$fh>) {
print $line;
}
}
Note:
C:\...\Temp> xxd test.txt
00000000: 7468 6973 2069 7320 6120 7465 7374 0a0a this is a test..
Without binmode:
C:\...\Temp> perl -e "print " test.txt | xxd
00000000: 7468 6973 2069 7320 6120 7465 7374 0d0a this is a test..
00000010: 0d0a ..
With the script above:
C:\...\Temp> perl argv.pl test.txt | xxd
00000000: 7468 6973 2069 7320 6120 7465 7374 0a0a this is a test..
Same results using perl ... < test.txt | xxd, or piping text through perl ...

workaround for SPLIT 1000 file limit?

I need to split a few large files into specifically sized smaller files, with 500-5000 smaller files output. I'm using split with a -b designation, so I'm using a manual workaround when reaching the split 1000 file limit. Is there a another UNIX command or Perl one-liner that will accomplish this?
Are you sure about the 1000 file limit?
The original split had no such limit, and there's no limit for GNU or BSD version of split. Maybe you're confusing the suffix length with some sort of limit. On BSD, the suffix starts at .aaa and goes all of the way to .zzz which is over 17,000 files.
You can use the -a flag to adjust the suffix size if the three character suffix isn't enough.
$ split -a 5 $file
If I try to create lots of files, I get
$ perl -e'print "x"x5000' | split -b 1 && echo done.
split: output file suffixes exhausted
By default, the suffix length is two, which allows for 262 = 676 parts. Increasing it to three allows for 263 = 17,576 parts
$ perl -e'print "x"x5000' | split -b 1 -a 3 && echo done.
done.
One can control Perl's notion of an input record by setting $/:
Setting $/ to a reference to an integer, scalar containing an integer,
or scalar that's convertible to an integer will attempt to read
records instead of lines, with the maximum record size being the
referenced integer number of characters. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 characters from $fh.
So to split a large file into smaller files no larger than 1024 bytes, one could use the following:
use strict;
use warnings;
$/ = \1024;
my $filename = 'A';
while (<>) {
open my $fh, '>', ($filename++ . '.txt') or die $!;
print $fh $_;
close $fh or die $!;
}

Parsing huge text file in Perl

I have genome file something about 30 gb similar to under below ,
>2RHet assembled 2006-03-27 md5sum:88c0ac39ebe4d9ef5a8f58cd746c9810
GAGAGGTGTGGAGAGGAGAGGAGAGGAGTGGTGAGGAGAGGAGAGGTGAG
GAGAGGAGAGGAGAGGAGAGGAATGGAGAGGAGAGGAGTCGAGAGGAGAG
GAGAGGAGTGGTGAGGAGAGGAGAGGAGTGGAGAGGAGACGTGAGGAGTG
GAGAGGAGAGTAGTGGAGAGGAGTGGAGAGGAGAGGAGAGGAGAGGACGG
ATTGTGTTGAGGACGGATTGTGTTACACTGATCGATGGCCGAGAACGAAC
I am trying to parse the file and achieve my task fast ,
using the below code character by character
but the character is not getting printed
open (FH,"<:raw",'genome.txt') or die "cant open the file $!\n";
until ( eof(FH) ) {
$ch = getc(FH);
print "$ch\n";# not printing ch
}
close FH;
Your mistake is forgetting an eof:
until (eof FH) { ... }
But that is very unlikely to be the most efficient solution: Perl is slower than, say … C, so we want as few loop iterations as possible, and as much work done inside perl internals as we can get. This means that reading a file character by character is slow.
Also, use lexical variables (declared with my) instead of globals; this can lead to a performance increase.
Either pick a natural record delimiter (like \n), or read a certain number of bytes:
local $/ = \256; # read 256 bytes at a time.
while (<FH>) {
# do something with the bytes
}
(see perlvar)
You could also shed all the luxuries that open, readline and even getc do for you, and use sysopen and sysread for total control. However, that way lies madness.
# not tested; I will *not* use sysread.
use Fcntl;
use constant NUM_OF_CHARS => 1; # equivalent to getc; set higher maybe.
sysopen FH, "genome.txt", O_RDONLY or die;
my $char;
while (sysread FH, $char, NUM_OF_CHARS, 0) {
print($char .= "\n"); # appending should be better than concatenation.
}
If we are gone that far, using Inline::C is just a small and possibly preferable step.

get last few lines of file stored in variable

How could I get the last few lines of a file that is stored in a variable? On linux I would use the tail command if it was in a file.
1) How can I do this in perl if the data is in a file?
2) How can I do this if the content of the file is in a variable?
To read the end of a file, seek near the end of the file and begin reading. For example,
open my $fh, '<', $file;
seek $fh, -1000, 2;
my #lines = <$fh>;
close $fh;
print "Last 5 lines of $file are: ", #lines[-5 .. -1];
Depending on what is in the file or how many lines you want to look at, you may want to use a different magic number than -1000 above.
You could do something similar with a variable, either
open my $fh, '<', \$the_variable;
seek $fh, -1000, 2;
or just
open my $fh, '<', \substr($the_variable, -1000);
will give you an I/O handle that produces the last 1000 characters in $the_variable.
The File::ReadBackwards module on the CPAN is probably what you want. You can use it thus. This will print the last three lines in the file:
use File::ReadBackwards
my $bw = File::ReadBackwards->new("some_file");
print reverse map { $bw->readline() } (1 .. 3);
Internally, it seek()s to near the end of the file and looks for line endings, so it should be fairly efficient with memory, even with very big files.
To some extent, that depends how big the file is, and how many lines you want. If it is going to be very big you need to be careful, because reading it all into memory will take a lot longer than just reading the last part of the file.
If it is small. the easiest way is probably to File::Slurp it into memory, split by record delimiters, and keep the last n records. In effect, something like:
# first line if not yet in a string
my $string = File::Slurp::read_file($filename);
my #lines = split(/\n/, $string);
print join("\n", #lines[-10..-1])
If it is large, too large to find into memory, you might be better to use file system operations directly. When I did this, I opened the file and used seek() and read the last 4k or so of the file, and repeated backwards until I had enough data to get the number of records I needed.
Not a detailed answer, but the question could be a touch more specific.
I know this is an old question, but I found it while looking for a way to search for a pattern in the first and last k lines of a file.
For the tail part, in addition to seek (if the file is seekable), it saves some memory to use a rotating buffer, as follows (returns the last k lines, or less if fewer than $k are available):
my $i = 0; my #a;
while (<$fh>) {
$a[$i++ % $k] = $_;
}
my #tail = splice #a,0,$i % $k;
splice #a,#a,0,#tail;
return #a;
A lot has already been stated on the file side, but if it's already in a string, you can use the following regex:
my ($lines) = $str ~= /
(
(?:
(?:(?<=^)|(?<=\n)) # match beginning of line (separated due to variable lookbehind limitation)
[^\n]*+ # match the line
(?:\n|$) # match the end of the line
){0,5}+ # match at least 0 and at most 5 lines
)$ # match must be from end of the string
/sx # s = treat string as single line
# x = allow whitespace and comments
This runs extremely fast. Benchmarking shows between 40-90% faster compared to the split/join method (variable due to current load on machine). This is presumably due to less memory manipulations. Something you might want to consider if speed is essential. Otherwise, it's just interesting.

Unicode in Perl not working

I have some text files which I am trying to transform with a Perl script on Windows. The text files look normal in Notepad+, but all the regexes in my script were failing to match. Then I noticed that when I open the text files in NotePad+, the status bar says "UCS-2 Little Endia" (sic). I am assuming this corresponds to the encoding UCS-2LE. So I created "readFile" and "writeFile" subs in Perl, like so:
use PerlIO::encoding;
my $enc = ':encoding(UCS-2LE)';
sub readFile {
my ($fName) = #_;
open my $f, "<$enc", $fName or die "can't read $fName\n";
local $/;
my $txt = <$f>;
close $f;
return $txt;
}
sub writeFile {
my ($fName, $txt) = #_;
open my $f, ">$enc", $fName or die "can't write $fName\n";
print $f $txt;
close $f;
}
my $fName = 'someFile.txt';
my $txt = readFile $fName;
# ... transform $txt using s/// ...
writeFile $fName, $txt;
Now the regexes match (although less often than I expect), but the output contains long strings of Asian-looking characters interspersed with longs strings of the correct text. Is my code wrong? Or perhaps Notepad+ is wrong about the encoding? How should I proceed?
OK, I figured it out. The problem was being caused by a disconnect between the encoding translation done by the "encoding..." parameter of the "open" call and the default CRLF translation done by Perl on Windows. What appeared to be happening was that LF was being translated to CRLF on output after the encoding had already been done, which threw off the "parity" of the 16-bit encoding for the following line. Once the next line was reached, the "parity" got put back. That would explain the "long strings of Asian-looking characters interspersed with longs strings of the correct text"... every other line was being messed up.
To correct it, I took out the encoding parameter in my "open" call and added a "binmode" call, as follows:
open my $f, $fName or die "can't read $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
binmode apparently has a concept of "layered" I/O handling that is somewhat complicated.
One thing I can't figure out is how to get my CRLF translation back. If I leave out :raw or add :crlf, the "parity" problem returns. I've tried re-ordering as well and can't get it to work.
(I added this as a separate question: CRLF translation with Unicode in Perl)
I don't have the Notepad+ editor to check but it may be a BOM problem with your output encoding not containing a BOM.
http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
Maybe you need to encode $txt using a byte order mark as described above.