Perl: seek to and read bits, not bytes - perl

In Perl, I want to seek to the nth bit (not byte) of a file and then read the next m bits, returned as a list of 0s and 1s.
Is there any easy way to do this?
I realize I can write a subroutine wrapping regular seek and read, but was wondering if there's a easier solution.

bitseek would grab a group of bits at one time.
seek($fh, int($bit_num/8), SEEK_SET);
my $offset = $bit_num % 8;
read($fh, my $buf, ceil(($offset+$num_bits)/8));
I'm looking for bit addressing, not bit-by-bit reading.
vec($bits, $offset+$bit_num, 1);

If n is a multiple of m, and m is one of 1, 2, 4, 8, 16, 32, and on some platforms, 64, you can read the whole file into a string and use vec for this.
(Admittedly a fairly constraining case, but a common one.)
Barring that, you'll just have to do the math; in pseudo-code:
discard = n % 8;
startbyte = (n - discard) / 8
bits = m + discard
bytes = int( (bits + 7) / 8 )
seek to startbyte
read bytes into string
#list = split //, unpack "${bits}b", string
splice( #list, 0, $discard )
splice( #list, $m, #list )

I ended up writing something like what #ikegami and #ysth suggested. For reference:
=item seek_bits($fh, $start, $num)
Seek to bit (not byte) $start in filehandle $fh, and return the next
$num bits (as a list).
=cut
sub seek_bits {
my($fh, $start, $num) = #_;
# the byte where this bit starts and the offset
my($fbyte, $offset) = (floor($start/8), $start%8);
# the number of bytes to read ($offset does affect this)
my($nbytes) = ceil($num+$offset)/8;
seek($fh, $fbyte, SEEK_SET);
read($fh, my($data), $nbytes);
my(#ret) = split(//, unpack("B*", $data));
# return requested bits
return #ret[$offset-1..$offset+$num];
}

Related

Using big numbers in Perl

I have a scenario where I take 2 very big binary strings (having 100 characters) and I need to add them.
The issue is that I am getting the answer in the form 2.000xxxxxxxxxxe+2, whereas I need the precise answer, as another 100 character long string.
chomp($str1=<STDIN>);
chomp($str2=<STDIN>);
print "Str 1 is $str1\n";
print "Str 2 is $str2\n";
$t = $str1 + $str2;
print "Sum is $t\n";
Sample Input
1001101111101011011100101100100110111011111011000100111100111110111101011011011100111001100011111010
1001101111101011011100101100100110111011111011000100111100111110111101011011011100111001100011111010
Sample Output
Str1 is
1001101111101011011100101100100110111011111011000100111100111110111101011011011100111001100011111010
Str2 is
1001101111101011011100101100100110111011111011000100111100111110111101011011011100111001100011111010
Sum is
2.0022022220202e+099
As already suggested, you can use Math::BigInt core module,
use Math::BigInt;
# chomp($str1=<STDIN>);
# chomp($str2=<STDIN>);
# print "Str 1 is $str1\n";
# print "Str 2 is $str2\n";
my $t = Math::BigInt->new("0b$str1") + Math::BigInt->new("0b$str2");
print $t->as_bin;
In order to perform arithmetic on your strings, Perl converts them to floating-point numbers, which are inherently imprecise. If you want to avoid that, use Math::BigInt as already suggested ... or roll your own.
######## WARNING/GUARANTEE: This is practically certain to be
# slower, buggier, less portable, and less secure than Math::BigInt.
# In fact, I planted a security hole just to prove a point. Enjoy.
use strict;
use warnings;
sub addition {
my ($int1, $int2) = #_;
my #int1 = reverse split //, $int1;
my #int2 = reverse split //, $int2;
my $len = scalar(#int1>#int2 ? #int1 : #int2);
my #result;
my $carry = 0;
for (my $i=0; $i < $len; ++$i)
{
$int1[$i] |= 0;
$int2[$i] |= 0;
my $sum = $carry + $int1[$i] + $int2[$i];
if ($sum >= 10)
{
$carry = int($sum / 10);
$sum %= 10;
}
push #result, $sum;
}
push #result, $carry if $carry;
return join ('', reverse #result);
}

How to do bitwise operation in perl to get the count of longest sequence of 0s between two 1s

Given an integer I would like to print bit by bit in perl. For instance given a number 9, i would like to get
1
0
0
1
How do i achive this. Essentially what I am trying to do is, to get the number of longest 0s between two 1s. Meaning if the bitwise representation of a number is this
1000001001, I would like this perl function to return 5.
I would like to know whats the best way to code this in perl. Am totally new to perl.
With leading zeroes:
my #bits = reverse unpack '(a)*', unpack 'B*', pack 'J>', $int;
Without:
my #bits = reverse unpack '(a)*', sprintf '%b', $int;
Notes:
reverse is used to place the least significant bit in $bits[0].
unpack '(a)*' is used to split the string into individual bits.
Both work with signed and unsigned integers.
Both work with integers of the size (in bytes) given by perl -V:ivsize.
If you leave it as a string, you can take advantage of the regex engine to extract the sequences of zeroes.
use List::Util qw( max );
my $bin = sprintf '%b', $num;
my $longest = ( max map length, $bin =~ /1(0+)(?=1)/g ) || 0;
In C, you might do something like the following, but in Perl, it might be less efficient than the earlier solution:
my $longest = 0;
if ($num) {
# Cast to unsigned so that >> inserts zeroes even for neg nums.
$num = ~~$num;
# Skip zeros not between 1s.
$num >>= 1 while !($num & 1);
while (1) {
# Skip 1s.
$num >>= 1 while $num & 1;
last if !$num;
# Count 0s.
my $len = 0; ++$len, $num >>= 1 while !($num & 1);
$longest = $len if $longest < $len;
}
}

Decoding 3-byte integer in Perl

I'm reading a binary file format that starts out with 4 constant check bytes, followed by 3 octets that indicate how long the data portion of the record will be. I can decode this as follows:
read($fh, $x, 7) or do {
last if eof;
die "Can't read: $!";
};
my ($type, $l1, $l2, $l3) = unpack("a4 C3", $x);
my $length = $l1 << 16 | $l2 << 8 | $l3;
Is there a more direct way to read that 3-byte value, without intermediate variables? Something I'm missing in the pack specifications maybe? I haven't used pack very much except for hex encoding and other dilettantish pursuits.
You could insert a null byte into the string in order to be able to use the "N" format:
substr($x, 4, 0, "\0");
my ($type, $length) = unpack "a4 N", $x;
Edit: Or else unpack in two steps:
my ($type, $length) = unpack "a4 a3", $x;
$length = unpack "N", "\0" . $length;
my $type = unpack("a4", $x);
my $len = unpack("N", "\0".substr($x, 4));
or
my ($type, $plen) = unpack("a4 a3", $x);
my $len = unpack("N", "\0$plen");
No, unpack doesn't support 3-byte (or arbitrary length) integers, but you could use an unsigned 16-bit big-endian int to save a little work:
my ($type, $l1, $l23) = unpack("a4 Cn", $x);
my $length = $l1 << 16 | $l23;
Solution: Your method for getting the type is fine. However, I suggest that you unpack the length separately as a four-byte integer, then discard the first byte of those four bytes. This is more efficient even though it overlaps and discards the last byte of the type.
my $type = unpack("a4", $x);
my $length = unpack("x3N", $x); # skips the first 3 bytes of your original 7-byte string
$length = $length & 0xFFFFFF; # returns only the last 3 bytes of the four-byte integer

Fastest method for checking if a LF is at the end of a large scalar in Perl?

I've come up with the following to check the final character of a $scaler for a linefeed:
if( $buffer !~ /\n$/ ) {
if( substr( $buffer, -1, 1 ) !~ /\n/ ) {
if( substr( $buffer, -1, 1 ) ne '\n' ) {
Is there a faster method I could? The size of the $buffer scalar can get large and I've noticed that the larger it gets, the longer these conditionals take to run. I do have another scalar containing the length of $buffer, if that would help.
Thanks
The full code:
#!/usr/bin/perl
use strict;
use warnings;
use Fcntl qw();
use Time::HiRes qw( gettimeofday tv_interval );
use constant BUFSIZE => 2 ** 21; # 2MB worked best for me, YMMV.
die "ERROR: Missing filename" if( !$ARGV[0] );
my $top = [gettimeofday];
sysopen( my $fh, $ARGV[0], Fcntl::O_RDONLY | Fcntl::O_BINARY ) or
die "ERROR: Unable to open $ARGV[0], because $!\n";
open my $output, ">", "/dev/null"; # for 'dummy' processing
my $size = -s $ARGV[0];
my $osiz = $size;
my( $buffer, $offset, $lnCtr ) = ( "", "", 0 );
while( $size ) {
my $read = sysread( $fh, $buffer, BUFSIZE, length($offset) );
$size -= $read;
my #lines = split /\n/, $buffer;
if( substr( $buffer, -1, 1 ) ne "\n" ) {
$offset = pop( #lines );
} else {
$offset = "";
}
for my $line ( #lines ) {
processLine( \$line );
$lnCtr++;
}
$buffer = $offset if( $offset );
}
close $fh;
print "Processed $lnCtr lines ($osiz bytes) in file: $ARGV[0] in ".
tv_interval( $top ).
" secs.\n";
print "Using a buffered read of ".BUFSIZE." bytes. - JLB\n";
sub processLine {
if( ref($_[0]) ) {
print $output ${$_[0]}."\n";
} else {
print $output $_[0]."\n";
}
return 0;
}
I think I've reached that 'point-of-diminishing returns' in my attempts of making this run any faster. It seems to now be able to read in data as fast as my RAID5 SSDs are able to fetch it. As you can see, there is a reason I didn't use chomp(), the input can contain hundreds of thousands of linefeeds, which I need to keep to be able to break the lines for processing.
./fastread.pl newdata.log
Processed 516670 lines (106642635 bytes) in file: newdata.log in 0.674738 secs.
Using a buffered read of 2097152 bytes. - JLB
Perl has two string storage formats.
One of the formats uses the same number of bytes (1) to store each possible character the string can contain. Because of that and because Perl keeps track of how many bytes is used by a string, the performance of substr($x, -1) on a string in this format does not depend on the the length of the string.
The problem with the aforementioned format is that it can only store a very limited range of characters. It could be used to store the Unicode code points "Eric" and "Éric", but not for "Ελλάδα". When necessary (and even when not necessary), Perl will automatically switch a string's storage format to the other format.
The second format can store any Unicode code point as a character. In fact, it can store any 32-bit or 64-bit value (depending on perl's build settings). The downside is that a variable number of bytes is used to store each character. So even though Perl knows the number of bytes used by the entire string, it doesn't know where any character but the first one starts.* To find the last character, it must scan the entire string.
That said, because of properties of the storage format, it would actually be quite easy to find the last char of a string in constant time.
use Inline C => <<'__END_OF_C__';
# O(1) version of substr($x,-1)
SV* last_char(SV* sv) {
STRLEN len;
const char* s = SvPV(sv, len);
if (!len)
return newSVpvn("", 0);
{
const U32 utf8 = SvUTF8(sv);
const char* p = s+len-1;
if (utf8) {
while (p != s && (*p & 0xC0) != 0xC0)
--p;
}
return newSVpvn_utf8(p, s+len-p, utf8);
}
}
__END_OF_C__
* — It does keep a cache of the couple of char position to byte position mappings.
You've shown code which can be cleaned up so you don't even need to check the last char for a newline.
sub processLine {
print $_[0] $_[1];
}
open(my $fh, '<:raw', $ARGV[0])
or die("Can't open $ARGV[0]: $!\n");
my $buffer = '';
my $lnCtr = 0;
while (1) {
my $rv = sysread($fh, $buffer, BUFSIZE, length($buffer));
die $! if !defined($rv);
last if !$rv;
while ($buffer =~ s/(.*\n)//) {
processLine($1);
++$lnCtr;
}
}
if (length($buffer)) {
processLine($output, $buffer);
++$lnCtr;
}
Notes:
No need for sysopen. open is simpler.
If you pass $buffer to sysread, it doesn't make sense to use length($offset).
As you can see, $offset and the copying thereof is completely unnecessary.
Passing a var to a sub does not copy it, so no need to pass a reference.
If processLine doesn't need the newline, use s/(.*)\n// instead.
Why are you concerned about speed? Is this piece of code in a part of your program that is measurably slow, perhaps profiled with Devel::NYTProf? If not, then I suggest you go with what is the clearest to read and the most idiomatic, which is probably
if( $buffer !~ /\n$/ )
Your final version:
if( substr( $buffer, -1, 1 ) ne '\n' )
would also be a fine choice except for your single-quoting the linefeed, thus giving you a two-character string consisting of a backslash and a lowercase n. Perhaps you're coming from C where single characters are single quoted and strings are double-quoted? You want
if( substr( $buffer, -1, 1 ) ne "\n" )
This version
if( substr( $buffer, -1, 1 ) !~ /\n/ )
is doing a regex match that it shouldn't be because it's checking a one-character string against a single-character regex. The next person to read the code will think that's strange and wonder why you'd do that. Also, back to that speed thing, it's slower to match a string against a regex than just compare against a single character for equality.
Here is a Benchmark:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
my $buffer = 'abc'x10_000_000;
$buffer .= "\n";
my $count = -2;
cmpthese($count, {
'regex' => sub {
if ($buffer !~ /\n$/) { }
},
'substr + regex' => sub {
if (substr($buffer, -1, 1) !~ /\n$/) { }
},
'substr + ne' => sub {
if (substr($buffer, -1, 1) ne "\n") { }
},
'chomp' => sub {
if (chomp $buffer) { }
},
});
Output:
Rate substr + regex substr + ne regex chomp
substr + regex 6302468/s -- -11% -44% -70%
substr + ne 7072032/s 12% -- -37% -66%
regex 11294695/s 79% 60% -- -46%
chomp 20910531/s 232% 196% 85% --
chomp is certainly the fastest way.
I suspect perl is treating the string as utf-8 and has to iterate over the whole thing for some reason.
You could temporarily switch to byte semantics to see if the char on the end is a newline.
See docs for Perl's bytes pragma and perlunicode.
You can try chomp. Chomp will return the number of EOL characters removed from the end of a line:
if ( chomp $buffer ) {
print "You had an LF on the end of \$buffer";
}
Of course, chomp removes the NL characters it counted.

Writing (and reading) bits to binary files in Perl + EOF handling

I have two related problems (in Perl):
Write data to binary files, in the format: single bit flag followed by 8 bits
Read back the same format
I tried this (and other variations but for the life of me I can't figure this out):
binmode(OUT);
my $bit = pack("B1", '1');
my $byte = pack("H2", "02");
print OUT $bit . $byte;
Using a hex editor, I see I get 16 bits:
1000000000000020
What I want is 9 bits:
100000020
Also: Suppose I write out two of these patterns. That means I end up with 9 + 9 = 18 bits. I am not sure how to handle the last byte (padding?)
This is to compress and uncompress files, with space at premium. I was hoping there would be some simple idiomatic way to do this that I am not aware of.
Files are sequences of bytes. If you want to print out bits, you'll have to use some form of buffering.
my $bits = '';
$bits .= '1'; # Add 1 bit.
$bits .= unpack('B8', pack('C', 0x02)); # Add 8 bits.
$bits .= substr(unpack('B8', pack('C', 0x02)), -6); # Add 6 bits.
This prints as much as the buffer as possible:
my $len = ( length($bits) >> 3 ) << 3;
print($fh, pack('B*', substr($bits, 0, $len, '')));
You'll eventually need to pad the buffer so that you have a multiple of 8 bits in order to flush out the rest. You could simply pad with zeroes.
$bits .= "0" x ( -length($bits) % 8 );
If you're smart, though, you can come up with a padding scheme that can be used to indicate where the file actually ends. Remember, you can't rely on just the file length anymore. If you don't use a smart padding scheme, you'll have to use another method.
One example of a smart padding scheme would be:
$bits .= "0";
$bits .= "1" x ( -length($bits) % 8 );
Then, to unpad, remove all trailing 1 bits and the 0 bit before that.
You can use Bit::Vector to manage your bits and conversion with some more ease,
use Bit::Vector;
my $bit = Bit::Vector->new_Bin( 1, '1' );
my $byte = Bit::Vector->new_Bin( 8, '00000010' );
my $byte_9 = Bit::Vector->new_Bin( 9, '000000010' );
my $nineBits = Bit::Vector->new_Bin( 9, '100000000' );
my $carry = Bit::Vector->new_Bin( 9, '000000000' );
my $ORed = Bit::Vector->new_Bin( 9, '000000000' );
my $added = Bit::Vector->new_Bin( 9, '000000000' );
$ORed->Union($nineBits,$byte_9);
print "bit: 0x". $bit->to_Hex(). "\n";
print "byte 2: 0x". $byte->to_Hex(). "\n";
print "nineBits: 0x". $nineBits->to_Hex(). "\n";
print "nineBits: 0x". $nineBits->to_Bin(). "\n";
print "ORed bit and byte 0x". $ORed->to_Dec(). "\n";
open BINOUT, ">out.bin"
or die "\nCan't open out.bin for writing: $!\n";
binmode BINOUT;
print BINOUT pack ('B*', $ORed->to_Bin()) ."\n"
Here's the output
>perl bitstuff.pl
bit: 0x1
byte 2: 0x02
nineBits: 0x100
nineBits: 0x100000000
ORed bit and byte 0x-254
>cat out.bin
\201^#