Decoding 3-byte integer in Perl - perl

I'm reading a binary file format that starts out with 4 constant check bytes, followed by 3 octets that indicate how long the data portion of the record will be. I can decode this as follows:
read($fh, $x, 7) or do {
last if eof;
die "Can't read: $!";
};
my ($type, $l1, $l2, $l3) = unpack("a4 C3", $x);
my $length = $l1 << 16 | $l2 << 8 | $l3;
Is there a more direct way to read that 3-byte value, without intermediate variables? Something I'm missing in the pack specifications maybe? I haven't used pack very much except for hex encoding and other dilettantish pursuits.

You could insert a null byte into the string in order to be able to use the "N" format:
substr($x, 4, 0, "\0");
my ($type, $length) = unpack "a4 N", $x;
Edit: Or else unpack in two steps:
my ($type, $length) = unpack "a4 a3", $x;
$length = unpack "N", "\0" . $length;

my $type = unpack("a4", $x);
my $len = unpack("N", "\0".substr($x, 4));
or
my ($type, $plen) = unpack("a4 a3", $x);
my $len = unpack("N", "\0$plen");

No, unpack doesn't support 3-byte (or arbitrary length) integers, but you could use an unsigned 16-bit big-endian int to save a little work:
my ($type, $l1, $l23) = unpack("a4 Cn", $x);
my $length = $l1 << 16 | $l23;

Solution: Your method for getting the type is fine. However, I suggest that you unpack the length separately as a four-byte integer, then discard the first byte of those four bytes. This is more efficient even though it overlaps and discards the last byte of the type.
my $type = unpack("a4", $x);
my $length = unpack("x3N", $x); # skips the first 3 bytes of your original 7-byte string
$length = $length & 0xFFFFFF; # returns only the last 3 bytes of the four-byte integer

Related

Perl: seek to and read bits, not bytes

In Perl, I want to seek to the nth bit (not byte) of a file and then read the next m bits, returned as a list of 0s and 1s.
Is there any easy way to do this?
I realize I can write a subroutine wrapping regular seek and read, but was wondering if there's a easier solution.
bitseek would grab a group of bits at one time.
seek($fh, int($bit_num/8), SEEK_SET);
my $offset = $bit_num % 8;
read($fh, my $buf, ceil(($offset+$num_bits)/8));
I'm looking for bit addressing, not bit-by-bit reading.
vec($bits, $offset+$bit_num, 1);
If n is a multiple of m, and m is one of 1, 2, 4, 8, 16, 32, and on some platforms, 64, you can read the whole file into a string and use vec for this.
(Admittedly a fairly constraining case, but a common one.)
Barring that, you'll just have to do the math; in pseudo-code:
discard = n % 8;
startbyte = (n - discard) / 8
bits = m + discard
bytes = int( (bits + 7) / 8 )
seek to startbyte
read bytes into string
#list = split //, unpack "${bits}b", string
splice( #list, 0, $discard )
splice( #list, $m, #list )
I ended up writing something like what #ikegami and #ysth suggested. For reference:
=item seek_bits($fh, $start, $num)
Seek to bit (not byte) $start in filehandle $fh, and return the next
$num bits (as a list).
=cut
sub seek_bits {
my($fh, $start, $num) = #_;
# the byte where this bit starts and the offset
my($fbyte, $offset) = (floor($start/8), $start%8);
# the number of bytes to read ($offset does affect this)
my($nbytes) = ceil($num+$offset)/8;
seek($fh, $fbyte, SEEK_SET);
read($fh, my($data), $nbytes);
my(#ret) = split(//, unpack("B*", $data));
# return requested bits
return #ret[$offset-1..$offset+$num];
}

Perl, Decoding bits stored in a binary file

I'm decoding a fixed-width binary file in perl. One of the fields is 1 byte in length and the 8 bits are encoded such that:
The 7th Bit unused, The 0-4th bit is used to indicate Field1. The 5-6th Bit is used to indicate Field2.
For example the field 0x27 will give Field1 = 7, Field2 = 1.
I'm reading all the fields into a hash table called raw{} and then decoding the values into another hash called processed{}.
The code (only shown for this one byte) is
while (read(FILE, $buff, 559)) {
%raw = (); # Hash for first-pass extractions
%processed = (); # Hash for cleaned up values, for output
(
......
$raw{'Field12'},
......
)= unpack('
.....
H2
.....
h*',$buff);
$prcoessed{'Field1'} = unpack("B8", pack("H8", $raw{'Field12'})) & 0x1f;
$prcoessed{'Field2'} = (unpack("B8", pack("H8", $raw{'Field12'})) >> 5) & 0x3;
However, I don't get the desired values. I get $prcoessed{'Field1'} = 15 and $prcoessed{'Field2'} = 3. Where am I going wrong?
0x1f is a number, you need to transform it into a character.
Instead of packs and unpacks, you can use ord and chr:
#!/usr/bin/perl
use warnings;
use strict;
my $char = chr 0x27;
my $field1 = $char & chr 0x1f;
my $field2 = $char & chr 0x60;
$field2 = chr(ord($field2) >> 5);
print ord, "\n" for $field1, $field2;
The same script using pack and unpack:
my $char = pack 'C', 0x27;
my $field1 = $char & pack 'C', 0x1f;
my $field2 = $char & pack 'C', 0x60;
$field2 = pack 'C', unpack('C', $field2) >> 5;
print unpack('C', $_), "\n" for $field1, $field2;

Fastest method for checking if a LF is at the end of a large scalar in Perl?

I've come up with the following to check the final character of a $scaler for a linefeed:
if( $buffer !~ /\n$/ ) {
if( substr( $buffer, -1, 1 ) !~ /\n/ ) {
if( substr( $buffer, -1, 1 ) ne '\n' ) {
Is there a faster method I could? The size of the $buffer scalar can get large and I've noticed that the larger it gets, the longer these conditionals take to run. I do have another scalar containing the length of $buffer, if that would help.
Thanks
The full code:
#!/usr/bin/perl
use strict;
use warnings;
use Fcntl qw();
use Time::HiRes qw( gettimeofday tv_interval );
use constant BUFSIZE => 2 ** 21; # 2MB worked best for me, YMMV.
die "ERROR: Missing filename" if( !$ARGV[0] );
my $top = [gettimeofday];
sysopen( my $fh, $ARGV[0], Fcntl::O_RDONLY | Fcntl::O_BINARY ) or
die "ERROR: Unable to open $ARGV[0], because $!\n";
open my $output, ">", "/dev/null"; # for 'dummy' processing
my $size = -s $ARGV[0];
my $osiz = $size;
my( $buffer, $offset, $lnCtr ) = ( "", "", 0 );
while( $size ) {
my $read = sysread( $fh, $buffer, BUFSIZE, length($offset) );
$size -= $read;
my #lines = split /\n/, $buffer;
if( substr( $buffer, -1, 1 ) ne "\n" ) {
$offset = pop( #lines );
} else {
$offset = "";
}
for my $line ( #lines ) {
processLine( \$line );
$lnCtr++;
}
$buffer = $offset if( $offset );
}
close $fh;
print "Processed $lnCtr lines ($osiz bytes) in file: $ARGV[0] in ".
tv_interval( $top ).
" secs.\n";
print "Using a buffered read of ".BUFSIZE." bytes. - JLB\n";
sub processLine {
if( ref($_[0]) ) {
print $output ${$_[0]}."\n";
} else {
print $output $_[0]."\n";
}
return 0;
}
I think I've reached that 'point-of-diminishing returns' in my attempts of making this run any faster. It seems to now be able to read in data as fast as my RAID5 SSDs are able to fetch it. As you can see, there is a reason I didn't use chomp(), the input can contain hundreds of thousands of linefeeds, which I need to keep to be able to break the lines for processing.
./fastread.pl newdata.log
Processed 516670 lines (106642635 bytes) in file: newdata.log in 0.674738 secs.
Using a buffered read of 2097152 bytes. - JLB
Perl has two string storage formats.
One of the formats uses the same number of bytes (1) to store each possible character the string can contain. Because of that and because Perl keeps track of how many bytes is used by a string, the performance of substr($x, -1) on a string in this format does not depend on the the length of the string.
The problem with the aforementioned format is that it can only store a very limited range of characters. It could be used to store the Unicode code points "Eric" and "Éric", but not for "Ελλάδα". When necessary (and even when not necessary), Perl will automatically switch a string's storage format to the other format.
The second format can store any Unicode code point as a character. In fact, it can store any 32-bit or 64-bit value (depending on perl's build settings). The downside is that a variable number of bytes is used to store each character. So even though Perl knows the number of bytes used by the entire string, it doesn't know where any character but the first one starts.* To find the last character, it must scan the entire string.
That said, because of properties of the storage format, it would actually be quite easy to find the last char of a string in constant time.
use Inline C => <<'__END_OF_C__';
# O(1) version of substr($x,-1)
SV* last_char(SV* sv) {
STRLEN len;
const char* s = SvPV(sv, len);
if (!len)
return newSVpvn("", 0);
{
const U32 utf8 = SvUTF8(sv);
const char* p = s+len-1;
if (utf8) {
while (p != s && (*p & 0xC0) != 0xC0)
--p;
}
return newSVpvn_utf8(p, s+len-p, utf8);
}
}
__END_OF_C__
* — It does keep a cache of the couple of char position to byte position mappings.
You've shown code which can be cleaned up so you don't even need to check the last char for a newline.
sub processLine {
print $_[0] $_[1];
}
open(my $fh, '<:raw', $ARGV[0])
or die("Can't open $ARGV[0]: $!\n");
my $buffer = '';
my $lnCtr = 0;
while (1) {
my $rv = sysread($fh, $buffer, BUFSIZE, length($buffer));
die $! if !defined($rv);
last if !$rv;
while ($buffer =~ s/(.*\n)//) {
processLine($1);
++$lnCtr;
}
}
if (length($buffer)) {
processLine($output, $buffer);
++$lnCtr;
}
Notes:
No need for sysopen. open is simpler.
If you pass $buffer to sysread, it doesn't make sense to use length($offset).
As you can see, $offset and the copying thereof is completely unnecessary.
Passing a var to a sub does not copy it, so no need to pass a reference.
If processLine doesn't need the newline, use s/(.*)\n// instead.
Why are you concerned about speed? Is this piece of code in a part of your program that is measurably slow, perhaps profiled with Devel::NYTProf? If not, then I suggest you go with what is the clearest to read and the most idiomatic, which is probably
if( $buffer !~ /\n$/ )
Your final version:
if( substr( $buffer, -1, 1 ) ne '\n' )
would also be a fine choice except for your single-quoting the linefeed, thus giving you a two-character string consisting of a backslash and a lowercase n. Perhaps you're coming from C where single characters are single quoted and strings are double-quoted? You want
if( substr( $buffer, -1, 1 ) ne "\n" )
This version
if( substr( $buffer, -1, 1 ) !~ /\n/ )
is doing a regex match that it shouldn't be because it's checking a one-character string against a single-character regex. The next person to read the code will think that's strange and wonder why you'd do that. Also, back to that speed thing, it's slower to match a string against a regex than just compare against a single character for equality.
Here is a Benchmark:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
my $buffer = 'abc'x10_000_000;
$buffer .= "\n";
my $count = -2;
cmpthese($count, {
'regex' => sub {
if ($buffer !~ /\n$/) { }
},
'substr + regex' => sub {
if (substr($buffer, -1, 1) !~ /\n$/) { }
},
'substr + ne' => sub {
if (substr($buffer, -1, 1) ne "\n") { }
},
'chomp' => sub {
if (chomp $buffer) { }
},
});
Output:
Rate substr + regex substr + ne regex chomp
substr + regex 6302468/s -- -11% -44% -70%
substr + ne 7072032/s 12% -- -37% -66%
regex 11294695/s 79% 60% -- -46%
chomp 20910531/s 232% 196% 85% --
chomp is certainly the fastest way.
I suspect perl is treating the string as utf-8 and has to iterate over the whole thing for some reason.
You could temporarily switch to byte semantics to see if the char on the end is a newline.
See docs for Perl's bytes pragma and perlunicode.
You can try chomp. Chomp will return the number of EOL characters removed from the end of a line:
if ( chomp $buffer ) {
print "You had an LF on the end of \$buffer";
}
Of course, chomp removes the NL characters it counted.

Writing (and reading) bits to binary files in Perl + EOF handling

I have two related problems (in Perl):
Write data to binary files, in the format: single bit flag followed by 8 bits
Read back the same format
I tried this (and other variations but for the life of me I can't figure this out):
binmode(OUT);
my $bit = pack("B1", '1');
my $byte = pack("H2", "02");
print OUT $bit . $byte;
Using a hex editor, I see I get 16 bits:
1000000000000020
What I want is 9 bits:
100000020
Also: Suppose I write out two of these patterns. That means I end up with 9 + 9 = 18 bits. I am not sure how to handle the last byte (padding?)
This is to compress and uncompress files, with space at premium. I was hoping there would be some simple idiomatic way to do this that I am not aware of.
Files are sequences of bytes. If you want to print out bits, you'll have to use some form of buffering.
my $bits = '';
$bits .= '1'; # Add 1 bit.
$bits .= unpack('B8', pack('C', 0x02)); # Add 8 bits.
$bits .= substr(unpack('B8', pack('C', 0x02)), -6); # Add 6 bits.
This prints as much as the buffer as possible:
my $len = ( length($bits) >> 3 ) << 3;
print($fh, pack('B*', substr($bits, 0, $len, '')));
You'll eventually need to pad the buffer so that you have a multiple of 8 bits in order to flush out the rest. You could simply pad with zeroes.
$bits .= "0" x ( -length($bits) % 8 );
If you're smart, though, you can come up with a padding scheme that can be used to indicate where the file actually ends. Remember, you can't rely on just the file length anymore. If you don't use a smart padding scheme, you'll have to use another method.
One example of a smart padding scheme would be:
$bits .= "0";
$bits .= "1" x ( -length($bits) % 8 );
Then, to unpad, remove all trailing 1 bits and the 0 bit before that.
You can use Bit::Vector to manage your bits and conversion with some more ease,
use Bit::Vector;
my $bit = Bit::Vector->new_Bin( 1, '1' );
my $byte = Bit::Vector->new_Bin( 8, '00000010' );
my $byte_9 = Bit::Vector->new_Bin( 9, '000000010' );
my $nineBits = Bit::Vector->new_Bin( 9, '100000000' );
my $carry = Bit::Vector->new_Bin( 9, '000000000' );
my $ORed = Bit::Vector->new_Bin( 9, '000000000' );
my $added = Bit::Vector->new_Bin( 9, '000000000' );
$ORed->Union($nineBits,$byte_9);
print "bit: 0x". $bit->to_Hex(). "\n";
print "byte 2: 0x". $byte->to_Hex(). "\n";
print "nineBits: 0x". $nineBits->to_Hex(). "\n";
print "nineBits: 0x". $nineBits->to_Bin(). "\n";
print "ORed bit and byte 0x". $ORed->to_Dec(). "\n";
open BINOUT, ">out.bin"
or die "\nCan't open out.bin for writing: $!\n";
binmode BINOUT;
print BINOUT pack ('B*', $ORed->to_Bin()) ."\n"
Here's the output
>perl bitstuff.pl
bit: 0x1
byte 2: 0x02
nineBits: 0x100
nineBits: 0x100000000
ORed bit and byte 0x-254
>cat out.bin
\201^#

How can I do 64-bit hex/decimal arithmetic AND output a full number in HEX as string in Perl?

I need to do some arithmetic with large hexadecimal numbers below, but when I try to output I'm getting overflow error messages "Hexadecimal number > 0xffffffff non-portable", messages about not portable, or the maximum 32-bit hex value FFFFFFFF.
All of which imply that the standard language and output routines only cope with 32 bit values. I need 64-bit values and have done a lot of research, but I found nothing that BOTH enables the arithmetic AND outputs the large number in hex.
my $result = 0x00000200A0000000 +
( ( $id & 0xFFFFF ) * 2 ) + ( ( $id / 0x100000 ) * 0x40000000 );
So, for $id with the following values I should get $result:
$id = 0, $result = 0x00000200A0000000
$id = 1, $result = 0x00000200A0000002
$id = 2, $result = 0x00000200A0000004
How can I do this?
Here is my inconclusive research results, with reasons why:
How can I do 64-bit arithmetic in Perl?
How can I sum large hexadecimal values in Perl? Vague, answer not definitively precise and no example.
Integer overflow
non conclusive
Integer overflow
non conclusive
bigint
no info about assignment, arithmetic or output
bignum
examples not close to my problem.
How can I sprintf a big number in Perl?
example given is not enough info for me: doesn't deal with hex
assignment or arithmetic.
Re: secret code generator
Some examples using Fleximal, mentions to_str to output value of
variable but 1) I don't see how the
variable was assigned and 2) I get
error "Can't call method "to_str"
without a package or object
reference" when I run my code using
it.
String to Hex
Example of using Math::BigInt which
doesn't work for me - still get
overflow error.
Is there a 64-bit hex()?
Nearly there - but doesn't deal with
outputting the large number in hex,
it only talks of decimal.
CPAN Math:Fleximal
does the arithmetic, but there doesn't seem to be any means to actually
output the value still in hex
sprintf
Doesn't seem to be able to cope with
numbers greater than 32-bits, get the
saturated FFFFFFFF message.
Edit: Update - new requirement and supplied solution - please feel free to offer comments
Chas. Owens answer is still accepted and excellent (part 2 works for me, haven't tried the part 1 version for newer Perl, though I would invite others to confirm it).
However, another requirement was to be able to convert back from the result to the original id.
So I've written the code to do this, here's the full solution, including #Chas. Owens original solution, followed by the implementation for this new requirement:
#!/usr/bin/perl
use strict;
use warnings;
use bigint;
use Carp;
sub bighex {
my $hex = shift;
my $part = qr/[0-9a-fA-F]{8}/;
croak "$hex is not a 64-bit hex number"
unless my ($high, $low) = $hex =~ /^0x($part)($part)$/;
return hex("0x$low") + (hex("0x$high") << 32);
}
sub to_bighex {
my $decimal = shift;
croak "$decimal is not an unsigned integer"
unless $decimal =~ /^[0-9]+$/;
my $high = $decimal >> 32;
my $low = $decimal & 0xFFFFFFFF;
return sprintf("%08x%08x", $high, $low);
}
for my $id (0 ,1, 2, 0xFFFFF, 0x100000, 0x100001, 0x1FFFFF, 0x200000, 0x7FDFFFFF ) {
my $result = bighex("0x00000200A0000000");
$result += ( ( $id & 0xFFFFF ) * 2 ) + ( ( $id / 0x100000 ) * 0x40000000 );
my $clusterid = to_bighex($result);
# the convert back code here:
my $clusterid_asHex = bighex("0x".$clusterid);
my $offset = $clusterid_asHex - bighex("0x00000200A0000000");
my $index_small_units = ( $offset / 2 ) & 0xFFFFF;
my $index_0x100000_units = ( $offset / 0x40000000 ) * 0x100000;
my $index = $index_0x100000_units + $index_small_units;
print "\$id = ".to_bighex( $id ).
" clusterid = ".$clusterid.
" back to \$id = ".to_bighex( $index ).
" \n";
}
Try out this code at http://ideone.com/IMsp6.
#!/usr/bin/perl
use strict;
use warnings;
use bigint qw/hex/;
for my $id (0 ,1, 2) {
my $result = hex("0x00000200A0000000") +
( ( $id & 0xFFFFF ) * 2 ) + ( ( $id / 0x100000 ) * 0x40000000 );
printf "%d: %#016x\n", $id, $result;
}
The bigint pragma replaces the hex function with a version that can handle numbers that large. It also transparently makes the mathematical operators deal with big ints instead of the ints on the target platform.
Note, this only works in Perl 5.10 and later. If you are running an earlier version of Perl 5, you can try this:
#!/usr/bin/perl
use strict;
use warnings;
use bigint;
use Carp;
sub bighex {
my $hex = shift;
my $part = qr/[0-9a-fA-F]{8}/;
croak "$hex is not a 64-bit hex number"
unless my ($high, $low) = $hex =~ /^0x($part)($part)$/;
return hex("0x$low") + (hex("0x$high") << 32);
}
sub to_bighex {
my $decimal = shift;
croak "$decimal is not an unsigned integer"
unless $decimal =~ /^[0-9]+$/;
my $high = $decimal >> 32;
my $low = $decimal & 0xFFFFFFFF;
return sprintf("%08x%08x", $high, $low);
}
for my $id (0 ,1, 2) {
my $result = bighex("0x00000200A0000000");
$result += ( ( $id & 0xFFFFF ) * 2 ) + ( ( $id / 0x100000 ) * 0x40000000 );
print "$id ", to_bighex($result), "\n";
}
The comment by ysth is right. Short example of 64-bit arithmetics using Perl from Debian stretch without Math::BigInt aka "use bigint":
#!/usr/bin/perl -wwi
sub do_64bit_arith {
use integer;
my $x = ~2;
$x <<= 4;
printf "0x%08x%08x\n", $x>>32, $x;
}
do_64bit_arith();
exit 0;
The script prints 0xffffffffffffffffffffffffffffffd0.