Writing (and reading) bits to binary files in Perl + EOF handling - perl

I have two related problems (in Perl):
Write data to binary files, in the format: single bit flag followed by 8 bits
Read back the same format
I tried this (and other variations but for the life of me I can't figure this out):
binmode(OUT);
my $bit = pack("B1", '1');
my $byte = pack("H2", "02");
print OUT $bit . $byte;
Using a hex editor, I see I get 16 bits:
1000000000000020
What I want is 9 bits:
100000020
Also: Suppose I write out two of these patterns. That means I end up with 9 + 9 = 18 bits. I am not sure how to handle the last byte (padding?)
This is to compress and uncompress files, with space at premium. I was hoping there would be some simple idiomatic way to do this that I am not aware of.

Files are sequences of bytes. If you want to print out bits, you'll have to use some form of buffering.
my $bits = '';
$bits .= '1'; # Add 1 bit.
$bits .= unpack('B8', pack('C', 0x02)); # Add 8 bits.
$bits .= substr(unpack('B8', pack('C', 0x02)), -6); # Add 6 bits.
This prints as much as the buffer as possible:
my $len = ( length($bits) >> 3 ) << 3;
print($fh, pack('B*', substr($bits, 0, $len, '')));
You'll eventually need to pad the buffer so that you have a multiple of 8 bits in order to flush out the rest. You could simply pad with zeroes.
$bits .= "0" x ( -length($bits) % 8 );
If you're smart, though, you can come up with a padding scheme that can be used to indicate where the file actually ends. Remember, you can't rely on just the file length anymore. If you don't use a smart padding scheme, you'll have to use another method.
One example of a smart padding scheme would be:
$bits .= "0";
$bits .= "1" x ( -length($bits) % 8 );
Then, to unpad, remove all trailing 1 bits and the 0 bit before that.

You can use Bit::Vector to manage your bits and conversion with some more ease,
use Bit::Vector;
my $bit = Bit::Vector->new_Bin( 1, '1' );
my $byte = Bit::Vector->new_Bin( 8, '00000010' );
my $byte_9 = Bit::Vector->new_Bin( 9, '000000010' );
my $nineBits = Bit::Vector->new_Bin( 9, '100000000' );
my $carry = Bit::Vector->new_Bin( 9, '000000000' );
my $ORed = Bit::Vector->new_Bin( 9, '000000000' );
my $added = Bit::Vector->new_Bin( 9, '000000000' );
$ORed->Union($nineBits,$byte_9);
print "bit: 0x". $bit->to_Hex(). "\n";
print "byte 2: 0x". $byte->to_Hex(). "\n";
print "nineBits: 0x". $nineBits->to_Hex(). "\n";
print "nineBits: 0x". $nineBits->to_Bin(). "\n";
print "ORed bit and byte 0x". $ORed->to_Dec(). "\n";
open BINOUT, ">out.bin"
or die "\nCan't open out.bin for writing: $!\n";
binmode BINOUT;
print BINOUT pack ('B*', $ORed->to_Bin()) ."\n"
Here's the output
>perl bitstuff.pl
bit: 0x1
byte 2: 0x02
nineBits: 0x100
nineBits: 0x100000000
ORed bit and byte 0x-254
>cat out.bin
\201^#

Related

Right shift a binary no and get the shifted bits in a variable

I have a binary no say and I have a value in variable say value = 4.
I want to right shift the binary no by no of bits stored in "value" variable and then want to store the shifted bits in a variable and also want to save binary no obtained after right shift in another variable
Example:
binary_number = 110000001
value =4
then shifting no of bits in "value" to right (11000001 >> value)
Now I want to finally have two variables one containing the binary no after shift and one variable with shifted bits.
For above example the solution that I want is
right_shifted_binary = 11000
bits_shifted = 0001
I can not find a proper documentation for the problem as most of the problem are telling about arithmetic right shift.
Generate a bit mask based on $value and use the AND (&) operator:
#!/usr/bin/perl
use warnings;
use strict;
my $binary = 0b110000001;
my $value = 4;
# create mask with $value left-most bits 1
my $mask = ~(~0 << $value);
print "INPUT: ", unpack("B*", pack("N", $binary)), " ($binary)\n";
# right shift by $value bits
my $right_shifted_binary = $binary >> $value;
print "RIGHT: ", unpack("B*", pack("N", $right_shifted_binary)), " ($right_shifted_binary)\n";
# extract "remainder" of shift using mask
my $bits_shifted = $binary & $mask;
print "REMAINDER: ", unpack("B*", pack("N", $bits_shifted)), " ($bits_shifted)\n";
exit 0;
Test run:
$ perl dummy.pl
INPUT: 00000000000000000000000110000001 (385)
RIGHT: 00000000000000000000000000011000 (24)
REMAINDER: 00000000000000000000000000000001 (1)
# Proof
$ echo "24 * 16 + 1" | bc
385
If the binary number is given as string you can convert it to an integer first:
my $binary_string = "110000001";
my $binary = unpack("N", pack("B32", substr("0" x 32 . $binary_string, -32)));
But if it is already a string then the solution would be much simpler:
#!/usr/bin/perl
use warnings;
use strict;
my $binary_string = "110000001";
my $value = 4;
print "INPUT: $binary_string\n";
print "RIGHT: ", substr($binary_string, 0, -$value), "\n";
print "REMAINDER: ", substr($binary_string, -$value), "\n";
exit 0:
$ perl dummy.pl
INPUT: 110000001
RIGHT: 11000
REMAINDER: 0001

Perl: seek to and read bits, not bytes

In Perl, I want to seek to the nth bit (not byte) of a file and then read the next m bits, returned as a list of 0s and 1s.
Is there any easy way to do this?
I realize I can write a subroutine wrapping regular seek and read, but was wondering if there's a easier solution.
bitseek would grab a group of bits at one time.
seek($fh, int($bit_num/8), SEEK_SET);
my $offset = $bit_num % 8;
read($fh, my $buf, ceil(($offset+$num_bits)/8));
I'm looking for bit addressing, not bit-by-bit reading.
vec($bits, $offset+$bit_num, 1);
If n is a multiple of m, and m is one of 1, 2, 4, 8, 16, 32, and on some platforms, 64, you can read the whole file into a string and use vec for this.
(Admittedly a fairly constraining case, but a common one.)
Barring that, you'll just have to do the math; in pseudo-code:
discard = n % 8;
startbyte = (n - discard) / 8
bits = m + discard
bytes = int( (bits + 7) / 8 )
seek to startbyte
read bytes into string
#list = split //, unpack "${bits}b", string
splice( #list, 0, $discard )
splice( #list, $m, #list )
I ended up writing something like what #ikegami and #ysth suggested. For reference:
=item seek_bits($fh, $start, $num)
Seek to bit (not byte) $start in filehandle $fh, and return the next
$num bits (as a list).
=cut
sub seek_bits {
my($fh, $start, $num) = #_;
# the byte where this bit starts and the offset
my($fbyte, $offset) = (floor($start/8), $start%8);
# the number of bytes to read ($offset does affect this)
my($nbytes) = ceil($num+$offset)/8;
seek($fh, $fbyte, SEEK_SET);
read($fh, my($data), $nbytes);
my(#ret) = split(//, unpack("B*", $data));
# return requested bits
return #ret[$offset-1..$offset+$num];
}

Perl, Decoding bits stored in a binary file

I'm decoding a fixed-width binary file in perl. One of the fields is 1 byte in length and the 8 bits are encoded such that:
The 7th Bit unused, The 0-4th bit is used to indicate Field1. The 5-6th Bit is used to indicate Field2.
For example the field 0x27 will give Field1 = 7, Field2 = 1.
I'm reading all the fields into a hash table called raw{} and then decoding the values into another hash called processed{}.
The code (only shown for this one byte) is
while (read(FILE, $buff, 559)) {
%raw = (); # Hash for first-pass extractions
%processed = (); # Hash for cleaned up values, for output
(
......
$raw{'Field12'},
......
)= unpack('
.....
H2
.....
h*',$buff);
$prcoessed{'Field1'} = unpack("B8", pack("H8", $raw{'Field12'})) & 0x1f;
$prcoessed{'Field2'} = (unpack("B8", pack("H8", $raw{'Field12'})) >> 5) & 0x3;
However, I don't get the desired values. I get $prcoessed{'Field1'} = 15 and $prcoessed{'Field2'} = 3. Where am I going wrong?
0x1f is a number, you need to transform it into a character.
Instead of packs and unpacks, you can use ord and chr:
#!/usr/bin/perl
use warnings;
use strict;
my $char = chr 0x27;
my $field1 = $char & chr 0x1f;
my $field2 = $char & chr 0x60;
$field2 = chr(ord($field2) >> 5);
print ord, "\n" for $field1, $field2;
The same script using pack and unpack:
my $char = pack 'C', 0x27;
my $field1 = $char & pack 'C', 0x1f;
my $field2 = $char & pack 'C', 0x60;
$field2 = pack 'C', unpack('C', $field2) >> 5;
print unpack('C', $_), "\n" for $field1, $field2;

Decoding 3-byte integer in Perl

I'm reading a binary file format that starts out with 4 constant check bytes, followed by 3 octets that indicate how long the data portion of the record will be. I can decode this as follows:
read($fh, $x, 7) or do {
last if eof;
die "Can't read: $!";
};
my ($type, $l1, $l2, $l3) = unpack("a4 C3", $x);
my $length = $l1 << 16 | $l2 << 8 | $l3;
Is there a more direct way to read that 3-byte value, without intermediate variables? Something I'm missing in the pack specifications maybe? I haven't used pack very much except for hex encoding and other dilettantish pursuits.
You could insert a null byte into the string in order to be able to use the "N" format:
substr($x, 4, 0, "\0");
my ($type, $length) = unpack "a4 N", $x;
Edit: Or else unpack in two steps:
my ($type, $length) = unpack "a4 a3", $x;
$length = unpack "N", "\0" . $length;
my $type = unpack("a4", $x);
my $len = unpack("N", "\0".substr($x, 4));
or
my ($type, $plen) = unpack("a4 a3", $x);
my $len = unpack("N", "\0$plen");
No, unpack doesn't support 3-byte (or arbitrary length) integers, but you could use an unsigned 16-bit big-endian int to save a little work:
my ($type, $l1, $l23) = unpack("a4 Cn", $x);
my $length = $l1 << 16 | $l23;
Solution: Your method for getting the type is fine. However, I suggest that you unpack the length separately as a four-byte integer, then discard the first byte of those four bytes. This is more efficient even though it overlaps and discards the last byte of the type.
my $type = unpack("a4", $x);
my $length = unpack("x3N", $x); # skips the first 3 bytes of your original 7-byte string
$length = $length & 0xFFFFFF; # returns only the last 3 bytes of the four-byte integer

Perl + Word: Degree sign shows up preceded by A-circumflex

I'm generating a Word document in Perl, and I'd like to include the degree symbol (°) in the text I generate. If I generate the code like so:
$cell .= qq/\xB0/;
This works, and generates (for a value of $cell of 55): 55°
However, perlcritic complains at me when I do this and suggests I use this construction instead:
$cell .= qq/\N{DEGREE SIGN}/;
This does not work; it generates: 55°
Looking through my code in perl -d, I see that running the following code:
my $cell = 55;
$cell .= qq/\N{DEGREE SIGN}/; # the PBP way
print sprintf("%x\n", ord($_)) for split //, $cell;
my $cell = 55;
$cell .= qq/\xB0/; # the non-PBP way
print sprintf("%x\n", ord($_)) for split //, $cell;
results in:
35
35
b0
I'm outputting text to the Word document using Win32::OLE:
my #column_headings = #{ shift $args->{'data'} };
my #rows = #{ $args->{'data'} };
my $word = Win32::OLE->new( 'Word.Application', 'Quit' );
my $doc = $word->Documents->Add();
my $select = $word->Selection;
$csv->combine(#column_headings);
$select->InsertAfter( $csv->string );
$select->InsertParagraphAfter;
for my $row (#rows) {
$csv->combine( #{$row} );
$select->InsertAfter( $csv->string );
$select->InsertParagraphAfter;
}
my $table =
$select->ConvertToTable( { 'Separator' => wdSeparateByCommas } );
$table->Rows->First->Range->Font->{'Bold'} = 1;
$table->Rows->First->Range->ParagraphFormat->{'Alignment'} =
wdAlignParagraphCenter;
#{ $table->Rows->First->Borders(wdBorderBottom) }{qw/LineStyle LineWidth/}
= ( wdLineStyleDouble, wdLineWidth100pt );
$doc->SaveAs( { 'Filename' => Cwd::getcwd . '/test.doc' } );
What can I do to get rid of the extraneous Â?
Of course, you are suffering from encoding issues. The degree sign is U+00B0, but this serializes to UTF-8 C2 B0, which renders as ° — if this multi byte character is correctly decoded as utf-8. If you were decoding the bytes as a single-byte encoding (say … cp1252), then the bytes would be considered seperate, and would display  °.
Now clearly, the solution is either to tell Perl to transform the unicode string to a byte string of cp1252 chars (the horror!). You will find the my $bytestring = Encode::encode("cp1252", $string) function interesting here.
Or you tell the document that it will consider itself UTF-8. I don't know how you would do that, but there has to be an option somewhere. This would actually be preferable, as there are thousands of characters that (unlike the °) don't fit into cp1252. Like the degree Celsius ℃ (U+2103) or degree Fahrenheit ℉ (U+2109) characters ;-)