In Perl, can I treat a string as a byte array? - perl

In Perl, is it appropriate to use a string as a byte array containing 8-bit data? All the documentation I can find on this subject focuses on 7-bit strings.
For instance, if I read some data from a binary file into $data
my $data;
open FILE, "<", $filepath;
binmode FILE;
read FILE $data 1024;
and I want to get the first byte out, is substr($data,1,1) appropriate? (again, assuming it is 8-bit data)
I come from a mostly C background, and I am used to passing a char pointer to a read() function. My problem might be that I don't understand what the underlying representation of a string is in Perl.

The bundled documentation for the read command, reproduced here, provides a lot of information that is relevant to your question.
read FILEHANDLE,SCALAR,LENGTH,OFFSET
read FILEHANDLE,SCALAR,LENGTH
Attempts to read LENGTH characters of data into variable SCALAR
from the specified FILEHANDLE. Returns the number of
characters actually read, 0 at end of file, or undef if there
was an error (in the latter case $! is also set). SCALAR will
be grown or shrunk so that the last character actually read is
the last character of the scalar after the read.
An OFFSET may be specified to place the read data at some place
in the string other than the beginning. A negative OFFSET
specifies placement at that many characters counting backwards
from the end of the string. A positive OFFSET greater than the
length of SCALAR results in the string being padded to the
required size with "\0" bytes before the result of the read is
appended.
The call is actually implemented in terms of either Perl's or
system's fread() call. To get a true read(2) system call, see
"sysread".
Note the characters: depending on the status of the filehandle,
either (8-bit) bytes or characters are read. By default all
filehandles operate on bytes, but for example if the filehandle
has been opened with the ":utf8" I/O layer (see "open", and the
"open" pragma, open), the I/O will operate on UTF-8 encoded
Unicode characters, not bytes. Similarly for the ":encoding"
pragma: in that case pretty much any characters can be read.

See perldoc -f pack and perldoc -f unpack for how to treat strings as byte arrays.

You probably want to use sysopen and sysread if you want to read bytes from binary file.
See also perlopentut.
Whether this is appropriate or necessary depends on what exactly you are trying to do.
#!/usr/bin/perl -l
use strict; use warnings;
use autodie;
use Fcntl;
sysopen my $bin, 'test.png', O_RDONLY;
sysread $bin, my $header, 4;
print map { sprintf '%02x', ord($_) } split //, $header;
Output:
C:\Temp> t
89504e47

Strings are strings of "characters", which are bigger than a byte.1 You can store bytes in them and manipulate them as though they are characters, taking substrs of them and so on, and so long as you're just manipulating entities in memory, everything is pretty peachy. The data storage is weird, but that's mostly not your problem.2
When you try to read and write from files, the fact that your characters might not map to bytes becomes important and interesting. Not to mention annoying. This annoyance is actually made a bit worse by Perl trying to do what you want in the common case: If all the characters in the string fit into a byte and you happen to be on a non-Windows OS, you don't actually have to do anything special to read and write bytes. Perl will complain, however, if you have stored a non-byte-sized character and try to write it without giving it a clue about what to do with it.
This is getting a little far afield, largely because encoding is a large and confusing topic. Let me leave it off there with some references: Look at Encode(3perl), open(3perl), perldoc open, and perldoc binmode for lots of hilarious and gory details.
So the summary answer is "Yes, you can treat strings as though they contained bytes if they do in fact contain bytes, which you can assure by only reading and writing bytes.".
1: Or pedantically, "which can express a larger range of values than a byte, though they are stored as bytes when that is convenient". I think.
2: For the record, strings in Perl are internally represented by a data structure called a 'PV' which in addition to a character pointer knows things like the length of the string and the current value of pos.3
3: Well, it will start storing the current value of pos if it starts being interesting. See also
use Devel::Peek;
my $x = "bluh bluh bluh bluh";
Dump($x);
$x =~ /bluh/mg;
Dump($x);
$x =~ /bluh/mg;
Dump($x);

It might help more if you tell us what you are trying to do with the byte array. There are various ways to work with binary data, and each lends itself to a different set of tools.
Do you want to convert the data into a Perl array? If so, pack and unpack are a good start. split could also come in handy.
Do you want to access individual elements of the string without unpacking it? If so, substr is fast and will do the trick for 8 byte data. If you want other bit depths, take a look at the vec function, which treads a string as a bit vector.
Do you want to scan the string and convert certain bytes to other bytes? Then the s/// or tr/// constructs might be useful.

Allow me just to post a small example about treating string as binary array - since I myself found it difficult to believe that something called "substr" would handle null bytes; but seemingly it does - below is a snippet of a perl debugger terminal session (with both string and array/list approaches):
$ perl -d
Loading DB routines from perl5db.pl version 1.32
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
^D
Debugged program terminated. Use q to quit or R to restart,
use o inhibit_exit to avoid stopping after program termination,
h q, h R or h o to get additional info.
DB<1> $str="\x00\x00\x84\x00"
DB<2> print $str
�
DB<3> print unpack("H*",$str) # show content of $str as hex via `unpack`
00008400
DB<4> $str2=substr($str,2,2)
DB<5> print unpack("H*",$str2)
8400
DB<6> $str2=substr($str,1,3)
DB<7> print unpack("H*",$str2)
008400
[...]
DB<30> #stra=split('',$str); print #stra # convert string to array (by splitting at empty string)
�
DB<31> print unpack("H*",$stra[3]) # print indiv. elems. of array as hex
00
DB<32> print unpack("H*",$stra[2])
84
DB<33> print unpack("H*",$stra[1])
00
DB<34> print unpack("H*",$stra[0])
00
DB<35> print unpack("H*",join('',#stra[1..3])) # print only portion of array/list via indexes (using flipflop [two dots] operator)
008400

Related

Why does opening a file in utf 8 mode change the behaviour of seek?

Here's a simple text file with no special characters, called utf-8.txt with the following content.
foo bar baz
one two three
The new line is following the unix convention (one byte), so that the entire size of the file is 26 = 11 + 1 + 13 + 1. (11 = foo bar baz, 13 = one two three.
If I read the file with the following perl script
use warnings;
use strict;
open (my $f, '<', 'utf8.txt');
<$f>;
seek($f, -4, 1);
my $xyz = <$f>;
print "$xyz<";
it prints
baz
<
This is expected, since the seek command goes back four characters, the new line and the three belonging to baz.
If I now change the open statement to
open (my $f, '<:encoding(UTF-8)', 'utf8.txt');
the output changes to
baz
<
that is, the seek command goes back five characters (or it goes back four characters but skips the new line).
Is this behaviour expected? Is there a flag or somthing to turn this behaviour off?
Edit
As per Andrzej A. Filip suggestion, when I add print join("+",PerlIO::get_layers($f)),"\n"; just after the open statement, it prints in the "normal" open case: unix+crlf and in the open...encoding case: unix+crlf+encoding(utf-8-strict)+utf8.
For those looking for a TL;DR, seek and tell work in bytes. seek should always be okay if it uses values returned by tell
The documentation for Perl's seek operator is rather clumsy but it has this
seek FILEHANDLE,POSITION,WHENCE
The values for WHENCE are 0 to set the new position in bytes to POSITION ...
and
Note the in bytes: even if the filehandle has been set to operate on characters (for example by using the :encoding(utf8) open layer), tell() will return byte offsets, not character offsets (because implementing that would render seek() and tell() rather slow).
While this alludes to the problem it isn't stated explicitly
seek and tell use and return byte offsets within the file, regardless of any other PerlIO layer. That means they work on similar terms to sysread which is independent of Perl's streaming IO, although seek and tell respect Perl's buffering whereas sysread does not
It isn't just :utf8 or :encoding layers that confuse what units you may expect: the Windows :crlf layer also has an effect because it converts CR LF pairs to LF before streaming input and after output. That clearly causes a discrepancy for every line of text, but as far as I can tell this isn't mentioned in Perl's documentation; Linux and OSX being the pushy ugly sisters of pretty much every other Perl platform
Let's look at your code. I've run this code (it's identical to the code in your question, I promise) on my Windows 10 and Windows 7 systems, and even booted a VM with Windows 98 to try the same thing
use warnings;
use strict;
open (my $f, '<', 'utf8.txt');
print join("+",PerlIO::get_layers($f)),"\n";
<$f>;
seek($f, -4, 1);
my $xyz = <$f>;
print "$xyz<";
All of them output this
unix+crlf
az
which is what I expected, and not what you say you get. This is central since we're talking about single-byte offsets
Your file contains this
foo bar baz\r\none two three
The first read takes us to 13 characters from the start. Perl has read foo bar baz\r\n and removed the CR, handing foo bar baz\n to the program, which it discards. Fine
Now you seek($f, -4, 1)
That third parameter 1 is SEEK_CUR, which means you want to move the current read pointer relative to the current position.
Please
Please don't use magic numbers. Perl is pretty much exposing the underlying C file library to you here and you need to be responsible with it. Passing 1 as the third parameter is arcane and irresponsible. No one who reads your code will know what you have written
Do this
use Fcntl ':seek'
and then you can write more intelligible code like this. At least people can google SEEK_CUR whereas trying the same with 1 would be worse than fruitless
seek($f, -4, SEEK_CUR)
as it gives the rest of us a chance to understand your code
So you're seeking to 13 bytes, add -4 which is 9. That's just after the b of baz, and so I get az
That's what all my runs of your code produced on all of those different Windows machines. I have to think that the problem is with your code control and not with Perl, except for the issue with CRLF
I hope that this explained some anomalies for you, but please check your code and your results.

Why does PerlIO::encoding insert an additional utf8 layer?

The documentation for PerlIO says:
:encoding Use :encoding(ENCODING) either in open() or binmode() to
install a layer that transparently does character set and encoding
transformations, for example from Shift-JIS to Unicode. Note that
under stdio an :encoding also enables :utf8 . See PerlIO::encoding for
more information.
Here is a test script:
use feature qw(say);
use strict;
use warnings;
my $fn = 'test.txt';
for my $mode ('>', '>:encoding(utf8)' ) {
open( my $fh, $mode, $fn);
say join ' ', (PerlIO::get_layers($fh));
close $fh;
}
Output is:
unix perlio
unix perlio encoding(utf8) utf8
Why do I get the additional utf8 layer here?
For reasons that require knowledge of Perl internals.
When you store the number 4 in a scalar, it could be stored as a signed integer, an unsigned integer or a floating point number. You don't know which is used, and you don't have any reason to care which one is used. Perl will automatically convert as needed.
It's the same situation for strings. There are two storage formats for them. Your name is the perfect example. "Håkon Hægland" can be stored as
48.E5.6B.6F.6E.20.48.E6.67.6C.61.6E.64
or as
48.C3.A5.6B.6F.6E.20.48.C3.A6.67.6C.61.6E.64
A flag called UTF8 indicates the choice of storage format. This is transparent to the user (or at least should be).
$ perl -Mutf8 -E'
$_ = "Håkon Hægland";
utf8::downgrade( $d = $_ ); # Converts to the first format mentioned above.
utf8::upgrade( $u = $_ ); # Converts to the second format mentioned above.
say $d eq $u ? "eq" : "ne";
'
eq
While it's transparent to you, it's far from transparent to Perl itself. Whenever you manipulate a string, Perl has to check in which storage format it's stored. For example, if you concatenate two strings, Perl has to make sure they use the same storage format before performing the concatenation, converting one if necessary.
It's also not transparent to PerlIO. PerlIO, like the rest of Perl, has to deal with the bytes in the string buffer rather than what you see at the Perl level. Sometimes, those bytes are destined to be the string buffer of scalars with the UTF8 flag cleared, and sometimes, those bytes are destined to be the string buffer of scalars with the UTF8 flag set. PerlIO needs to track that. Rather than carrying a flag along from layer to layer, PerlIO adds a :utf8 layer when the scalars obtained by reading from the handle need to have the UTF8 flag set.
So, :encoding converts the bytes that form
Håkon Hægland
from the specified encoding to
48.C3.A5.6B.6F.6E.20.48.C3.A6.67.6C.61.6E.64
And :utf8 causes the scalar to have the UTF8 flag set, causing the resulting scalar to contain
U+0048.00E5.006B.006F.006E.0020.0048.00E6.0067.006C.0061.006E.0064

Perl ord and chr working with unicode

To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear
Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.
Indeed I can print a smiley using
perl -e 'print chr(0x263a)'
but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.
I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
The first answer I've got explains quite everything about IO, but I still don't understand why
#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';
print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";
print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";
prints
ne1 - eq1
match1 - no_match2
It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).
First,
perl -le'print chr(0x263A);'
is buggy. Perl even tells you as much:
Wide character in print at -e line 1.
That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:
perl -le'print chr(0x263A);'
perl -le'print chr(0x00C0);'
To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À
Now on to the "why".
File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:
$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004
This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.
By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011
Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.
A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.
Here are a few example of operators that do assign meaning to the strings they receive as operands:
m// expects a string of Unicode code points.
connect expects a sequence of bytes that represent a sockaddr_in structure.
print with a handle without :encoding expect a sequence of bytes.
print with a handle with :encoding expect a sequence of Unicode code points.
etc
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.
Regarding the question you've added:
There are problems with the encoding pragma. I recommend against using it. Instead, use
use open ':std', ':encoding(UTF-8)';
That'll fix one of the problems. The other problem you are encountering is with
chr(0x00C0) =~ /\w/
It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:
use 5.014; # use 5.012; *might* suffice.
A workaround that works as far back as 5.8:
my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/

Convert raw hex to readable hex in perl?

I have written a little perl app to reads the serial port. When I run my little script I receive data but it's written in unreadable signs.. it shows like *I??. However if I do
perl test.pl | hexdump
I get the required data. And the hex data makes sense to me. Does anyone know how I can get this output using perl without using hexdump?
Right now I use print ($data) to print my data.
"Raw hex" doesn't mean anything; what you've got is a string of bytes that you want to convert to a textual representation. To do that you can use unpack. For example,
my $bytes = read_from_serial_port();
my $hex = unpack 'h*', $bytes;
Use H instead of h if you want the opposite endianness. (I always forget which is which.)

How can I access the nth byte of a binary scalar in Perl?

Thanks to everyone in advance.
I'd like to access the nth byte of a binary scalar. For example you could get all the file data in one scalar variable...
Imagine that the binary data is collected into scalar...
open(SOURCE, "<", "wl.jpg");
my $thisByteData = undef;
while(<SOURCE>){$thisByteData .= $_;}
close SOURCE;
$thisByteData is raw binary data. When I use length($thisByteData) I get the byte count back, so Perl does know how big it is. My question is how can I access the Nth byte?
Side note: My function is going to receive this binary scalar, its in my function that I want to access the Nth byte. The help regarding how to collect this data is appreciated but not what I'm looking for. Whichever way the other programmer wants to collect the binary data is up to them, my job is to get the Nth byte when its passed to me :)
Again thanks so much for the help to all!
Thanks to #muteW who has gotten me further than ever. I guess I'm not understanding unpack(...) correctly.
print(unpack("N1", $thisByteData));
print(unpack("x N1", $thisByteData));
print(unpack("x0 N1", $thisByteData));
Is returning the following:
4292411360
3640647680
4292411360
I would assume those 3 lines would all access the same (first) byte. Using no "x" just an "x" and "x$pos" is giving unexpected results.
I also tried this...
print(unpack("x0 N1", $thisByteData));
print(unpack("x1 N1", $thisByteData));
print(unpack("x2 N1", $thisByteData));
Which returns... the same thing as the last test...
4292411360
3640647680
4292411360
I'm definatly missing something about how unpack works.
If I do this...
print(oct("0x". unpack("x0 H2", $thisByteData)));
print(oct("0x". unpack("x1 H2", $thisByteData)));
print(oct("0x". unpack("x2 H2", $thisByteData)));
I get what I was expecting...
255
216
255
Can't unpack give this to me itself without having to use oct()?
As a side note: I think I'm getting the 2's complement of these byte integers when using "x$pos N1". I'm expecting these as the first 3 bytes.
255
216
255
Thanks again for the help to all.
Special thanks to #brian d foy and #muteW ... I now know how to access the Nth byte of my binary scalar using unpack(...). I have a new problem to solve now, which isn't related to this question. Again thanks for all the help guys!
This gave me the desired result...
print(unpack("x0 C1", $thisByteData));
print(unpack("x1 C1", $thisByteData));
print(unpack("x2 C1", $thisByteData));
unpack(...) has a ton of options so I recommend that anyone else who reads this read the pack/unpack documentation to get the byte data result of their choice. I also didn't try using the Tie options #brian mentioned, I wanted to keep the code as simple as possible.
If you have the data in a string and you want to get to a certain byte, use substr, as long as you are treating the string as bytes to start with.
However, you can read it directly from the file without all this string nonsense people have been filling your head with. :) Open the file with sysopen and the right options, use seek to put yourself where you want, and read what you need with sysread.
You skip all the workarounds for the stuff that open and readline are trying to do for you. If you're just going to turn off all of their features, don't even use them.
Since you already have the file contents in $thisByteData you could use pack/unpack to access the n-th byte.
sub getNthByte {
my ($pos) = #_;
return unpack("x$pos b1", $thisByteData);
}
#x$pos - treats $pos bytes as null bytes(effectively skipping over them)
#b1 - returns the next byte as a bit string
Read through the pack documentation to get a sense of the parameters you can use in the template to get different return values.
EDIT -
Your comment below shows that you are missing the high-order nybble ('f') of the first byte. I am not sure why this is happening but here is an alternative method that works, in the meantime I'll have a further look into unpack's behavior.
sub getNthByte {
my ($pos) = #_;
return unpack("x[$pos]H2", $binData);
}
(my $hex = unpack("H*", $binData)) =~ s/(..)/$1 /g;
#To convert the entire data in one go
Using this the output for the first four bytes are - 0xff 0xd8 0xff 0xe0 which matches the documentation.
I think the correct answer involves pack/unpack, but this might also work:
use bytes;
while( $bytestring =~ /(.)/g ){
my $byte = $1;
...
}
"use bytes" ensures that you never see characters -- but if you have a character string and are processing it as bytes, you are doing something wrong. Perl's internal character encoding is undefined, so the data you see in the string under "use bytes" is nearly meaningless.
The Perl built-in variable $/ (or $INPUT_RECORD_SEPARATOR in if you're useing English) controls Perl's idea of a "line". By default it is set to "\n", so lines are separated by newline characters (duh), but you can change this to any other string. Or change it to a reference to a number:
$/ = \1;
while(<FILE>) {
# read file
}
Setting it to a reference to a number will tell Perl that a "line" is that number of bytes.
Now, what exactly are you trying to do? There's probably a number of modules that will do what you're trying to do, and possibly more efficiently. If you're just trying to learn how to do it, go ahead, but if you have a specific task in mind, consider not reinventing the wheel (unless you want to).
EDIT: Thanks to jrockway in the comments...
If you have Unicode data, this may not read one byte, but one character, but if this happens, you should be able to use bytes; to turn off automatic byte-to-character translation.
Now, you say you want to read the data all at once and then pass it to a function. Let's do this:
my $data;
{
local $/;
$data = <FILE>;
}
Or this:
my $data = join("", <FILE>);
Or some will suggest the File::Slurp module, but I think it's a bit overkill. However, let's get an entire file into an array of bytes:
use bytes;
...
my #data = split(//, join("", <FILE>));
And then we have an array of bytes that we can pass to a function. Like?
Without knowing much more about what you're trying to do with your data, something like this will iterate over the bytes in the file:
open(SOURCE, "wl.jpg");
my $byte;
while(read SOURCE, $byte, 1) {
# Do something with the contents of $byte
}
close SOURCE;
Be careful with the concatanation used in your example; you may end up with newline conversions, which is definitely not what you want to happen while reading binary files. (It's also inefficient to continually expand the scalar while reading it.) This is the idiomatic way to schlep an entire file into a Perl scalar:
open(SOURCE, "<", "wl.jpg");
local $/ = undef;
my $big_binary_data = <SOURCE>;
close SOURCE;