Convert raw hex to readable hex in perl? - perl

I have written a little perl app to reads the serial port. When I run my little script I receive data but it's written in unreadable signs.. it shows like *I??. However if I do
perl test.pl | hexdump
I get the required data. And the hex data makes sense to me. Does anyone know how I can get this output using perl without using hexdump?
Right now I use print ($data) to print my data.

"Raw hex" doesn't mean anything; what you've got is a string of bytes that you want to convert to a textual representation. To do that you can use unpack. For example,
my $bytes = read_from_serial_port();
my $hex = unpack 'h*', $bytes;
Use H instead of h if you want the opposite endianness. (I always forget which is which.)

Related

Perl ord and chr working with unicode

To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear
Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.
Indeed I can print a smiley using
perl -e 'print chr(0x263a)'
but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.
I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
The first answer I've got explains quite everything about IO, but I still don't understand why
#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';
print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";
print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";
prints
ne1 - eq1
match1 - no_match2
It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).
First,
perl -le'print chr(0x263A);'
is buggy. Perl even tells you as much:
Wide character in print at -e line 1.
That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:
perl -le'print chr(0x263A);'
perl -le'print chr(0x00C0);'
To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À
Now on to the "why".
File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:
$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004
This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.
By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011
Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.
A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.
Here are a few example of operators that do assign meaning to the strings they receive as operands:
m// expects a string of Unicode code points.
connect expects a sequence of bytes that represent a sockaddr_in structure.
print with a handle without :encoding expect a sequence of bytes.
print with a handle with :encoding expect a sequence of Unicode code points.
etc
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.
Regarding the question you've added:
There are problems with the encoding pragma. I recommend against using it. Instead, use
use open ':std', ':encoding(UTF-8)';
That'll fix one of the problems. The other problem you are encountering is with
chr(0x00C0) =~ /\w/
It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:
use 5.014; # use 5.012; *might* suffice.
A workaround that works as far back as 5.8:
my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/

How to Make Perl Properly Interpret Unicode Bytes?

I have a really crappy file full of unicode bytes that I'm trying to clean up. Some examples from the file are as follows:
ブラック
roler coaster
digital social party
big bellie
cornacopia
\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f \xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0
Now, what I'd like to do is convert all those ugly byte points into real unicode text. So, the above would be output as:
ブラック
roler coaster
digital social party
big bellie
cornacopia
зубная щетка
I've been banging my head against how to do this in Perl for like an hour now, and I'm out of good ideas. If you have one, I'd love to hear it.
It's UTF-8
$ perl -E'
use open ":std", ":locale";
use Encode qw( decode );
$_ = q{\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f }.
q{\xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0};
s/\\x(..)/chr hex $1/seg;
$_ = decode("UTF-8", $_);
say;
'
зубная щетка

In Perl, can I treat a string as a byte array?

In Perl, is it appropriate to use a string as a byte array containing 8-bit data? All the documentation I can find on this subject focuses on 7-bit strings.
For instance, if I read some data from a binary file into $data
my $data;
open FILE, "<", $filepath;
binmode FILE;
read FILE $data 1024;
and I want to get the first byte out, is substr($data,1,1) appropriate? (again, assuming it is 8-bit data)
I come from a mostly C background, and I am used to passing a char pointer to a read() function. My problem might be that I don't understand what the underlying representation of a string is in Perl.
The bundled documentation for the read command, reproduced here, provides a lot of information that is relevant to your question.
read FILEHANDLE,SCALAR,LENGTH,OFFSET
read FILEHANDLE,SCALAR,LENGTH
Attempts to read LENGTH characters of data into variable SCALAR
from the specified FILEHANDLE. Returns the number of
characters actually read, 0 at end of file, or undef if there
was an error (in the latter case $! is also set). SCALAR will
be grown or shrunk so that the last character actually read is
the last character of the scalar after the read.
An OFFSET may be specified to place the read data at some place
in the string other than the beginning. A negative OFFSET
specifies placement at that many characters counting backwards
from the end of the string. A positive OFFSET greater than the
length of SCALAR results in the string being padded to the
required size with "\0" bytes before the result of the read is
appended.
The call is actually implemented in terms of either Perl's or
system's fread() call. To get a true read(2) system call, see
"sysread".
Note the characters: depending on the status of the filehandle,
either (8-bit) bytes or characters are read. By default all
filehandles operate on bytes, but for example if the filehandle
has been opened with the ":utf8" I/O layer (see "open", and the
"open" pragma, open), the I/O will operate on UTF-8 encoded
Unicode characters, not bytes. Similarly for the ":encoding"
pragma: in that case pretty much any characters can be read.
See perldoc -f pack and perldoc -f unpack for how to treat strings as byte arrays.
You probably want to use sysopen and sysread if you want to read bytes from binary file.
See also perlopentut.
Whether this is appropriate or necessary depends on what exactly you are trying to do.
#!/usr/bin/perl -l
use strict; use warnings;
use autodie;
use Fcntl;
sysopen my $bin, 'test.png', O_RDONLY;
sysread $bin, my $header, 4;
print map { sprintf '%02x', ord($_) } split //, $header;
Output:
C:\Temp> t
89504e47
Strings are strings of "characters", which are bigger than a byte.1 You can store bytes in them and manipulate them as though they are characters, taking substrs of them and so on, and so long as you're just manipulating entities in memory, everything is pretty peachy. The data storage is weird, but that's mostly not your problem.2
When you try to read and write from files, the fact that your characters might not map to bytes becomes important and interesting. Not to mention annoying. This annoyance is actually made a bit worse by Perl trying to do what you want in the common case: If all the characters in the string fit into a byte and you happen to be on a non-Windows OS, you don't actually have to do anything special to read and write bytes. Perl will complain, however, if you have stored a non-byte-sized character and try to write it without giving it a clue about what to do with it.
This is getting a little far afield, largely because encoding is a large and confusing topic. Let me leave it off there with some references: Look at Encode(3perl), open(3perl), perldoc open, and perldoc binmode for lots of hilarious and gory details.
So the summary answer is "Yes, you can treat strings as though they contained bytes if they do in fact contain bytes, which you can assure by only reading and writing bytes.".
1: Or pedantically, "which can express a larger range of values than a byte, though they are stored as bytes when that is convenient". I think.
2: For the record, strings in Perl are internally represented by a data structure called a 'PV' which in addition to a character pointer knows things like the length of the string and the current value of pos.3
3: Well, it will start storing the current value of pos if it starts being interesting. See also
use Devel::Peek;
my $x = "bluh bluh bluh bluh";
Dump($x);
$x =~ /bluh/mg;
Dump($x);
$x =~ /bluh/mg;
Dump($x);
It might help more if you tell us what you are trying to do with the byte array. There are various ways to work with binary data, and each lends itself to a different set of tools.
Do you want to convert the data into a Perl array? If so, pack and unpack are a good start. split could also come in handy.
Do you want to access individual elements of the string without unpacking it? If so, substr is fast and will do the trick for 8 byte data. If you want other bit depths, take a look at the vec function, which treads a string as a bit vector.
Do you want to scan the string and convert certain bytes to other bytes? Then the s/// or tr/// constructs might be useful.
Allow me just to post a small example about treating string as binary array - since I myself found it difficult to believe that something called "substr" would handle null bytes; but seemingly it does - below is a snippet of a perl debugger terminal session (with both string and array/list approaches):
$ perl -d
Loading DB routines from perl5db.pl version 1.32
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
^D
Debugged program terminated. Use q to quit or R to restart,
use o inhibit_exit to avoid stopping after program termination,
h q, h R or h o to get additional info.
DB<1> $str="\x00\x00\x84\x00"
DB<2> print $str
�
DB<3> print unpack("H*",$str) # show content of $str as hex via `unpack`
00008400
DB<4> $str2=substr($str,2,2)
DB<5> print unpack("H*",$str2)
8400
DB<6> $str2=substr($str,1,3)
DB<7> print unpack("H*",$str2)
008400
[...]
DB<30> #stra=split('',$str); print #stra # convert string to array (by splitting at empty string)
�
DB<31> print unpack("H*",$stra[3]) # print indiv. elems. of array as hex
00
DB<32> print unpack("H*",$stra[2])
84
DB<33> print unpack("H*",$stra[1])
00
DB<34> print unpack("H*",$stra[0])
00
DB<35> print unpack("H*",join('',#stra[1..3])) # print only portion of array/list via indexes (using flipflop [two dots] operator)
008400

How can I access the nth byte of a binary scalar in Perl?

Thanks to everyone in advance.
I'd like to access the nth byte of a binary scalar. For example you could get all the file data in one scalar variable...
Imagine that the binary data is collected into scalar...
open(SOURCE, "<", "wl.jpg");
my $thisByteData = undef;
while(<SOURCE>){$thisByteData .= $_;}
close SOURCE;
$thisByteData is raw binary data. When I use length($thisByteData) I get the byte count back, so Perl does know how big it is. My question is how can I access the Nth byte?
Side note: My function is going to receive this binary scalar, its in my function that I want to access the Nth byte. The help regarding how to collect this data is appreciated but not what I'm looking for. Whichever way the other programmer wants to collect the binary data is up to them, my job is to get the Nth byte when its passed to me :)
Again thanks so much for the help to all!
Thanks to #muteW who has gotten me further than ever. I guess I'm not understanding unpack(...) correctly.
print(unpack("N1", $thisByteData));
print(unpack("x N1", $thisByteData));
print(unpack("x0 N1", $thisByteData));
Is returning the following:
4292411360
3640647680
4292411360
I would assume those 3 lines would all access the same (first) byte. Using no "x" just an "x" and "x$pos" is giving unexpected results.
I also tried this...
print(unpack("x0 N1", $thisByteData));
print(unpack("x1 N1", $thisByteData));
print(unpack("x2 N1", $thisByteData));
Which returns... the same thing as the last test...
4292411360
3640647680
4292411360
I'm definatly missing something about how unpack works.
If I do this...
print(oct("0x". unpack("x0 H2", $thisByteData)));
print(oct("0x". unpack("x1 H2", $thisByteData)));
print(oct("0x". unpack("x2 H2", $thisByteData)));
I get what I was expecting...
255
216
255
Can't unpack give this to me itself without having to use oct()?
As a side note: I think I'm getting the 2's complement of these byte integers when using "x$pos N1". I'm expecting these as the first 3 bytes.
255
216
255
Thanks again for the help to all.
Special thanks to #brian d foy and #muteW ... I now know how to access the Nth byte of my binary scalar using unpack(...). I have a new problem to solve now, which isn't related to this question. Again thanks for all the help guys!
This gave me the desired result...
print(unpack("x0 C1", $thisByteData));
print(unpack("x1 C1", $thisByteData));
print(unpack("x2 C1", $thisByteData));
unpack(...) has a ton of options so I recommend that anyone else who reads this read the pack/unpack documentation to get the byte data result of their choice. I also didn't try using the Tie options #brian mentioned, I wanted to keep the code as simple as possible.
If you have the data in a string and you want to get to a certain byte, use substr, as long as you are treating the string as bytes to start with.
However, you can read it directly from the file without all this string nonsense people have been filling your head with. :) Open the file with sysopen and the right options, use seek to put yourself where you want, and read what you need with sysread.
You skip all the workarounds for the stuff that open and readline are trying to do for you. If you're just going to turn off all of their features, don't even use them.
Since you already have the file contents in $thisByteData you could use pack/unpack to access the n-th byte.
sub getNthByte {
my ($pos) = #_;
return unpack("x$pos b1", $thisByteData);
}
#x$pos - treats $pos bytes as null bytes(effectively skipping over them)
#b1 - returns the next byte as a bit string
Read through the pack documentation to get a sense of the parameters you can use in the template to get different return values.
EDIT -
Your comment below shows that you are missing the high-order nybble ('f') of the first byte. I am not sure why this is happening but here is an alternative method that works, in the meantime I'll have a further look into unpack's behavior.
sub getNthByte {
my ($pos) = #_;
return unpack("x[$pos]H2", $binData);
}
(my $hex = unpack("H*", $binData)) =~ s/(..)/$1 /g;
#To convert the entire data in one go
Using this the output for the first four bytes are - 0xff 0xd8 0xff 0xe0 which matches the documentation.
I think the correct answer involves pack/unpack, but this might also work:
use bytes;
while( $bytestring =~ /(.)/g ){
my $byte = $1;
...
}
"use bytes" ensures that you never see characters -- but if you have a character string and are processing it as bytes, you are doing something wrong. Perl's internal character encoding is undefined, so the data you see in the string under "use bytes" is nearly meaningless.
The Perl built-in variable $/ (or $INPUT_RECORD_SEPARATOR in if you're useing English) controls Perl's idea of a "line". By default it is set to "\n", so lines are separated by newline characters (duh), but you can change this to any other string. Or change it to a reference to a number:
$/ = \1;
while(<FILE>) {
# read file
}
Setting it to a reference to a number will tell Perl that a "line" is that number of bytes.
Now, what exactly are you trying to do? There's probably a number of modules that will do what you're trying to do, and possibly more efficiently. If you're just trying to learn how to do it, go ahead, but if you have a specific task in mind, consider not reinventing the wheel (unless you want to).
EDIT: Thanks to jrockway in the comments...
If you have Unicode data, this may not read one byte, but one character, but if this happens, you should be able to use bytes; to turn off automatic byte-to-character translation.
Now, you say you want to read the data all at once and then pass it to a function. Let's do this:
my $data;
{
local $/;
$data = <FILE>;
}
Or this:
my $data = join("", <FILE>);
Or some will suggest the File::Slurp module, but I think it's a bit overkill. However, let's get an entire file into an array of bytes:
use bytes;
...
my #data = split(//, join("", <FILE>));
And then we have an array of bytes that we can pass to a function. Like?
Without knowing much more about what you're trying to do with your data, something like this will iterate over the bytes in the file:
open(SOURCE, "wl.jpg");
my $byte;
while(read SOURCE, $byte, 1) {
# Do something with the contents of $byte
}
close SOURCE;
Be careful with the concatanation used in your example; you may end up with newline conversions, which is definitely not what you want to happen while reading binary files. (It's also inefficient to continually expand the scalar while reading it.) This is the idiomatic way to schlep an entire file into a Perl scalar:
open(SOURCE, "<", "wl.jpg");
local $/ = undef;
my $big_binary_data = <SOURCE>;
close SOURCE;

How can I convert an input file to UTF-8 encoding in Perl?

I already know how to convert the non-utf8-encoded content of a file line by line to UTF-8 encode, using something like the following code:
# outfile.txt is in GB-2312 encode
open my $filter,"<",'c:/outfile.txt';
while(<$filter>){
#convert each line of outfile.txt to UTF-8 encoding
$_ = Encode::decode("gb2312", $_);
...}
But I think Perl can directly encode the whole input file to UTF-8 format, so I've tried something like
#outfile.txt is in GB-2312 encode
open my $filter,"<:utf8",'c:/outfile.txt';
(Perl says something like "utf8 "\xD4" does not map to Unicode" )
and
open my $filter,"<",'c:/outfile.txt';
$filter = Encode::decode("gb2312", $filter);
(Perl says "readline() on unopened filehandle!)
They don't work. But is there some way to directly convert the input file to UTF-8 encode?
Update:
Looks like things are not as simple as I thought. I now can convert the input file to UTF-8 code in a roundabout way. I first open the input file and then encode the content of it to UTF-8 and then output to a new file and then open the new file for further processing. This is the code:
open my $filter,'<:encoding(gb2312)','c:/outfile.txt';
open my $filter_new, '+>:utf8', 'c:/outfile_new.txt';
print $filter_new $_ while <$filter>;
while (<$filter_new>){
...
}
But this is too much work and it is even more troublesome than simply encode the content of $filter line by line.
I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.
When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.
old answer
The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.
You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.
If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.
The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.
But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)
To do that, all you need is to apply an additional layer to your open:
open my $foo, "<:encoding(gb2312):bytes", ...;
Note that the output of the following will be the same:
perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'
but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.