I have a really crappy file full of unicode bytes that I'm trying to clean up. Some examples from the file are as follows:
ブラック
roler coaster
digital social party
big bellie
cornacopia
\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f \xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0
Now, what I'd like to do is convert all those ugly byte points into real unicode text. So, the above would be output as:
ブラック
roler coaster
digital social party
big bellie
cornacopia
зубная щетка
I've been banging my head against how to do this in Perl for like an hour now, and I'm out of good ideas. If you have one, I'd love to hear it.
It's UTF-8
$ perl -E'
use open ":std", ":locale";
use Encode qw( decode );
$_ = q{\xd0\xb7\xd1\x83\xd0\xb1\xd0\xbd\xd0\xb0\xd1\x8f }.
q{\xd1\x89\xd0\xb5\xd1\x82\xd0\xba\xd0\xb0};
s/\\x(..)/chr hex $1/seg;
$_ = decode("UTF-8", $_);
say;
'
зубная щетка
Related
I hate to ask a question that's undoubtedly been answered a dozen times before, but I find encoding issues confusing and am having a hard time matching up other people's q/a with my own problem.
I'm pulling information from a json file online, and my perl script isn't handling unicode escape characters properly.
Script looks like this:
use LWP::Simple;
use JSON;
my $url = ______;
my $json = get($url);
my $data = decode_json($json);
foreach my $i (0 .. $#{data->{People}}) {
print "$data->{People}[$i]{first_name} $data->{People}[$i]{last_name}\n";
}
It encounters jsons that look like this: "first_name":"F\u00e9lix","last_name":"Cat" and prints them like this: FΘlix Cat
I'm sure there's a trivial fix here, but I'm stumped. I'd really appreciate any help you can provide.
You didn't tell Perl how to encode the output. You need to add
use open ':std', ':encoding(XXX)';
where XXX is the encoding the terminal expects.
On unix boxes, you normally need
use open ':std', ':encoding(UTF-8)';
On Windows boxes, you normally need
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';
To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear
Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.
Indeed I can print a smiley using
perl -e 'print chr(0x263a)'
but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.
I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
The first answer I've got explains quite everything about IO, but I still don't understand why
#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';
print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";
print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";
prints
ne1 - eq1
match1 - no_match2
It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).
First,
perl -le'print chr(0x263A);'
is buggy. Perl even tells you as much:
Wide character in print at -e line 1.
That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:
perl -le'print chr(0x263A);'
perl -le'print chr(0x00C0);'
To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À
Now on to the "why".
File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:
$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004
This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.
By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011
Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.
A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.
Here are a few example of operators that do assign meaning to the strings they receive as operands:
m// expects a string of Unicode code points.
connect expects a sequence of bytes that represent a sockaddr_in structure.
print with a handle without :encoding expect a sequence of bytes.
print with a handle with :encoding expect a sequence of Unicode code points.
etc
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.
Regarding the question you've added:
There are problems with the encoding pragma. I recommend against using it. Instead, use
use open ':std', ':encoding(UTF-8)';
That'll fix one of the problems. The other problem you are encountering is with
chr(0x00C0) =~ /\w/
It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:
use 5.014; # use 5.012; *might* suffice.
A workaround that works as far back as 5.8:
my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/
I have written a little perl app to reads the serial port. When I run my little script I receive data but it's written in unreadable signs.. it shows like *I??. However if I do
perl test.pl | hexdump
I get the required data. And the hex data makes sense to me. Does anyone know how I can get this output using perl without using hexdump?
Right now I use print ($data) to print my data.
"Raw hex" doesn't mean anything; what you've got is a string of bytes that you want to convert to a textual representation. To do that you can use unpack. For example,
my $bytes = read_from_serial_port();
my $hex = unpack 'h*', $bytes;
Use H instead of h if you want the opposite endianness. (I always forget which is which.)
For example, I want to create a file called sample.bin and put a number, like 255, so that 255 is saved in the file as little-endian, FF 00. Or 3826 to F2 0E.
I tried using binmode, as the perldoc said.
The Perl pack function will return "binary" data according to a template.
open(my $out, '>:raw', 'sample.bin') or die "Unable to open: $!";
print $out pack('s<', 255);
close($out);
In the above example, the 's' tells it to output a short (16 bits), and the '<' forces it to little-endian mode.
In addition, ':raw' in the call to open tells it to put the filehandle into binary mode on platforms where that matters (it is equivalent to using binmode). The PerlIO manual page has a little more information on doing I/O in different formats.
You can use pack to generate your binary data. For complex structures, Convert::Binary::C is particularly nice.
CBC parses C header files (either from a directory or from a variable in your script). It uses the information from the headers to pack or unpack binary data.
Of course, if you want to use this module, it helps to know some C.
CBC gives you the ability to specify the endianness and sizes for your C types, and you can even specify functions to convert between native Perl types and the data in the binary file. I've used this feature to handle encoding and decoding fixed point numbers.
For your very basic example you'd use:
use strict;
use warnings;
use IO::File;
use Convert::Binary::C;
my $c = Convert::Binary::C->new('ByteOrder' => 'LittleEndian');
my $packed = $c->pack( 'short int', 0xFF );
print $packed;
my $fh = IO::File->new( 'outfile', '>' )
or die "Unable to open outfile - $!\n";
$fh->binmode;
$fh->print( $packed );
CBC doesn't really get to shine in this example, since it is just working with a single short int. If you need to handle complex structures that may have typedefs pulled from several different C headers, you will be very happy to have this tool on hand.
Since you are new to Perl, I'll suggest that you always use stict and use warnings. Also, you can use diagnostics to get more detailed explanations for error messages. Both this site and Perlmonks have lots of good information for beginners and many very smart, skilled people willing to help you.
BTW, if you decide to go the pack route, check out the pack tutorial, it helps clarify the somewhat mystifying pack documentation.
Yes, use binmode
For your entertainment (if not education) my very first attempt at creating a binary file included binmode STDOUT and the following:
sub output_word {
$word = $_[0];
$lsb = $word % 256;
$msb = int($word/256);
print OUT chr($lsb) . chr($msb);
return $word;
}
FOR PITY'S SAKE DON'T USE THIS CODE! It comes from a time when I didn't know any better.
It could be argued I still don't, but it's reproduced here to show that you can control the order of the bytes, even with brain-dead methods, and because I need to 'fess up.
A better method would be to use pack as Adam Batkin suggested.
I think I committed the atrocity above in Perl 4. It was a long time ago. I wish I could forget it...
I already know how to convert the non-utf8-encoded content of a file line by line to UTF-8 encode, using something like the following code:
# outfile.txt is in GB-2312 encode
open my $filter,"<",'c:/outfile.txt';
while(<$filter>){
#convert each line of outfile.txt to UTF-8 encoding
$_ = Encode::decode("gb2312", $_);
...}
But I think Perl can directly encode the whole input file to UTF-8 format, so I've tried something like
#outfile.txt is in GB-2312 encode
open my $filter,"<:utf8",'c:/outfile.txt';
(Perl says something like "utf8 "\xD4" does not map to Unicode" )
and
open my $filter,"<",'c:/outfile.txt';
$filter = Encode::decode("gb2312", $filter);
(Perl says "readline() on unopened filehandle!)
They don't work. But is there some way to directly convert the input file to UTF-8 encode?
Update:
Looks like things are not as simple as I thought. I now can convert the input file to UTF-8 code in a roundabout way. I first open the input file and then encode the content of it to UTF-8 and then output to a new file and then open the new file for further processing. This is the code:
open my $filter,'<:encoding(gb2312)','c:/outfile.txt';
open my $filter_new, '+>:utf8', 'c:/outfile_new.txt';
print $filter_new $_ while <$filter>;
while (<$filter_new>){
...
}
But this is too much work and it is even more troublesome than simply encode the content of $filter line by line.
I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.
When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.
old answer
The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.
You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.
If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.
The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.
But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)
To do that, all you need is to apply an additional layer to your open:
open my $foo, "<:encoding(gb2312):bytes", ...;
Note that the output of the following will be the same:
perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'
but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.