Why does encoding, then decoding strings make Arabic characters lose their context?

Why does encoding, then decoding strings make Arabic characters lose their context? - perl

I'm (belatedly) testing Unicode waters for the first time and am failing to understand why the process of encoding, then decoding an Arabic string is having the effect of separating out the individual characters that the word is made of.
In the example below, the word "ﻟﻠﺒﻴﻊ" comprises of 5 individual letters: "ع","ي","ب","ل","ل", written right to left. Depending on the surrounding context (adjacent letters), the letters change form
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "For sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
print "Original string : $str\n"; # ل ل ب ي ع
print "Decoded string 1: $dec\n" # ل ل ب ي ع
print "Decoded string 2: $decoded\n"; # ل ل ب ي ع
ADDITIONAL INFO
When pasting the string to this post, the rendering is reversed so it looks like "ﻊﻴﺒﻠﻟ". I'm reversing it manually to get it to look 'right'. The correct hexdump is given below:
$ echo "ﻟﻠﺒﻴﻊ" | hexdump
0000000 bbef ef8a b4bb baef ef92 a0bb bbef 0a9f
0000010
The output of the Perl script (per ikegami's request):
$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63
0000040 6f 64 65 64 20 73 74 72 69 6e 67 20 31 3a 20 d8
0000060 b9 d9 8a d8 a8 d9 84 d9 84 0a 44 65 63 6f 64 65
0000100 64 20 73 74 72 69 6e 67 20 32 3a 20 d8 b9 d9 8a
0000120 d8 a8 d9 84 d9 84 0a
0000127
And if I just print $str:
$ perl unicode.pl | od -t x1
0000000 4f 72 69 67 69 6e 61 6c 20 73 74 72 69 6e 67 20
0000020 3a 20 d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
0000035
Finally (per ikegami's request):
$ grep 'For sale' unicode.pl | od -t x1
0000000 6d 79 20 24 73 74 72 20 3d 20 27 d8 b9 d9 8a d8
0000020 a8 d9 84 d9 84 27 3b 20 20 23 20 22 46 6f 72 20
0000040 73 61 6c 65 22 20 0a
0000047
Perl details
$ perl -v
This is perl, v5.10.1 (*) built for x86_64-linux-gnu-thread-multi
(with 53 registered patches, see perl -V for more detail)
Outputting to file reverses the string: "ﻊﻴﺒﻠﻟ"
QUESTIONS
I have several:
How can I preserve the context of each character while printing?
Why is the original string printed out to screen as individual letters, even though it hasn't been 'processed'?
When printing to file, the word is reversed (I'm guessing this is due to the script's right-to-left nature). Is there a way I can prevent this from happening?
Why does the following not hold true: $str !~ /\P{Bidi_Class: Right_To_Left}/;

Source code returned by StackOverflow (as fetched using wget):
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a ...
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM
perl output I get from the source code returned by StackOverflow:
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
... ef bb 9f ef bb a0 ef ba 92 ef bb b4 ef bb 8a 0a
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FECA ARABIC LETTER AIN FINAL FORM
U+000A LINE FEED
So I get exactly what's in the source, as I should.
perl output you got:
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
... d8 b9 d9 8a d8 a8 d9 84 d9 84 0a
U+0639 ARABIC LETTER AIN
U+064A ARABIC LETTER YEH
U+0628 ARABIC LETTER BEH
U+0644 ARABIC LETTER LAM
U+0644 ARABIC LETTER LAM
U+000A LINE FEED
Ok, so you could have a buggy Perl (that reverses and changes Arabic characters and only those), but it's far more likely that your sources doesn't contain what you think it does. You need to check what bytes form up your source.
echo output you got:
ef bb 8a ef bb b4 ef ba 92 ef bb a0 ef bb 9f 0a
U+FECA ARABIC LETTER AIN FINAL FORM
U+FEF4 ARABIC LETTER YEH MEDIAL FORM
U+FE92 ARABIC LETTER BEH MEDIAL FORM
U+FEE0 ARABIC LETTER LAM MEDIAL FORM
U+FEDF ARABIC LETTER LAM INITIAL FORM
U+000A LINE FEED
There are significant differences in what you got from perl and from echo, so it's no surprise they show up differently.
Output inspected using:
$ perl -Mcharnames=:full -MEncode=decode_utf8 -E'
say sprintf("U+%04X %s", $_, charnames::viacode($_))
for unpack "C*", decode_utf8 pack "H*", $ARGV[0] =~ s/\s//gr;
' '...'
(Don't forget to swap the bytes of hexdump.)

Maybe something odd with your shell? If I redirect the output to a file, the result will be the same. Please try this out:
use strict;
use warnings;
use utf8;
binmode( STDOUT, ':utf8' );
use Encode qw< encode decode >;
my $str = 'ﻟﻠﺒﻴﻊ'; # "For sale"
my $enc = encode( 'UTF-8', $str );
my $dec = decode( 'UTF-8', $enc );
my $decoded = pack 'U0W*', map +ord, split //, $enc;
open(F1,'>',"origiinal.txt") or die;
open(F2,'>',"decoded.txt") or die;
open(F3,'>',"decoded2.txt") or die;
binmode(F1, ':utf8');binmode(F2, ':utf8');binmode(F3, ':utf8');
print F1 "$str\n"; # ل ل ب ي ع
print F2 "$dec\n"; # ل ل ب ي ع
print F3 "$decoded\n";

Related

How to read currency symbol in a perl script

I have a Perl script where we are reading data from a .csv file which is having some different currency symbol . When we are reading that file and write the content I can see it is printing
Get <A3>50 or <80>50 daily
Actual value is
Get £50 or €50 daily
With Dollar sign it is working fine if there is any other currency code it is not working
I tried
open my $in, '<:encoding(UTF-8)', 'input-file-name' or die $!;
open my $out, '>:encoding(latin1)', 'output-file-name' or die $!;
while ( <$in> ) {
print $out $_;
}
$ od -t x1 input-file-name
0000000 47 65 74 20 c2 a3 35 30 20 6f 72 20 e2 82 ac 35
0000020 30 20 64 61 69 6c 79 0a
0000030
od -t x1 output-file-name
0000000 47 65 74 20 a3 35 30 20 6f 72 20 5c 78 7b 32 30
0000020 61 63 7d 35 30 20 64 61 69 6c 79 0a
0000034
but that is also not helping .Output I am getting
Get \xA350 or \x8050 daily
od -t x1 output-file-name
0000000 47 65 74 20 a3 35 30 20 6f 72 20 5c 78 7b 32 30
0000020 61 63 7d 35 30 20 64 61 69 6c 79 0a
0000034

Unicode Code Point
Glyph
UTF-8
Input File
ISO-8859-1
Output File
U+00A3 POUND SIGN
£
C2 A3
C2 A3
A3
A3
U+20AC EURO SIGN
€
E2 82 AC
E2 82 AC
N/A
5C 78 7B 32 30 61 63 7D
("LATIN1" is an alias for "ISO-8859-1".)
There are no problems with the input file.
£ is correctly encoded in your input file.
€ is correctly encoded in your input file.
As for the output file,
£ is correctly encoded in your output file.
€ isn't found in the latin1 charset, so \x{20ac} is used instead.
Your program is working as expected.
You say you see <A3> instead of £. That's probably because the program you are using is expecting a file encoded using UTF-8, but you provided a file encoded using ISO-8859-1.
You also say you see <80> instead of €. But there's no way you'd see that for the file you provided.

perl replace non UTF-8 characters or binary contents with whitespace

I have a file with non-ascii characters.
$ org od -t c -t x1 -A d tmp.txt
0000000 S o - c a l l e d 217 204 l a b
53 6f 2d 63 61 6c 6c 65 64 f4 8f b1 84 6c 61 62
0000016 e l e d 217 204 p a t t e r n s
65 6c 65 64 f4 8f b1 84 70 61 74 74 65 72 6e 73
0000032 217 204 c a n b e 217 204 u s
f4 8f b1 84 63 61 6e 20 62 65 f4 8f b1 84 75 73
0000048 e d 217 204 w i t h 217 204 s i
65 64 f4 8f b1 84 77 69 74 68 f4 8f b1 84 73 69
0000064 n g l e , 217 204 d o u b l e
6e 67 6c 65 2c 20 f4 8f b1 84 64 6f 75 62 6c 65
0000080 , 217 204 a n d 217 204 t r i
2c 20 f4 8f b1 84 61 6e 64 f4 8f b1 84 74 72 69
0000096 p l e 217 204 b l a n k s .
70 6c 65 f4 8f b1 84 62 6c 61 6e 6b 73 2e
As you can see, \x{f4}\x{8f}\x{b1}\x{84} has several occurrences. I want to replace \x{f4}\x{8f}\x{b1}\x{84} with whitespace. According to this, I try:
s/\x{f4}\x{8f}\x{b1}\x{84}/ /g;
tr/\x{f4}\x{8f}\x{b1}\x{84}/ /;
It doesn't work.
But if I remove this two lines in the script:
use utf8;
use open qw( :std :encoding(UTF-8) );
It works. Why?
I suspect that it is because perl only deals with characters, but \x{f4}\x{8f}\x{b1}\x{84} is not regarded as a character. Is there a way to remove \x{f4}\x{8f}\x{b1}\x{84} or any other binary contents or non UTF-8 characters with perl?

While the file may contain "\x{f4}\x{8f}\x{b1}\x{84}", your string contains "\x{10FC44}" — "\N{U+10FC44}" if you prefer — because you decoded what you read. As such, you'd need
tr/\N{U+10FC44}/ /
It's a private-use Code Point. To replace all 137,468 private-use Code Points, you can use
s/\p{General_Category=Private_Use}/ /g
General_Category can be abbreviated to Gc.
Private_Use can be abbreviated to Co.
General_Category= can be omitted.
So these are equivalent:
s/\p{Gc=Private_Use}/ /g
s/\p{Private_Use}/ /g
s/\p{Co}/ /g
Co makes me think of "control", so maybe it's best to avoid that one. (Controls characters are identified by the Control aka Cc general category.)

dd: copy N first bytes of each M-sized block

Given a binary input file made of data blocks which are N bytes long, how can I extract the first M first bytes of each block using dd?
for example, with M=10 and N=8, the data could look like this:
$ M=10
$ head -c $(( M * 5 )) /dev/urandom \
| tee inputfile.bin \
| hexdump -e '"%07.7_Ax\n"' -e "\"%07.7_ax \" ${M}/1 \"%02x \" \"\n\""
0000000 c0 07 5d 59 dc 03 2e 38 49 c4
000000a ca ad 44 6d 09 61 2b 6c 7c ba
0000014 c4 96 c6 73 8b ed 42 cf d9 9c
000001e 49 b7 bb ea 32 dc 35 6a 5c d8
0000028 55 15 a0 aa d5 aa 60 2c 30 de
0000032
and I would like to extract this from the input:
$ N=8
$ hexdump -e '"%07.7_Ax\n"' -e "\"%07.7_ax \" ${N}/1 \"%02x \" \"\n\"" output.bin
0000000 c0 07 5d 59 dc 03 2e 38
0000008 ca ad 44 6d 09 61 2b 6c
0000010 c4 96 c6 73 8b ed 42 cf
0000018 49 b7 bb ea 32 dc 35 6a
0000020 55 15 a0 aa d5 aa 60 2c
0000028

Identifying an unknown data encoding?

I'm trying understand an undocumented API I have discovered, and I can't get past the format of the data that is being returned.
Here is an example of what I get back when I perform a GET on the url I'm looking at:
A+uZL4258wXdnWztlEPJNXtdl3Tu4hRITtW2AUwQHUK5c6BATSBU/XsQEVIttCpI7wrW/oXWiBloT8+cdtUWBag3mzk3cLohKPvi7PWpf7jqCSbjNGh+5Iv5Gb8by2k31kp62sfwZ+i8r/3TA6nGrnJb6edOB7d0c6F34RTFRrrZSeJtiWYXAJ5JeD3yJY+C
At first I thought this was base64 encoded, but that just gives me back gibberish:
echo -n "<above snippet>" | base64 -D
?/???ݝl?C?5Vy??????,?8?s?#M T?{R-?*H?
???ֈhOϜv??7?97p?!(??????? &?4h~???i7?Jz???g輯???Ʈr[??N?ts?w??F??I?m?f?Ix=?%?
When I strip the URL down to just the domain, I get a website with cyrillic text. Maybe the data could be converted to cyrillic somehow?
Does this data format look familiar to you?
I'll continue to keep trying and report back if I make any progress.

This is definitely base64, because of the / and + characters.
When you decode that string using base64, you get this hexdump:
00000000 03 eb 99 2f 8d b9 f3 05 dd 9d 6c ed 94 43 c9 35 |.../......l..C.5|
00000010 7b 5d 97 74 ee e2 14 48 4e d5 b6 01 4c 10 1d 42 |{].t...HN...L..B|
00000020 b9 73 a0 40 4d 20 54 fd 7b 10 11 52 2d b4 2a 48 |.s.#M T.{..R-.*H|
00000030 ef 0a d6 fe 85 d6 88 19 68 4f cf 9c 76 d5 16 05 |........hO..v...|
00000040 a8 37 9b 39 37 70 ba 21 28 fb e2 ec f5 a9 7f b8 |.7.97p.!(.......|
00000050 ea 09 26 e3 34 68 7e e4 8b f9 19 bf 1b cb 69 37 |..&.4h~.......i7|
00000060 d6 4a 7a da c7 f0 67 e8 bc af fd d3 03 a9 c6 ae |.Jz...g.........|
00000070 72 5b e9 e7 4e 07 b7 74 73 a1 77 e1 14 c5 46 ba |r[..N..ts.w...F.|
00000080 d9 49 e2 6d 89 66 17 00 9e 49 78 3d f2 25 8f 82 |.I.m.f...Ix=.%..|
This just looks like 128 bytes of random data. And whenever you call this API URL again, you get a different string, although it starts with the same few characters.
Perhaps you should ask the maintainers of that website how to use their API. Maybe this string is some session ID that you should use in further calls.

Why does Perl MIME::Base64 insert CR characters before LF characters on decoding Base64-encoded strings when they aren't present in the original data?

Why does the Perl MIME::Base64 module on decoding Base64-encoded strings insert CR characters before LF characters when they are not present in the original data?
Input: a binary described by the following hex string,
14 15 6A 48 E4 15 6A 32 E5 48 46 13 A5 E3 88 43 18 A6 84 E3 51 3A 8A 0A 1A 3E E6 84 A6 1A 16 E8 46 84 A1 2E A3 5E 84 8A 4E 1A 35 E1 35 1E 84 A9 8E 46 54 44
This encodes to the Base64-encoded string:
FBVqSOQVajLlSEYTpeOIQximhONROooKGj7mhKYaFuhGhKEuo16Eik4aNeE1HoSpjkZURA==
My Perl script for decoding is
use MIME::Base64;
my $bin = decode_base64('FBVqSOQVajLlSEYTpeOIQximhONROooKGj7mhKYaFuhGhKEuo16Eik4aNeE1HoSpjkZURA==');
open FH, ">test.bin" or die $!;
print FH $bin;
close FH;
Output: the resulting file 'test.bin' has the following hex string representation,
14 15 6A 48 E4 15 6A 32 E5 48 46 13 A5 E3 88 43 18 A6 84 E3 51 3A 8A 0D 0A 1A 3E E6 84 A6 1A 16 E8 46 84 A1 2E A3 5E 84 8A 4E 1A 35 E1 35 1E 84 A9 8E 46 54 44
Note the hex digits in bold highlighting the additional '0D' character that has been inserted before '0A' where it was not present in the original data.
I'm using Perl v5.14.2 on Windows 7.

Since you're on Windows, you will need to open that filehandle in binary mode to prevent your line-endings from being munged.
open FH, ">test.bin" or die $!;
binmode FH;
You can do that all at once using IO layers, and also using a lexical filehandle which is better practice than a package symbol like FH:
open my $fh, '>:raw', 'test.bin' or die $!;
print { $fh } $bin;
For more, check out
perldoc perlio
perldoc perlopentut

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why does encoding, then decoding strings make Arabic characters lose their context? - perl

Related

How to read currency symbol in a perl script

perl replace non UTF-8 characters or binary contents with whitespace

dd: copy N first bytes of each M-sized block

Identifying an unknown data encoding?

Why does Perl MIME::Base64 insert CR characters before LF characters on decoding Base64-encoded strings when they aren't present in the original data?

Categories

Resources