Console does not show 4 bytes utf-8 - unicode

I'm using Ubuntu. My console (using Python or bash) does not print utf8 chars of 4 bytes, only 3 bytes or less.
Example (bash):
eran#eran-MS-7788:~/Dropbox/help$ echo -e "\xee\x8c\xb0"
# This prints the v sign
eran#eran-MS-7788:~/Dropbox/help$ echo -e "\xf0\x9f\x8f\x80"
# This should print the basketball sign but does not print so
I get the same from the Python console.
Any way to solve this?

Most so called Unicode fonts only implement characters present in Unicode BMP (Basic Multiligual Plane) - from U+0000 to U+FFFF, which all can be represented using just 3 bytes using UTF-8.
There are very few fonts that implement characters beyond BMP: that is, from U+010000 to U+1FFFFF (which are encoded using 4 bytes in UTF-8), and I would think that your Linux console or terminal is no exception. Because of this, you are not likely to see character you expect. Over time, it is possible that support for characters beyond BMP would become better, but it will probably take a long time. You can read more about this here.

Related

understanding different character encodings

When I save a text document in UTF-8 that's basically saying: Computer, use the codepage for UTF8 that's installed somewhere on your computer to figure out, how to turn the 1's and 0's to characters, right?
When I save this content:
激光
äüß
#§
in ISO-8895-1, it becomes this (on Linux, using Kate editor):
æ¿å
äüÃ
#§
What is not displayed here is that in the first and second row that are some weird squares displayed instead of characters (can be seen in developer tools).
So my understanding is that this means that the combination of 0's and 1's that represent 激 in utf-8 is mapped to æ in ISO-8895-1, right? And the weird squares > < happen because there is no mapping for that binary number in the ISO-8895-1 character set so the computer defaults to some other encoding.
Is that correct?
Yes, sort of correct.
If you store a file as UTF-8, it usually gets a special byte combination that indicates its type of encoding at the beginning of the file. I think, Kate (don't know this editor) doesn't recognize this and just displays the file as something else. So basically, your file is still correct, but was just visualized in a wrong way.
The weird squares are another indicator, that Kate doesn't recognize those leading bytes, cause usually editors hide them from the user and just use the information to display the file correctly.
You have it pretty much right. The character U+6FC0 (激) for example is encoded with 3 bytes in UTF-8: 0xE6 0xBF 0x80.
If you interpret these bytes in ISO-8859-1, you get the characters æ¿. Depending on the version of ISO-8859-1, 0x80 is either not mapped to a character at all, or is mapped to a non-printable control character, that's why you can see only two characters for the three bytes.
If you use Windows-1252 instead of ISO-8859-1 you'll see æ¿€.

troubles with understanding how ASCII works

I have few questions about ASCII encoding:
How come there is 127 characters definable in ASCII coding? It should be 7-bit, which means 2 to the power of 7 which equals 128. Where is the one character missing?
When I save a textfile as a textfile.txt it should be saved in ASCII coding right? But when I write like 10 characters into the file it has 10 bytes, which is 80 bits, shouldnt it be 70 ?
How do I save a file to a kind of ASCII code which has 7 bits per character?
Do some softwares still use the ASCII encoding for storing the information ?
1) Ascii has 128 value, but they are enumerated from 0-127 like most computer arrays. 0 means null.
2) Either the ascii is fit into 8 bit, or you are using the extended 8 bit version.
3) Define your own program that writes to a bytestream, then you can check the bytes yourself.
4) Most readable text are encoded using UTF, but things that only need the basic characters, such as computer code, can still use ascii.
1) How come there is 127 characters definable in ASCII coding ? It
should be 7-bit, which means 2 to the power of 7 which equals
128...where is the one character missing
The NUL character, whose ASCII code is 0. That's the one you missed.
2) When I save a textfile as a textfile.txt it should be saved in
ASCII coding right ? But when I write like 10 characters into the file
it has 10 bytes, which is 80 bits, shouldnt it be 70 ?
Storage systems (and main memory) tend to use a byte as the mininum piece of information to store, so a file full of standard ASCII characters waste one bit per character. Non english users give thanks for that, because it allowed to extend ASCII to 8 bits, giving codes to store accentuated vocals and things like that.
3) How do I save a file to a kind of ASCII code which has 7 bits per
character ?
Just make sure all your file contents are ASCII standard. You will not, however, recover those missing bits. A compression algorithm might take advantage of that to squeeze a text file a little, though.
4) Do some softweres still use the ASCII encoding for storing the
information ?
The vast majority of software use ASCII even to encode things that are not ASCII by themselves. Notable examples: e-mail, and the HTML source of this very page you are reading.
Stepping back a bit…
ASCII is an archaic, nearly obsolete character set. That said, nearly all character sets in use are a superset of ASCII and have compatible encodings. For example, Unicode has the UTF-8 encoding which maps the bytes for the first 128 characters the same as ASCII. Windows-1252 has 251 characters with the first 128 the same as ASCII.
Many modern programming environments use Unicode (at least for their source code and/or strings): Java, .NET, XML, HTML, ….
So, if you are reading a file, don't assume that it is ASCII. And, if you are writing a file, you could loose data if your programming environment uses Unicode and you force the output to be ASCII. In either case, the intended character set and encoding should be known by readers and writers.

How can I display Extended ASCII Codes characters in Perl?

How to display 192 character symbol ( └ ) in perl ?
What you want is to be able to print unicode, and the answer is in perldoc perluniintro.
You can use \x{nnnn} where n is the hex identifier, or you can do \N{...} with the name:
perl -E 'say "\x{2514}"; use charnames; say "\N{BOX DRAWINGS LIGHT UP AND RIGHT}"'
To use exactly these codes your terminal must support Code Page 437, which contains frames. Alternatively you can use derived CP850 with less boxing characters.
Such boxing characters also exist as Unicode Block Elements. The char which you want in perl is noted as \N{U+2514}. More details in perlunicode
That looks like the Code page 437 encoding. Perl is probably just outputting bytes that you give it. And your terminal is probably expecting UTF8.
So you need to decode it to Unicode, then re-encode it in UTF-8.
EDIT: Correct encoding.
As usual, Jon Skeet nails it: the 192 code is in the "extended ASCII" range. I suggest you follow #Douglas Leeder's advice, but I'm not sure which encoding www.LookupTables.com is giving you; ISO-8859-1 thinks 192 maps to "À", and Mac OS Roman thinks its "¿".
Is there a solution that works on ALL characters?
The user says they wanted to use an latin-1 extended charset character — so let's try an example from this block! So, if they wanted the Æ character, they would run...
print "\x{00C6}";
Output: �
Full Testable, Online Demo
TDLR Character Encoding Modes in Perl
So, wait, what just happened there? You'll notice that other ways of invoking UTF-8, such as char(...), \N{U+...}, and even unpack(...) also have the same issue. That's right -- the problem isn't with any of these functions, but an underlying character abstraction layer. In this case, you'll want to indicate this layer early in your code..
use open qw( :std :encoding(UTF-8) );
print "\x{00C6}";
Output: Æ
Now I can spell 'Ælf' correctly!
Full Testable, Online Demo
Why did that happen?
There is a note within the PerlDoc regarding the chr() function....
Note that characters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons.
For this reason, this special block needs to have that special use open to indicate std encoding.

Perl strings internals

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.
In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.
You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -
Perl uses logically-wide characters to represent strings internally.
it will confirm your statements.
Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

What multi-byte character set starts with 0x7F and is 4 bytes long?

I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..
UPDATE:
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?
For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?
One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long
What are the other bytes? Do you have any Latin text in this encoding?
If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.
Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.
ETA:
are there supposed to be Byte Order Marks?
There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.
That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.
It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.
Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.
(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)
ETA(2):
OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.
0x00-0x7E: plain ASCII
0x7F A B C: Unicode character
The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:
A*0x1000 + B*0x40 + C
That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:
.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.
So, for example:
0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京
ETA(3):
Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).
I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.
Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.
It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.
You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.
GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.
Try chardet. It does a good job of guessing the character encoding of a string of bytes.
Are Unicode and UTF-8 the same?
No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.
It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.
Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode.
The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.