What's Unicode/ASCII's relevance to machine code? - unicode

Even though machine language varies according to, well, machine, as far as I've found out, Unicode/ASCII has specific values for characters(this whole concept is still a bit confusing). So, basically, is the binary value for the character, let's say, 'A' in Linux different to that of 'A' in Windows? If different machines understand different sequences of 1s and 0s, shouldn't 'A's 1s and 0s differ according to machine(even though Unicode has set values for each character--I think)?
P.S. I'm kind of new to programming and don't even know if this is the right place to ask this question.(If it isn't, sorry!)

Linux and Windows are different operating systems, which can very well run on the same machine (hardware). ASCII and Unicode (and the Unicode encodings like UTF-8) are standards independent of any specific operating system or machine. These standards define how data should be expressed, and that is independent of any specific implementation of that standard. ASCII in Windows is exactly the same as ASCII in Linux, because ASCII has been defined the way it is and different systems must make their implementation conform to that standard if they want to be interoperable.
Now, different hardware architectures may use big-endian vs. little-endian architectures, in which case the actual bytes may be processed in a different order internally. But that is merely an implementation detail; ASCII will still be ASCII.

Machines don't "understand" characters. They process bytes, made of 0 and 1s.

Related

What determines how strings are encoded in memory?

Say we have a file that is Latin-1 encoded and that we use a text editor to read in that file into memory. My questions are then:
How will those character strings be represented in memory? Latin-1, UTF-8, UTF-16 or something else?
What determines how those strings are represented in memory? Is it the application, the programming language the application was written in, the OS or the hardware?
As a follow-up question:
How do applications then save files to encoding schemes that use different character sets? F.e. converting UTF-8 to UTF-16 seems fairly intuitive to me as I assume you just decode to the Unicode codepoint, then encode to the target encoding. But what about going from UTF-8 to Shift-JIS which has a different character set?
Operating system
Windows
1993: Windows adopted Unicode 1.0 with NT 3.1 - back then Unicode was what is nowadays known as UCS-2. That Windows version also introduced NTFS (New Technology File System), which also stores every filename in UCS-2 like manner (16 bit codepoints).
2000: With NT 5.0 (aka Windows 2000) there was a shift/improvement from UCS-2 to UTF-16 - both OS and encoding became available in this year.
Since then nothing has changed. Internally, Windows uses 16 bit codepoints for almost 30 years already, and thanks to UTF-16 also newest codepoints such as Emojis are supported. Its API works the same way, with compatibility functions for byte-wise encodings merely being stubs that convert the input to UTF-16. See also
What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types?
"Windows uses UTF-16 as its internal encoding", what exactly does this mean?
Why does Windows use UTF-16LE?
Is it safe to assume all Windows platforms will be in UCS-2 LE
Unix: most distributions use UTF-8 by default, because it's most backward compatible while being future proof enough.
Programming language
Depends on their age or on their compiler: while languages themselves are not necessarily bound to an OS the compiler which produces the binaries might treat things differently as per OS.
Pascal: based in 1970 the String was just an array of bytes, not even necessarily meaning text. And for text ASCII or one of the other single-byte encodings could easily be dealt with.
Delphi: adopted as per Windows WideString, dealing with 16 bit per character, to perfectly make use of the WinAPI and its Unicode support. Later additions also emerged the UTF8String, which works with bytes again, but not necessarily only one byte per character. But also creations such as UCS4String are available since 2009, eating 4 bytes per character.
Free Pascal: stays with the old String but always defaults to UTF-8 encoding. While this always needs conversion when using the WinAPI it is also more platform independent. Several other String (compatibilty) types also exist, each with different memory usage.
ECMAScript (JavaScript): as per standard an engine should use UTF-16 for texts. See also JavaScript strings - UTF-16 vs UCS-2?
Java: engines must support a minimum of encodings, including UTF-16, thus internal String handling/memory usage may differ. See also What is the Java's internal represention for String? Modified UTF-8? UTF-16?
Application/program
Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.
Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.
Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.
Conversion
The most common approach is to use a common denominator:
Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.

Why can't we store Unicode directly?

I read some article about Unicode and UTF-8.
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12CA‘. U+12CA is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly. I think the reason is:
Unicode is not Self-synchronizing code, so if
10 represent A
110 represent B
10110 represent C
When I see 10110 in the disk we can't tell it's A and B or just C.
Unicode uses much more space instead of UTF-8 or UTF-16.
Am I right?
Read about Unicode, UTF-8 and the UTF-8 everywhere website.
There are more than a million Unicode code-points (you mentionned 1,114,111...). So you need at least 21 bits to be able to separate all of them (since 221 > 1114111).
So you can store Unicode characters directly, if you represent each of them by a wide enough integral type. In practice, that type would be some 32 bits integer (because it is not convenient to handle 3-bytes i.e. 24 bits integers). This is called UCS-4 and some systems or software do already handle their Unicode string in such a format.
Notice also that displaying Unicode strings is quite difficult, because of the variety of human languages (and also since Unicode has combining characters). Some need to be displayed right to left (Arabic, Hebrew, ....), others left to right (English, French, Spanish, German, Russian ...), and some top to down (Chinese, ...). A library displaying Unicode strings should be capable of displaying a string containing English, Chinese and Arabic words.... Then you see that decoding UTF-8 is the easy part of Unicode string displaying (and storing UCS-4 strings won't help much).
But, since English is the dominant language in IT technology (for economical reasons), it is very often cheaper to keep strings in UTF8 form. If most of the strings handled by your system are English (or in some other European language using the Latin alphabet), it is cheaper and it takes less space to keep them in UTF-8.
I guess than when China will become a dominant power in IT, things might change (or maybe not).
(I have no idea of the most common encoding used today on Chinese supercomputers or smartphones; I guess it is still UTF-8)
In practice, use a library (perhaps libunistring or Glib in C), to process UTF-8 strings and another one (e.g. pango and GTK in C) to display them. You'll find many Unicode related libraries in various programming languages.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly.
How do you write 12CA to a disk directly? It is a bigger value than a byte can hold, so you need to write at least two bytes. Do you write 12 followed by CA? You just encoded it in UTF-16BE. That's what an encoding is...a definition of how to write an abstract number as bytes.
Other reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Pragmatic Unicode
For good and specific reasons, Unicode doesn't specify any particular encoding. If it makes sense for your scenario, you can specify your own.
Because Unicode doesn't specify any serialization, there is no way to "directly" store Unicode, just like you can't "directly" store a mathematical number or a flow chart to implement a program you designed. The question isn't really well-defined.
There are a number of existing serialization formats (encodings) so it is very likely that it makes the most sense to use an existing one unless your requirements are significantly different than what any existing encoding provides; even then, is it really worth the cost?
A stream of bits is just a stream of bits. Conventionally, we chop them up into groups of 8 and call that a "byte" and the latter half of your question is really "if it's not a byte, how can you tell which bits belong to which symbol?" There are many ways to do that, but the common ones generally define a sequence of some particular length (8, 16, and 32 are often convenient for reasons of compatibility with bus width on modern computers etc) but again, if you really wanted to, you could come up with something different. Huffman trees come to mind as one way to implement a way to communicate a structure of variable length (and is used for precisely that in many compression algorithms).
Consider one situation, even if you can directly save unicode binary into disk and close the file, what happens when you open the file again? It's just a bunch of binary, you don't know how many bytes a char occupied right, which means, if '🥶'(U+129398) and 'A' are the content of your file, then if you take it 1 byte for a char, then '🥶' can't be decoded correctly, which takes 2 bytes, then instead 1 emoji you see, you get two, which is U+63862 and U+65536 unicode char.

Why are there different encoding types?

This is a noob question, but I wanna know why there are different encoding types and what are their differences (ie. ASCII, utf-8 and 16, base64, etc.)
Reasons are many I believe but the main point is: "How many characters you need to display (encode)?" If you live in US for example, you could go pretty far with ASCII. But in many counties we need characters like ä, å, ü etc. (If SO was ASCII only or you try to read this text as ASCII encoded text, you'd see some weird characters in the places of ä, å and ü.) Think also the China, Japan, Thailand and other "exotic" countires. Those weird figures on photos you may have seen around the world just might be letters, not pretty pictures.
As for the differences between different encoding types you need to see their specification. Here's something for UTF-8.
http://www.unicode.org/standard/standard.html
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8#Compared_to_other_multi-byte_encodings
I'm not familiar with UTF-16. Here's some information about the differences.
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Unicode_plane
Base64 is used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. If you've ever made somesort of email system with PHP, you've probably encountered Base64.
http://en.wikipedia.org/wiki/Base64
http://www.phpeveryday.com/articles/PHP-Email-Using-Embedded-Images-in-HTML-Email-P113.html
Is short: To support computer program's user interface localizations to many different languages. (Programming languages still mainly consist of characters found in ASCII encoding, althought it's possible for example in Java to use UTF-8 encoding in variable names, and the source code file is usually stored as something else than ASCII encoded text, for example UTF-8 encoding.)
In short vol.2: Always when different people are trying to solve some problem from a specific point of view (or even without a point of view if it's even possible), results may be quite different. Quote from Joel's unicode article (link below): "Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255."
Thanks to Joachim and tchrist for all the info and discussion. Here's two articles I just read. (Both links are on the page I linked to earlier.) I'd forgotten most of the stuff from Joel's article since I last read it a few years back. Good introduction to the subject I hope. Mark Davis goes a little deeper.
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
The real reason why there are so many variants is that the Unicode consortium came along too late.
In The Beginning memory and storage was expensive and using more than 8 (or sometimes only 7) bit of memory to store a single character was considered excessive. Thus pretty much all text was stored using 7 or 8 bit per character. Clearly, 8 bit are not enough memory to represent the characters of all human languages. It's barely enough to represent most characters used in a single language (and for some languages even that's not possible). Therefore many different character encodings where designed to allow different languages (English, German, Greek, Russian, ...) to encode their texts in 8 bits per characters. After all a single text file (and usually even a single computer system) will only ever used in a single language, right?
This led to a situation where there was no single agreed-upon mapping of characters to numbers of any kind. Many different, incompatible solutions where produced and no real central control existed. Some computer systems used ASCII, others used EBCDIC (or more precisely: one of the many variations of EBCDIC), ISO-8859-* (or one of its many derivatives) or any of a big list of encodings that are hardly heard about now.
Finally, the Unicode Consortium stepped up to the task to produce that single mapping (together with lots of auxiliary data that's useful but outside of the bounds of this answer).
When the Unicode consortium finally produced a fairly comprehensive list of characters that a computer might represent (together with a number of encoding schemes to encode them to binary data, depending on your concrete needs), the other character encoding schemes were already widely used. This slowed down the adoption of Unicode and its encodings (UTF-8, UTF-16) considerably.
These days, if you want to represent text, your best bet is to use one of the few encodings that can represent all Unicode characters. UTF-8 and UTF-16 together should suffice for 99% of all use cases, UTF-32 covers almost all the others. And just to be clear: all the UTF-* encodings can encode all valid Unicode characters. But due to the fact that UTF-8 and UTF-16 are variable-width encodings, they might not be ideal for all use cases. Unless you need to be able to interact with a legacy system that can't handle those encodings, there is rarely a reason to choose anything else these days.
The main reason is to be able to show more characters. When the internet was in it's infancy, noone really planned ahead thinking that one day there would be people using it from all countries and all languages around the world. So a small character set was good enough. Gradually it was revealed to be limited and English-centric, thus the demand for bigger character sets.

Are the microprocessors 'encoding format' specific?

A computer system is based on binary system. Data/instructions are encoded in binary. Encoding can be carried out in many formats - ASCII, UNICODE etc.
Is a microprocessor made for a chosen 'encoding format' ? if yes, how would it become compatible to other encoding formats? wouldn't there be a performance penalty in that case?
when we create a program, how its encoding format is chosen?
ASCII and UNICODE are encoding of text data and have nothing about binary data.
No, all microprocessors know about is binary numbers - they don't have a clue about the meaning of those numbers. That meaning is provided by us and by our tools used to build programs. For example, if you compile a C++ program using Visual Studio, it will use multi-byte characters, but the CPU doesn't know that.
One area where the microprocessor architecture does matter is endianness—for example, when you try to read a UTF-16LE encoding file on a big-endian machine, you have to swap the individual bytes of each code unit to get the expected 16-bit integer. This is an issue for all encoding forms whose code unit is wider than one byte. See section 2.6 of the second chapter of the Unicode standard for a more in-depth discussion. The processor itself still works with individual integer numbers, but as a library developer, you have to deal with the mapping from files (i.e., byte sequences) to memory arrays (i.e., code unit sequences).

Why does anyone use an encoding other than UTF-8? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I want to know why any developer would need to use an encoding other than UTF-8.
Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:
http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages
The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.
Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.
Those are the usual excuses for not using Unicode.
As for not using UTF-8 specifically there are different reasons. Some systems, like Windows1 (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.
Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.
So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.
1 NT.
In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.
UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.
For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.
Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.
Examples:
ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
ISO-2022-JP in Japan for emails, Shift JIS for websites
Big5 in Taiwan
GB2312 in China
The last two examples show that encodings can even be a political issue.
Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).
Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).
Sometimes it's just laziness learning the UTF-8 functions for a particular language.
Because they do not know better.
The only valid criticism to utf-8 is that encodings for common Asian languages are oversized from other encodings.
UTF-8 is superior because
It is ASCII compatible. Most known and tried string operations do not need adaptation.
It is Unicode. Anything that isn't Unicode shouldn't even be considered in this day and age. If you have important data in encoding X, spend two minutes on Google and write a conversion function. Even if you have to interface with sourceless legacy app Z, you can run your communications through a pipe so that your logic stays in the 21st century.
UTF-16 isn't fixed length either and assuming it is like many do, will only cause terrible bugs.
Additionally Unicode is very complex and it is almost certain than any fixed-size algorithm adapted from ASCII will yield bad results even in UTF-32.
Say you have this UTF-16 string.
[0][1][2][F|3] [4] [5]
And you want to insert a character with code 8 between [3] and [4]
you would do insert(5,8)
If you don't check for characters outside BMP(serially as in UTF-8 as you cannot know how many double sized characters you have) you get:
[0][1][2][F|8][3][4][5]
Two new garbage characters. So much for your fixed size encoding.
You can of course disallow such characters altogether, but then when your code interfaces with the real world, you might find your program saves the profile for this user who lives in rm -Rf / in .profile instead of [Classical Chinese Proverb].profile.
Or just an angry user that cannot write his thesis on Classical Chinese Proverbs with your software.
One legitimate reason is when you need to deal with legacy documents, software or hardware that are not Unicode compatible.
Another legitimate reason is that you need to use a programming language / libraries that do not support UTF8 / Unicode well ... or at all.
Other answers mention that UTF-16 is more compact than UTF-8 for Asian languages / characters.
And of course there are reasons like short-sightedness, ignorance, laziness ... and deadlines.
Its also worth remembering that in some circumstances (where a non-latin set of characters are needed) UTF-8 can actually bloat larger than the 16 bit Unicode encoding. In those cases ucs-2 or utf-16 would be a better choice.
The reasons for using non-Unicode 8-bit character sets / encodings are all back compatibility of some kind, and/or inertia. For that matter, the most frequent reasons for using UTF-8 are compatibility with standards like XML that mandate or prefer UTF-8.
Differences in the number of bytes you think text will take up in different encodings, especially in storage, are mostly theoretical. In real world situations, compatibility requirements are more important. If compression is used, the size differences go away anyway. Even if compression is not used, total text size is hard to predict and is rarely a deciding factor.
When converting legacy code that used non-Unicode 8-bit encodings, using UTF-16 can be a tool for making sure all code has been converted, because mismatches can be caught as compile-time type errors. Many languages, runtimes and libraries like Javascript, JVM, .NET, ICU use 16-bit strings and UTF-16, even though storage and Internet protocols are usually 8-bit.
Imagine all files to consider are in GB2312 (China mainland standard). Then you might choose GB18030 as Unicode encoding instead. They are compatible the same way as all ASCII is UTF-8. That is useful in China mainland!
You might decide even quicker when you find out that both mentioned GB-standards are required in your IT-product by law (as far as I have heard), if you want to ship in China (mainland).
Another upside is that GB2312, and as such GB18030 as well, are also ASCII compatible.
It is algorithmically not so robust, though. – So if you have no political reasons or any GB2312 legacy, it makes no sense to use it. But if you do, here you got your answer.
Related to the subject, when using MySQL, as if it wasn't complex enough, you get the option the choose which kind of UTF-8 collation you want to use. So what would you use?
UTF-8 general ci
or
UTF-8 unicode ci?
(I tend to use the UTF-8 variant that is used for the database connection)
Because you sometimes want to operate easily on codepoints -- then you'd choose f.e. UCS-2 or UCS-4.
Many APIs require other Unicode encodings - mostly UTF-16. For instance, Java, .NET, Win32.
At my previous employer we used iso-8859-1 for some of our ASP pages to match the collation of our SQL Server, which as you can guess was not Unicode. I wanted to change the collation, but the manager said to wait till we upgrade our SQL Server to do it. Needless to say it never happened - I haven't been with them for a little over a year now, so I don't know if they finally did it.
Unicode certainly is a good place to work from in most cases, but a developer should be familiar with many different types of character encoding. Certainly ASCII might be used if the set of characters is limited.
What if you're a developer and receiving data from a source that doesn't send UTF-8? There could be lots of interface issues if you don't understand your input.
Joel's article on the must-knows for character encoding is good and worth reading.