A computer system is based on binary system. Data/instructions are encoded in binary. Encoding can be carried out in many formats - ASCII, UNICODE etc.
Is a microprocessor made for a chosen 'encoding format' ? if yes, how would it become compatible to other encoding formats? wouldn't there be a performance penalty in that case?
when we create a program, how its encoding format is chosen?
ASCII and UNICODE are encoding of text data and have nothing about binary data.
No, all microprocessors know about is binary numbers - they don't have a clue about the meaning of those numbers. That meaning is provided by us and by our tools used to build programs. For example, if you compile a C++ program using Visual Studio, it will use multi-byte characters, but the CPU doesn't know that.
One area where the microprocessor architecture does matter is endianness—for example, when you try to read a UTF-16LE encoding file on a big-endian machine, you have to swap the individual bytes of each code unit to get the expected 16-bit integer. This is an issue for all encoding forms whose code unit is wider than one byte. See section 2.6 of the second chapter of the Unicode standard for a more in-depth discussion. The processor itself still works with individual integer numbers, but as a library developer, you have to deal with the mapping from files (i.e., byte sequences) to memory arrays (i.e., code unit sequences).
Related
Say we have a file that is Latin-1 encoded and that we use a text editor to read in that file into memory. My questions are then:
How will those character strings be represented in memory? Latin-1, UTF-8, UTF-16 or something else?
What determines how those strings are represented in memory? Is it the application, the programming language the application was written in, the OS or the hardware?
As a follow-up question:
How do applications then save files to encoding schemes that use different character sets? F.e. converting UTF-8 to UTF-16 seems fairly intuitive to me as I assume you just decode to the Unicode codepoint, then encode to the target encoding. But what about going from UTF-8 to Shift-JIS which has a different character set?
Operating system
Windows
1993: Windows adopted Unicode 1.0 with NT 3.1 - back then Unicode was what is nowadays known as UCS-2. That Windows version also introduced NTFS (New Technology File System), which also stores every filename in UCS-2 like manner (16 bit codepoints).
2000: With NT 5.0 (aka Windows 2000) there was a shift/improvement from UCS-2 to UTF-16 - both OS and encoding became available in this year.
Since then nothing has changed. Internally, Windows uses 16 bit codepoints for almost 30 years already, and thanks to UTF-16 also newest codepoints such as Emojis are supported. Its API works the same way, with compatibility functions for byte-wise encodings merely being stubs that convert the input to UTF-16. See also
What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types?
"Windows uses UTF-16 as its internal encoding", what exactly does this mean?
Why does Windows use UTF-16LE?
Is it safe to assume all Windows platforms will be in UCS-2 LE
Unix: most distributions use UTF-8 by default, because it's most backward compatible while being future proof enough.
Programming language
Depends on their age or on their compiler: while languages themselves are not necessarily bound to an OS the compiler which produces the binaries might treat things differently as per OS.
Pascal: based in 1970 the String was just an array of bytes, not even necessarily meaning text. And for text ASCII or one of the other single-byte encodings could easily be dealt with.
Delphi: adopted as per Windows WideString, dealing with 16 bit per character, to perfectly make use of the WinAPI and its Unicode support. Later additions also emerged the UTF8String, which works with bytes again, but not necessarily only one byte per character. But also creations such as UCS4String are available since 2009, eating 4 bytes per character.
Free Pascal: stays with the old String but always defaults to UTF-8 encoding. While this always needs conversion when using the WinAPI it is also more platform independent. Several other String (compatibilty) types also exist, each with different memory usage.
ECMAScript (JavaScript): as per standard an engine should use UTF-16 for texts. See also JavaScript strings - UTF-16 vs UCS-2?
Java: engines must support a minimum of encodings, including UTF-16, thus internal String handling/memory usage may differ. See also What is the Java's internal represention for String? Modified UTF-8? UTF-16?
Application/program
Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.
Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.
Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.
Conversion
The most common approach is to use a common denominator:
Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.
I read some article about Unicode and UTF-8.
The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12CA to mean the character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot of tables listing characters and their corresponding code points:
Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12CA‘. U+12CA is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 through 0x10FFFF (1,114,111 decimal). This sequence needs to be represented as a set of bytes (meaning, values from 0 through 255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly. I think the reason is:
Unicode is not Self-synchronizing code, so if
10 represent A
110 represent B
10110 represent C
When I see 10110 in the disk we can't tell it's A and B or just C.
Unicode uses much more space instead of UTF-8 or UTF-16.
Am I right?
Read about Unicode, UTF-8 and the UTF-8 everywhere website.
There are more than a million Unicode code-points (you mentionned 1,114,111...). So you need at least 21 bits to be able to separate all of them (since 221 > 1114111).
So you can store Unicode characters directly, if you represent each of them by a wide enough integral type. In practice, that type would be some 32 bits integer (because it is not convenient to handle 3-bytes i.e. 24 bits integers). This is called UCS-4 and some systems or software do already handle their Unicode string in such a format.
Notice also that displaying Unicode strings is quite difficult, because of the variety of human languages (and also since Unicode has combining characters). Some need to be displayed right to left (Arabic, Hebrew, ....), others left to right (English, French, Spanish, German, Russian ...), and some top to down (Chinese, ...). A library displaying Unicode strings should be capable of displaying a string containing English, Chinese and Arabic words.... Then you see that decoding UTF-8 is the easy part of Unicode string displaying (and storing UCS-4 strings won't help much).
But, since English is the dominant language in IT technology (for economical reasons), it is very often cheaper to keep strings in UTF8 form. If most of the strings handled by your system are English (or in some other European language using the Latin alphabet), it is cheaper and it takes less space to keep them in UTF-8.
I guess than when China will become a dominant power in IT, things might change (or maybe not).
(I have no idea of the most common encoding used today on Chinese supercomputers or smartphones; I guess it is still UTF-8)
In practice, use a library (perhaps libunistring or Glib in C), to process UTF-8 strings and another one (e.g. pango and GTK in C) to display them. You'll find many Unicode related libraries in various programming languages.
I wonder why we have to encode U+12CA to UTF-8 or UTF-16 instead of saving the binary of 12CA in the disk directly.
How do you write 12CA to a disk directly? It is a bigger value than a byte can hold, so you need to write at least two bytes. Do you write 12 followed by CA? You just encoded it in UTF-16BE. That's what an encoding is...a definition of how to write an abstract number as bytes.
Other reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Pragmatic Unicode
For good and specific reasons, Unicode doesn't specify any particular encoding. If it makes sense for your scenario, you can specify your own.
Because Unicode doesn't specify any serialization, there is no way to "directly" store Unicode, just like you can't "directly" store a mathematical number or a flow chart to implement a program you designed. The question isn't really well-defined.
There are a number of existing serialization formats (encodings) so it is very likely that it makes the most sense to use an existing one unless your requirements are significantly different than what any existing encoding provides; even then, is it really worth the cost?
A stream of bits is just a stream of bits. Conventionally, we chop them up into groups of 8 and call that a "byte" and the latter half of your question is really "if it's not a byte, how can you tell which bits belong to which symbol?" There are many ways to do that, but the common ones generally define a sequence of some particular length (8, 16, and 32 are often convenient for reasons of compatibility with bus width on modern computers etc) but again, if you really wanted to, you could come up with something different. Huffman trees come to mind as one way to implement a way to communicate a structure of variable length (and is used for precisely that in many compression algorithms).
Consider one situation, even if you can directly save unicode binary into disk and close the file, what happens when you open the file again? It's just a bunch of binary, you don't know how many bytes a char occupied right, which means, if '🥶'(U+129398) and 'A' are the content of your file, then if you take it 1 byte for a char, then '🥶' can't be decoded correctly, which takes 2 bytes, then instead 1 emoji you see, you get two, which is U+63862 and U+65536 unicode char.
The DataMatrix article on Wikipedia mentions that it supports only ASCII by default. It also mentions a special mode for Base256 encoding, which should be able to represent arbitrary byte values.
However all the barcode generator libraries that I tried so far support data to be entered as string and show errors for characters beyond ASCII (Onbarcode and Barcodelib). There is also no way how to enter byte[] which would be required for Base256 mode.
Is there a barcode generator library that supports Base256 mode? (preferably commercial library with support)
Converting the unicode string into Base64 and decoding from base64 after the data is scanned would be one approach, but is there anything else?
it is possible, although, it has some pitfalls:
1) it depends on which language you're writing your app (there are different bindings fo different DM-libraries across programming languages.
For example, there is pretty common library in *nix-related environment (almost all barcode scanners/generators on Maemo/MeeGo/Tizen, some WinPhone apps, KDE thingies, and so on, using it) called [libdmtx][1]. As far, as I tested, encodes and decodes messages contatining unicode pretty fine, but it doesn't properly mark encoded message ("Hey, other readers, it is unicode here!"), so, other libraries, such as [ZXing][2], as many proprietary scanners, decodes that unicode messages as ASCII.
As far, as I dicussed with [ZXing][2] author, proper mark would probably be an ECI segment (0d241 byte as first codeword, followed by "0d26" byte (for UTF-8)). Although, that is theoretical solution, based on such one for QR-codes and not standardized in any way for DataMatrix (and neither [libdmtx][1] nor [ZXing][2], do not yet support encoding with such markings, althought, there is some steps in that way.
So, TL;DR: If you plan to use that generated codes (with unicode messages) only between apps, that you're writing — you can freely use [libdmtx][1] for both encoding and decoding on both sides and it will work fine :) If not — try to look for [zxing][2] ports on your language (and make sure that port supports encoding).
1: github.com/dmtx/libdmtx
2: github.com/zxing/zxing
Could anyone give me a concise definitions of
Unicode
UTF7
UTF8
UTF16
UTF32
Codepages
How they differ from Ascii/Ansi/Windows 1252
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
This is a good start: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
If you want a really brief introduction:
Unicode in 5 Minutes
Or if you are after one-liners:
Unicode: a mapping of characters to integers ("code points") in the range 0 through 1,114,111; covers pretty much all written languages in use
UTF7: an encoding of code points into a byte stream with the high bit clear; in general do not use
UTF8: an encoding of code points into a byte stream where each character may take one, two, three or four bytes to represent; should be your primary choice of encoding
UTF16: an encoding of code points into a word stream (16-bit units) where each character may take one or two words (two or four bytes) to represent
UTF32: an encoding of code points into a stream of 32-bit units where each character takes exactly one unit (four bytes); sometimes used for internal representation
Codepages: a system in DOS and Windows whereby characters are assigned to integers, and an associated encoding; each covers only a subset of languages. Note that these assignments are generally different than the Unicode assignments
ASCII: a very common assignment of characters to integers, and the direct encoding into bytes (all high bit clear); the assignment is a subset of Unicode, and the encoding a subset of UTF-8
ANSI: a standards body
Windows 1252: A commonly used codepage; it is similar to ISO-8859-1, or Latin-1, but not the same, and the two are often confused
Why do you care? Because without knowing the character set and encoding in use, you don't really know what characters a given byte stream represents. For example, the byte 0xDE could encode
Þ (LATIN CAPITAL LETTER THORN)
fi (LATIN SMALL LIGATURE FI)
ή (GREEK SMALL LETTER ETA WITH TONOS)
or 13 other characters, depending on the encoding and character set used.
As well as the oft-referenced Joel one, I have my own article which looks at it from a .NET-centric viewpoint, just for variety...
Yea I got some insight but it might be wrong, however it's helped me to understand it.
Let's just take some text. It's stored in the computers ram as a series of bytes, the codepage is simply the mapping table between the bytes and characters you and i read. So something like notepad comes along with its codepage and translates the bytes to your screen and you see a bunch of garbage, upside down question marks etc. This does not mean your data is garbled only that the application reading the bytes is not using the correct codepage. Some applications are smarter at detecting the correct codepage to use than others and some streams of bytes in memory contain a BOM which stands for a Byte Order Mark and this can declare the correct codepage to use.
UTF7, 8 16 etc are all just different codepages using different formats.
The same file stored as bytes using different codepages will be of a different filesize because the bytes are stored differently.
They also don't really differ from windows 1252 as that's just another codepage.
For a better smarter answer try one of the links.
Here, read this wonderful explanation from the Joel himself.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Others have already pointed out good enough references to begin with. I'm not listing a true Dummy's guide, but rather some pointers from the Unicode Consortium page. You'll find some more nitty-gritty reasons for the usage of different encodings at the Unicode Consortium pages.
The Unicode FAQ is a good enough place to answer some (not all) of your queries.
A more succinct answer on why Unicode exists, is present in the Newcomer's section of the Unicode website itself:
Unicode provides a unique number for
every character, no matter what the
platform, no matter what the program,
no matter what the language.
As far as the technical reasons for usage of UTF-8, UTF-16 or UTF-32 are concerned, the answer lies in the Technical Introduction to Unicode:
UTF-8 is popular for HTML and similar
protocols. UTF-8 is a way of
transforming all Unicode characters
into a variable length encoding of
bytes. It has the advantages that the
Unicode characters corresponding to
the familiar ASCII set have the same
byte values as ASCII, and that Unicode
characters transformed into UTF-8 can
be used with much existing software
without extensive software rewrites.
UTF-16 is popular in many environments
that need to balance efficient access
to characters with economical use of
storage. It is reasonably compact and
all the heavily used characters fit
into a single 16-bit code unit, while
all other characters are accessible
via pairs of 16-bit code units.
UTF-32 is popular where memory space
is no concern, but fixed width, single
code unit access to characters is
desired. Each Unicode character is
encoded in a single 32-bit code unit
when using UTF-32.
All three encoding forms need at most
4 bytes (or 32-bits) of data for each
character.
A general thumb rule is to use UTF-8 when the predominant languages supported by your application are spoken west of the Indus river, UTF-16 for the opposite (east of the Indus), and UTF-32 when you are concerned about utilizing characters with uniform storage.
By the way UTF-7 is not a Unicode standard and was designed primarily for use in mail applications.
I'm not after wikipedia links or incredible detail, just some brief information on how and why the huge variations in Unicode have come about and why you should care as a programmer.
First of all, there aren't "variations of unicode". Unicode is a standard, the standard, to assign code points (integers) to characters. UTF8 is the most popular way to represent those integers as bytes!
Why should you care as a programmer?
It's fun to understand this!
If you don't have basic understanding of encodings, you can easily produce buggy code.
Example: You receive a ByteArray myByteArray from somewhere and you know it represents characters. You then run myByteArray.toString() and you get the string Hello. Your program works! One day after shiping your code your german customer calls: "We have a problem, äöü are not displayed correctly!". You start debugging the code, feeling pretty lost without a basic understanding of encodings. However, with the understanding of encodings you know that the error probably was this: When running myByteArray.toString(), your program assumed the string was encoded with the default system encoding. But maybe it wasn't! Maybe it was UTF8 and your system is LATIN-SOMETHING and so you should have ran myByteArray.toString("UTF8") instead!
Resources:
I would NOT recommend Joel's article as suggested by others. It's a long article with a lot of irrelevant information. I read it a couple of years back and the essence of it didn't stick to my brain since there are so many unimportant details.
As already mentioned http://wiki.secondlife.com/wiki/Unicode_In_5_Minutes is a great place to go for to grasp the essence of unicode.
If you want to actually understand variable length encodings like UTF8 I'd recommend https://www.tsmean.com/articles/encoding/unicode-and-utf-8-tutorial-for-dummies/.
What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?
In what way are these helpful for programmers?
Going down your list:
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
There's more on my Unicode page and tips for debugging Unicode problems.
The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.
Some reading to get you started on character encodings: Joel on Software:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By the way - ASP.NET has nothing to do with it. Encodings are universal.