While working on Unicode, I came across "wctomb" API which is used to convert wide character to multibyte equivalent, my query is that UTF-8, UTF 16 both are considered to be mutibyte encoding because of variable length, so in which encoding wctomb gives the output?
The "wide" and "multibyte" encoding formats of wctomb (and other related functions) are platform-dependent, and are likely to also be locale-dependent (see setlocale and LC_CTYPE). There is no portable way to detect or control which encoding is used.
Related
I am a very stupid Programme Manager and I have a client requesting us to send in either ASCII or ANSI encoding format.
Our programmers has used Unicode (UTF-16), so my question is if Unicode (UTF-16) is compatible with ASCII or ANSI? Or am I understanding this incorrectly? Are we to change encoding or?
We haven't tried anything yet.
In short: ASCII encoding contains 128 characters. ANSI encoding contains 256 characters. UTF-16 encoding has the capacity for 1,112,064 character codes. There is some nuance such as the bytes used to store each character, but I don't think that is relevant here.
You can certainly convert a UTF-16 document down to ANSI or ASCII encoding, but any characters that are beyond their specification will be lost (probably converted to the 128th or 256th character, respectively, or some sort of null character).
For you, as a manager, there are some questions. At minimum:
Why does the client need this particular encoding? Can it be accommodated in some other way?
Are any characters in your data beyond the scope of ASCII/ANSI. Most (all?) programming languages provide a method to retrieve an integer representation of a character and determine if it is beyond the range of the desired encoding. This could be leveraged to discover how many instances exist of a character not compatible with the desired encoding.
I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.
What is the best way to (losslessly) convert Unicode to a lower-order byte encoding (8 bits), in a language inspecific way? I want a format that is standard, i.e. has widespread library support for conversion both directions.
If I were using Python, I would use repr:
In [1]: x = u"Российская Федерация"
In [2]: repr(x)
Out[2]: "u'\\xd0\\xa0\\xd0\\xbe\\xd1\\x81\\xd1\\x81\\xd0\\xb8\\xd0\\xb9\\xd1\\x81\\xd0\\xba\\xd0\\xb0\\xd1\\x8f \\xd0\\xa4\\xd0\\xb5\\xd0\\xb4\\xd0\\xb5\\xd1\\x80\\xd0\\xb0\\xd1\\x86\\xd0\\xb8\\xd1\\x8f'"
However, I'm looking for a format that has good library support for converting the second string back to the first, in a variety of languages.
Out[2]: "u'\xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd0\xb9\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f \xd0\xa4\xd0\xb5\xd0\xb4\xd0\xb5\xd1\x80\xd0\xb0\xd1\x86\xd0\xb8\xd1\x8f'"
If that's what you see, your terminal is set up wrong, it's treating UTF-8 input as being ISO-8859-1 (or cp1252 in the case of the Windows console, which isn't possible to set up right).
The proper Python repr of Российская Федерация would be the Unicode literal:
u'\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u0430\u044f \u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u044f'
Which as it happens is pretty close to the JavaScript/JSON string literal
"\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u0430\u044f \u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u044f"
If you want a 7-bit-safe (ASCII) representation of a Unicode string, JSON is a reasonable choice of format. Get it by using json.dumps() though rather than hacking the Python repr, since there are some subtle inconsistencies between the two formats.
Other well-understood ASCII representations you could try might include URL-encoding (%D0%A0%D0%BE...) and XML character escapes (<value>Рос...</value>).
If you only need an arbitrary binary representation that doesn't need to be 7-bit safe, as Max mentioned, just .encode('utf-8').
UTF-8, UTF-16 and UTF-32 are all standard. Perhaps UTF-8 is most common on the Internet; UTF-16 is used internally by Windows and Java. Any language with Unicode support will have encoding and decoding functions for all of them. In Python you can use the .encode method of unicode strings and .decode method of string to convert between them.
If you need something that's 7-bit clean (no 8th bits set), there's also UTF-7.
This question already has answers here:
What is the difference between UTF-8 and Unicode?
(18 answers)
Closed 6 years ago.
Consider:
Is it true that unicode=utf16?
Many are saying Unicode is a standard, not an encoding, but most editors support save as Unicode encoding actually.
As Rasmus states in his article "The difference between UTF-8 and Unicode?":
If asked the question, "What is the difference between UTF-8 and
Unicode?", would you confidently reply with a short and precise
answer? In these days of internationalization all developers should be
able to do that. I suspect many of us do not understand these concepts
as well as we should. If you feel you belong to this group, you should
read this ultra short introduction to character sets and encodings.
Actually, comparing UTF-8 and Unicode is like comparing apples and
oranges:
UTF-8 is an encoding - Unicode is a character
set
A character set is a list of characters with unique numbers (these
numbers are sometimes referred to as "code points"). For example, in
the Unicode character set, the number for A is 41.
An encoding on the other hand, is an algorithm that translates a
list of numbers to binary so it can be stored on disk. For example
UTF-8 would translate the number sequence 1, 2, 3, 4 like this:
00000001 00000010 00000011 00000100
Our data is now translated into binary and can now be saved to
disk.
All together now
Say an application reads the following from the disk:
1101000 1100101 1101100 1101100 1101111
The app knows this data represent a Unicode string encoded with
UTF-8 and must show this as text to the user. First step, is to
convert the binary data to numbers. The app uses the UTF-8 algorithm
to decode the data. In this case, the decoder returns this:
104 101 108 108 111
Since the app knows this is a Unicode string, it can assume each
number represents a character. We use the Unicode character set to
translate each number to a corresponding character. The resulting
string is "hello".
Conclusion
So when somebody asks you "What is the difference between UTF-8 and
Unicode?", you can now confidently answer short and precise:
UTF-8 (Unicode Transformation Format) and Unicode cannot be compared. UTF-8 is an encoding
used to translate numbers into binary data. Unicode is a character set
used to translate characters into numbers.
most editors support save as ‘Unicode’ encoding actually.
This is an unfortunate misnaming perpetrated by Windows.
Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).
This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.
This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.
(Other editors that do encodings themselves, like Notepad++, don't have this problem.)
If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.
It's not that simple.
UTF-16 is a 16-bit, variable-width encoding. Simply calling something "Unicode" is ambiguous, since "Unicode" refers to an entire set of standards for character encoding. Unicode is not an encoding!
http://en.wikipedia.org/wiki/Unicode#Unicode_Transformation_Format_and_Universal_Character_Set
and of course, the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) link.
There's a lot of misunderstanding being displayed here. Unicode isn't an encoding, but the Unicode standard is devoted primarily to encoding anyway.
ISO 10646 is the international character set you (probably) care about. It defines a mapping between a set of named characters (e.g., "Latin Capital Letter A" or "Greek small letter alpha") and a set of code points (a number assigned to each -- for example, 61 hexadecimal and 3B1 hexadecimal for those two respectively; for Unicode code points, the standard notation would be U+0061 and U+03B1).
At one time, Unicode defined its own character set, more or less as a competitor to ISO 10646. That was a 16-bit character set, but it was not UTF-16; it was known as UCS-2. It included a rather controversial technique to try to keep the number of necessary characters to a minimum (Han Unification -- basically treating Chinese, Japanese and Korean characters that were quite a bit alike as being the same character).
Since then, the Unicode consortium has tacitly admitted that that wasn't going to work, and now concentrate primarily on ways to encode the ISO 10646 character set. The primary methods are UTF-8, UTF-16 and UCS-4 (aka UTF-32). Those (except for UTF-8) also have LE (little endian) and BE (big-endian) variants.
By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).
UTF-16 and UTF-8 are both encodings of Unicode. They are both Unicode; one is not more Unicode than the other.
Don't let an unfortunate historical artifact from Microsoft confuse you.
The development of Unicode was aimed
at creating a new standard for mapping
the characters in a great majority of
languages that are being used today,
along with other characters that are
not that essential but might be
necessary for creating the text. UTF-8
is only one of the many ways that you
can encode the files because there are
many ways you can encode the
characters inside a file into Unicode.
Source:
http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/
In addition to Trufa's comment, Unicode explicitly isn't UTF-16. When they were first looking into Unicode, it was speculated that a 16-bit integer might be enough to store any code, but in practice that turned out not to be the case. However, UTF-16 is another valid encoding of Unicode - alongside the 8-bit and 32-bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems.
Let's start from keeping in mind that data is stored as bytes; Unicode is a character set where characters are mapped to code points (unique integers), and we need something to translate these code points data into bytes. That's where UTF-8 comes in so called encoding – simple!
It's weird. Unicode is a standard, not an encoding. As it is possible to specify the endianness I guess it's effectively UTF-16 or maybe 32.
Where does this menu provide from?
How does the Chinese GB18030 code set differ from Unicode?
What special techniques are required for handling GB18030?
Are there any (open source) libraries for handling GB18030?
As per the Wikipedia article on GB18030, "GB18030 can be be considered a Unicode Transformation Format (i.e. an encoding of all Unicode code points) that maintains compatibility with a legacy character set." That is, all Unicode characters can be encoded in GB18030, but they will be encoded with different byte sequences than would be generated with UTF-8 or UTF-16. Handling the GB18030 encoding doesn't require any more special techniques than are required for any other non-Unicode encoding.
The ICU project is an open source library (for C or Java) that has full support for many different encodings, including GB18030. Information on converting between different encodings with ICU can be found here.
What special techniques are required for handling GB18030?
The biggest thing to be aware of is that, unlike UTF-8, GB18030 allows ASCII bytes to occur within the encoding of a multi-byte character. (For example, 'ß' is encoded as the bytes 81 30 89 38, which contains the ASCII encoding of '0' and '8'.) This means that you can't use a simple byte-oriented find/index function.