Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
We've two encodings available for Chinese characters, GB18030 and Big5 for Chinese Simplified and Chinese Traditional respectively.
How many byte(s)/octet(s) a single Chinese character would take in each encoding?
Going by Wikipedia:
GB_18030 - Guójiā Biāozhǔn (国家标准) is a 4 octets(bytes) encoding scheme. Hence, every character should take 4 octets. Same is said on GB18030 - New Chinese Encoding Standard
Big-5 or Big5 is a 2 octets(bytes) encoding scheme. Here every character takes 2 octets.
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I know it exist ISO-8859-9/Latin-5 or ISO-8859-15/Latin-9, but recently I had to manage some messages encoded with ISO-8859-9/Latin-9 format.
What does it exactly mean?
There is ISO-8859-9 which is called Latin-5.
And there is ISO-8859-15 which is called Latin-9.
Yes, it is confusing. In my opinion it's simplest to always only use the ISO-8859-n moniker. That avoids potential confusions.
So "ISO-8859-9/Latin-9" is probably a typo (or someone wrongly thought that the suffix is identical for the "ISO-8859-" and the "Latin-" prefix).
Depending on the source of the data, you can guess which one they meant. ISO-8859-9 is used for Turkish text and ISO-8859-15 is basically the modern replacement for ISO-8859-1 (covering most of Western Europe, mostly used because it has the € symbol).
Source: ISO/IEC 8859 Wiki page.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I have this character.
–
How to convert this character to unicode?
Sorry if it is a silly question.
It's not a silly question, character encoding can be tricky to get your head around. I highly recommend reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (I'm sure you can guess the topic).
Unicode itself isn't an encoding, it's a very long list of characters and code points. What I'm guessing you want to do is display the dash character in some way. Where are you wanting to display or store the data? If it's in a browser, then that representation should work as that's the HTML encoded version. If you want to store it in a database then you'll need to convert that encoded version to a string and then convert that string to whatever encoding the database is using.
Take a look at this source has the encoding in different formats
http://www.fileformat.info/info/unicode/char/2013/index.htm
but each language has its own rules on how to write this in a string/char literal
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
Utf-8 is " is a variable-width encoding that can represent every character in the Unicode character set" (wikipedia), unicode is "standard for the consistent encoding, representation and handling of text" (wikipedia). They're difference things. Why does windows notepad give possibility to save document in unicode and utf-8? How can I compare two difference things?
To simplify,
Unicode says what number should represent each character.
UTF-8 says how to arange the bits to form different strings of unicode values.
According to this thread, what Unicode means in notepad is UTF-16 Little Endian (UTF-16LE) which is another way arranging the bits in order to form strings of Unicode values.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What does it mean when I save a text file as "Unicode" in notepad? is it Utf-8, Utf-16 or Utf-32? Thanks in advance.
In Notepad, as in Windows software in general, “Unicode” as an encoding name means UTF-16 Little Endian (UTF-16LE). (I first thought it’s not real UTF-16, because Notepad++ recognizes it as UCS-2 and shows the content as garbage, but re-checking with BabelPad, I concluded that Notepad can encode even non-BMP characters correctly.)
Similarly, “Unicode big endian” means UTF-16 Big Endian. And “ANSI” means the system’s native legacy encoding, e.g. the 8-bit windows-1252 encoding in Western versions of Windows.
All of these formats are "Unicode". But usually editors on Mac and Windows mean UTF-8 with that because it is ASCII compatible below code 128 IIRC. UTF-8 can represent more codes than just 256 (which fits in a single byte of 8 bits) by using a special character which means that the following byte also belongs to the same character.
If you look at the output in terminal, say with vi, and if you see a space between every two characters then you are looking at UTF-16 because there every two bytes make up one character. What you should see is that the characters don't have spaces between them, that's an indication for UTF-8.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
What languages does UTF-8 support?
And how many languages does the UTF-8 support?
See the page Supported Scripts on unicode.org. UTF-8 supports all Unicode characters.
Note that Unicode defines character encodings, not languages.
The Unicode Standard encodes scripts rather than languages per se. ...
UTF-8 is suppose to represent any Unicode character.
http://en.wikipedia.org/wiki/UTF-8