Corrupted Japanese words becomes Chinese words [closed] - unicode

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I have this problem when a song's details becomes gibberish Chinese language. For example:
トランスルーセント becomes 僩儔儞僗儖乕僙儞僩
This usually happens to downloaded songs. I analyzed the unicode and they seem to differ at around 8k. What is changing the unicodes? My friend downloaded the same file with no problem.

The sequence of bytes:
83 67 83 89 83 93 83 58 83 8b 81 5b 83 5a 83 93 83 67
Can be interpreted using the Shift-JIS encoding (on Windows, code page 932) as “トランスルーセント”, or using the GB encoding (on Windows, code page 936) as “僩儔儞僗儖乕僙儞僩”. If a Windows machine encounters a series of bytes like that without any signalling to tell it which encoding is in use, it will choose its “default code page”, which depends on the setting in the Control Panel Regional Options “Language for non-Unicode applications” field. If set to Japanese you see “トランスルーセント”, if Chinese you get “僩儔儞僗儖乕僙儞僩”, if Western European you get “ƒgƒ‰ƒ“ƒXƒ‹�[ƒZƒ“ƒg” (classic mojibake).
To avoid this happening as an application author you should use Unicode strings and Unicode-safe encodings like UTF-8 and UTF-16 to store data. To avoid this happening as an end-user you should use applications and formats that support Unicode. If you are downloading a random MP3, of course, you don't get much say on what the application that encoded it did and you will have to put up with it.
It's not clear what exact sequence of events you are describing in your question and what you are comparing that differs. If you are comparing MP3 files be aware that some highly antisocial media player applications decide to write to the ID3 tags when they play a file, which may change it in arbitrary ways.

Related

BOCU-1 for internal encoding of strings [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Some languages/platforms like Java, Javascript, Windows, Dotnet, KDE etc. use UTF16. Some others prefer UTF8.
What is the reason that no language/platform uses BOCU-1? What is the rationale for JEP 254 and JEP 254 equivalent for Dotnet?
Is the reason that BOCU-1 is patented? Are there any technical reasons also?
Edit
My question is not about Java specifically. By JEP 254, I mean compact UTF-16 as mentioned in that proposal. My question is, since BOCU-1 is compact for almost any unicode string, why don't any language/platform use it internally, instead of UTF-16 or UTF-8. Such a usage would improve cache performance for any string, and not just ASCII or Latin-1.
Such a usage may also help in non-Latin programming language support in formats like The Language Server Index Format (LSIF).
What is the reason that no language/platform uses BOCU-1?
That question is far too broad in scope for Stack Overflow, and a concise answer is impossible.
However, in the specific case of Java note that someone raised the possibility of Java adopting BOCU-1 as an RFE (Request For Enhancement) in 2002. See JDK-4787935 (str) Reducing the memory footprint for Strings.
That bug was closed with a resolution of "Won't Fix" ten years later:
"Although this is a very interesting proposal, it is highly unlikely that BOCU or any other multi-byte encoding for internal use would be adopted. Furthermore, this comes down to a space-time tradeoff with unclear long-term consequences. Given the length of time this proposal has lingered, it seems appropriate to close it as will not fix".
What is the rationale for JEP 254...?
There is a section of JEP 254 titled "Motivation" which explains that, and in particular it states "most String objects contain only Latin-1 characters". However, if that does not satisfy you, raise a separate question.
Ensure that it is on topic for Stack Overflow by reviewing What topics can I ask about here? first. Two of the people who reviewed JEP 254 (Aleksey Shipilev and Brian Goetz) respond here on SO, so you may get an authoritative answer.
What is the rationale for ... JEP 254 equivalent for Dotnet?
Again, raise this as a separate SO question.
Is the reason that BOCU-1 is patented?
That question is specifically off topic here: "Legal questions, including questions about copyright or licensing, are off-topic for Stack Overflow", though Wikipedia notes "BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with intellectual property restrictions".
Are there any technical reasons also?
A very important non-technical reason is that the HTML5 specification explicitly forbids the use of BOCU-1!...
Avoid these encodings
The HTML5 specification calls out a number of encodings that you should avoid...
Documents must also not use CESU-8, UTF-7, BOCU-1, or SCSU encodings, since they... were never intended for Web content and the HTML5 specification forbids browsers from recognising them.
Of course that invites the question of why HTML 5 forbids the use of BOCU-1, and the only technical reason I can find for that is that this Mozilla documentation on HTML's <meta> element states:
Authors must not use CESU-8, UTF-7, BOCU-1 and/or SCSU as cross-site scripting attacks with these encodings have been demonstrated.
See this GitHub link for more details on the XSS vulnerability with BOCU-1.
Also note that in line with the the HTML5 specification, all the major browsers specifically do not support BOCU-1.

What is the machine encoding of 4 -- Is it 011 0100 (ascii) or 0100 (binary)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 3 years ago.
Improve this question
I have a file "num.txt" which has only a number 4 in it.
With xxd num.txt, I found that the number is encoded as its ASCII code, 0x34 that is 011 0100. Why is the number not simply encoded as its binary form 0100?
[Edit] My question is really about why 4 is encoded in ASCII, not in its binary form?
What you have is the character '4', which is code point 0x34 in ASCII (and Unicode, for that matter).
In ASCII, code point 4 is EOT (end of transmission), commonly entered as CTRL-D. See, for example, the following table:
As to your edit:
My question is really about why 4 is encoded in ASCII, not in its binary form?
The answer to that is that it's a text file. Whatever has created it has decided it wants to store the values as textual rather than binary information. It's really that simple :-)
If you want to go deeper into that particular question, you're going to have to ask the person who developed the software that creates the file, I'm afraid.
011 0100 isn't 34. It's 0x34. 0x34 is the ASCII encoding of the digit '4'.

What is Unicode? and how Encoding works? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Few hours before I was reading a c programming book. While I was reading the book I came across these words, Character encoding and Unicode. Then I started googling for the information about Unicode. Then I came to know that Unicode character set has every character from every language and UTF-8,16,32 can encode the characters listed in unicode character set.
but I was not able to understand how it works.
Does unicode depends upon the operating systems?
How it is related to softwares and programs?
Is UTF-8 is a software that is installed on my computer when i installed operating system?
or Is it related to hardware?
and how a computer encodes the things?
I have found it so much confusing. Please answer me in detail.
I am new to these things, so please keep that in mind while you give me the answer.
thank you.
I have written about this extensively in What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text. Here some highlights:
encodings are plentiful, encodings define how a "character" like "A" can be encoded as bits and bytes
most encodings only specify this for a small number of selected characters; for example all (or at least most) characters needed to write English or Czech; single byte encodings typically support a set of up to 256 characters
Unicode is one large standard effort which has catalogued and specified a number ⟷ character relationship for virtually all characters and symbols of every major language in use, which is hundreds of thousands of characters
UTF-8, 16 and 32 are different sub-standards for how to encode this ginormous catalog of numbers to bytes, each with different size tradeoffs
software needs to specifically support Unicode and its UTF-* encodings, just like it needs to support any other kind of specialized encoding; most of the work is done by the OS these days which exposes supporting functions to an application

Unicode file in notepad [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
What does it mean when I save a text file as "Unicode" in notepad? is it Utf-8, Utf-16 or Utf-32? Thanks in advance.
In Notepad, as in Windows software in general, “Unicode” as an encoding name means UTF-16 Little Endian (UTF-16LE). (I first thought it’s not real UTF-16, because Notepad++ recognizes it as UCS-2 and shows the content as garbage, but re-checking with BabelPad, I concluded that Notepad can encode even non-BMP characters correctly.)
Similarly, “Unicode big endian” means UTF-16 Big Endian. And “ANSI” means the system’s native legacy encoding, e.g. the 8-bit windows-1252 encoding in Western versions of Windows.
All of these formats are "Unicode". But usually editors on Mac and Windows mean UTF-8 with that because it is ASCII compatible below code 128 IIRC. UTF-8 can represent more codes than just 256 (which fits in a single byte of 8 bits) by using a special character which means that the following byte also belongs to the same character.
If you look at the output in terminal, say with vi, and if you see a space between every two characters then you are looking at UTF-16 because there every two bytes make up one character. What you should see is that the characters don't have spaces between them, that's an indication for UTF-8.

Does the Unicode Consortium Intend to make UTF-16 run out of characters? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.
Does the Unicode Consortium Intend to make UTF-16 run out of characters?
i.e. make a code point > 0x10FFFF
If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.
Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?
As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):
leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.
In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF
Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):
CJK extension E(~10,000 chars)
Ferengi culture characters(~5,000 chars)
At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.
Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.
Cutting to the chase:
It is indeed intentional that the encoding system only supports code points up to U+10FFFF
It does not appear that there is any real risk of running out any time soon.
There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.