This single character :
ਠਂ
gives two characters when copied/pasted into Sublime Text 2 (Windows 7 here), in a document saved with UTF-8. Why?
I can imagine it's an encoding problem, but which one?
Related
I found that the adjacent letters of the Japanese tilde are small if they are Chinese characters, and large if they are kana.
I would like to ask the seniors, is this a feature of the language, or a bug of vscode? See below
~つ
~着
~羽
~番
~足
~度
~キロ(メートル)
The search function, can confirm, is the same character, so I'm confused.
Moreover, my file encoding is UTF-8, there should be no strange character set errors.
This question already has answers here:
When to use Unicode Normalization Forms NFC and NFD?
(2 answers)
Normalizing Unicode
(2 answers)
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 4 months ago.
I was looking into Matt Parker's problem of finding 5 5-letters words that have in total 25 distinct letters (link if you are interested, not really relevant to the issue), but I wanted to do it using french words (and so I had to use the french alphabet).
So I downloaded the following file from github, containing 1 french word per line, encoded in utf-8, with the accents. But something I never seen before occured : There seems to be two ways (maybe more) to encode accents in utf-8.
When I opened the file in VSCode (also works with Windows' Notepad), every accent displayed as they should, but :
when I select a letter with an accent, it shows that 2 letters are selected
when I write a letter with an accent using a french keyboard, and then select it, it shows that only 1 letter is selected (so the two appear to be distinct, though they display exactly identically)
I then tried the following code in python :
# first is typed from keyboard, second is typed from keyboard, returns true
print("é" == "é")
# first is copied from downloaded text file, second is typed from keyboard, returns false
print("é" == "é")
It displays identically, but is not identical !?
I then tried to change the encoding of the file, but it either changed nothing, or removed the accents : the word "abaissé" might for example show as "abaisse�".
After testing, it seems like the file is encoding accented letters as a special accent character adjacent to the letter character it wants to put the accent on. But when I then read the file character by character (it does it in rust for example, but I'm pretty sure it does the same in other languages), they come as two characters : I first get the accent and then the letter. And rust chars are by default valid unicode units, so if they where 1 single unicode char, rust should read them as 1 single unicode char. It's not the case.
I'm honestly stuck, so if you know, I would be really grateful for your help.
Why are these combos of characters showing as 1 character when they should display as 2 distinct ?
I know this is not a bug from utf-8, but it makes handling the data very hard. So how do I recombine them together, as the accent shouldn't be decoupled from the letter in the first place ?
This may sound like a stupid question. I typed some Chinese characters into an empty text file in VS code text editor (default utf8). Then I saved the file in an encoding for Japanese: shift JIS, which apparently doesn't cover all the characters I have typed in.
However, before I close the file, all Chinese characters are displayed properly in VS code. Now after I closed the file and reopened it using shift JIS encoding, several characters are displayed as a question mark ?. I guess these are the Chinese characters not covered by the Japanese encoding?
What happened in the process? Is there anyway I can 'get back' the Chinese characters that are now shown in ?? I don't really understand how encoding works in this scenario...
Not all encodings cover all characters. (Unicode encodings, in principle, do, but even they don't have quite everything yet.) If you save some text in an encoding which does not include all characters in that text, something has to give.
Options:
you get an error message,
nothing saves at all,
the characters which cannot be included are silently dropped,
the characters which cannot be included are converted to some other character (such as the question mark).
Once that conversion is done, the data is lost, and cannot be recovered. Why not use UTF-8 or another Unicode encoding? (GB 18030 might be the best for large amounts of Chinese text.)
When I save a text document in UTF-8 that's basically saying: Computer, use the codepage for UTF8 that's installed somewhere on your computer to figure out, how to turn the 1's and 0's to characters, right?
When I save this content:
激光
äüß
#§
in ISO-8895-1, it becomes this (on Linux, using Kate editor):
æ¿å
äüÃ
#§
What is not displayed here is that in the first and second row that are some weird squares displayed instead of characters (can be seen in developer tools).
So my understanding is that this means that the combination of 0's and 1's that represent 激 in utf-8 is mapped to æ in ISO-8895-1, right? And the weird squares > < happen because there is no mapping for that binary number in the ISO-8895-1 character set so the computer defaults to some other encoding.
Is that correct?
Yes, sort of correct.
If you store a file as UTF-8, it usually gets a special byte combination that indicates its type of encoding at the beginning of the file. I think, Kate (don't know this editor) doesn't recognize this and just displays the file as something else. So basically, your file is still correct, but was just visualized in a wrong way.
The weird squares are another indicator, that Kate doesn't recognize those leading bytes, cause usually editors hide them from the user and just use the information to display the file correctly.
You have it pretty much right. The character U+6FC0 (激) for example is encoded with 3 bytes in UTF-8: 0xE6 0xBF 0x80.
If you interpret these bytes in ISO-8859-1, you get the characters æ¿. Depending on the version of ISO-8859-1, 0x80 is either not mapped to a character at all, or is mapped to a non-printable control character, that's why you can see only two characters for the three bytes.
If you use Windows-1252 instead of ISO-8859-1 you'll see æ¿€.
I use utf-8 for default encoding for new created file in both notepad++ and sublime text 2.
Create a new file in notepad++ containing only ASCII characters, save it and close it.
Reopen it in notepad++, check the 'Encoding' menu, it's 'Encode in ANSI'. Then I add some non-ASCII characters(eg: Chinese) to the file and save it, it's still in ANSI encoding but displayed correctly(also correct in Windows default notepad), but open the file with sublime text 2, messy code appears.
When using sublime text 2 to do the same thing, the file is converted to utf-8 automatically when non-ASCII characters are entered.
So why notepad++ and sublime text 2 behave differently, why can notepad++ display non-ASCII characters in ANSI encoding correctly?
ANSI is not an encoding and is very ambiguous term. It usually means Windows-1252 or the active OS code page, which is probably ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312) for you.
Sublime Text 2 cannot detect encodings other than UTF-8, UTF-16 and ASCII. The default fallback encoding in this case is Windows-1252, not the active system code page.