Why can Notepad++ display non-ASCII characters in ANSI encoding correctly but Sublime Text 2 cannot? - encoding

I use utf-8 for default encoding for new created file in both notepad++ and sublime text 2.
Create a new file in notepad++ containing only ASCII characters, save it and close it.
Reopen it in notepad++, check the 'Encoding' menu, it's 'Encode in ANSI'. Then I add some non-ASCII characters(eg: Chinese) to the file and save it, it's still in ANSI encoding but displayed correctly(also correct in Windows default notepad), but open the file with sublime text 2, messy code appears.
When using sublime text 2 to do the same thing, the file is converted to utf-8 automatically when non-ASCII characters are entered.
So why notepad++ and sublime text 2 behave differently, why can notepad++ display non-ASCII characters in ANSI encoding correctly?

ANSI is not an encoding and is very ambiguous term. It usually means Windows-1252 or the active OS code page, which is probably ANSI/OEM Simplified Chinese (PRC, Singapore); Chinese Simplified (GB2312) for you.
Sublime Text 2 cannot detect encodings other than UTF-8, UTF-16 and ASCII. The default fallback encoding in this case is Windows-1252, not the active system code page.

Related

Auto-replace unicode \uxxxx characters in a text file with their utf-8 equivalents

I've been given a huge properties file created in eclipse with ISO-8859-1 encoding, and all Greek characters in it are in Unicode format (i.e.: \u03bc\u03af\u03b1\u0020\u03bc\u03ad\u03c1\u03b1). It works fine, but I want the actual file to be human readable.
I converted the file to utf-8, but the characters remained as they were. Is there a way to automatically convert the contents of the file to utf-8 either from inside eclipse, or via an external tool?
This can be done with the AnyEdit Tools plugin:
When installed, in the editor hit Ctrl+A (to select all the text containing ASCII and Unicode formatted characters), right-click and choose Convert > From Unicode Notation.

Chinese in Japanese encoding

This may sound like a stupid question. I typed some Chinese characters into an empty text file in VS code text editor (default utf8). Then I saved the file in an encoding for Japanese: shift JIS, which apparently doesn't cover all the characters I have typed in.
However, before I close the file, all Chinese characters are displayed properly in VS code. Now after I closed the file and reopened it using shift JIS encoding, several characters are displayed as a question mark ?. I guess these are the Chinese characters not covered by the Japanese encoding?
What happened in the process? Is there anyway I can 'get back' the Chinese characters that are now shown in ?? I don't really understand how encoding works in this scenario...
Not all encodings cover all characters. (Unicode encodings, in principle, do, but even they don't have quite everything yet.) If you save some text in an encoding which does not include all characters in that text, something has to give.
Options:
you get an error message,
nothing saves at all,
the characters which cannot be included are silently dropped,
the characters which cannot be included are converted to some other character (such as the question mark).
Once that conversion is done, the data is lost, and cannot be recovered. Why not use UTF-8 or another Unicode encoding? (GB 18030 might be the best for large amounts of Chinese text.)

Recursive directory listing of unicoded file names

If i use dir /s /b>list.txt all unicode characters in file names, like äöüß, are broken or missed - instead of ä i get '', ü just disappears and so on...
Yes, i know, unicode characters aren't a good way to name files - they aren't named by me.
Is there a method to get file names healthy listed?
The default console code page usually only supports a small subset of Unicode. US Windows defaults to code page 437 and supports only 256 characters.
If you open a Unicode command prompt (cmd /u), when you redirect to a file the file will be encoded in UTF-16LE, which supports all Unicode characters. Notepad should display the content as long as its font supports the glyphs used.
Changing to an encoding such as UTF-8 (chcp 65001) that supports the full Unicode code point set and redirecting to a file will use that encoding and work as well.

retrieving unicode text from notepad which is saved as ANSI text file

Yesterday I wrote some text in a notepad file which was full of Unicode characters and saved the file as ANSI. Notepad gave me some warning, which i clicked OK without reading it fully and closed notepad.
Today when I again opened the same text in notepad, I am seeing notepad full of ??? signs. I now understand that this happened because I saved Unicode data as ANSI text. Is there a way to retrieve this text back? May be using some hex-editor or so?
No. Certain characters cannot be encoded in certain encodings. "風" cannot be encoded at all in ISO-8859 or any other single-byte encoding, for example. Each ANSI encoding also can only encode a certain subset of all possible characters. It is simply not possible to store characters not defined in a particular ANSI encoding in that encoding, they're simply not defined there.
So, they're gone. You better pull out a backup.

Creating files with french characters and encoding

HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.