Difference between text file encoding in Android - unicode

What is the difference between different text file encoding for my Android project, such as:
UTF-8
UTF-16BE
UTF-16LE
UTF-16
ISO-8859-1
US-ASCII
For example, for displaying Korean, I know I should use UTF-8. But when I should use the others?

About Character_encoding and their difference http://en.wikipedia.org/wiki/Character_encoding.
Usually UTF-8 works fine for cross platform and multiple language. http://en.wikipedia.org/wiki/UTF-8
But Korean version of Windows also use Unified Hangul Code
Unified Hangul Code (UHC) extends Wansung Code by adding the missing
8,822 Hangul characters, and is designed for smooth migration to
Unicode Version 2.0. All Wansung code points map directly to the same
UHC code points (but not vice versa). UHC also provides round trip
mapping with Unicode Version 2.0. UHC is used in Korean versions of
Windows 95 and Windows NT.
There is this command iconv under Linux (and also libiconv for c programming language), for encoding translation.
iconv -l
to list all encoding that iconv supports.

Related

What determines how strings are encoded in memory?

Say we have a file that is Latin-1 encoded and that we use a text editor to read in that file into memory. My questions are then:
How will those character strings be represented in memory? Latin-1, UTF-8, UTF-16 or something else?
What determines how those strings are represented in memory? Is it the application, the programming language the application was written in, the OS or the hardware?
As a follow-up question:
How do applications then save files to encoding schemes that use different character sets? F.e. converting UTF-8 to UTF-16 seems fairly intuitive to me as I assume you just decode to the Unicode codepoint, then encode to the target encoding. But what about going from UTF-8 to Shift-JIS which has a different character set?
Operating system
Windows
1993: Windows adopted Unicode 1.0 with NT 3.1 - back then Unicode was what is nowadays known as UCS-2. That Windows version also introduced NTFS (New Technology File System), which also stores every filename in UCS-2 like manner (16 bit codepoints).
2000: With NT 5.0 (aka Windows 2000) there was a shift/improvement from UCS-2 to UTF-16 - both OS and encoding became available in this year.
Since then nothing has changed. Internally, Windows uses 16 bit codepoints for almost 30 years already, and thanks to UTF-16 also newest codepoints such as Emojis are supported. Its API works the same way, with compatibility functions for byte-wise encodings merely being stubs that convert the input to UTF-16. See also
What unicode encoding (UTF-8, UTF-16, other) does Windows use for its Unicode data types?
"Windows uses UTF-16 as its internal encoding", what exactly does this mean?
Why does Windows use UTF-16LE?
Is it safe to assume all Windows platforms will be in UCS-2 LE
Unix: most distributions use UTF-8 by default, because it's most backward compatible while being future proof enough.
Programming language
Depends on their age or on their compiler: while languages themselves are not necessarily bound to an OS the compiler which produces the binaries might treat things differently as per OS.
Pascal: based in 1970 the String was just an array of bytes, not even necessarily meaning text. And for text ASCII or one of the other single-byte encodings could easily be dealt with.
Delphi: adopted as per Windows WideString, dealing with 16 bit per character, to perfectly make use of the WinAPI and its Unicode support. Later additions also emerged the UTF8String, which works with bytes again, but not necessarily only one byte per character. But also creations such as UCS4String are available since 2009, eating 4 bytes per character.
Free Pascal: stays with the old String but always defaults to UTF-8 encoding. While this always needs conversion when using the WinAPI it is also more platform independent. Several other String (compatibilty) types also exist, each with different memory usage.
ECMAScript (JavaScript): as per standard an engine should use UTF-16 for texts. See also JavaScript strings - UTF-16 vs UCS-2?
Java: engines must support a minimum of encodings, including UTF-16, thus internal String handling/memory usage may differ. See also What is the Java's internal represention for String? Modified UTF-8? UTF-16?
Application/program
Depends on the platform/OS. While the in-memory consumption of text is strongly influenced by the programming language compiler and the data types used there, using libraries (which could have been produced by entirely other compilers and programming languages) can mix this.
Strictly speaking the binary file format also has its strict encodings: on Windows the PE (used in EXE, DLL, etc.) has resource Strings in 16 bit characters again. So while f.e. the Free Pascal Compiler can (as per language) make heavy use of UTF-8 it will still build an EXE file with UTF-16 metadata in it.
Programs that deal with text (such as editors) will most likely hold any encoding "as is" in memory for the sake of performance, surely with compromises such as temporarily duplicating parts into Strings of 32 bit per character, just to quickly search through it, let alone supporting Unicode normalization.
Conversion
The most common approach is to use a common denominator:
Either every input is decoded into 32 bit characters which are then encoded into the target. Costs the most memory, but makes it easy to deal with.
In the WinAPI you either convert to UTF-16 via MultiByteToWideChar(), or from UTF-16 via WideCharToMultiByte(). To go from UTF-8 to Shift-JIS you'd make a sidestep from UTF-8 to UTF-16, then from UTF-16 to Shift-JIS. Support for all the encodings shift as per version and localized installation, there's not really a guarantee for all of them.
External libraries specialized on encodings alone can do this, like iconv - these support many encodings unbound to the OS support.

Looking for c++ equivalent for _wfindfirst for char16_t

I have filenames using char16_t characters:
char16_t Text[2560] = u"ThisIsTheFileName.txt";
char16_t const* Filename = Text;
How can I check if the file exists already? I know that I can do so for wchar_t using _wfindfirst(). But I need char16_t here.
Is there an equivalent function to _wfindfirst() for char16_t?
Background for this is that I need to work with Unicode characters and want my code working on Linux (32-bit) as well as on other platforms (16-bit).
findfirst() is the counterpart to _wfindfirst().
However, both findfirst() and _wfindfirst() are specific to Windows. findfirst() accepts ANSI (outdated legacy stuff). _wfindfirst() accepts UTF-16 in the form of wchar_t (which is not exactly the same thing as char16_t).
ANSI and UTF-16 are generally not used on Linux. findfirst()/_wfindfirst() are not included in the gcc compiler.
Linux uses UTF-8 for its Unicode format. You can use access() to check for file permission, or use opendir()/readdir()/closedir() as the equivalent to findfirst().
If you have a UTF-16 filename from Windows, you can convert the name to UTF-8, and use the UTF-8 name in Linux. See How to convert UTF-8 std::string to UTF-16 std::wstring?
Consider using std::filesystem in C++17 or higher.
Note that a Windows or Linux executable is 32-bit or 64-bit, that doesn't have anything to do with the character set. Some very old systems are 16-bit, you probably don't come across them.

Save file with ANSI encoding in VS Code

I have a text file that needs to be in ANSI mode. It specifies only that: ANSI. Notepad++ has an option to convert to ANSI and that does the trick.
In VS Code I don't find this encoding option. So I read up on it and it looks like ANSI doesn't really exist and should actually be called Windows-1252.
However, there's a difference between Notepad++'s ANSI encoding and VS Code's Windows-1252. It picks different codepoints for characters such as an accented uppercase e (É), as is evident from the document.
When I let VS Code guess the encoding of the document converted to ANSI by Notepad++, however, it still guesses Windows-1252.
So the questions are:
Is there something like pure ANSI?
How can VS Code convert to it?
Check out the upcoming VSCode 1.48 (July 2020) browser support.
It also solves issue 33720, where you had to force encoding for the full project before with, for instance:
"files.encoding": "utf8",
"files.autoGuessEncoding": false
Now you can set the encoding file by file:
Text file encoding support
All of the text file encodings of the desktop version of VSCode are now supported for web as well.
As you can see, the encoding list includes ISO 8859-1, which is the closest norm of what "ANSI encoding" could represent.
The Windows codepage 1252 was created based on ISO 8859-1 but is not completely equal.
It differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range
From Wikipedia:
Historically, the phrase "ANSI Code Page" was used in Windows to refer to non-DOS encodings; the intention was that most of these would be ANSI standards such as ISO-8859-1.
Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard.
Microsoft explains, "The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community."

How many Unicode encodings are there, and are all of them still in use?

I know of the following Unicode encodings:
UTF-7
UTF-8
UTF-16
UTF-32
UCS-2
Are there more Unicode encodings? And are all of the Unicode encodings still in use, or are some of them now obsolete?
There is one Unicode (really there are different versions).
You can define any kind of encoding, this do not matter much.
There are UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE as official encoding form. Also officially, so in Unicode standard, you have description of UTF-8, UTF-16, and UTF-32.
UCS2 was the old unicode encoding (equal to UTF-16, but with support only to code < 65536), so now it is obsolete (replaced by UTF16, which is capable to code all (also the newer) unicode code points). UTF-7 is also obsolete.
There are also April fools UTF-9 and UTF-18.
Some application have the UTF8-sig (which is UTF-8 with initial BOM).
On mail, probably you will use UTF8 + BASE64, or some other double encoding.
Mysql uses UTF8MB3 and UTF8MB4, so it specifies UFT-8 but also how many bytes to reserve (3 or 4) per SQL CHAR.
Python3 uses (internally, you probably never see it) a mixed encoding: UTF-8, UTF-16, or UTF-32 according to the larger code in the entire string (and the "encoding" is saved together with string length, outside the "true string"). So this is also a sort of encoding.
We have 21 bits to describe any unicode code point. Then we are free to choose any encodings (in a manner that we can get back to the original code point). UTF-8, UTF-16 and UTF-32 are just the most common (and described in Unicode standard).

Why can iconv convert precomposed form but not decomposed form of "É" (from UTF-8 to CP1252)

I use the iconv library to interface from a modern input source that uses UTF-8 to a legacy system that uses Latin1, aka CP1252 (superset of ISO-8859-1).
The interface recently failed to convert the French string "Éducation", where the "É" was encoded as hex 45 CC 81. Note that the destination encoding does have an "É" character, encoded as C9.
Why does iconv fail converting that "É"? I checked that the iconv command-line tool that's available with MacOS X 10.7.3 says it cannot convert, and that the PERL iconv module fails too.
This is all the more puzzling that the precomposed form of the "É" character (encoded as C3 89) converts just fine.
Is this a bug with iconv or did I miss something?
Note that I also have the same issue if I try to convert from UTF-16 (where "É" is encoded as 00 C9 composed or 00 45 03 01 decomposed).
Unfortunately iconv indeed doesn't deal with the decomposed characters in UTF-8, except the version installed on Mac OS X.
When dealing with Mac file names, you can use iconv with the "utf8-mac" character set option. It also takes into account a few idiosyncrasies of the Mac decomposed form.
However, non-mac versions of iconv or libiconv don't support this, and I could not find the sources used on Mac which provide this support.
I agree with you that iconv should be able to deal with both NFC and NFD forms of UTF8, but until someone patches the sources we have to detect this manually and deal with it before passing stuff to iconv.
Faced with this annoying problem, I used Perl's Unicode::Normalize module as suggested by Jukka.
#!/usr/bin/perl
use Encode qw/decode_utf8 encode_utf8/;
use Unicode::Normalize;
while (<>) {
print encode_utf8( NFC(decode_utf8 $_) );
}
Use a normalizer (in this case, to Normalization Form C) before calling iconv.
A program that deals with character encodings (different representations of characters or, more exactly, code points, as sequences of bytes) and converting between them should be expected to treat precomposed and composed forms as distinct. The decomposed É is two code points and as such distinct from the precomposed É, which is one code point.