How does computer display a character on the screen with the correct encoding? - unicode

I'm interested in the encoding of the character in the computer.
When I open my xxx.c with visual studio code, how does the VS code detect the encoding of my file and interprets these "01" sequence. Further on, how the visual studio code (or even the computer system) display the character on the screen acorrding to my "01" sequence file and the character encoding?
Thank you!
I also uses Chinese during my projects. Sometimes, the file encoding really drive my crazy. Sometimes,my correct utf-8 file created by edit A for example, was destroyed by some text editor B that interpret it as GBK file, and edit A can never get it back correct.
I searched a lot, but the most answers seems to be too abstract or irrelevant. I want to figure out how the software and the computer system( or operating system) cooperate together to make this simple but important job done!

First things first, "can never get it back": Always Use Source Code Control
"How the software and the computer system (or operating system) cooperate together to make this simple but important job done!": They don't that's the problem!
Short history: Many decades ago people used small character sets. The idea was a system would always use the same one. Simple. Every time a text file was transferred between systems, it would be immediately transcribed to the local character encoding. Then came the globalization of file exchanges and systems needed to hold text files in different encodings. There was no general way of recording what the encoding was. In 1991 came the huge character set Unicode. Languages (VB4, Java), operating system APIs (Win32), file systems (NTFS), … began adopting it. However, its encodings (UTF-8, UTF-16) are just yet more possibilities for which encoding a text file uses. Many programs that read text files either rely on the old system of a system default encoding or guess ("detect").
In the programming world, some languages require source files to use a specific encoding (say UTF-8); In others, tools default to specific encoding (say UTF-8). In most cases, the toolset provided with a C or C++ implementation will have a consistent set of rules. If you also use an IDE or other form of project system, you can set the encoding for the entire project and in some cases specific files.
So, the only solution is to only use tools that work for you and to properly configure them. If it hurts, stop doing it.
Aside: On the topic of programming and default character encodings, be careful not to get tricked with various language libraries' use of the system default character encoding—unless that is exactly what's needed. Otherwise, you are giving your users the same problem that you are encountering. (In Java, just avoid it with explicit arguments. In C and C++ libraries, encoding is combined into Locales. But note that many systems initialize a program to use default character encoding.

Related

ECL: The filesystem does not accept filenames with extended characters

how do you open a file of which name contains UTF-8 character?
For example:
(open "~/a/你好.txt")
give this:
The filesystem does not accept filenames with extended characters: "~/a/你好.txt"
I'm using ecl 16.1.3 from emerge from gentoo.
Meantime, sbcl can open the file.
I'm pretty sure ECL simply does not support general unicode filenames on Unix or Linux, however they get encoded in the underlying filesystem (I also don't know how that happens with *nix nowadays, although I guess there must be a standard now).
The specific error you're seeing originates here, in pathname.d. If you then look in unixfsys.d you'll see that ECL_NAMESTRING_FORCE_BASE_STRING is one of the flags passed to ecl_namestring all over the place, and this isn't conditionalized by anything.
So at the very least you would need to compile ECL from scratch, and more probably it simply does not support general unicode filenames at all.

Can you write Perl 6 scripts using an encoding that is not utf8?

Perl 5 has the encoding pragma or the Filter::Encoding module, however, I have not found anything similar in Perl 6. I guess eventually source filters will be created, but for the time being, can you use other encodings in Perl 6 scripts?
You cannot write your Perl 6 script in anything except utf8. I don't think there will ever be any other encoding you will be allowed to write your script in, as utf8 is basically the universal standard. Benefits like not having endianess and being back compatible with ASCII are some reasons it has become the standard and not things like utf16 or utf32.
Maybe there was a time before when such a thing may have been useful, but today I do not see that being the case. All text editors in common usage I know of default to utf8, and having files in multiple formats makes it more difficult to share your Perl 6 programs with others. There are plenty of reasons to want to use other encodings external to Perl 6 (writing to files, reading files etc.) but I don't see adding filters as smart move.
Rakudo currently supports an --encoding= option, so you might in theory be able to write a script in a different character encoding, and call it with perl6 --encoding=utf16 yourscript.p6. But in my experiments, I haven't managed to get it working with anything except utf8, and even if it worked, specifying --encoding on the command line would be a big no go for me.
So the operational answer is: currently no.
(And I don't think anybody else has asked for it yet...)

Using Perl, is there a difference between Win32API::File::MoveFile and CORE::rename on MSWin32?

I see that Win32API::File supports MoveFile(). However, I'm not sure how CORE::rename() is implemented in such a fashion that it should matter. Could someone juxtapose the difference -- specifically for the Win32 Environment -- between
CORE::rename()
File::Copy::move()
and, Win32API::File::MoveFile()
rename is implemented in a broken fashion since forever; move too, since it uses rename.
Win32::Unicode::File exposes MoveFileW from windows.h as moveW, and apparently handles encoding in a sane fashion, whereas Win32API::File leaves that to the user AFAICS from existing example code.
Related: How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

Unicode characters (Cyrillic) with Intel Fortran

Does anyone have any experience with using Unicode in Fortran? How does one pass Cyrillic characters, and open files with Cyrillic characters in their names?
Details:
I have a Fortran executable that needs to read parameters from a control file. Some of these parameters are in Cyrillic (e.g., file paths).
The executable calls a C++ DLL. Some of the parameters to these calls need to be in Cyrillic.
I am using the latest Intel Fortran.
I'm looking for any source of information, or small examples as to how to do so.
As already indicated, Fortran 2003 has a Unicode character type. Exactly what features will work with that character type ... I don't know ... filenames? I don't see mention of Unicode in the release notes for the Intel Fortran compiler. In 2006 Intel indicated that this feature was a low priority (http://software.intel.com/en-us/forums/showthread.php?t=51751). You might ask on the Intel forums ... probably an Intel representative will answer about the capabilities of the Intel compiler. If the Intel Fortran compiler can't handle Unicode yet, you might need to do this I/O in another language.
While I've not done anything similar, so have no personal experience on the matter, simply googling "fortran unicode" shows a few interesting results.
Apparently, gfortran has some moderate support for it (for an example scroll a bit down). Also, Tobian Burnus's answer in this thread sheds some more light on the matter - it seems that there is progress on that field, in F2003 and the (upcoming) F2013 standard, but for now, it doesn't really present one of the priorities.
If you want to open unicode files, this will not help. However, by default, the Intel Fortran compiler cannot even open files in a unicode folder. The documentation doesn't make it clear, but the /fpscomp:general compiler flag will allow you to work within unicode folders.

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

Note below how ã changes to a. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find.
The particular problem I'm trying to fix involves working with image files stored on a local drive, and manipulating the file names which may contain foreign characters. The two experiments shown below are intermediate debugging steps.
The ã character is common in latin languages. e.g. http://pt.wikipedia.org/wiki/Cão
Experiment 1
Look closely, note how cão becomes cao.
Experiment 2
Here I tried using File::Find instead of piped input, in case the issue was with the Windows implementation of the | shell operator. The issue actually gets worse, as the ~a becomes Pi:
Debugging update:
I tried some of the tricks listed at http://perldoc.perl.org/perlunicode.html,
e.g. use utf8, use feature 'unicode_strings', etc, to no avail.
Environment and Version Info
The OS is Windows 7, 64-bit.
The Perl is:
This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread
(with 8 registered patches, see perl -V for more detail)
Copyright 1987-2010, Larry Wall
Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com
Built Sep 6 2010 22:53:42
Perl, as with many other scripting languages, is built on the C runtime.
On Windows, the standard MS C runtime for narrow (byte) characters uses an encoding which defaults to the Windows system encoding (‘ANSI code page’) for IO activities such as opening files or writing to the console.
The ANSI code page is always a locale-specific encoding: usually single-byte, but multi-byte in some locales (eg China, Japan etc). It is never UTF-8 or anything else capable of reproducing the whole of Unicode; which characters Perl IO can cope with is dependent on the Windows locale (“language for non-Unicode programs” setting).
Whilst console apps can be given UTF-8 using the chcp 65001 command, there are a number of serious inconsistencies which come up with doing this. This causes difficulty for a lot of tools on Windows and is something Microsoft really needs to fix, but so far their attitude is that Unicode Equals UTF-16; everyone who wants Unicode to work must use the widechar interfaces.
So you won't currently be able to deal with files that use non-ASCII filenames reliably in Perl on Windows. Sorry.
You could try Python (which added special Windows-only filename handling to get around this problem in version 2.3 onwards; see PEP 277), or one of the Unicode-aware Windows Scripting Host languages. Either way, getting Unicode out to the console on Windows still has more pitfalls.
The following 3 liner works as expected on my newly minted ActivePerl 5.12.2:
use utf8;
open($file, '>:encoding(UTF-8)', "output.txt") or die $!;
print $file "さっちゃん";
I think the culprit is cmd.exe.