how do you open a file of which name contains UTF-8 character?
For example:
(open "~/a/你好.txt")
give this:
The filesystem does not accept filenames with extended characters: "~/a/你好.txt"
I'm using ecl 16.1.3 from emerge from gentoo.
Meantime, sbcl can open the file.
I'm pretty sure ECL simply does not support general unicode filenames on Unix or Linux, however they get encoded in the underlying filesystem (I also don't know how that happens with *nix nowadays, although I guess there must be a standard now).
The specific error you're seeing originates here, in pathname.d. If you then look in unixfsys.d you'll see that ECL_NAMESTRING_FORCE_BASE_STRING is one of the flags passed to ecl_namestring all over the place, and this isn't conditionalized by anything.
So at the very least you would need to compile ECL from scratch, and more probably it simply does not support general unicode filenames at all.
Related
Perl 5 has the encoding pragma or the Filter::Encoding module, however, I have not found anything similar in Perl 6. I guess eventually source filters will be created, but for the time being, can you use other encodings in Perl 6 scripts?
You cannot write your Perl 6 script in anything except utf8. I don't think there will ever be any other encoding you will be allowed to write your script in, as utf8 is basically the universal standard. Benefits like not having endianess and being back compatible with ASCII are some reasons it has become the standard and not things like utf16 or utf32.
Maybe there was a time before when such a thing may have been useful, but today I do not see that being the case. All text editors in common usage I know of default to utf8, and having files in multiple formats makes it more difficult to share your Perl 6 programs with others. There are plenty of reasons to want to use other encodings external to Perl 6 (writing to files, reading files etc.) but I don't see adding filters as smart move.
Rakudo currently supports an --encoding= option, so you might in theory be able to write a script in a different character encoding, and call it with perl6 --encoding=utf16 yourscript.p6. But in my experiments, I haven't managed to get it working with anything except utf8, and even if it worked, specifying --encoding on the command line would be a big no go for me.
So the operational answer is: currently no.
(And I don't think anybody else has asked for it yet...)
I'm interested in the encoding of the character in the computer.
When I open my xxx.c with visual studio code, how does the VS code detect the encoding of my file and interprets these "01" sequence. Further on, how the visual studio code (or even the computer system) display the character on the screen acorrding to my "01" sequence file and the character encoding?
Thank you!
I also uses Chinese during my projects. Sometimes, the file encoding really drive my crazy. Sometimes,my correct utf-8 file created by edit A for example, was destroyed by some text editor B that interpret it as GBK file, and edit A can never get it back correct.
I searched a lot, but the most answers seems to be too abstract or irrelevant. I want to figure out how the software and the computer system( or operating system) cooperate together to make this simple but important job done!
First things first, "can never get it back": Always Use Source Code Control
"How the software and the computer system (or operating system) cooperate together to make this simple but important job done!": They don't that's the problem!
Short history: Many decades ago people used small character sets. The idea was a system would always use the same one. Simple. Every time a text file was transferred between systems, it would be immediately transcribed to the local character encoding. Then came the globalization of file exchanges and systems needed to hold text files in different encodings. There was no general way of recording what the encoding was. In 1991 came the huge character set Unicode. Languages (VB4, Java), operating system APIs (Win32), file systems (NTFS), … began adopting it. However, its encodings (UTF-8, UTF-16) are just yet more possibilities for which encoding a text file uses. Many programs that read text files either rely on the old system of a system default encoding or guess ("detect").
In the programming world, some languages require source files to use a specific encoding (say UTF-8); In others, tools default to specific encoding (say UTF-8). In most cases, the toolset provided with a C or C++ implementation will have a consistent set of rules. If you also use an IDE or other form of project system, you can set the encoding for the entire project and in some cases specific files.
So, the only solution is to only use tools that work for you and to properly configure them. If it hurts, stop doing it.
Aside: On the topic of programming and default character encodings, be careful not to get tricked with various language libraries' use of the system default character encoding—unless that is exactly what's needed. Otherwise, you are giving your users the same problem that you are encountering. (In Java, just avoid it with explicit arguments. In C and C++ libraries, encoding is combined into Locales. But note that many systems initialize a program to use default character encoding.
Let's say you have used the new std::filesystem (or std::experimental::filesystem) code to hunt down a file. You have a path variable that contains the full pathname to this variable.
How do you open that file?
That may sound silly, but consider the obvious answer:
std::filesystem::path my_path = ...;
std::ifstream stream(my_path.c_str(), std::ios::binary);
This is not guaranteed to work. Why? Because on Windows for example, path::string_type is std::wstring. So path::c_str will return a const wchar_t*. And std::ifstream can only take paths with a const char* type.
Now it turns out that this code will actually function in VS. Why? Because Visual Studio has a library extension that does permit this to work. But that's non-standard behavior and therefore not portable. For example, I have no idea if GCC on Windows provides the same feature.
You could try this:
std::filesystem::path my_path = ...;
std::ifstream stream(my_path.string().c_str(), std::ios::binary);
Only Windows confounds us again. Because if my_path happened to contain Unicode characters, then now you're reliant on setting the Windows ANSI locale stuff correctly. And even that won't necessarily save you if the path happens to have characters from multiple languages that cannot exist in the same ANSI locale.
Boost Filesystem actually had a similar problem. But they extended their version of iostreams to support paths directly.
Am I missing something here? Did the committee add a cross-platform filesystem library without adding a cross-platform way to open files in it?
Bo Persson pointed out that this is the subject of a standard library defect report. This defect has been resolved, and C++17 will ship, requiring implementations where path::value_type is not char to have their file stream types take const filesystem path::value_type*s in addition to the usual const char* versions.
I see that Win32API::File supports MoveFile(). However, I'm not sure how CORE::rename() is implemented in such a fashion that it should matter. Could someone juxtapose the difference -- specifically for the Win32 Environment -- between
CORE::rename()
File::Copy::move()
and, Win32API::File::MoveFile()
rename is implemented in a broken fashion since forever; move too, since it uses rename.
Win32::Unicode::File exposes MoveFileW from windows.h as moveW, and apparently handles encoding in a sane fashion, whereas Win32API::File leaves that to the user AFAICS from existing example code.
Related: How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?
Does anyone have any experience with using Unicode in Fortran? How does one pass Cyrillic characters, and open files with Cyrillic characters in their names?
Details:
I have a Fortran executable that needs to read parameters from a control file. Some of these parameters are in Cyrillic (e.g., file paths).
The executable calls a C++ DLL. Some of the parameters to these calls need to be in Cyrillic.
I am using the latest Intel Fortran.
I'm looking for any source of information, or small examples as to how to do so.
As already indicated, Fortran 2003 has a Unicode character type. Exactly what features will work with that character type ... I don't know ... filenames? I don't see mention of Unicode in the release notes for the Intel Fortran compiler. In 2006 Intel indicated that this feature was a low priority (http://software.intel.com/en-us/forums/showthread.php?t=51751). You might ask on the Intel forums ... probably an Intel representative will answer about the capabilities of the Intel compiler. If the Intel Fortran compiler can't handle Unicode yet, you might need to do this I/O in another language.
While I've not done anything similar, so have no personal experience on the matter, simply googling "fortran unicode" shows a few interesting results.
Apparently, gfortran has some moderate support for it (for an example scroll a bit down). Also, Tobian Burnus's answer in this thread sheds some more light on the matter - it seems that there is progress on that field, in F2003 and the (upcoming) F2013 standard, but for now, it doesn't really present one of the priorities.
If you want to open unicode files, this will not help. However, by default, the Intel Fortran compiler cannot even open files in a unicode folder. The documentation doesn't make it clear, but the /fpscomp:general compiler flag will allow you to work within unicode folders.