Unicode paths with MATLAB - matlab

Given the following code that attempts to create 2 folders in the current MATLAB path:
%%
u_path1 = native2unicode([107, 97, 116, 111, 95, 111, 117, 116, 111, 117], 'UTF-8'); % 'kato_outou'
u_path2 = native2unicode([233 129 142, 230 184 161, 229 191 156, 231 173 148], 'UTF-8'); % '過渡応答'
mkdir(u_path1);
mkdir(u_path2);
the first mkdir call succeeds while the second fails, with the error message "The filename, directory name, or volume label syntax is incorrect". However, creating the folders manually in the "Current Folder" GUI panel ([right click]⇒New Folder⇒[paste name]) encounters no problem. This kind of glitches appear in most of MATLAB's low-level I/O functions (dir, fopen, copyfile, movefile etc.) and I'd like to use all these functions.
The environment is:
Win7 Enterprise (32 bit, NTFS)
MATLAB R2012a
thus the filesystem supports Unicode chars in path, and MATLAB can store true Unicode strings (and not "fake" them).
The mkdir official documentation elegantly{1} avoids the issue by stating that the correct syntax for calling the function is:
mkdir('folderName')
which suggests that the only officially supported call for the function is the one that uses string literals for folder name argument, and not string variables. That would also suggest the eval way—which I'm testing to see if it's working as I write this post.
I wonder if there is a way to circumvent these limitations. I would be interested in solutions that:
don't rely on undocumented/unsupported MATLAB stuff;
don't involve system-wide changes (e.g changing operating system's locale info);
may rely eventually on non-native MATLAB libraries, as long the resulting handles/objects can be converted to MATLAB native objects and manipulated as such;
may rely eventually on manipulations of the paths that would render them usable by the standard MATLAB functions, even if Windows specific (e.g. short-name paths).
Later edit
What I'm looking for are implementations for the following functions, which will shadow the originals in the code that is already written:
function listing = dir(folder);
function [status,message,messageid] = mkdir(folder1,folder2);
function [status,message,messageid] = movefile(source,destination,flag);
function [status,message,messageid] = copyfile(source,destination,flag);
function [fileID, message] = fopen(filename, permission, machineformat, encoding);
function status = fclose(fileID);
function [A, count] = fread(fileID, sizeA, precision, skip, machineformat);
function count = fwrite(fileID, A, precision, skip, machineformat);
function status = feof(fileID);
function status = fseek(fileID, offset, origin);
function [C,position] = textscan(fileID, varargin); %'This one is going to be funny'
Not all the output types need to be interchangeable with the original MATLAB functions, however need to be consistent between function calls (eg fileID between fopen and fclose). I am going update this declaration list with implementations as soon as I get/write them.
{1} for very loose meanings of the word "elegant".

Some useful information on how MATLAB handles filenames (and characters in general) is available in the comments of this UndocumentedMatlab post (especially those by Steve Eddins, who works at MathWorks). In short:
"MathWorks began to convert all the string handling in the MATLAB code base to UTF-16 .... and we have approached it incrementally"
--Steve Eddins, December 2014.
This statement implies that the newer the version of MATLAB, the more features support UTF-16. This in turn means that if a possibility to update your version of MATLAB exists, it may be an easy solution to your problem.
Below is a list of functions that were tested by users on different platforms, according to the functionality that was requested in the question:
The following command creates a directory with UTF16 characters in its name ("תיקיה" in Hebrew, in this example) from within MATLAB:
java.io.File(fullfile(pwd,native2unicode(...
[255 254 234 5 217 5 231 5 217 5 212 5],'UTF-16'))).mkdir();
Tested on:
Windows 7 with MATLAB R2015a by Dev-iL
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
The following commands also seem to create directories successfully:
mkdir(native2unicode([255 254 234 5 217 5 231 5 217 5 212 5],'utf-16'));
mkdir(native2unicode([215,170,215,153,215,167,215,153,215,148],'utf-8'));
Tested on:
Windows 10 with MATLAB R2015a having feature('DefaultCharacterSet') => windows-1255 by Dev-iL
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
The value of feature('DefaultCharacterSet') has no influence here because the encoding is explicitly defined in the command native2unicode.
The following commands successfully open a file having unicode characters both in its name and as its content:
fid = fopen([native2unicode([255,254,231,5,213,5,209,5,229,5],'utf-16') '.txt']);
txt = textscan(fid,'%s');
Tested on:
Windows 10 with MATLAB R2015a having feature('DefaultCharacterSet') => windows-1255 by Dev-iL. Note: the scanned text appears correctly in the Variables view. The text file can be edited and saved from the MATLAB editor with UTF characters intact.
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
If feature('DefaultCharacterSet') is set to utf-8 before using textscan, the output of celldisp(txt) is displayed correctly. The same applies to the Variables view.

Try to use UTF-16 if you are on Windows because NTFS uses UTF-16 for filename encoding and Windows has two sets of APIs: the ones that work with so called 'Windows Codepages' (1250, 1251, 1252 etc.) and use C's char data type and the ones that use wchar_t. The latter type has a size of 2 bytes on Windows which is enough to store UTF-16 code units.
The reason your first call worked is because the first 128 code points in the Unicode Standard are encoded in UTF-8 identically to the 128 ASCII characters (which is made on purpose for backwards compatibility). UTF-8 uses 1-byte code units (instead of 2-byte code units for UTF-16) and usually software such as MATLAB does not process filenames so they need to just store byte-sequences and pass them to the OS APIs. The second call failed, because the UTF-8 byte-sequences representing code points are probably filtered out by Windows because some byte-values are prohibited in filenames. On POSIX-conformant operating systems most APIs are byte-oriented and the standard pretty much prevents you from using the existing multibyte encodings in APIs (e.g., UTF-16, UTF-32) and you have to use char* APIs and encodings with 1-byte code units:
POSIX.1-2008 places only the following requirements on the encoded
values of the characters in the portable character set:
...
The encoded values associated with and shall be invariant across all locales supported by the implementation.
The encoded values associated with the members of the portable character set are each represented in a single byte. Moreover, if the
value is stored in an object of C-language type char, it is guaranteed to be positive (except the NUL, which is always zero).
Not all POSIX-conformant operating systems validate filenames other than for period or slash so you can pretty much store garbage in filenames. Mac OS X, as a POSIX system, uses byte-oriented (char*) APIs but the underlying HFS+ uses UTF-16 in the NFD (Normalization Form D), so some processing is done at the OS-level before saving a filename.
Windows does not perform any type of Unicode normalization and stores filenames in whatever form they are passed in UTF-16 (provided NTFS is used) or Windows Codepages (not sure how they handle this on the filesystem level - probably by conversion).
So, how does this relate to MATLAB? Well it is cross-platform and has to deal with many issues because of that. One of them is that Windows has char APIs for Windows Codepages and certain forbidden characters in filenames while other OSes do not. They could implement system-dependent checks but that would be much harder to test and support (much code churning I guess).
My best advise is to use UTF-16 on Windows, implement platform-dependent checks or use ASCII if you need portability.

Related

How does vbscript filesystemobject encode characters?

I have this vbscript code:
Set fs = CreateObject("Scripting.FileSystemObject")
Set ts = fs.OpenTextFile("tmp.txt", 2, True)
for i = 128 to 255
s = chr(i)
if lenb(s) <>2 then
wscript.echo i
wscript.quit
end if
ts.write s
next
ts.close
On my system, each integer is converted to a double byte character: there are no numbers in that range that cannot be represented by a character, and no number requires more than 2 bytes.
But when I look at the file, I find only 127 bytes.
This answer: https://stackoverflow.com/a/31436726/1335492 suggests the the FSO creates UTF files and inserts a BOM. But the file contains only 127 bytes, and no Byte Order Mark.
How does FSO decide how to encode text? What encoding allows 8 bit single-byte characters? What encodings do not include 255 8 bit single-byte characters?
(Answers about how FSO reads characters may also be interesting, but that's not what I'm specifically asking here)
Edit: I've limited my question to the high-bit characters, to make it clear what the question is. (Answers about the low-bit characters may also be interesting, but that's not what I'm specifically asking here)
Short Answer:
The file system object maps "Unicode" to "ASCII" using the code page associated with the System Locale. (Chr and ChrW use the User Locale.)
Application:
There may be silent transposition errors between the System code page and the Thread (user) code page. There may also be coding and decoding errors if code points are missing from a code page, or, as with Japanese and UTF-8, the code pages contain multi-byte characters.
VBscript provides no native method to detect the User, Thread, or System code page. The Thread (user) code page maybe inferred from the Locale set by SetLocale or returned by GetLocale (there is a list here: https://www.science.co.il/language/Locale-codes.php), but there does not appear to be any MS documentation. On Win2K+, WMI may be used to query the System code page. The CHCP command queries and changes the OEM codepage, which is neither the User nor the System code page.
The system code page may be spoofed by an application manifest. There is no way for an application (such as cscript or wscript) or script (such as VBScript or JScript) to change it's parent system except by creating a new process with a new manifest. or rebooting the system after making a registry change.
In detail:
s = chr(i)
'creates a Unicode string, using the Thread Locale Codepage.
Code points that do not exist as characters are mapped as control characters: 127 becomes U+00FF (which is a standard Unicode control character), and 128 becomes U+20AC (the Euro symbol) and 129 becomes 0081 (which is a code point in a Unicode control character region). In VBScript, Thread Locale can be set and read by SetLocale and GetLocale
createobject("Scripting.FileSystemObject").OpenTextFile(strOutFile, 2, True).write s
'creates a 'code page' string, using the System Locale Codepage.
There are two ways that Windows can handle Unicode values it can't map: it can either map to a default character, or return an error. "Scripting.FileSystemObject" uses the error setting, and throws an exception.
In More Detail:
The Thread Locale is, by default, the User Locale, which is the date and time format setting in the "Region and Language" control panel applet (called different things in different versions of windows). It has an associated code page. According to MS internationalization expert Michka (Michael Kaplan, RIP), the reason it has a code page is so that Months and Days of the week can be written in appropriate characters, and it should not be used for any other purpose.
The ASP-classic people clearly had other ideas, since Response.CodePage is thread-locale, and can be controlled by vbscript GetLocale and SetLocale amongst other methods. If the User Locale is changed, all processes are notified, and any thread that is using the default value updates. (I haven't tested what happens to a thread currently using a non-default value).
The System Locale is also called "Language for non-Unicode programs" and is also found in the "Region and Language" applet, but requires a reboot to change. This is the value used internally by windows ("The System") to map between the "A" API and the "W" API. Changing this has no effect on the language of the Windows GUI (That is not a "non-Unicode program")
Assuming that the "Time and Date" setting matches the "Language for non-Unicode programs", any Chr(i) that can create a valid Unicode code point (see "mapping errors" below), will map back exactly from Unicode to "code page". Note that this does work for code points that are "control characters": also note that it doesn't work the other way: UTF-CodePage-UTF doesn't always round-trip exactly. Famously (Character,Modifer)-CodePage-(Complex Character) does not round-trip correctly, where Unicode defines more than one way of constructing a language character representation.
If the "Time and Date" does not match the "Language for non-Unicode programs", any translation could take place, for example U+0101 is 0xE0 on cp28594 and 0xE2 on cp28603: Chr(224) would go through U+0101 to be written as 226.
Even if there are not transposition errors, if the "Time and Date" does not match the "Language for non-Unicode programs" the program may fail when translating to the System Locale: if the Unicode code point does not have a matching Code Page code point, there will be an exception from the FileSystemObject.
There may also be mapping errors at Chr(i), going from Code page to Unicode. Code page 1041 (Japanese) is a double-byte code page (probably Shift JIS). 0x81 is (only) the first byte of a double-byte pair. To be consistent with other code pages, 0x81 should map to the control character 0081, but when given 81 and code page 1041, Windows assumes that the next byte in the buffer, or in the BSTR, is the second byte of the double-byte pair (I've not determined if the mistake is made before or after the conversion). Chr(&H81) is mapped to U+xx81 (81,xx). When I did it, I got U+4581, which is a CJK Unified Ideograph (Brasenia purpurca): it's not mapped by code page 1041.
Mapping errors at Chr(1) do not cause VBScript exceptions at the point of creation. If the UTF-16 code point created is invalid or not on the System Locale code page, there will be a FileSystemObject exception at .write. This particular problem can be avoided by using ChrW(i) instead of Chr(i). On code page 1041, ChrW(129) becomes the Unicode Control character 0081 instead of xx81.
Background:
A program can map between Unicode and "codepage" using any installed code page: the Windows functions MultiByteToWideChar and WideCharToMultiByte take [UINT CodePage] as the first parameter. That mechanism is used internally in Windows to map the "A" API to the "W" API, for example GetAddressByNameA and GetAddressByNameW. Windows is "W", (wide, 16 bit) internally, and "A" strings are mapped to "W" strings on call, and back from "W" to "A" on return. When Windows does the mapping, it uses the code page associated with the "System Locale", also called "Language for non-Unicode programs".
The Windows API function WriteFile writes bytes, not characters, so it's not an "A" or "W" function. Any program that uses it has to handle conversion between strings and bytes. The c function fwrite writes characters, so it can handle 16 bit characters, but it has no way of handling variable length code points like UTF-8 or UTF-16: again, any program that uses "fwrite" has to handle conversion between strings and words.
The C++ function fwrite can handle UTF, and the compiler function _fwrite does magic that depends on the compiler. Presumably, on Windows, if code page translation is required the MultiByteToWideChar and WideCharToMultiByte API is used.
The "A" code pages and the "A" API were called "ANSI" or "ASCII" or "OEM", and started out as 8 bit characters, then grew to double-byte characters, and have now grown to UTF-8 (1..3 bytes). The "W" API started out as 16 bit characters, then grew to UTF-16 (1..6 bytes). Both are multi-word character encodings: the distinction is that for the "A" API and code pages, the word length is 8 bits: for the "W" API and UTF-16, the word length is 16 bits. Because they are both multi-byte mappings, and because "byte" and "word" and "char" and "character" mean different things in different contexts, and because "W" and particularly "A" mean different things than they did years ago, I've just use "A" and "W" and "code page" and "Unicode".
"OEM" is the code page associated with another locale: The Console I/O API. It is per-process (it's a thread locale), it can be changed dynamically (using the CHCP command) and its default value is set at installation: there is no GUI provided to change the value stored in the registry. Most console programs don't use the console I/O API, and as written, use either the system locale, or the user locale, or, (sometimes inadvertently), a mixture of both.
The System Locale can be spoofed by using a manifest and there was a WinXP utility called "AppLocale" that did the same thing.
FSO decide how to encode text during file opening. Use format argument as follows:
Set ts = fs.OpenTextFile("tmp.txt", 2, True, -1)
' ↑↑
Resource: OpenTextFile Method
Syntax
object.OpenTextFile(filename[, iomode[, create[, format]]])
Arguments
object - Required. Object is always the name of a FileSystemObject.
filename - Required. String expression that identifies the file to
open.
iomode - Optional. Can be one of three constants: ForReading,
ForWriting, or ForAppending.
create - Optional. Boolean value that indicates whether a new file
can be created if the specified filename doesn't exist. The value is
True if a new file is created, False if it isn't created. If
omitted, a new file isn't created.
format - Optional. One of three Tristate values used to indicate the
format of the opened file.
TristateTrue = -1 to open the file as Unicode,
TristateFalse = 0 to open the file as ASCII,
TristateUseDefault = -2 to open the file as the system default.
If omitted, the file is opened as ASCII.

Normalize filenames to NFC or not (Unicode)

I wrote an application that prefers NFC. When I get a filename from OSX its normalized as NFD though. As far as I know I shouldn't convert the data as its mentioned here:
http://www.win.tue.nl/~aeb/linux/uc/nfc_vs_nfd.html
[...](Not because something is wrong with NFD, or this version of NFD,
but because one should never change data. Filenames must not be
normalized.)[...]
When I compare the filename with the user input (which is in NFC) I have to implement a corresponding compare function which takes care of the Unicode equivalence. But that could be much slower than needed. Wouldn't it be better if I normalize the filename to NFC instead? It would improve the speed a lot when just a memory compare is involved.
The accuracy of advice you link to is dependent on the filesystem in question.
The 'standard' Linux file systems do not prescribe an encoding for filenames (they are treated as raw bytes), so assuming they are UTF-8 and normalising them is an error and may cause problems.
On the other hand, the default filesystem on Mac OS X (HFS+) enforces all filenames to be valid UTF-16 in a variant of NFD. If you need to compare file paths, you should do so in a similar format – ideally using the APIs provides by the system, as its NFD form is tied to an older version of Unicode.

Why does Matlab only accept a small set of characters in script filenames?

Matlab requires that script files are limited to only 63 characters
>> namelengthmax
ans =
63
and these 63 characters must be out of a small characterset without - and others.
Why does Matlab limit the filenames and is there a workaround?
The comment from beaker answers part of your question. Because they can also be funtion names, you are restricted in the character they can incorporate.
For example, if you had a file (function) named foot-ball.m, when you call it in an instruction, Matlab couldn't differentiate between:
a = foot-ball ;
where you mean to call the result of a function named foot-ball.m (actually impossible)
or
a = foot-ball ;
assigning to variable "a" the result of function foot.m minus the result of function ball.m
As to the maximum length, there are no workaround (yet) to my knowledge (until Matlab lift the restriction).
Remember that your operating system also has a limit to the length of file (and full path). On windows it is 256+4 characters. So I guess limiting a file name length to 63 is just to allow for 193 characters of full path. This can be quickly reached, faster than we think.
If your filename was 255 character long, you would have no other choice than to put it directly into c:\ or the operating system couldn't access it (so Matlab couldn't call it obviously).
Use the instruction len = namelengthmax to get the actual maximum length on your system. You can read more about it in specify file names.
or also read a similar problem from another user: Extending the maximum length of MATLAB function names. Note that this user couldn't bypass the length limit, he had to find another way to fit all the information he wanted into the maximum file name.

Get MATLAB Engine to return unicode

The MATLAB Engine is a C interface to MATLAB. It provides a function engEvalString() which takes some MATLAB code as a C string (char *), evaluates it, then returns MATLAB's output as a C string again.
I need to be able to pass unicode data to MATLAB through engEvalString() and to retrieve the output as unicode. How can I do this? I don't care about the particular encoding (UTF-8, UTF-16, etc.), any will do. I can adapt my program.
More details:
To give a concrete example, if I send the following sting, encoded as, say, UTF-8,
s='Paul Erdős'
I would like to get back the following output, encoded again as UTF-8:
s =
Paul Erdős
I hoped to achieve this by sending feature('DefaultCharacterSet', 'UTF-8') (reference) before doing anything else, and this worked fine when working with MATLAB R2012b on OS X. It also works fine with R2013a on Ubuntu Linux. It does not work on R2013a on OS X though. Instead of the character ő in the output of engEvalString(), I get character code 26, which is supposed to mean "I don't know how to represent this". However, if I retrieve the contents of the variable s by other means, I see that MATLAB does correctly store the character ő in the string. This means that it's only the output that didn't work, but MATLAB did interpret the UTF-8 input correctly. If I test this on Windows with R2013a, neither input, nor output works correctly. (Note that the Windows and the Mac/Linux implementations of the MATLAB Engine are different.)
The question is: how can I get unicode input/output working on all platforms (Win/Mac/Linux) with engEvalString()? I need this to work in R2013a, and preferably also in R2012b.
If people are willing to experiment, I can provide some test C code. I'm not posting that yet because it's a lot of work to distill a usable small example from my code that makes it possible to experiment with different encodings.
UPDATE:
I learned about feature('locale') which returns some locale-related data. On Linux, where everything works correctly, all encodings it returns are UTF-8. But not on OS X / Windows. Is there any way I could set the various encodings returned by feature('locale')?
UPDATE 2:
Here's a small test case: download. The zip file contains a MATLAB Engine C program, which reads a file, passes it to engEvalString(), then writes the output to another file. There's a sample file included with the following contents:
feature('DefaultCharacterSet', 'UTF-8')
feature('DefaultCharacterSet')
s='中'
The (last part of the) output I expect is
>>
s =
中
This is what I get with R2012b on OS X. However, R2013 on OS X gives me character code 26 instead of the character 中. Outputs produces by R2012b and R2013a are included in the zip file.
How can I get the expected output with R2013a on all three platforms (Windows, OS X, Linux)?
I strongly urge you to use engPutVariable, engGetVariable, and Matlab's eval instead. What you're trying to do with engEvalString will not work with many unicode strings due to embedded NULL (\0) characters, among other problems. Due to how the Windows COM interface works, the Matlab engine can't really support unicode in interpreted strings. I can't speculate about how the engine works on other platforms.
Your other question had an answer about using mxCreateString_UTF16. Wasn't that sufficient?

What encoding Win32 API functions expect?

For example, MessageBox function has LPCTSTR typed argument for text and caption, which is a pointer to char or pointer to wchar when _UNICODE or _MBCS are defined, respectively.
How does the MessageBox function interpret those stings? As which encoding?
Only explanation I managed to find is this:
http://msdn.microsoft.com/en-us/library/cwe8bzh0(VS.90).aspx
But it doesn't say anything about encoding? Just that in case of _MBCS one character takes up one wchar (which is 16-bit on Windows) and that in case of _UNICODE one or two char's (8-bit).
So are those some Microsoft's versions of UTF-8 and UTF-16 that ignore anything that has to be encoded in 3 or four bytes in case of UTF-8 and anything that has to be encoded in 4 bytes in case of UTF-16? And is there a way to show anything outside of basic multilingual plane of Unicode with MessageBox?
There are normally two different implementations of each function:
MessageBoxA, which accepts ANSI strings
MessageBoxW, which accepts Unicode strings
Here, 'ANSI' means the multi-byte code page currently assigned to the process. This varies according to the user's preferences and locale setting, although Win32 API functions such as WideCharToMultiByte can be counted on to do the right conversion, and the GetACP function will tell you the code page in use. MSDN explains the ANSI code page and how it interacts with Unicode.
'Unicode' generally means UCS-2; that is, support for characters above 0xFFFF isn't consistent. I haven't tried this, but UI functions such as MessageBox in recent versions (> Windows 2000) should support characters outside the BMP.
The ...A functions are obsolete and only wrap the ...W functions. The former were required for compatibility with Windows 9x, but since that is not used any more, you should avoid them at any costs and use the ...W functions exclusively. They require UTF-16 strings, the only native Windows encoding. All modern Windows versions should support non-BMP characters quite well (if there is a font that has these characters, of course).