Get MATLAB Engine to return unicode - matlab

The MATLAB Engine is a C interface to MATLAB. It provides a function engEvalString() which takes some MATLAB code as a C string (char *), evaluates it, then returns MATLAB's output as a C string again.
I need to be able to pass unicode data to MATLAB through engEvalString() and to retrieve the output as unicode. How can I do this? I don't care about the particular encoding (UTF-8, UTF-16, etc.), any will do. I can adapt my program.
More details:
To give a concrete example, if I send the following sting, encoded as, say, UTF-8,
s='Paul Erdős'
I would like to get back the following output, encoded again as UTF-8:
s =
Paul Erdős
I hoped to achieve this by sending feature('DefaultCharacterSet', 'UTF-8') (reference) before doing anything else, and this worked fine when working with MATLAB R2012b on OS X. It also works fine with R2013a on Ubuntu Linux. It does not work on R2013a on OS X though. Instead of the character ő in the output of engEvalString(), I get character code 26, which is supposed to mean "I don't know how to represent this". However, if I retrieve the contents of the variable s by other means, I see that MATLAB does correctly store the character ő in the string. This means that it's only the output that didn't work, but MATLAB did interpret the UTF-8 input correctly. If I test this on Windows with R2013a, neither input, nor output works correctly. (Note that the Windows and the Mac/Linux implementations of the MATLAB Engine are different.)
The question is: how can I get unicode input/output working on all platforms (Win/Mac/Linux) with engEvalString()? I need this to work in R2013a, and preferably also in R2012b.
If people are willing to experiment, I can provide some test C code. I'm not posting that yet because it's a lot of work to distill a usable small example from my code that makes it possible to experiment with different encodings.
UPDATE:
I learned about feature('locale') which returns some locale-related data. On Linux, where everything works correctly, all encodings it returns are UTF-8. But not on OS X / Windows. Is there any way I could set the various encodings returned by feature('locale')?
UPDATE 2:
Here's a small test case: download. The zip file contains a MATLAB Engine C program, which reads a file, passes it to engEvalString(), then writes the output to another file. There's a sample file included with the following contents:
feature('DefaultCharacterSet', 'UTF-8')
feature('DefaultCharacterSet')
s='中'
The (last part of the) output I expect is
>>
s =
中
This is what I get with R2012b on OS X. However, R2013 on OS X gives me character code 26 instead of the character 中. Outputs produces by R2012b and R2013a are included in the zip file.
How can I get the expected output with R2013a on all three platforms (Windows, OS X, Linux)?

I strongly urge you to use engPutVariable, engGetVariable, and Matlab's eval instead. What you're trying to do with engEvalString will not work with many unicode strings due to embedded NULL (\0) characters, among other problems. Due to how the Windows COM interface works, the Matlab engine can't really support unicode in interpreted strings. I can't speculate about how the engine works on other platforms.
Your other question had an answer about using mxCreateString_UTF16. Wasn't that sufficient?

Related

German Umlaut in Matlab function not displayed properly

I run some matlab function and after upgrading to a new computer, MATLAB 2017b is not able to do string comparison containing umlaut because all the umlaut are dispalyed as ?. See below:
strcmp(factor_struct.conditions(i,k),'gr?sser als')
Chaning ? to ö does not work as matlab seems not able to display properly this character.
Is there a set-up to change in order to be able to read that type of charachters?
I managed to solve the problem by changing indeed the OS language format.

Freetype unicode on Windows

I'm using Freetype 2.5.3 on a portable OpenGL application.
My issue is that i can't get unicode on my Windows machine, while i get them correctly on linux-based systems (lubuntu, OSX, Android)
i'm using the famous arialuni.ttf (23mb) so i'm pretty sure it contains everything. In fact, i had this working in my previous Windows installation (Win7), then re-installed Win7 from another source and now unicode is not working right.
Specifically when i draw a string, then only latin are rendered while unicode are getting skipped. I dug deeper and i found that character codes are not what they should be in wstring. For example, i'm using some greek letters in the string like γ which i know it should have a code point of 947.
My engine just iterates the wstring characters and drives the above code point to another vector that holds texture coordinates so i can draw the glyph.
The problem is that on my Windows 7 machine, the wstring does not give me 947 for a γ, but instead it gives me a 179. In addition, the character of Ά returns as 2 characters of 206 code (??) instead of one of 902.
It's like simple iterating a wstring, like:
for(size_t c=0,sz=wtext.size();c<sz;c++) {
uint32_t ch = wtext[c]; // code point
...
}
This is only happening on my newly installed Win7; it worked before on another Win7 system, along with my all linux machines. Now it's broken on this, and also on my XP virtual machine.
I don't use any wide formatting functions on this, just like:
wstring wtext = L"blΆh";
In addition, i can see my glyphs being rendered correctly in my OpenGL texture, so not a font issue either. My font generator uses the greek range of ~900-950 code points to collect the glyphs.
I add the code points per language with this:
FT_UInt charcode;
FT_ULong character = FT_Get_First_Char(face, &charcode);
do {
character = FT_Get_Next_Char(face, character, &charcode);
...
} while(charcode);
Not sure why but i fixed it by saving the file as UTF-8 BOM, rather UTF-8 (i had it by default).

Normalize filenames to NFC or not (Unicode)

I wrote an application that prefers NFC. When I get a filename from OSX its normalized as NFD though. As far as I know I shouldn't convert the data as its mentioned here:
http://www.win.tue.nl/~aeb/linux/uc/nfc_vs_nfd.html
[...](Not because something is wrong with NFD, or this version of NFD,
but because one should never change data. Filenames must not be
normalized.)[...]
When I compare the filename with the user input (which is in NFC) I have to implement a corresponding compare function which takes care of the Unicode equivalence. But that could be much slower than needed. Wouldn't it be better if I normalize the filename to NFC instead? It would improve the speed a lot when just a memory compare is involved.
The accuracy of advice you link to is dependent on the filesystem in question.
The 'standard' Linux file systems do not prescribe an encoding for filenames (they are treated as raw bytes), so assuming they are UTF-8 and normalising them is an error and may cause problems.
On the other hand, the default filesystem on Mac OS X (HFS+) enforces all filenames to be valid UTF-16 in a variant of NFD. If you need to compare file paths, you should do so in a similar format – ideally using the APIs provides by the system, as its NFD form is tied to an older version of Unicode.

Unicode paths with MATLAB

Given the following code that attempts to create 2 folders in the current MATLAB path:
%%
u_path1 = native2unicode([107, 97, 116, 111, 95, 111, 117, 116, 111, 117], 'UTF-8'); % 'kato_outou'
u_path2 = native2unicode([233 129 142, 230 184 161, 229 191 156, 231 173 148], 'UTF-8'); % '過渡応答'
mkdir(u_path1);
mkdir(u_path2);
the first mkdir call succeeds while the second fails, with the error message "The filename, directory name, or volume label syntax is incorrect". However, creating the folders manually in the "Current Folder" GUI panel ([right click]⇒New Folder⇒[paste name]) encounters no problem. This kind of glitches appear in most of MATLAB's low-level I/O functions (dir, fopen, copyfile, movefile etc.) and I'd like to use all these functions.
The environment is:
Win7 Enterprise (32 bit, NTFS)
MATLAB R2012a
thus the filesystem supports Unicode chars in path, and MATLAB can store true Unicode strings (and not "fake" them).
The mkdir official documentation elegantly{1} avoids the issue by stating that the correct syntax for calling the function is:
mkdir('folderName')
which suggests that the only officially supported call for the function is the one that uses string literals for folder name argument, and not string variables. That would also suggest the eval way—which I'm testing to see if it's working as I write this post.
I wonder if there is a way to circumvent these limitations. I would be interested in solutions that:
don't rely on undocumented/unsupported MATLAB stuff;
don't involve system-wide changes (e.g changing operating system's locale info);
may rely eventually on non-native MATLAB libraries, as long the resulting handles/objects can be converted to MATLAB native objects and manipulated as such;
may rely eventually on manipulations of the paths that would render them usable by the standard MATLAB functions, even if Windows specific (e.g. short-name paths).
Later edit
What I'm looking for are implementations for the following functions, which will shadow the originals in the code that is already written:
function listing = dir(folder);
function [status,message,messageid] = mkdir(folder1,folder2);
function [status,message,messageid] = movefile(source,destination,flag);
function [status,message,messageid] = copyfile(source,destination,flag);
function [fileID, message] = fopen(filename, permission, machineformat, encoding);
function status = fclose(fileID);
function [A, count] = fread(fileID, sizeA, precision, skip, machineformat);
function count = fwrite(fileID, A, precision, skip, machineformat);
function status = feof(fileID);
function status = fseek(fileID, offset, origin);
function [C,position] = textscan(fileID, varargin); %'This one is going to be funny'
Not all the output types need to be interchangeable with the original MATLAB functions, however need to be consistent between function calls (eg fileID between fopen and fclose). I am going update this declaration list with implementations as soon as I get/write them.
{1} for very loose meanings of the word "elegant".
Some useful information on how MATLAB handles filenames (and characters in general) is available in the comments of this UndocumentedMatlab post (especially those by Steve Eddins, who works at MathWorks). In short:
"MathWorks began to convert all the string handling in the MATLAB code base to UTF-16 .... and we have approached it incrementally"
--Steve Eddins, December 2014.
This statement implies that the newer the version of MATLAB, the more features support UTF-16. This in turn means that if a possibility to update your version of MATLAB exists, it may be an easy solution to your problem.
Below is a list of functions that were tested by users on different platforms, according to the functionality that was requested in the question:
The following command creates a directory with UTF16 characters in its name ("תיקיה" in Hebrew, in this example) from within MATLAB:
java.io.File(fullfile(pwd,native2unicode(...
[255 254 234 5 217 5 231 5 217 5 212 5],'UTF-16'))).mkdir();
Tested on:
Windows 7 with MATLAB R2015a by Dev-iL
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
The following commands also seem to create directories successfully:
mkdir(native2unicode([255 254 234 5 217 5 231 5 217 5 212 5],'utf-16'));
mkdir(native2unicode([215,170,215,153,215,167,215,153,215,148],'utf-8'));
Tested on:
Windows 10 with MATLAB R2015a having feature('DefaultCharacterSet') => windows-1255 by Dev-iL
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
The value of feature('DefaultCharacterSet') has no influence here because the encoding is explicitly defined in the command native2unicode.
The following commands successfully open a file having unicode characters both in its name and as its content:
fid = fopen([native2unicode([255,254,231,5,213,5,209,5,229,5],'utf-16') '.txt']);
txt = textscan(fid,'%s');
Tested on:
Windows 10 with MATLAB R2015a having feature('DefaultCharacterSet') => windows-1255 by Dev-iL. Note: the scanned text appears correctly in the Variables view. The text file can be edited and saved from the MATLAB editor with UTF characters intact.
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
If feature('DefaultCharacterSet') is set to utf-8 before using textscan, the output of celldisp(txt) is displayed correctly. The same applies to the Variables view.
Try to use UTF-16 if you are on Windows because NTFS uses UTF-16 for filename encoding and Windows has two sets of APIs: the ones that work with so called 'Windows Codepages' (1250, 1251, 1252 etc.) and use C's char data type and the ones that use wchar_t. The latter type has a size of 2 bytes on Windows which is enough to store UTF-16 code units.
The reason your first call worked is because the first 128 code points in the Unicode Standard are encoded in UTF-8 identically to the 128 ASCII characters (which is made on purpose for backwards compatibility). UTF-8 uses 1-byte code units (instead of 2-byte code units for UTF-16) and usually software such as MATLAB does not process filenames so they need to just store byte-sequences and pass them to the OS APIs. The second call failed, because the UTF-8 byte-sequences representing code points are probably filtered out by Windows because some byte-values are prohibited in filenames. On POSIX-conformant operating systems most APIs are byte-oriented and the standard pretty much prevents you from using the existing multibyte encodings in APIs (e.g., UTF-16, UTF-32) and you have to use char* APIs and encodings with 1-byte code units:
POSIX.1-2008 places only the following requirements on the encoded
values of the characters in the portable character set:
...
The encoded values associated with and shall be invariant across all locales supported by the implementation.
The encoded values associated with the members of the portable character set are each represented in a single byte. Moreover, if the
value is stored in an object of C-language type char, it is guaranteed to be positive (except the NUL, which is always zero).
Not all POSIX-conformant operating systems validate filenames other than for period or slash so you can pretty much store garbage in filenames. Mac OS X, as a POSIX system, uses byte-oriented (char*) APIs but the underlying HFS+ uses UTF-16 in the NFD (Normalization Form D), so some processing is done at the OS-level before saving a filename.
Windows does not perform any type of Unicode normalization and stores filenames in whatever form they are passed in UTF-16 (provided NTFS is used) or Windows Codepages (not sure how they handle this on the filesystem level - probably by conversion).
So, how does this relate to MATLAB? Well it is cross-platform and has to deal with many issues because of that. One of them is that Windows has char APIs for Windows Codepages and certain forbidden characters in filenames while other OSes do not. They could implement system-dependent checks but that would be much harder to test and support (much code churning I guess).
My best advise is to use UTF-16 on Windows, implement platform-dependent checks or use ASCII if you need portability.

wxTextCtrl OSX mutated vowel

i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.