I run some matlab function and after upgrading to a new computer, MATLAB 2017b is not able to do string comparison containing umlaut because all the umlaut are dispalyed as ?. See below:
strcmp(factor_struct.conditions(i,k),'gr?sser als')
Chaning ? to ö does not work as matlab seems not able to display properly this character.
Is there a set-up to change in order to be able to read that type of charachters?
I managed to solve the problem by changing indeed the OS language format.
I'm using Freetype 2.5.3 on a portable OpenGL application.
My issue is that i can't get unicode on my Windows machine, while i get them correctly on linux-based systems (lubuntu, OSX, Android)
i'm using the famous arialuni.ttf (23mb) so i'm pretty sure it contains everything. In fact, i had this working in my previous Windows installation (Win7), then re-installed Win7 from another source and now unicode is not working right.
Specifically when i draw a string, then only latin are rendered while unicode are getting skipped. I dug deeper and i found that character codes are not what they should be in wstring. For example, i'm using some greek letters in the string like γ which i know it should have a code point of 947.
My engine just iterates the wstring characters and drives the above code point to another vector that holds texture coordinates so i can draw the glyph.
The problem is that on my Windows 7 machine, the wstring does not give me 947 for a γ, but instead it gives me a 179. In addition, the character of Ά returns as 2 characters of 206 code (??) instead of one of 902.
It's like simple iterating a wstring, like:
for(size_t c=0,sz=wtext.size();c<sz;c++) {
uint32_t ch = wtext[c]; // code point
...
}
This is only happening on my newly installed Win7; it worked before on another Win7 system, along with my all linux machines. Now it's broken on this, and also on my XP virtual machine.
I don't use any wide formatting functions on this, just like:
wstring wtext = L"blΆh";
In addition, i can see my glyphs being rendered correctly in my OpenGL texture, so not a font issue either. My font generator uses the greek range of ~900-950 code points to collect the glyphs.
I add the code points per language with this:
FT_UInt charcode;
FT_ULong character = FT_Get_First_Char(face, &charcode);
do {
character = FT_Get_Next_Char(face, character, &charcode);
...
} while(charcode);
Not sure why but i fixed it by saving the file as UTF-8 BOM, rather UTF-8 (i had it by default).
I wrote an application that prefers NFC. When I get a filename from OSX its normalized as NFD though. As far as I know I shouldn't convert the data as its mentioned here:
http://www.win.tue.nl/~aeb/linux/uc/nfc_vs_nfd.html
[...](Not because something is wrong with NFD, or this version of NFD,
but because one should never change data. Filenames must not be
normalized.)[...]
When I compare the filename with the user input (which is in NFC) I have to implement a corresponding compare function which takes care of the Unicode equivalence. But that could be much slower than needed. Wouldn't it be better if I normalize the filename to NFC instead? It would improve the speed a lot when just a memory compare is involved.
The accuracy of advice you link to is dependent on the filesystem in question.
The 'standard' Linux file systems do not prescribe an encoding for filenames (they are treated as raw bytes), so assuming they are UTF-8 and normalising them is an error and may cause problems.
On the other hand, the default filesystem on Mac OS X (HFS+) enforces all filenames to be valid UTF-16 in a variant of NFD. If you need to compare file paths, you should do so in a similar format – ideally using the APIs provides by the system, as its NFD form is tied to an older version of Unicode.
Given the following code that attempts to create 2 folders in the current MATLAB path:
%%
u_path1 = native2unicode([107, 97, 116, 111, 95, 111, 117, 116, 111, 117], 'UTF-8'); % 'kato_outou'
u_path2 = native2unicode([233 129 142, 230 184 161, 229 191 156, 231 173 148], 'UTF-8'); % '過渡応答'
mkdir(u_path1);
mkdir(u_path2);
the first mkdir call succeeds while the second fails, with the error message "The filename, directory name, or volume label syntax is incorrect". However, creating the folders manually in the "Current Folder" GUI panel ([right click]⇒New Folder⇒[paste name]) encounters no problem. This kind of glitches appear in most of MATLAB's low-level I/O functions (dir, fopen, copyfile, movefile etc.) and I'd like to use all these functions.
The environment is:
Win7 Enterprise (32 bit, NTFS)
MATLAB R2012a
thus the filesystem supports Unicode chars in path, and MATLAB can store true Unicode strings (and not "fake" them).
The mkdir official documentation elegantly{1} avoids the issue by stating that the correct syntax for calling the function is:
mkdir('folderName')
which suggests that the only officially supported call for the function is the one that uses string literals for folder name argument, and not string variables. That would also suggest the eval way—which I'm testing to see if it's working as I write this post.
I wonder if there is a way to circumvent these limitations. I would be interested in solutions that:
don't rely on undocumented/unsupported MATLAB stuff;
don't involve system-wide changes (e.g changing operating system's locale info);
may rely eventually on non-native MATLAB libraries, as long the resulting handles/objects can be converted to MATLAB native objects and manipulated as such;
may rely eventually on manipulations of the paths that would render them usable by the standard MATLAB functions, even if Windows specific (e.g. short-name paths).
Later edit
What I'm looking for are implementations for the following functions, which will shadow the originals in the code that is already written:
function listing = dir(folder);
function [status,message,messageid] = mkdir(folder1,folder2);
function [status,message,messageid] = movefile(source,destination,flag);
function [status,message,messageid] = copyfile(source,destination,flag);
function [fileID, message] = fopen(filename, permission, machineformat, encoding);
function status = fclose(fileID);
function [A, count] = fread(fileID, sizeA, precision, skip, machineformat);
function count = fwrite(fileID, A, precision, skip, machineformat);
function status = feof(fileID);
function status = fseek(fileID, offset, origin);
function [C,position] = textscan(fileID, varargin); %'This one is going to be funny'
Not all the output types need to be interchangeable with the original MATLAB functions, however need to be consistent between function calls (eg fileID between fopen and fclose). I am going update this declaration list with implementations as soon as I get/write them.
{1} for very loose meanings of the word "elegant".
Some useful information on how MATLAB handles filenames (and characters in general) is available in the comments of this UndocumentedMatlab post (especially those by Steve Eddins, who works at MathWorks). In short:
"MathWorks began to convert all the string handling in the MATLAB code base to UTF-16 .... and we have approached it incrementally"
--Steve Eddins, December 2014.
This statement implies that the newer the version of MATLAB, the more features support UTF-16. This in turn means that if a possibility to update your version of MATLAB exists, it may be an easy solution to your problem.
Below is a list of functions that were tested by users on different platforms, according to the functionality that was requested in the question:
The following command creates a directory with UTF16 characters in its name ("תיקיה" in Hebrew, in this example) from within MATLAB:
java.io.File(fullfile(pwd,native2unicode(...
[255 254 234 5 217 5 231 5 217 5 212 5],'UTF-16'))).mkdir();
Tested on:
Windows 7 with MATLAB R2015a by Dev-iL
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
The following commands also seem to create directories successfully:
mkdir(native2unicode([255 254 234 5 217 5 231 5 217 5 212 5],'utf-16'));
mkdir(native2unicode([215,170,215,153,215,167,215,153,215,148],'utf-8'));
Tested on:
Windows 10 with MATLAB R2015a having feature('DefaultCharacterSet') => windows-1255 by Dev-iL
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
The value of feature('DefaultCharacterSet') has no influence here because the encoding is explicitly defined in the command native2unicode.
The following commands successfully open a file having unicode characters both in its name and as its content:
fid = fopen([native2unicode([255,254,231,5,213,5,209,5,229,5],'utf-16') '.txt']);
txt = textscan(fid,'%s');
Tested on:
Windows 10 with MATLAB R2015a having feature('DefaultCharacterSet') => windows-1255 by Dev-iL. Note: the scanned text appears correctly in the Variables view. The text file can be edited and saved from the MATLAB editor with UTF characters intact.
OSX Yosemite (10.10.4) with MATLAB R2014b by Matt
If feature('DefaultCharacterSet') is set to utf-8 before using textscan, the output of celldisp(txt) is displayed correctly. The same applies to the Variables view.
Try to use UTF-16 if you are on Windows because NTFS uses UTF-16 for filename encoding and Windows has two sets of APIs: the ones that work with so called 'Windows Codepages' (1250, 1251, 1252 etc.) and use C's char data type and the ones that use wchar_t. The latter type has a size of 2 bytes on Windows which is enough to store UTF-16 code units.
The reason your first call worked is because the first 128 code points in the Unicode Standard are encoded in UTF-8 identically to the 128 ASCII characters (which is made on purpose for backwards compatibility). UTF-8 uses 1-byte code units (instead of 2-byte code units for UTF-16) and usually software such as MATLAB does not process filenames so they need to just store byte-sequences and pass them to the OS APIs. The second call failed, because the UTF-8 byte-sequences representing code points are probably filtered out by Windows because some byte-values are prohibited in filenames. On POSIX-conformant operating systems most APIs are byte-oriented and the standard pretty much prevents you from using the existing multibyte encodings in APIs (e.g., UTF-16, UTF-32) and you have to use char* APIs and encodings with 1-byte code units:
POSIX.1-2008 places only the following requirements on the encoded
values of the characters in the portable character set:
...
The encoded values associated with and shall be invariant across all locales supported by the implementation.
The encoded values associated with the members of the portable character set are each represented in a single byte. Moreover, if the
value is stored in an object of C-language type char, it is guaranteed to be positive (except the NUL, which is always zero).
Not all POSIX-conformant operating systems validate filenames other than for period or slash so you can pretty much store garbage in filenames. Mac OS X, as a POSIX system, uses byte-oriented (char*) APIs but the underlying HFS+ uses UTF-16 in the NFD (Normalization Form D), so some processing is done at the OS-level before saving a filename.
Windows does not perform any type of Unicode normalization and stores filenames in whatever form they are passed in UTF-16 (provided NTFS is used) or Windows Codepages (not sure how they handle this on the filesystem level - probably by conversion).
So, how does this relate to MATLAB? Well it is cross-platform and has to deal with many issues because of that. One of them is that Windows has char APIs for Windows Codepages and certain forbidden characters in filenames while other OSes do not. They could implement system-dependent checks but that would be much harder to test and support (much code churning I guess).
My best advise is to use UTF-16 on Windows, implement platform-dependent checks or use ASCII if you need portability.
i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.