Special char comparison behavior difference between Windows and Linux - unicode

I had to compare sequence of Japanese characters for sorting purpose. In Windows platform I'm using CompareStringW function and in Linux platform I'm using wcscasecmp function to compare. When comparing the below UTF-16 character sequences, the functions' output differs.
String 1 - {0x65e5, 0x672c}
String 2 - {0xff7a, 0xff8a}
In Windows, CompareStringW returns CSTR_GREATER_THAN, meaning String 1 is greater. Where as in Linux, wcscasecmp returns '-39317' meaning string 2 is greater.
The Windows function is called with flags (NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH | SORT_STRINGSORT).
Can anyone kindly guide me to understand why the behavior is different between the platforms? And, is there any way to get same behavior across the platforms?

Related

Get MATLAB Engine to return unicode

The MATLAB Engine is a C interface to MATLAB. It provides a function engEvalString() which takes some MATLAB code as a C string (char *), evaluates it, then returns MATLAB's output as a C string again.
I need to be able to pass unicode data to MATLAB through engEvalString() and to retrieve the output as unicode. How can I do this? I don't care about the particular encoding (UTF-8, UTF-16, etc.), any will do. I can adapt my program.
More details:
To give a concrete example, if I send the following sting, encoded as, say, UTF-8,
s='Paul Erdős'
I would like to get back the following output, encoded again as UTF-8:
s =
Paul Erdős
I hoped to achieve this by sending feature('DefaultCharacterSet', 'UTF-8') (reference) before doing anything else, and this worked fine when working with MATLAB R2012b on OS X. It also works fine with R2013a on Ubuntu Linux. It does not work on R2013a on OS X though. Instead of the character ő in the output of engEvalString(), I get character code 26, which is supposed to mean "I don't know how to represent this". However, if I retrieve the contents of the variable s by other means, I see that MATLAB does correctly store the character ő in the string. This means that it's only the output that didn't work, but MATLAB did interpret the UTF-8 input correctly. If I test this on Windows with R2013a, neither input, nor output works correctly. (Note that the Windows and the Mac/Linux implementations of the MATLAB Engine are different.)
The question is: how can I get unicode input/output working on all platforms (Win/Mac/Linux) with engEvalString()? I need this to work in R2013a, and preferably also in R2012b.
If people are willing to experiment, I can provide some test C code. I'm not posting that yet because it's a lot of work to distill a usable small example from my code that makes it possible to experiment with different encodings.
UPDATE:
I learned about feature('locale') which returns some locale-related data. On Linux, where everything works correctly, all encodings it returns are UTF-8. But not on OS X / Windows. Is there any way I could set the various encodings returned by feature('locale')?
UPDATE 2:
Here's a small test case: download. The zip file contains a MATLAB Engine C program, which reads a file, passes it to engEvalString(), then writes the output to another file. There's a sample file included with the following contents:
feature('DefaultCharacterSet', 'UTF-8')
feature('DefaultCharacterSet')
s='中'
The (last part of the) output I expect is
>>
s =
中
This is what I get with R2012b on OS X. However, R2013 on OS X gives me character code 26 instead of the character 中. Outputs produces by R2012b and R2013a are included in the zip file.
How can I get the expected output with R2013a on all three platforms (Windows, OS X, Linux)?
I strongly urge you to use engPutVariable, engGetVariable, and Matlab's eval instead. What you're trying to do with engEvalString will not work with many unicode strings due to embedded NULL (\0) characters, among other problems. Due to how the Windows COM interface works, the Matlab engine can't really support unicode in interpreted strings. I can't speculate about how the engine works on other platforms.
Your other question had an answer about using mxCreateString_UTF16. Wasn't that sufficient?

Linux Sort vs Perl String Comparison

Because I was dealing with very large files, I sorted my base and candidate files before comparing them to see what lines were missing from the other. I did this to avoid keeping the records in memory. The sorting was done by using the Linux command-line tool, sort.
In my Perl script, I would look at whether the string in the line was lt, gt, or eq to the line in the other file, advancing the pointers in the file where necessary. However, I hit a problem when I noticed that my string comparison thought the strings in the base file were lt a string in the candidate file which contained special characters.
Is there a surefire way of making sure my Linux sort and Perl string comparisons are using the same type of string comparator?
The sort command uses the current locale, as specified by the environment variable LC_ALL, to determine the sort order for characters. Usually the easiest way to fix sorting issues is to manually set this to the C locale, which treats each 8-bit byte as a single character and compares by simple numeric value. In most shells this can be done as a one-off just for a single command by prefixing it like so:
LC_ALL=C sort < infile > outfile
This will also solve similar problems for some other text-processing programs. (E.g. I recall problems working with CSV files on a German person's computer -- this was traced back to the fact that Germans use a comma instead of a decimal point. Putting LC_ALL=C in front of the relevant commands fixed that issue too.)
[EDIT] Although Perl can be directed to treat some strings as Unicode, by default it still treats input and output as streams of 8-bit bytes, so the above approach should produce an order that is the same as Perl's sort() function. (Thanks to Ven'Tatsu for this nugget.)

wxWidgets and Unicode

i want to use korean translations under in my - quite large - wxwidgets application. The application uses the wxwidgets translation framework, which is based on gettext.
I have working translations for french, german and russian. I want to go unicode anyway, but my first question is:
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
I have thousands of string literals. Do i have to prepend each and every one of them with 'L' ? ( wxString foo("foo") --> wxString foo(L"foo") )
if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ? ( pleeze! =) )
Will this change in wxWidgets 3.0?
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Is there any remarkeable performance impact in using Unicode?
Thank you!
Wendy
Addressing some of your questions…
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
Russian fits in a single-byte charset, just like western European languages (though it is a different charset). Korean and Japanese (and Chinese) don't. There are many workarounds for this, but the most elegant I know of to date is to use Unicode so that you don't need to rebuild your application for each locale; just change its message catalog.
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
Only strings that are going to be shown to (non-technical) users need to be localized, so they're the only ones that have to be in Unicode. The most common approach is to use UTF-8 (which is a particular way of encoding Unicode) as that means that ASCII strings – the most common type passed around inside programs – are exactly the same, which simplifies things a lot. The down-side is that you no longer have cheap indexing into the string as not all characters are the same number of bytes long. That can be anything from a non-issue to a right royal hindering PITA, depending on what the program is doing.
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Comparisons work fine, as does simple string finding. Other operations (e.g., getting the 20th character of a string, or working out how many characters into a string you've found a substring) are nasty because you've not got constant character widths. The nastiness can be mitigated by using wide characters, but they're less nice to use for external data (they introduce potential problems with endianness unless you go into working with byte-order marks, and that's another matter right there).
Is there any remarkeable performance impact in using Unicode?
Depends on exactly what you do. With UTF-8, if you're mostly dealing with ASCII text in reality then you get very little in the way of performance problems for most operations. With wide characters, you take more memory for every character, which naturally has performance implications (but which might acceptable because it does mean you've got constant-time indexing).
There's a korean .po file on http://www.wxwidgets.org/about/i18n.php for wxWidget's own strings. If your application displays wxWidget's own strings correctly when using that file, then it does not need Unicode support to display Korean and Japanese languages.
ISO-8859-5 is an 8 bit character set with Cyrillic letters.
Only if 1. does not yield the correct result. But if you want to translate the string, you should have used _().
I don't know.
wxWidgets 3.0 will not have separate Unicode- and ANSI-builds. 2.9.1 doesn't have, either.
It depends on how you use the arguments. C- and C++-functions usually operate on the representation of strings and are unaware of any particular character encoding. Particularly what you perceive to be a character and what the program considers a character might be different things.
See 6.
I do not know, but many toolkits use UTF-16 or UTF-32 instead of UTF-8 because these schemes are simpler. It's a size-speed tradeoff.
1.does my application need unicode support to display korean and japanese
languages?
Thanks to Oswald, i found out that you can have a korean translation without using unicode in your wxwidgets application. Change ( under windows, at least ) settings for non-unicode aware programs. But i still have to check out if this is enough for a whole application.
3.I have thousands of string literals. Do i have to prepend each
and every one of them with 'L' ? (
wxString foo("foo") --> wxString
foo(L"foo") )
If you have to use unicode with wxwidgets prior to 3.0, you have to. But do not use 'L' under wxwidgets, use wxT("foo")
4.if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ?
I did, at least a search and replace under Visual Studio:
Search: {"([^"]*)"}
Replace: wxT(\1)
But be careful! Will replace all string literals, #include "file.h" with #include wxT("file.h")
Will this change in wxWidgets 3.0?
Yes. See answer/quote above.

What encoding Win32 API functions expect?

For example, MessageBox function has LPCTSTR typed argument for text and caption, which is a pointer to char or pointer to wchar when _UNICODE or _MBCS are defined, respectively.
How does the MessageBox function interpret those stings? As which encoding?
Only explanation I managed to find is this:
http://msdn.microsoft.com/en-us/library/cwe8bzh0(VS.90).aspx
But it doesn't say anything about encoding? Just that in case of _MBCS one character takes up one wchar (which is 16-bit on Windows) and that in case of _UNICODE one or two char's (8-bit).
So are those some Microsoft's versions of UTF-8 and UTF-16 that ignore anything that has to be encoded in 3 or four bytes in case of UTF-8 and anything that has to be encoded in 4 bytes in case of UTF-16? And is there a way to show anything outside of basic multilingual plane of Unicode with MessageBox?
There are normally two different implementations of each function:
MessageBoxA, which accepts ANSI strings
MessageBoxW, which accepts Unicode strings
Here, 'ANSI' means the multi-byte code page currently assigned to the process. This varies according to the user's preferences and locale setting, although Win32 API functions such as WideCharToMultiByte can be counted on to do the right conversion, and the GetACP function will tell you the code page in use. MSDN explains the ANSI code page and how it interacts with Unicode.
'Unicode' generally means UCS-2; that is, support for characters above 0xFFFF isn't consistent. I haven't tried this, but UI functions such as MessageBox in recent versions (> Windows 2000) should support characters outside the BMP.
The ...A functions are obsolete and only wrap the ...W functions. The former were required for compatibility with Windows 9x, but since that is not used any more, you should avoid them at any costs and use the ...W functions exclusively. They require UTF-16 strings, the only native Windows encoding. All modern Windows versions should support non-BMP characters quite well (if there is a font that has these characters, of course).

Does Lua support Unicode?

Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.