There is a 3rd lib only accept char* filename e.g. 3rdlib_func_name(char* file_name). Every things get wrong when I provide a filename in Chinese or Japanese.
Is there any way to make this lib open UNICODE filename? The program is running on Windows.
Thanks for your reply.
We has a similar problem too. Luckily there's a solution, though it's kinda tricky.
If the file/directory already exists - you may use the GetShortPathName function. The resulting "short" path name is guaranteed not to contain non-latin characters.
Call GetShortPathNameW (unicode version) to get the "short" path string.
Convert the short path into the ANSI string (use WideCharToMultiByte).
Give the resulting ANSI string to the stupid 3rd-party lib.
Now, if the file/directory doesn't exist yet - you may not obtain its short pathname. In such a case you should create it first.
No, there isn't unless you can recompile it from modified source (a major undertaking). You might have better luck feeding the 3rd party library short filenames, like AHDF76~4.DOC; these filenames use ASCII. See GetShortPathName.
You may try to convert the string to local code page:
setlocale(LC_ALL,"Japanese_Japan.932");
std::string file_name = convert_to_codepage_932(utf16_file_name);
3rdlib_func_name(file_name.c_str());
Otherwise?
Blame windows for not supporting UTF-8 ;-)
Related
i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.
I need to be able to use binaries with Cyrillic characters in them. I tried just writing <<"абвгд">> but I got a badarg error.
How can I work with Cyrillic (or unicode) strings in Erlang?
If you want to input the above expression in erlang shell, please read unicode module user manual.
Function character_to_binary, and character_to_list are both reversable function. The following are an example:
(emacs#yus-iMac.local)37> io:getopts().
[{expand_fun,#Fun<group.0.33302583>},
{echo,true},
{binary,false},
{encoding,unicode}]
(emacs#yus-iMac.local)40> A = unicode:characters_to_binary("上海").
<<228,184,138,230,181,183>>
(emacs#yus-iMac.local)41> unicode:characters_to_list(A).
[19978,28023]
(emacs#yus-iMac.local)45> io:format("~s~n",[ unicode:characters_to_list(A,utf8)]).
** exception error: bad argument
in function io:format/3
called as io:format(<0.30.0>,"~s~n",[[19978,28023]])
(emacs#yus-iMac.local)46> io:format("~ts~n",[ unicode:characters_to_list(A,utf8)]).
上海
ok
If you want to use unicode:characters_to_binary("上海"). directly in the source code, it is a little more complex. You can try it firstly to find difference.
The Erlang compiler will interpret the code as ISO-8859-1 encoded text, which limits you to Latin characters. Although you may be able to bang in some ISO characters that may have the same byte representation as you want in Unicode, this is not a very good idea.
You want to make sure your editor reads and writes ISO-8859-1, and you want to avoid using literals as much as possible. Source these strings from files.
I have a UILabel which I change through the code. However when I create a NSString with the charaters æ,ø,å(Danish) I get an input conversion warning. The code look as this:
NSString *label=[[NSString alloc]initWithFormat:#"Prøv igen"];
And the warning I get is this - warning: input conversion stopped due to an input byte that does not belong to the input codeset UTF-8. I can understand that ø is probably not an UTF encoding but what to do? Anyone who can give me a hint about what to do to solves this?
Regards
Bjarke
Your source code is not saved as UTF-8, but most likely as something like ISO-8859-1.
Just open the file and re-save it as UTF-8 - and while you're at it, you should probably also make that the default. Exactly how to do that depends on what editor you're using.
Make sure your file text encoding is set to UTF-8, not Western (ISO) or something else. You can use the Xcode file info inspector to do this.
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/XcodeWorkspace/050-File_Management/file_management.html%23//apple_ref/doc/uid/TP40002677-BABICEHI
Make sure it says Unicode (UTF-8) for the File Encoding. If it asks you, tell it to reinterpret your file with the new encoding. Also, you may want to delete the problematic text and reinput it to get it to work.
I had the same problem, but my source code files were already UTF-8 encoded so I fix it in a different way.
In your case, it would have been something like
NSString *label=[NSString stringWithUTF8String"Prøv igen"];
I hope this will be helpful for others who stumble on this question
When reading a text file that was created somewhere else outside my app, the encoding used is unknown. My app has being using NSUnicodeStringEncoding (which is the same as NSUTF16StringEncoding) so have problems reading other than UTF16 encoded files.
Is there a way I can guess the encoding of a file? My priority is to be able to read UTF8 files and then all other files.
Is iterating through available encodings and check if read string's length is more than zero is really a good approach?
Thanks in advance.
Ignacio
Apple's documentation has some guidance on how to proceed: String Programming Guide: Reading data with an unknown encoding:
If you are forced to guess the encoding (and note that in the absence of explicit information, it is a guess):
Try stringWithContentsOfFile:usedEncoding:error: or initWithContentsOfFile:usedEncoding:error: (or the URL-based equivalents).
These methods try to determine the encoding of the resource, and if successful return by reference the encoding used.
If (1) fails, try to read the resource by specifying UTF-8 as the encoding.
If (2) fails, try an appropriate legacy encoding.
"Appropriate" here depends a bit on circumstances; it might be the default C string encoding, it might be ISO or Windows Latin 1, or something else, depending on where your data is coming from.
If the file is properly constructed you can read the first four bytes and see if it is a BOM (Byte Order Mark):
http://en.wikipedia.org/wiki/Byte-order_mark
I'm trying to make some of my code a bit more friendly to non-pure-ascii systems and was wondering if there was a particular character encoding used for NEEDED entries in ELF binaries, or is it rather unstandard and based on the creating system's filesystem encoding (or even just directly the bytes that were passed to whatever created the binary) (if so is there any place in the binary that specifies the encoding? assuming the current systems encoding wouldn't work very well for my usage I think), are non-ascii names pretty much banned or something else?
ELF format specifies NEEDED fields as "null-terminated string" and does not say more about the encoding, which pretty much implies 8-bit ASCII string.
I personally don't see any point in complicating executable file format specification that does not provide any additional value for the final product or development process: the user won't see library names, so they wouldn't care about localization of thereof. You may try to use UTF-8, but actual file system encoding is not guaranteed to be UTF-8. To be sure you need to know how your target linker handles those strings.
As far as I know, the standard Unix way of dealing with non-ASCII characters is to encode them as UTF-8.