I would like some way of doing essentially the following:
if supports_unicode {
print!("some unicode");
} else {
print!("ascii");
}
Is there any way in rust to check if the output supports unicode?
Update
I found a way to check if the device supports unicode, but it doesn't check if the current output is set to the correct encoding, nor does it check if the font supports the full range of unicode characters. If you're curious, it uses the crate locale-codes 0.3.0, and the code is
locale_codes::codeset::all_names().contains(&String::from("UTF-8"))
But, as I said, this doesn't solve my problem
Also, if you want, here is a more specific example of the problem I've been having. In the VSCode intergrated terminal (Windows 10 x64, VSCode 1.47), if I run a rust program that prints the character 𝑥 (U+1D465), I get a variety of results, such as:
It actually printing the correct character
It prints �
It prints nothing at all
It prints 𝐵 (U+1D435)
I hope this example helps.
Related
I have a problem with my VS Code. When trying to modify a file that contains special characters like "á", "ñ", "ó" etc., the special characters are replaced with a question mark. (See image below.)
Although, this can be solved easily from the back of Visual Studio Code, changing the language type to "Windows 1252", because at first it worked for me. But now, even if I change it to that language, the signs are still there.
the files that you opened before you made the changes to the encoding have been auto-overwritten and the original characters were replaced with the unknown-character character
I'm just trying to pick up D having come from C++. I'm sure it's something very basic, but I can't find any documentation to help me. I'm trying to print the character à, which is U+00E0. I am trying to assign this character to a variable and then use write() to output it to the console.
I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.
Note that for everything I've tried, I've tried replacing string with char[] and wstring with wchar[]. I've also tried with and without the w or d suffixes after wide strings.
These methods return the compiler error, "Invalid trailing code unit":
string str = "à";
wstring str = "à"w;
dstring str = "à"d;
These methods print a totally different character (Ò U+00D2):
string str = "\xE0";
string str = hexString!"E0";
And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:
string str = "\xC3\xA0";
wstring str = "\u00E0"w;
dstring str = "\U000000E0"d;
Any ideas?
I confirmed it works on my Windows box, so gonna type this up as an answer now.
In the source code, if you copy/paste the characters directly, make sure your editor is saving it in utf8 encoding. The D compiler insists on it, so if it gives a compile error about a utf thing, that's probably why. I have never used c:b but an old answer on the web said edit->encodings... it is a setting somewhere in the editor regardless.
Or, you can replace the characters in your source code with \uxxxx in the strings. Do NOT use the hexstring thing, that is for binary bytes, but your example of "\u00E0" is good, and will work for any type of string (not just wstring like in your example).
Then, on the output side, it depends on your target because the program just outputs bytes, and it is up to the recipient program to interpret it correctly. Since you said you are on Windows, the key is to set the console code page to utf-8 so it knows what you are trying to do. Indeed, the same C function can be called from D too. Leading to this program:
import core.sys.windows.windows;
import std.stdio;
void main() {
SetConsoleOutputCP(65001);
writeln("Hi \u00E0");
}
printing it successfully. On older Windows versions, you might need to change your font to see the character too (as opposed to the generic box it shows because some fonts don't have all the characters), but on my Windows 10 box, it just worked with the default font.
BTW, technically the console code page a shared setting (after running the program and it exits, you can still hit properties on your console window and see the change reflected there) and you should perhaps set it back when your program exits. You could get that at startup with the get function ( https://learn.microsoft.com/en-us/windows/console/getconsoleoutputcp ), store it in a local var, and set it back on exit. You could auto ccp = GetConsoleOutputCP(); SetConsoleOutputCP(65005;) scope(exit) SetConsoleOutputCP(ccp); right at startup - the scope exit will run when the function exits, so doing it in main would be kinda convenient. Just add some error checking if you want.
The Microsoft docs don't say anything about setting it back, so it probably doesn't actually matter, but still I wanna mention it just in case. But also the knowledge that it is shared and persists can help in debugging - if it works after you comment it, it isn't because the code isn't necessary, it is just because it was set previously and not unset yet!
Note that running it from an IDE might not be exactly the same, because IDEs often pipe the output instead of running it right out to the Windows console. If that happens, lemme know and we can type up some stuff about that for future readers too. But you can also open your own copy of the console (run the program outside the IDE) and it should show correctly for you.
D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.
E.g.
import std.stdio;
void main() {
writeln(cast(char)0xC3, cast(char)0xA0);
}
Will output as UTF-8 the character you seek.
Which you can then hard code like so:
import std.stdio;
void main() {
string str = "à";
writeln(str);
}
i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.
How do I let my Eclipse use \uXXXX symbols?
Should I change the font?
Eclipse will never use \u escapes for display in the console window. That's just not in its repertoire.
However, that's probably not what you want.
If you have coded some Java with a \u escape in the source, your first problem is to configure the run / debug configuration to use an appropriate encoding for the console window. UTF-8 is usually the right answer. Then, you need to select an appropriate font in the eclipse preferences for the particular character you've chosen. However, whatever you do, "\uxxxx" will never be what comes out. What you will get is the character specified by your unicode escape.
If you're just trying to see unicode output in the console, make sure the font you're using supports unicode and that the output encoding is set to UTF-8.
When running this in my pretty vanilla install of Eclipse:
System.out.println("\u0CA0_\u0CA0");
I get this as expected in the Eclipse console output:
ಠ_ಠ
I've stuck with the following problem:
I have a script which is retrieving title form the Firefox window:
tell application "Firefox"
if the (count of windows) is not 0 then
set window_name to name of front window
end if
end tell
It works well as long as the title contains only English characters but when title contains some non-ASCII characters(Cyrillic in my case) it produces some utf-8 garbage. I've analyzed this garbage a bit and it seems that my Cyrillic character is converted to the Utf-8 without any concerning about codepage i.e instead of using Cyrillic codepage for conversion it uses non codepages at all and I have utf-8 text with characters different from those in the window title.
My question is: How can I retrieved the window title in utf-8 directly without any conversion?
I can achieve this goal by using AXAPI but I want to achieve this by AppleScript because AXAPI needs some option turned on in the system.
UPD:
It works fine in the AppleScript Editor. But I'm compiling it through the C++ code via OSACompile->OSAExecute->OSADisplay
I don't know the guts of the AppleScript Editor so maybe it has some inside information about how to encode the characters
I've found the answer when wrote update. Sometimes it is good to ask a question for better it understanding :)
So for the future searchers: If you want to use unicode result of the script execution you should provide typeUnicodeText to the OSADisplay then you will have result in the UTF-16LE in the result AEDesc