Param value encoding - encoding

I have a benchmark in sbt-jmh, which is a "wrapper" for jmh for which I have a parameter that contains non-ASCII characters. It looks like this:
#Param(Array("1000", "1000"))
That's the equivalent for Java
#Param({"1000", "1000"})
Note that the second string "1000" starts with a full width one character, code point +uFF11
This file is encoded in UTF8. My platform is Windows 8.1, and the platform encoding is cp1252
My build.sbt contains scalacOptions ++= List("encoding", "UTF8")
I expect very similar benchmark results for both params, but I'm seeing drastically different results, that seem to imply that the second string isn't processed properly.
How can I make sure the benchmark uses the correct string as a parameter?

This was a bug in 1.17, and is fixed in 1.18

Related

Apache POI 3.9 generated Excel XSSF has ? in place of special characters like á (Spanish) in weblogic

I am working on an APP in which I have to generate EXCEL with XSSFCellStyle, etc. I am using Apache POI 3.9.
In some field I am doing this:
cell.setValue(myString);
myString may contains special characters like ñ and á, which are Spanish. These characters may come from i18n.properties, or hardcoded as plain String.
myString = "ññññññññ";
All is well in local machine with Tomcat8, but in Weblogic server, in the Excel generated, I see ´?` in place of these characters.
I read somewhere that in Weblogic servers the default charset is UTF-8. The local environment is of Spanish(cp1252), and in Eclipse Luna the workspace charset is also cp1252, so it may be the reason, but I am not sure. Should I change in Preference - Workspace, or in JVM parameters -Dfile.encoding=UTF-8?
I also read about Apache POI encoding handling, that the API handles it all, so I am not to worry about that. All I can do is set font charset, like this:
font.setCharSet(FontCharset.DEFAULT);
But, I cannot see UTF-8 here. In source code I see:
/**
* Charset represents the basic set of characters associated with a font (that it can display), and
* corresponds to the ANSI codepage (8-bit or DBCS) of that character set used by a given language.
*
* #author Gisella Bronzetti
*/
public enum FontCharset {
ANSI(0),
DEFAULT(1),
SYMBOL(2),
MAC(77),
SHIFTJIS(128),
HANGEUL(129),
JOHAB(130),
GB2312(134),
CHINESEBIG5(136),
GREEK(161),
TURKISH(162),
VIETNAMESE(163),
HEBREW(177),
ARABIC(178),
BALTIC(186),
RUSSIAN(204),
THAI(222),
EASTEUROPE(238),
OEM(255);
Neither that of WESTEUROPE. So how can I set it?
Thanks #centic for the hint. I finally find my solution:
Change JVM encoding by settting parameter -Dfile.encoding=UTF-8, or,
Change text file encoding to UTF-8 for my project in Eclipse.
Thus special characters hard-coded in Strings will be converted to ? and I manually fixed them and save the .java as UTF-8 (now is default), and compile the project and make my WAR. Now in Weblogic it's all fine.
I think the problem lies in that in local machine, my java files are encoded as cp1252 according to JVM and Eclipse settings, and then in Tomcat, as in the same environment, also has cp1252 as its decoding setting(both inherited from my Windows 7), so it's ok. But in Weblogic, it only accepts UTF-8 as input, and therefore will only decode my WAR/class files using UTF-8, so characters encoded as cp1252 are not recognized.

Must UTF-8 binaries include /utf8 in the binary literal in Erlang?

In erlang, when defining a UTF-8 binary string, I need to specify the encoding in the binary literal, like this:
Star = <<"★"/utf8>>.
> <<226,152,133>>
io:format("~ts~n", [Star]).
> ★
> ok
But, if the /utf8 encoding is omitted, the unicode characters are not handled correctly:
Star1 = <<"★">>.
> <<5>>
io:format("~ts~n", [Star1]).
> ^E
> ok
Is there a way that I can create literal binary strings like this without having to specify /utf8 in every binary I create? My code has quite a few binaries like this and things have become quite cluttered. Is there a way to set some sort of default encoding for binaries?
This is probably a result of the ambiguity of Erlang strings and lists. When you enter <<"★">>, what Erlang is actually seeing is <<[9733]>>, which, of course, is just a list containing an integer. As such, I believe Erlang in this case would encode 9733 as an integer, most likely with 16-bits (though I could certainly be wrong on that).
The /utf8 flag indicates to Erlang that this is supposed to be a UTF8 string, and thus gives a hint to the VM about how best to encode the integer it encounters.

wxTextCtrl OSX mutated vowel

i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Printing Unicode from Scala interpreter

When using the scala interpreter (i.e. running the command 'scala' on the commandline), I am not able to print unicode characters correctly. Of course a-z, A-Z, etc. are printed correctly, but for example € or ƒ is printed as a ?.
print(8364.toChar)
results in ? instead of €.
Probably I'm doing something wrong. My terminal supports utf-8 characters and even when I pipe the output to a seperate file and open it in a texteditor, ? is displayed.
This is all happening on Mac OS X (Snow Leopard, 10.6.2) with Scala 2.8 (nightly build) and Java 1.6.0_17)
I found the cause of the problem, and a solution to make it work as it should.
As I already suspected after posting my question and reading the answer of Calum and issues with encoding on the Mac with another project (which was in Java), the cause of the problem is the default encoding used by Mac OS X. When you start scala interpreter, it will use the default encoding for the specified platform. On Mac OS X, this is Macroman, on Windows it is probably CP1252. You can check this by typing the following command in the scala interpreter:
scala> System.getProperty("file.encoding");
res3: java.lang.String = MacRoman
According to the scala help test, it is possible to provide Java properties using the -D option. However, this does not work for me. I ended up setting the environment variable
JAVA_OPTS="-Dfile.encoding=UTF-8"
After running scala, the result of the previous command will give the following result:
scala> System.getProperty("file.encoding")
res0: java.lang.String = UTF-8
Now, printing special characters works as expected:
print(0x20AC.toChar)
€
So, it is not a bug in Scala, but an issue with default encodings. In my opinion, it would be better if by default UTF-8 was used on all platforms. In my search for an answer if this is considered, I came across a discussion on the Scala mailing list on this issue. In the first message, it is proposes to use UTF-8 by default on Mac OS X when file.encoding reports Macroman, since UTF-8 is the default charset on Mac OS X (keeps me wondering why file.encoding by defaults is set to Macroman, probably this is an inheritance from Mac OS before 10 was released?). I don't think this proposal will be part of Scala 2.8, since Martin Odersky wrote that it is probably best to keep things as they are in Java (i.e. honor the file.encoding property).
Ok, at least part, if not all, of your problem here is that 128 is not the Unicode codepoint for Euro. 128 (or 0x80 since hex seems to be the norm) is U+0080 <control>, i.e. it is not a printable character, so it's not surprising your terminal is having trouble printing it.
Euro's codepoint is 0x20AC (or in decimal 8364), and that appears to work for me (I'm on Linux, on a nightly of 2.8):
scala> print(0x20AC.toChar)
€
Another fun test is to print the Unicode snowman character:
scala> print(0x2603.toChar)
☃
128 as € is apparently an extended character from one of the Windows code pages.
I got the other character you mentioned to work too:
scala> 'ƒ'.toInt
res8: Int = 402
scala> 402.toChar
res9: Char = ƒ
For Windows in command line (cmd) print:
set JAVA_OPTS="-Dfile.encoding=UTF-8"
chcp 65001
Item 2 means UTF-8
If you don't want everytime print "chcp 65001", you can change/add value in Windows Registry like this:
Run command regedit
find record [HKEY_CURRENT_USER\Software\Microsoft\Command Processor]
New => String value
Name = "AutoRun", Data = "chcp 65001" (without quotes)
(see https://superuser.com/a/482117/454417)
I use Windows 10 and scala 2.11.8