Bucklescript is compiling utf8 ReasonML files into us-ascii

Bucklescript is compiling utf8 ReasonML files into us-ascii - unicode

I'm using ReasonReact with bsb -init myapp -theme react-hooks. I run my project on MacOS Catalina. When building or starting my project, Bucklescript is compiling my utf8 *.re files into us-ascii. This results into bad encoded accentuated characters. I cannot figure out why. Thanks for helping me out.

It's not clear from the question whether you use unicode characters only in string literals, or in identifiers.
If the former, BuckleScript provides syntax for unicode string literals, which should be translated correctly:
let unicode = {js|你好， 世界|js};
If you use unicode in identifiers, however, the compiler unfortunately does not support that. It's an internal limitation inherited from the OCaml compiler.

Related

What is the protocol / relationship between encodings and programming languages?

As a test I created a file called Hello.java and the contents are as follows:
public class Hello{
public static void main(String[] args){
System.out.println("Hello world!");
}
}
I saved this file with UTF-8 encoding.
Anyway, compiling and running the problem was no problem. This file was 103 bytes long.
I then saved the file with UTF-16 BE encoding. This time the file was 206 bytes long, since well UTF-16 (usually) needs more space, so no surprise here.
Tried compiling the file from my terminal and I got all these errors:
Hello.java:4: error: illegal character: '\u0000'
}
^
So does javac work only with UTF-8 encoded source files? Is that like a standard?
javac -version
javac 1.8.0_45
Also, I only know Java but lets say you are running Python code or any interpreted programming language. (Sorry if I am mistaken by thinking Python is interpreted if it is not..) Would the encoding be a problem? If not, would it have any effect on performance?
Ok so the word "true" is a reserved keyword (for a given programming language..) but in what encoding is it reserved? ASCII - UTF-8 only?
How "true" is stored in the hard drive or in memory depends on the encoding the file is saved in, so must a programming language expect always to work with a particular encoding for source files?

Regarding javac, you can set the encoding with -encoding parameter. Internally Java handles strings in UTF-16 so the compiler will convert everything to that.
The compiler must know the encoding so it can process the source codes. It doesn't matter what compiler, interpreter or language it is. Just like people can't just take random language text and assume it's German.
Keywords aren't reserves in any specific encoding. They are keywords. You can't have two ways of writing a single word no matter what encoding you use. The words are the same.
Programming language doesn't care about encoding. Compiler/interpreter does.

wxTextCtrl OSX mutated vowel

i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
√º√º√º√§√§√§√∂√∂√∂√ü√ü√ü
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,

If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.

It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.

Antlr generated lexer hangs on unicode character of "supplementary plane" (antlr 3.4)

I'm parsing PHP code using an antlr Grammar and the antlr Ruby Target. One of the source file I have to parse actually contains translation, some of them making heavy use of Unicode character. The grammar seems to hang on one character from the "supplementary plane", namely U+10430.
I had a similar problem in the past due to the fact that the Ruby antlr target is quite old, and was not unicode compliant (well, Ruby was not, at the time). We had to bump RubyTarget.java getMaxCharValue from 0xFF (ascii) to 0xFFFF (unicode) to solve it. Now it seems that even this set is insufficient. Unicode states that characters outside this range may be represented using two UTF-16 characters, but how do antlr manage this ? Would bumping the getMaxCharValue again help (it did once, but I'm no fan of the "try" approach) ?
Thanks !

The reference Java target for ANTLR can only parse characters in the supplementary plane by using a UTF-16 surrogate pair in the grammar and using a UTF-16 encoding for your input stream. Other targets are created by members of the community and may or (as you saw the Ruby target) may not support the same range of characters.
Since there is no way to represent anything past 0xFFFE in the grammar itself, you'll be limited to the UTF-16 encoding even if you modify a target to support characters above 0xFF.

Erlang and binary with Cyrillic

I need to be able to use binaries with Cyrillic characters in them. I tried just writing <<"абвгд">> but I got a badarg error.
How can I work with Cyrillic (or unicode) strings in Erlang?

If you want to input the above expression in erlang shell, please read unicode module user manual.
Function character_to_binary, and character_to_list are both reversable function. The following are an example:
(emacs#yus-iMac.local)37> io:getopts().
[{expand_fun,#Fun<group.0.33302583>},
{echo,true},
{binary,false},
{encoding,unicode}]
(emacs#yus-iMac.local)40> A = unicode:characters_to_binary("上海").
<<228,184,138,230,181,183>>
(emacs#yus-iMac.local)41> unicode:characters_to_list(A).
[19978,28023]
(emacs#yus-iMac.local)45> io:format("~s~n",[ unicode:characters_to_list(A,utf8)]).
** exception error: bad argument
in function io:format/3
called as io:format(<0.30.0>,"~s~n",[[19978,28023]])
(emacs#yus-iMac.local)46> io:format("~ts~n",[ unicode:characters_to_list(A,utf8)]).
上海
ok
If you want to use unicode:characters_to_binary("上海"). directly in the source code, it is a little more complex. You can try it firstly to find difference.

The Erlang compiler will interpret the code as ISO-8859-1 encoded text, which limits you to Latin characters. Although you may be able to bang in some ISO characters that may have the same byte representation as you want in Unicode, this is not a very good idea.
You want to make sure your editor reads and writes ISO-8859-1, and you want to avoid using literals as much as possible. Source these strings from files.

Multibyte encoding in java

i have no idea of how to add multibyte encoding support and very little knowledge on multibyte languages.
Being working on a search engine, my application scans code in all programming languages.
Some sourcecode might have CJK encoding in their comments section.
For easiness sake, i take java as source-code sample and my application is also in java.
First thing, i want to write test cases to see if to-be-indexed source-code has CJK encoding and if it is encoded by my application.
I want my tests to fail if support not included so that can be added in future.
But i have no idea how to test it ,
how to entre CJK in input samples for unit test and what would be output in Java application console.

The presence of a Byte Order Mark might be of use, but they are optional. There are other methods for determining the encoding when UTF is used. This may be of use: Java : How to determine the correct charset encoding of a stream.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Bucklescript is compiling utf8 ReasonML files into us-ascii - unicode

Related

What is the protocol / relationship between encodings and programming languages?

wxTextCtrl OSX mutated vowel

Antlr generated lexer hangs on unicode character of "supplementary plane" (antlr 3.4)

Erlang and binary with Cyrillic

Multibyte encoding in java

Categories

Resources