What is the protocol / relationship between encodings and programming languages? - encoding

As a test I created a file called Hello.java and the contents are as follows:
public class Hello{
public static void main(String[] args){
System.out.println("Hello world!");
}
}
I saved this file with UTF-8 encoding.
Anyway, compiling and running the problem was no problem. This file was 103 bytes long.
I then saved the file with UTF-16 BE encoding. This time the file was 206 bytes long, since well UTF-16 (usually) needs more space, so no surprise here.
Tried compiling the file from my terminal and I got all these errors:
Hello.java:4: error: illegal character: '\u0000'
}
^
So does javac work only with UTF-8 encoded source files? Is that like a standard?
javac -version
javac 1.8.0_45
Also, I only know Java but lets say you are running Python code or any interpreted programming language. (Sorry if I am mistaken by thinking Python is interpreted if it is not..) Would the encoding be a problem? If not, would it have any effect on performance?
Ok so the word "true" is a reserved keyword (for a given programming language..) but in what encoding is it reserved? ASCII - UTF-8 only?
How "true" is stored in the hard drive or in memory depends on the encoding the file is saved in, so must a programming language expect always to work with a particular encoding for source files?

Regarding javac, you can set the encoding with -encoding parameter. Internally Java handles strings in UTF-16 so the compiler will convert everything to that.
The compiler must know the encoding so it can process the source codes. It doesn't matter what compiler, interpreter or language it is. Just like people can't just take random language text and assume it's German.
Keywords aren't reserves in any specific encoding. They are keywords. You can't have two ways of writing a single word no matter what encoding you use. The words are the same.
Programming language doesn't care about encoding. Compiler/interpreter does.

Related

Bucklescript is compiling utf8 ReasonML files into us-ascii

I'm using ReasonReact with bsb -init myapp -theme react-hooks. I run my project on MacOS Catalina. When building or starting my project, Bucklescript is compiling my utf8 *.re files into us-ascii. This results into bad encoded accentuated characters. I cannot figure out why. Thanks for helping me out.
It's not clear from the question whether you use unicode characters only in string literals, or in identifiers.
If the former, BuckleScript provides syntax for unicode string literals, which should be translated correctly:
let unicode = {js|你好, 世界|js};
If you use unicode in identifiers, however, the compiler unfortunately does not support that. It's an internal limitation inherited from the OCaml compiler.

How to detect file encoding in Octave?

I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.
There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.
Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR&plus;LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.

Erlang and binary with Cyrillic

I need to be able to use binaries with Cyrillic characters in them. I tried just writing <<"абвгд">> but I got a badarg error.
How can I work with Cyrillic (or unicode) strings in Erlang?
If you want to input the above expression in erlang shell, please read unicode module user manual.
Function character_to_binary, and character_to_list are both reversable function. The following are an example:
(emacs#yus-iMac.local)37> io:getopts().
[{expand_fun,#Fun<group.0.33302583>},
{echo,true},
{binary,false},
{encoding,unicode}]
(emacs#yus-iMac.local)40> A = unicode:characters_to_binary("上海").
<<228,184,138,230,181,183>>
(emacs#yus-iMac.local)41> unicode:characters_to_list(A).
[19978,28023]
(emacs#yus-iMac.local)45> io:format("~s~n",[ unicode:characters_to_list(A,utf8)]).
** exception error: bad argument
in function io:format/3
called as io:format(<0.30.0>,"~s~n",[[19978,28023]])
(emacs#yus-iMac.local)46> io:format("~ts~n",[ unicode:characters_to_list(A,utf8)]).
上海
ok
If you want to use unicode:characters_to_binary("上海"). directly in the source code, it is a little more complex. You can try it firstly to find difference.
The Erlang compiler will interpret the code as ISO-8859-1 encoded text, which limits you to Latin characters. Although you may be able to bang in some ISO characters that may have the same byte representation as you want in Unicode, this is not a very good idea.
You want to make sure your editor reads and writes ISO-8859-1, and you want to avoid using literals as much as possible. Source these strings from files.

Multibyte encoding in java

i have no idea of how to add multibyte encoding support and very little knowledge on multibyte languages.
Being working on a search engine, my application scans code in all programming languages.
Some sourcecode might have CJK encoding in their comments section.
For easiness sake, i take java as source-code sample and my application is also in java.
First thing, i want to write test cases to see if to-be-indexed source-code has CJK encoding and if it is encoded by my application.
I want my tests to fail if support not included so that can be added in future.
But i have no idea how to test it ,
how to entre CJK in input samples for unit test and what would be output in Java application console.
The presence of a Byte Order Mark might be of use, but they are optional. There are other methods for determining the encoding when UTF is used. This may be of use: Java : How to determine the correct charset encoding of a stream.