ocaml: reading unicode from a file, printing to console - unicode

I'm looking for a simple example in Ocaml of reading Unicode from a text file, and print it to the console.
I checked several packages that purport to support Unicode in Ocaml, but these do not include any mechanism for output. Thanks!

Once your string encoding and the encoding expected by your console match, you can just print normally. In other words,
let () = Printf.printf "∑ₖ |ψₖ⟩⟨ψₖ| = 𝕀 "
should work as expected

Related

Encoding from ANSI when having non-latin letters

I have a very old program (not a server or something on the internet) that I think it use the ANSI (Windows-1252) encoding.
The problem is that some inputs to this program are written in Arabic.
However, when I am trying to read the result, the Arabic words are written with very wired character. For example the input: "نور" is converted to "äæÑ".
The program output should contain a combination of English words and Arabic words.
E.x. It outputs "Name äæÑ" while the correct output should be something like "Name نور".
In general, the English words are correct and readable with both UTF-8 and ANSI. But the Arabic words are read for example as "���" with UTF-8 and as "äæÑ" with ANSI.
I understand that this is because ANSI doesn't have support to non-Latin letters.
but what should I do now? How can I convert them to Arabic again?
Note: I know the exact input and the exact output that this program should produce.
Note2: I don't have the source code of this program. I just want to convert the output file of this program to have the correct words or encoding.
I solved this problem now by typing in the terminal:
iconv -f WINDOWS-1256 -t utf8 < my_File.ged > result.ged
I tried to write code in java that do a similar thing but it wasn't really working with giving my the result I wanted.
I have also tried the previous terminal command but using WINDOWS-1252 instead of WINDOWS-1256 but it wasn't working. So, I guess it is good to try different encoding until it is working

D Unicode string literals: can't print specific Unicode character

I'm just trying to pick up D having come from C++. I'm sure it's something very basic, but I can't find any documentation to help me. I'm trying to print the character à, which is U+00E0. I am trying to assign this character to a variable and then use write() to output it to the console.
I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.
Note that for everything I've tried, I've tried replacing string with char[] and wstring with wchar[]. I've also tried with and without the w or d suffixes after wide strings.
These methods return the compiler error, "Invalid trailing code unit":
string str = "à";
wstring str = "à"w;
dstring str = "à"d;
These methods print a totally different character (Ò U+00D2):
string str = "\xE0";
string str = hexString!"E0";
And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:
string str = "\xC3\xA0";
wstring str = "\u00E0"w;
dstring str = "\U000000E0"d;
Any ideas?
I confirmed it works on my Windows box, so gonna type this up as an answer now.
In the source code, if you copy/paste the characters directly, make sure your editor is saving it in utf8 encoding. The D compiler insists on it, so if it gives a compile error about a utf thing, that's probably why. I have never used c:b but an old answer on the web said edit->encodings... it is a setting somewhere in the editor regardless.
Or, you can replace the characters in your source code with \uxxxx in the strings. Do NOT use the hexstring thing, that is for binary bytes, but your example of "\u00E0" is good, and will work for any type of string (not just wstring like in your example).
Then, on the output side, it depends on your target because the program just outputs bytes, and it is up to the recipient program to interpret it correctly. Since you said you are on Windows, the key is to set the console code page to utf-8 so it knows what you are trying to do. Indeed, the same C function can be called from D too. Leading to this program:
import core.sys.windows.windows;
import std.stdio;
void main() {
SetConsoleOutputCP(65001);
writeln("Hi \u00E0");
}
printing it successfully. On older Windows versions, you might need to change your font to see the character too (as opposed to the generic box it shows because some fonts don't have all the characters), but on my Windows 10 box, it just worked with the default font.
BTW, technically the console code page a shared setting (after running the program and it exits, you can still hit properties on your console window and see the change reflected there) and you should perhaps set it back when your program exits. You could get that at startup with the get function ( https://learn.microsoft.com/en-us/windows/console/getconsoleoutputcp ), store it in a local var, and set it back on exit. You could auto ccp = GetConsoleOutputCP(); SetConsoleOutputCP(65005;) scope(exit) SetConsoleOutputCP(ccp); right at startup - the scope exit will run when the function exits, so doing it in main would be kinda convenient. Just add some error checking if you want.
The Microsoft docs don't say anything about setting it back, so it probably doesn't actually matter, but still I wanna mention it just in case. But also the knowledge that it is shared and persists can help in debugging - if it works after you comment it, it isn't because the code isn't necessary, it is just because it was set previously and not unset yet!
Note that running it from an IDE might not be exactly the same, because IDEs often pipe the output instead of running it right out to the Windows console. If that happens, lemme know and we can type up some stuff about that for future readers too. But you can also open your own copy of the console (run the program outside the IDE) and it should show correctly for you.
D source code needs to be encoded as UTF-8.
My guess is that you're putting a UTF-16 character into the UTF-8 source file.
E.g.
import std.stdio;
void main() {
writeln(cast(char)0xC3, cast(char)0xA0);
}
Will output as UTF-8 the character you seek.
Which you can then hard code like so:
import std.stdio;
void main() {
string str = "à";
writeln(str);
}

PyGame: Proper use of Unicode

My goal is to create a program, with which the user can learn Bible verses by getting shown a problem and solving it through input (e.g. "Quote vers Gen 3:15"). As the Bible translation, I have to work with, is German, it contains a ton of umlauts, which are never showing properly.
My PyGame file's header:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Later on, I list the three German umlauts:
u'ö'.encode('utf-8')
u'ä'.encode('utf-8')
u'ü'.encode('utf-8')
The txt-file is parsed by this function:
def load_list(listname):
fullname = os.path.join("daten", listname + ".txt")
with codecs.open(fullname, "r", "utf-8-sig") as name:
lines = name.readlines()
for x in range(0, len(lines)):
lines[x] = lines[x].strip("\n")
lines[x] = lines[x].strip("\r")
print lines
I'm aware, that I could combine the two lines with the strip-commands, but that's not the topic here.
How can I get my PyGame to display the umlauts from the text-file correctly as well also display the user input's umlauts correctly? I checked hundreds of suggestions, I can't get anything really working here.
Any help is highly appreciated, before I lose my sane mind (well, as I'm sitting here, coding games, I probably did already anyway :D )
I'll try to summarize:
Printing something else than a string or unicode opject triggers that object's __repr__() method. If it is a sequence, this applies to the contained elements as well, causing any non-ascii character to be escaped with \xXX (or \uXXXX) notation. Note the difference between print 'text' and print ['text']: in the latter case, the string's quotes will be printed as well (besides the brackets of course). Use str.join() for concatenating lists of strings in order to control the way the output looks.
It's a good idea to always explicitely decode input (as you do by using codecs) and encode the output (which is not done in the code snippets in your question).
The source file encoding (the # coding: utf8 line in the header) has nothing to do with encoding of input and output. It only enables you to type non-ascii character in string literals (= characters inside quotes in the source file), instead of using \xXX escapes.
Hope that makes some things clearer. There's a lot that can go wrong that looks like an encoding error, and it's not always easy to find out what's actually happening.

How to convert UNICODE Hebrew appears as Gibberish in VBScript?

I am gathering information from a HEBREW (WINDOWS-1255 / UTF-8 encoding) website using vbscript and WinHttp.WinHttpRequest.5.1 object.
For Example :
Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
...
'writes the file as unicode (can't use Ascii)
Set Fileout = FSO.CreateTextFile("c:\temp\myfile.xml", true, true)
....
Fileout.WriteLine(objWinHttp.responsetext)
When Viewing the file in notepad / notepad++, I see Hebrew as Gibrish / Gibberish.
For example :
äìëåú - äøá àáøäí éåñó - îåøùú
I need a vbscript function to return Hebrew correctly, the function should be similar to the following http://www.pixiesoft.com/flip/ choosing the 2nd radio button and press convert button , you will see Hebrew correctly.
Your script is correctly fetching the byte stream and saving it as-is. No problems there.
Your problem is that the local text editor doesn't know that it's supposed to read the file as cp1255, so it tries the default on your machine of cp1252. You can't save the file locally as cp1252, so that Notepad will read it correctly, because cp1252 doesn't include any Hebrew characters.
What is ultimately going to be reading the file or byte stream, that will need to pick up the Hebrew correctly? If it does not support cp1255, you will need to find an encoding that is supported by that tool, and convert the cp1255 string to that encoding. Suggest you might try UTF-8 or UTF-16LE (the encoding Windows misleadingly calls 'Unicode'.)
Converting text between encodings in VBScript/JScript can be done as a side-effect of an ADODB stream. See the example in this answer.
Thanks to Charming Bobince (that posted the answer), I am now able to see HEBREW correctly (saving a windows-1255 encoding to a txt file (notpad)) by implementing the following :
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "X-ANSI"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "WINDOWS-1255"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function

Working out file encoding: I know the string, know the character, what is the encoding?

I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file says it is encoded as ISO-8859.
$ file data_clean.csv
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators
I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.
row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
print row[4]
description = row[4].encode("UTF-8")
print description
This gives me the following:
'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight
¥ Media and communications
Why is the \xa5 bullet character converting as a yen symbol?
I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.
More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?
I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that A5 is a bullet point in the Mac OS Roman codepage.
More generally, is there a tool where
you can specify (i) string (ii) known
character, and find out the encoding?
You can easily write one in Python.
(Examples use 3.x syntax.)
import encodings
ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}
def _decode(data, encoding):
try:
return data.decode(encoding)
except UnicodeError:
return None
def possible_encodings(encoded, decoded):
return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}
So if you know that your bullet point is U+2022, then
>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}
You could try
iconv -f latin1 -t utf8 data_clean.csv
if you know it is indeed iso-latin-1
Although in iso-latin-1 \xA5 is indeed a ¥
Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?
http://forums.quark.com/p/14849/61253.aspx
http://www.macosxhints.com/article.php?story=2003090403110643