How to print unicode to console in Eiffel? - unicode

Evidently in python:
print u'\u0420\u043e\u0441\u0441\u0438\u044f'
outputs:
Россия
How do I do this in Eiffel?

An example at the first page of eiffel.org (at the time of writintg) suggests the following code:
io.put_string_32 ("[
Hello!
¡Hola!
Bonjour!
こんにちは!
Здравствуйте!
Γειά σου!
]")
It is supported since EiffelStudio 20.05. (I tested the example with EiffelStudio 22.05.)
In this particular case, using print instead of io.put_string_32 works as well. However, in some boundary cases, when all character codes in the string are below 256, you may need to specify the string type explicitly:
print ({STRING_32} "Zürich") -- All character code points are below 256.
Of course, you can write the character code points explicitly:
io.put_string_32 ("%/0x041f/%/0x0440/%/0x0438/%/0x0432/%/0x0435/%/0x0442/%/0x0021/")
The output code page defaults to UTF-8 for text files and the current console code page for CONSOLE (the standard object type of io). If needed, the default encoding can be changed to any other encoding:
io.standard_default.set_encoding ({SYSTEM_ENCODINGS}.console_encoding)

On my linux OS, this code print exactly what you want:
make
-- Initialisation of `Current'
local
l_utf_converter:UTF_CONVERTER
l_string:STRING_32
do
create l_string.make_empty
l_string.append_code (0x420)
l_string.append_code (0x43e)
l_string.append_code (0x441)
l_string.append_code (0x441)
l_string.append_code (0x438)
l_string.append_code (0x44f)
print(l_utf_converter.utf_32_string_to_utf_8_string_8 (l_string))
end
In a nutshell, STRING_32 use UTF-32 and the linux console use UTF-8. The UTF_CONVERTER class can be use to convert UTF-32 to UTF-8.
I do not know if it is possible on a Windows console.

Related

Trying to upload specific characters in Python 3 using Windows Powershell

I'm running this code in Windows Powershell and it includes this file called languages.txt where I'm trying to convert between bytes to strings:
Here is languages.txt:
Afrikaans
አማርኛ
Аҧсшәа
العربية
Aragonés
Arpetan
Azərbaycanca
Bamanankan
বাংলা
Bân-lâm-gú
Беларуская
Български
Boarisch
Bosanski
Буряад
Català
Чӑвашла
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
Español
Esperanto
فارسی
Français
Frysk
Gaelg
Gàidhlig
Galego
한국어
Հայերեն
हिन्दी
Hrvatski
Ido
Interlingua
Italiano
עברית
ಕನ್ನಡ
Kapampangan
ქართული
Қазақша
Kreyòl ayisyen
Latgaļu
Latina
Latviešu
Lëtzebuergesch
Lietuvių
Magyar
Македонски
Malti
मराठी
მარგალური
مازِرونی
Bahasa Melayu
Монгол
Nederlands
नेपाल भाषा
日本語
Norsk bokmål
Nouormand
Occitan
Oʻzbekcha/ўзбекча
ਪੰਜਾਬੀ
پنجابی
پښتو
Plattdüütsch
Polski
Português
Română
Romani
Русский
Seeltersk
Shqip
Simple English
Slovenčina
کوردیی ناوەندی
Српски / srpski
Suomi
Svenska
Tagalog
தமிழ்
ภาษาไทย
Taqbaylit
Татарча/tatarça
తెలుగు
Тоҷикӣ
Türkçe
Українська
اردو
Tiếng Việt
Võro
文言
吴语
ייִדיש
中文
Then, here is the code I used:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line = language_file.readline()
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
next_lang = line.strip()
raw_bytes = next_lang.encode(encoding, errors=errors)
cooked_string = raw_bytes.decode(encoding, errors=errors)
print(raw_bytes, "<===>", cooked_string)
languages = open("languages.txt", encoding="utf-8")
main(languages, input_encoding, error)
Here's the output (credit: Learn Python 3 the Hard Way by Zed A. Shaw):
I don't know why it doesn't upload the characters and shows question blocks instead. Can anyone help me?
The first string which fails is አማርኛ. The first character, አ is in unicode 12A0 (see here). In UTF-8, that is b'\xe1\x8a\xa0'. So, that part is obviously fine. The file really is UTF-8.
Printing did not raise an exception, so your output encoding can handle all of the characters. Everything is fine.
The only remaining reason I see for it to fail is that the font used in the console does not support all of the characters.
If it is just for play, you should not worry about it. Consider it working correctly.
On the other hand, I would suggest changing some things in your code:
You are running main recursively for each line. There is absolutely no need for that and it would run into recursion depth limit on a longer file. User a for loop instead.
for line in lines:
print_line(line, encoding, errors)
You are opening the file as UTF-8, so reading from it automatically decodes UTF-8 into Unicode, then you encode it back into row_bytes and then encode again into cooked_string, which is the same as line. It would be a better exercise to read the file as raw binary, split it on newlines and then decode. Then you'd have a clearer picture of what is going on.
with open("languages.txt", 'rb') as f:
raw_file_contents = f.read()

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

Python 3 CGI: how to output raw bytes

I decided to use Python 3 for making my website, but I encountered a problem with Unicode output.
It seems like plain print(html) #html is astr should be working, but it's not. I get UnicodeEncodeError: 'ascii' codec can't encode characters[...]: ordinal not in range(128). This must be because the webserver doesn't support unicode output.
The next thing I tried was print(html.encode('utf-8')), but I got something like repr output of the byte string: it is placed inside b'...' and all the escape characters are in raw form (e.g. \n and \xd0\x9c)
Please show me the correct way to output a Unicode (str) string as a raw UTF-8 encoded bytes string in Python 3.1
The problem here is that you stdout isn't attached to an actual terminal and will use the ASCII encoding by default. Therefore you need to write to sys.stdout.buffer, which is the "raw" binary output of sys.stdout. This can be done in various ways, the most common one seems to be:
import codecs, sys
writer = codecs.getwriter('utf8')(sys.stdout.buffer)
And the use writer. In a CGI script you may be able to replace sys.stdout with the writer so:
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
Might actually work so you can print normally. Try that!

Perl: Problem with changing encoding in the middle of reading a file

I am using Perl to load some 'macro' files. These macros can, however, be encoded in various encodings, so there is a directive defined for users writing their macros (i.e.
#encoding iso-8859-2
at the beginning of the macro).
Every time this directive is encountered in the macro, a function setting encoding is called and looks sth like this:
sub change_encoding {
my ($file_handle, $encoding) = #_;
$file_handle->flush();
binmode($file_handle); # get rid of IO layers
binmode($file_handle,":encoding($encoding)");
}
The problem is that when I read the macro using standard
while($line = <$file_handle>){
process_macro($line);
}
I got messages saying "utf8 "\xXY" does not map to Unicode", but only if characters with diacritics is near the #encoding directive. I tried several examples and I was able to have half of the string with \xXY codes and other half of the string with correctly decoded characters, like here:
sub macro5_fn {
print "\xBElu\xBBou\xE8k\xFD k\xF9\xF2 úpěl ďábelské ódy\n";
}
If I put more comments before the function, all the characters are OK:
sub macro5_fn {
print "žluťoučký kůň úpěl ďábelské ódy\n";
}
Simply said, the number of correctly decoded characters depends on the distance of these characters from the #encoding directive, the ones that are close are not decoded correctly.
It seems to me that this is an issue of Perl and PerlIO (not) flushing the buffer. Or am I doing something wrong?
Thank you for your answers.
The problem is that <> reads more than just one line, so the next line or so is being interpreted under the old encoding before you ever see the #encoding directive for the new.
Your best bet is probably to read the file in binary mode and use the Encode module to decode each line from the current encoding.

How can I get Perl to detect bad UTF-8 sequences?

I'm running Perl 5.10.0 and Postgres 8.4.3, and strings into a database, which is behind a DBIx::Class.
These strings should be in UTF-8, and therefore my database is running in UTF-8. Unfortunatly some of these strings are bad, containing malformed UTF-8, so when I run it I'm getting an exception
DBI Exception: DBD::Pg::st execute failed: ERROR: invalid byte sequence for encoding "UTF8": 0xb5
I thought that I could simply ignore the invalid ones, and worry about the malformed UTF-8 later, so using this code, it should flag and ignore the bad titles.
if(not utf8::valid($title)){
$title="Invalid UTF-8";
}
$data->title($title);
$data->update();
However Perl seems to think that the strings are valid, but it still throws the exceptions.
How can I get Perl to detect the bad UTF-8?
First off, please follow the documentation - the utf8 module should only be used in the 'use utf8;' form to indicate that your source code is UTF-8 instead of Latin-1. Don't use any of the utf8 functions.
Perl makes the distinction between bytes and UTF-8 strings. In byte mode, Perl doesn't know or care what encoding you are using, and will use Latin-1 if you print it. Take for example the Euro sign (€). In UTF-8 this is 3 bytes, 0xE2, 0x82, 0xAC. If you print the length of these bytes, Perl will return 3. Again, it doesn't care about the encoding. It can be any bytes or any encoding, legal or illegal.
If you use the Encode module and call Encode::decode("UTF-8', $bytes) you will get a new string which has the so-called UTF8 flag set. Perl now knows your string is in UTF-8, and will return a length of 1.
The problem that utf8::valid only applies to the second type of string. Your strings are probably in the first form, byte mode, and utf8::valid just returns true for anything in byte form. This is documented in the perldoc.
The solution is to get Perl to decode your byte strings as UTF-8, and detect any errors. This can be done with FB_CROAK as brian d foy explains:
my $ustring =
eval { decode( 'UTF-8', $byte_string, FB_CROAK ) }
or die "Could not decode string: $#";
You can then catch that error and skip those invalid strings.
Or if you know your code is mostly UTF-8 with a few invalid sequences here and there, you can use:
my $ustring = decode( 'UTF-8', $byte_string );
which uses the default mode of FB_DEFAULT, replacing invalid characters with U+FFFD, the Unicode REPLACEMENT CHARACTER (diamond with question mark in it).
You can then pass the string directly to your database driver in most cases. Some drivers may require you to re-encode the string back to byte form first:
my $byte_string = encode('UTF-8', $ustring);
There are also regexes online that you can use to check for valid UTF-8 sequences before calling decode (check other Stack Overflow answers). If you use those regexes, you don't need to do any encoding or decoding.
Finally, please use UTF-8 rather than utf8 in your calls to decode. The latter is more lax and allows some invalid UTF-8 sequences (such as sequences outside the Unicode range) to be allowed through.
How are you getting your strings? Are you sure that Perl thinks that they are UTF-8 already? If they aren't decoded yet (that is, octets interpreted as some encoding), you need to do that yourself:
use Encode;
my $ustring =
eval { decode( 'utf8', $byte_string, FB_CROAK ) }
or die "Could not decode string: $#";
Better yet, if you know that your source of strings is already UTF-8, you need to read that source as UTF-8. Look at the code you have that gets the strings to see if you are doing that properly.
As the documentation for utf8::valid points out, it returns true if the string is marked as UTF-8 and it's valid UTF-8, or if the string isn't UTF-8 at all. Although it's impossible to tell without seeing the code in context and knowing what the data is, most likely what you want isn't the "valid utf8" check at all; probably you just need to do
$data->title( Encode::encode("UTF-8", $title) )