IPC::Open3 converting character encoding - perl

I am observing strange behaviour with IPC::Open3 arguments as part of a script.
I give a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).
However once the subprocess is started the arguments are now UTF-8 encoded. I have verified this using the desired executable (freebcp) and a wrapper script. I ended up writing a wrapper script which converts all the arguments back to ISO-8859-15.
What causes this behaviour? LANG is set to en_AU.ISO-8859-15. It works correctly on other hosts. I cannot find any reference to binmode()

I has a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).
However once the subprocess is started the arguments are now UTF-8 encoded.
LANG is set to en_AU.ISO-8859-15.
Perl5 by default doesn't do any encoding conversion: the strings treated as dumb byte arrays.
That, until you tell Perl that the strings contain the Unicode, for example by calling decode(), or reading string from a file handle that has encoding layer attached (via binmode(), or via open() flags, or via use open with :encoding/:locale, or via command line with -C switch.)
Since you have the string in ISO-8859-15, but it is outputted in UTF-8, that means that the Perl is aware of the encoding of your string. Somewhere somehow you have told Perl the encoding of the string, and it has converted it to the Unicode, which is internally represented using the UTF-8. The UTF-8 which now seems to be printed to the open3() file handles.
As a possible solution, before outputting the strings, you should try to explicitly convert the strings into the desired encoding.
P.S. Using the utf8::is_utf8() function, you can try to debug/find when/how your strings get converted into the Unicode, and whether they are really Unicode.

Related

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR+LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.

How to achieve fool proof unicode handling in Perl CGI?

So I have a mysql database which is serving an old wordpress db I backed up. I am writing some simple perl scripts to serve the wordpress articles (I don't want a wordpress install).
Wordpress for whatever reason stored all quotes as unicode chars, all ... as unicode chars, all double dashes, all apostrophes, there is unicode nbsp all over the place -- it is a mess (this is why I do not a wordpress install).
In my test environment, which is Linux Mint 17.1 Perl 5.18.2 Mysql 5.5, things mostly work fine when I have Content-type line being served up with "charset=utf-8" (except apostrophes just simply never decode properly no matter what combination I of things I try). Omitting the charset causes all unicode characters to break (except apostrophes now work). This is OK, with the exception of the apostrophes, I under stand what is going on and I have a handle on the data.
Now on my production environment, which is a VM, is Linux CentOS 6.5 Perl 5.10.1 Mysql 5.6.22, and here things do not work at all. Whether or not I include the "charset=utf-8" in the Content-type there is no difference, no unicode charatcers work correctly (including apostrophes). Maybe it has to do with the lower version of Perl? Does anyone have any insight?
Apart from this very specific case, does anyone know of a fool-proof Perl idiom for handling unicode which comes from the DB? (I'm not sure where in the pipeline things are going wrong, but I have a suspicion it is at the DB-driver level)
One of the problems is that my data is very inconsistent and dirty. I could parse the entire DB and scrub all unicode and re-import it -- the point is I want to avoid that. I want a one-size fits all collection of Perl scripts for reading wordpress databases.
Dealing with Perl and UTF-8 has been a pain to me. After a good amount of time i learned that there is no "fool proof unicode handling" in Perl ... but there is an unicode handling that can be of help:
The Encode module.
As the perlunifaq says (http://perldoc.perl.org/perlunifaq.html):
When should I decode or encode?
Whenever you're communicating text with anything that is external to
your perl process, like a database, a text file, a socket, or another
program. Even if the thing you're communicating with is also written
in Perl.
So we do this to every UTF-8 text string sent to our Perl process:
my $perl_str = decode('utf8',$myExt_str);
And this to every text string sent from Perl to anything external to our Perl process:
my $ext_str = encode('utf8',$perl_str);
...
Now that's a lot of encoding/decoding when we retrieve or send data from/to a mysql or postgresql database. But fear not, because there is a way to tell Perl that EVERY TEXT STRING from/to a database are utf8. Additionally we tell the database that every text string should be treated as UTF-8. The only downside is that you need to be sure that every text string is UTF-8 encoded... but that's another story:
# For MySQL:
# This requires DBD::mysql version 4 or greater
use DBI;
my $dbh = DBI->connect ('dbi:mysql:test_db',
$username,
$password,
{mysql_enable_utf8 => 1}
);
Ok, now we have the text strings from our database in utf8 and the database knows all our text strings should be treated as UTF-8... But what about anything else? We need to tell Perl (AND CGI) that EVERY TEXT STRING we write in our process is utf8 AND tell other other processes to treat our text strings as UTF-8 as well:
use utf8;
use CGI '-utf8';
my $cgi = new CGI;
$cgi->charset('UTF-8');
UPDATED!
What is a "wide character"?
This is a term used both for characters with an ordinal value greater
than 127, characters with an ordinal value greater than 255, or any
character occupying more than one byte, depending on the context. The
Perl warning "Wide character in ..." is caused by a character with an
ordinal value greater than 255.
With no specified encoding layer, Perl
tries to fit things in ISO-8859-1 for backward compatibility reasons.
When it can't, it emits this warning (if warnings are enabled), and
outputs UTF-8 encoded data instead. To avoid this warning and to avoid
having different output encodings in a single stream, always specify
an encoding explicitly, for example with a PerlIO layer:
# The next line is required to avoid the "Wide character in print" warning
# AND to avoid having different output encodings in a single stream.
binmode STDOUT, ":encoding(UTF-8)";
...
Even with all of this sometimes you need to encode('utf8',$perl_str) . That's why i learn there is no fool proof unicode handling in Perl. Please read the perlunifaq (http://perldoc.perl.org/perlunifaq.html)
I hope this helps.

encoding accents in soap lite

I have a perl script using soap::lite that calls a webservice written in Net.
The call works, but the problem is that I need to pass a parameter like
SOAP::Data->name('x' => 'àò??\a')->type('string')
And the resulting XML is something like
<x>\xc3\x83\xc2\xa0\xc3\x83\xc2\xb2??\\a</x>
The accented letters are replaced and also the \ becomes '\\'.
I need the parameter to be exactly how is written.
The encoding is utf-8.
When you have Unicode literals in your Perl source, you must use utf8; and save the file in UTF-8 encoding.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Perl strings internals

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.
In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.
You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -
Perl uses logically-wide characters to represent strings internally.
it will confirm your statements.
Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.