I have a perl script using soap::lite that calls a webservice written in Net.
The call works, but the problem is that I need to pass a parameter like
SOAP::Data->name('x' => 'àò??\a')->type('string')
And the resulting XML is something like
<x>\xc3\x83\xc2\xa0\xc3\x83\xc2\xb2??\\a</x>
The accented letters are replaced and also the \ becomes '\\'.
I need the parameter to be exactly how is written.
The encoding is utf-8.
When you have Unicode literals in your Perl source, you must use utf8; and save the file in UTF-8 encoding.
Related
I am observing strange behaviour with IPC::Open3 arguments as part of a script.
I give a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).
However once the subprocess is started the arguments are now UTF-8 encoded. I have verified this using the desired executable (freebcp) and a wrapper script. I ended up writing a wrapper script which converts all the arguments back to ISO-8859-15.
What causes this behaviour? LANG is set to en_AU.ISO-8859-15. It works correctly on other hosts. I cannot find any reference to binmode()
I has a string containing ISO-8859-15. Just before open3() is called (literally the statement before) the string is correct (verified with print and Data::Dumper).
However once the subprocess is started the arguments are now UTF-8 encoded.
LANG is set to en_AU.ISO-8859-15.
Perl5 by default doesn't do any encoding conversion: the strings treated as dumb byte arrays.
That, until you tell Perl that the strings contain the Unicode, for example by calling decode(), or reading string from a file handle that has encoding layer attached (via binmode(), or via open() flags, or via use open with :encoding/:locale, or via command line with -C switch.)
Since you have the string in ISO-8859-15, but it is outputted in UTF-8, that means that the Perl is aware of the encoding of your string. Somewhere somehow you have told Perl the encoding of the string, and it has converted it to the Unicode, which is internally represented using the UTF-8. The UTF-8 which now seems to be printed to the open3() file handles.
As a possible solution, before outputting the strings, you should try to explicitly convert the strings into the desired encoding.
P.S. Using the utf8::is_utf8() function, you can try to debug/find when/how your strings get converted into the Unicode, and whether they are really Unicode.
I need to get a string from <STDIN>, written in latin and russian mixed encodings, and convert it to some url:
$search_url = "http://searchengine.com/search?text=" . uri_escape($query);
But this proccess goes bad and gives out Mojibake (a mixture of weird letters). What can I do with Perl to solve it?
Before you can get started, there's a few things you need to know.
You'll need to know the encoding of your input. "Latin" and "russian" aren't (character) encodings.
If you're dealing with multiple encodings, you'll need to know what is encoded using which encoding. "It's a mix" isn't good enough.
You'll need to know the encoding the site expects the query to use. This should be the same encoding as the page that contains the search form.
Then, it's just a matter of decoding the input using the correct encoding, and encoding the query using the correct encoding. That's the easy part. Encode provides functions decode and encode to do just that.
How do I URI escape Japanese characters in Perl?
The URI::Escape module will be able to handle japanese characters, just like any other unsafe or special character un URIs.
Generally, when you're looking for some functionality in Perl, especially some that would seem as common as escaping things in URIs, you should consult http://search.cpan.org first. URI::Escape would probably have been at the very top of the results when searching for any of the keywords you used in your question.
How do I URI escape Japanese characters in Perl?
You need to mention what encoding the Japanese characters are in.
If you are using UTF-8 and also using Perl's built-in Unicode encoding, then you can use this:
use utf8;
use URI::Escape qw/uri_escape_utf8/;
my $escaped = uri_escape_utf8 ("チャオ");
If your Japanese characters are encoded using some format like EUC-JP, Shift-JIS, or other such things, you need to specify what kind of URI escaping you require. The standard things like
my $escaped = uri_escape ("ハロー");
will give you something which is URI encoded but it isn't necessarily meaningful to the other end. For example if you are making a URL for WWWJDIC, the URI to look up 渮 is this for UTF-8:
http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1MMJ%E6%B8%AE
But for EUC-JP the same page looks like this:
http://www.csse.monash.edu.au/~jwb/cgi-bin/wwwjdic.cgi?1MKJ%DE%D1
Perl will do either one for you but you need to be specific about what your starting point is.
How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.
In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.
You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -
Perl uses logically-wide characters to represent strings internally.
it will confirm your statements.
Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.
I'm modifying a mature CGI application written in Perl and the question of content encoding has come up. The browser reports that the content is iso-8859-1 encoded and the application is declaring iso-8859-1 as the charset in the HTTP headers but doesn't ever seem to actually do the encoding. None of the various encoding techniques described in the perldoc tutorials (Encode, Encoding, Open) are used in the code so I'm a little confused as to how the document is actually being encoded.
As mentioned, the application is quite mature and likely predates many of the current encoding methods. Does anyone know of any legacy or deprecated techniques I should be looking for? To what encoding does Perl assume/default to when no direction is provided by the developer?
Thanks
By default Perl handles strings as being byte sequences, so if you read from a file, and print that to STDOUT, it will produce the same byte sequence. If your templates are Latin-1, your output will also be Latin-1.
If you use a string in text string context (like with uc, lc and so on) perl assumes Latin-1 semantics, unless the string has been decoded before.
More on Perl, charsets and encodings
Perl will not assume anything, but the browser is assuming that encoding based usually on guesswork. The documents are output directly, just as they were written, if none of the encoding techniques is used.
You can specify the charset in the HTTP Content-Type header.
The first place I'd look is the server configuration. If you aren't setting the content-encoding header in the program, you're likely picking up the server's guess.
Run the script separate from the server to see what its actual output is. When the server gets the output from a CGI program (that's not nph), the server fixes up the header for anything it thinks is missing before it sends it to the client.
If the browser reports the content as iso-8859-1, maybe your perl script didn't output the correct headers to specify the charset?