How can I save Perl/Expect output that contains mixed ascii content? - perl

I have a perl script that uses the expect library to login to a remote system. I'm getting the final output of the interaction with the before method:
$exp->before();
I'm saving this to a text file. When I use cat on the file it outputs fine in the terminal, but when I open the text file in an editor or try to process it the formatting is bizarre:
[H[2J[1;19HCIRCULATION ACTIVITY by TERMINAL (Nov 6,14)[11;1H
Is there a better way to save the output?
When I run enca it's identified as:
7bit ASCII characters
Surrounded by/intermixed with non-text data

you can remove none ascii chars.
$str1 =~ s/[^[:ascii:]]//g;
print "$str1\n";

I was able to remove the ANSI escape codes from my output by using the Text::ANSI::Util library's ta_strip() function:
my $ansi_string = $exp->before();
my $clean_string = ta_strip($ansi_string);

Related

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

Perl write to file returns huge weird stacktrace

I have the following problem: when I try to save the file that contains a semicolon in the name it returns a huge and weird stacktrace of the characters on the page. I've tried to escape, to trim and to replace those semicolons, but the result is still the same. I use the following regex:
$value =~ s/([^a-zA-Z0-9_\-.]|;)/uc sprintf("%%%02x",ord($1))/eg;
(I've even added the |; part separately..)
So, when I open the file to write and call the print function it returns lots of weird stuff, like that:
PK!}�3y�[Content_Types].xml ���/�h9\�?�0���cz��:� �s_����o���>�T�� (it is a huge one, this is just a part of it).
Is there any way I could avoid this?
Thank you in advance!
EDIT:
Just interested - what is the PK responsible of in this string? I mean I can understand that those chars are just contents of the file, but what is PK ? And why does it show the content type?
EDIT 2.0:
I'm uploading the .docx file - when the name doesn't contain the semicolon it works all fine. This is the code for the file saving:
open (QSTR,">", "$dest_file") or die "can't open output file: $qstring_file";
print QSTR $value;
close (QSTR);
EDIT 3.0
This is a .cgi script, that is called after posting some data to the server. It has to save some info about the uploading file to a temp file (name, contents, size) in the manner of key-value pairs. So any file that contains the semicolon causes this error.
EDIT 4.0
Found the cause:
The CGI param function while uploading the params counts semicolon as the delimiter! Is there any way to escape it in the file header?
The PK in file header it means it is compressed ZIP like file, like docx.
One guess: The ; is not valid character in filename at the destination?
Your regexp is not good: (the dot alone is applicable to any character...)
$value =~ s/([^a-zA-Z0-9_\-.]|;)/uc sprintf("%%%02x",ord($1))/eg;
Try this:
#replace evey non valid char to underscore
$value =~ s/([^a-zA-Z0-9_\-\.\;])/_/g;

Filtering microsoft 1252 characters out of an ASCII text file opened in utf8 mode in Perl

I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:
$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right
... etc.
However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).
The new utf8 pragmas at the start of the script are:
use CGI qw(-utf8);
use open IO => ':utf8';
I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.
And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.
If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...
Encoding::FixLatin was specifically written to help fix data broken in the same manner as yours.
Ikegami already mentioned the Encoding::FixLatin module.
Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:
unless ( utf8::decode($string) ) {
require Encode;
$string = Encode::decode(cp1252 => $string);
}
Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.
You could also use Encode.pm's support for fallback.
use Encode qw[decode];
my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
my ($ordinal) = #_;
return decode('Windows-1252', pack 'C', $ordinal);
});
printf "<%s>\n",
join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;
Output:
<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>
Did you recode the data files? If not, opening them as UTF-8 won't work. You can simply open them as
open $filehandle, '<:encoding(cp1252)', $filename or die ...;
and everything (tm) should work.
If you did recode, something seem to have gone wrong, and you need to analyze what it is, and fix it. I recommend using hexdump to find out what actually is in a file. Text consoles and editors sometimes lie to you, hexdump never lies.

Python 3 CGI: how to output raw bytes

I decided to use Python 3 for making my website, but I encountered a problem with Unicode output.
It seems like plain print(html) #html is astr should be working, but it's not. I get UnicodeEncodeError: 'ascii' codec can't encode characters[...]: ordinal not in range(128). This must be because the webserver doesn't support unicode output.
The next thing I tried was print(html.encode('utf-8')), but I got something like repr output of the byte string: it is placed inside b'...' and all the escape characters are in raw form (e.g. \n and \xd0\x9c)
Please show me the correct way to output a Unicode (str) string as a raw UTF-8 encoded bytes string in Python 3.1
The problem here is that you stdout isn't attached to an actual terminal and will use the ASCII encoding by default. Therefore you need to write to sys.stdout.buffer, which is the "raw" binary output of sys.stdout. This can be done in various ways, the most common one seems to be:
import codecs, sys
writer = codecs.getwriter('utf8')(sys.stdout.buffer)
And the use writer. In a CGI script you may be able to replace sys.stdout with the writer so:
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
Might actually work so you can print normally. Try that!

Why does Perl's LWP gives me a different encoding than the original website?

Lets say i have this code:
use strict;
use LWP qw ( get );
my $content = get ( "http://www.msn.co.il" );
print STDERR $content;
The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94"
which i'm guessing it's utf-16 ?
The website's encoding is with
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">
so why these characters appear and not the windows-1255 chars ?
And, another weird thing is that i have two servers:
the first server returning CP1255 chars and i can simply convert it to utf8,
and the current server gives me these chars and i can't do anything with it ...
is there any configuration file in apache/perl/module that is messing up the encoding ?
forcing something ... ?
The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"
One more thing that i tested is ...
Through perl:
my $content = `curl "http://www.anglo-saxon.co.il"`;
I get utf8 encoding.
Through Bash:
curl "http://www.anglo-saxon.co.il"
and here i get CP1255 ( Windows-1255 ) encoding ...
Also,
when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...
fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:
use Text::Iconv;
my $converter = Text::Iconv->new("utf8", "CP1255");
$content=$converter->convert($content);
my $converter = Text::Iconv->new("CP1255", "utf8");
$content=$converter->convert($content);
All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.
Anyway, since the server does return the correct encoding, this works:
my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;
$content is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode on it is appropriate; do not use Encode::decode as it's already been decoded once.
http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.
I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.
First, note that you should import get from LWP::Simple. Second, everything works fine with:
#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';
which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.
The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.
You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like
use Encode;
...;
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));
The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.