Decoding URI related entities with Perl - perl

I may not even be referring to this the proper way so my apologies in advance. Our server logs are constantly showing us an encoded style of attack. An example is below....
http://somecompany.com/script.pl?var=%20%D1%EB........ (etc etc)
I am familiar with encoding and decoding HTML entities using Perl (using HTML::Entities) but I am not even sure how to refer to this style of decoding. I'd love to be able to write a script to decode these URI encodings (?). Is there a module that anyone knows of that can point me in the right direction?
Nikki

Use the URI::Escape module to escape and unescape URI-encoded strings.
Example:
use strict;
use warnings;
use URI::Escape;
my $uri = "http://somecompany.com/script.pl?var=%20%D1%EB";
my $decoded = uri_unescape( $uri );
print $decoded, "\n";

There are online resources such as http://www.albionresearch.com/misc/urlencode.php for doing quick encoding/decoding of a string.
Programmatically, you can do this:
use URI::Escape;
my $str = uri_unescape("%20%D1%EB");
print $str . "\n";
or simply:
perl -MURI::Escape -wle'print uri_unescape("%20%D1%EB");'

Related

Perl drop down menus and Unicode

I've been going around on this for some time now and can't quite get it. This is Perl 5 on Ubuntu. I have a drop down list on my web page:
$output .= start_form . "Student: " . popup_menu(-name=>'student', -values=>['', #students], -labels=>\%labels, -onChange=>'Javascript:submit()') . end_form;
It's just a set of names in the form "Last, First" that are coming from a SQL Server table. The labels are created from the SQL columns like so:
$labels{uc($record->{'id'})} = $record->{'lastname'} . ", " . $record->{'firstname'};
The issue is that the drop down isn't displaying some Unicode characters correctly. For instance, "Søren" shows up in the drop down as "Søren". I have in my header:
use utf8;
binmode(STDOUT, ":utf8");
...and I've also played around with various takes on the "decode( )" function, to no avail. To me, the funny thing is that if I pull $labels into a test script and print the list to the console, the names appear just fine! So what is it about the drop down that is causing this? Thank you in advance.
EDIT:
This is the relevant functionality, which I've stripped down to this script that runs in the console and yields the correct results for three entries that have Unicode characters:
#!/usr/bin/perl
use DBI;
use lib '/home/web/library';
use mssql_util;
use Encode;
binmode(STDOUT, ":utf8");
$query = "[SQL query here]";
$dbh = &connect;
$sth = $dbh->prepare($query);
$result = $sth->execute();
while ($record = $sth->fetchrow_hashref())
{
if ($record->{'id'})
{
$labels{uc($record->{'id'})} = Encode::decode('UTF-8', $record->{'lastname'} . ", " . $record->{'nickname'} . " (" . $record->{'entryid'} . ")");
}
}
$sth->finish();
print "$labels{'ST123'}\n";
print "$labels{'ST456'}\n";
print "$labels{'ST789'}\n";
The difference in what the production script is doing is that instead of printing to the console like above, it's printing to HTTP:
$my_output = "<p>$labels{'ST123'}</p><br>
<p>$labels{'ST456'}</p><br>
<p>$labels{'ST789'}</p>";
$template =~ s/\$body/$my_output/;
print header(-cookie=>$cookie) . $template;
This gives, i.e., strings like "Zoë" and "Søren" on the page. BUT, if I remove binmode(STDOUT, ":utf8"); from the top of the production script, then the strings appear just fine on the page (i.e. I get "Zoë" and "Søren").
I believe that the binmode( ) line is necessary when writing UTF-8 to output, and yet removing it here produces the correct results. What gives?
Problem #1: Decoding inputs
53.C3.B8.72.65.6E is the UTF-8 encoding for Søren. When you instruct Perl to encode it all over again (by printing it to handle with the :utf8 layer), you are producing garbage.
You need to decode your inputs ($record->{id}, $record->{lastname}, $record->{firstname}, etc)! This will transform The UTF-8 bytes 53.C3.B8.72.65.6E ("encoded text") into the Unicode Code Points 53.F8.72.65.6E ("decoded text").
In this form, you will be able to use uc, regex matches, etc. You will also be able to print them out to a handle with an encoding layer (e.g. :encoding(UTF-8), or the improper :utf8).
You let on that these inputs come from a database. Most DBD have a flag that causes strings to be decoded. For example, if it's a MySQL database, you should pass mysql_enable_utf8mb4 => 1 to connect.
Problem #2: Communicating encoding
If you're going to output UTF-8, don't tell the browser it's ISO-8859-1!
$ perl -e'use CGI qw( :standard ); print header()'
Content-Type: text/html; charset=ISO-8859-1
Fixed:
$ perl -e'use CGI qw( :standard ); print header( -type => "text/html; charset=UTF-8" )'
Content-Type: text/html; charset=UTF-8
Hard to give a definitive solution as you don't give us much useful information. But here are some pointers that might help.
use utf8 only tells Perl that your source code is encoded as UTF-8. It does nothing useful here.
Reading perldoc perlunitut would be a good start.
Do you know how your database tables are encoded?
Do you know whether your database connection is configured to automatically decode data coming from the database into Perl characters?
What encoding are you telling the browser that you have encoded your HTTP response in?

MIME::Base64::decode_base64 wrong characters

I get some troubles using perl MIME::Base64::decode_base64
Here is my code:
#!/usr/bin/perl
use MIME::Base64;
$string_to_decrypt="lVvfrx23jX7vX3HghyJGxo4oivqBIg";
$content=MIME::Base64::decode_base64($string_to_decrypt);
open(WRITE,">/home/laurent/decrypted.txt");
print WRITE $content;
close(WRITE);
exit;
Using online decoder (like https://www.base64decode.org/) result should be:
[߯·~ï_qà"FÆ(ú"
But in my file, I get:
<95>[߯^]·<8d>~ï_qà<87>"FÆ<8e>(<8a>ú<81>"
I don't know how to get rid of:
<95>, ^], <8d>,<87> ....
Thanks
Laurent
This is clearly not text, so it's no surprise it doesn't render properly when printed as text. base64decode.org actually produces the same correct result as decode_base64, which is the following bytes:
95.5B.DF.AF.1D.B7.8D.7E.EF.5F.71.E0.87.22.46.C6.8E.28.8A.FA.81.22
You can use either of the following to remove the characters you identified, but that's is most definitely the wrong thing to do.
$content =~ tr/\x1D\x87\x8D\x95//d;
-or-
$content =~ s/[\x1D\x87\x8D\x95]//g;

Decode unicode escape characters with perl

I hate to ask a question that's undoubtedly been answered a dozen times before, but I find encoding issues confusing and am having a hard time matching up other people's q/a with my own problem.
I'm pulling information from a json file online, and my perl script isn't handling unicode escape characters properly.
Script looks like this:
use LWP::Simple;
use JSON;
my $url = ______;
my $json = get($url);
my $data = decode_json($json);
foreach my $i (0 .. $#{data->{People}}) {
print "$data->{People}[$i]{first_name} $data->{People}[$i]{last_name}\n";
}
It encounters jsons that look like this: "first_name":"F\u00e9lix","last_name":"Cat" and prints them like this: FΘlix Cat
I'm sure there's a trivial fix here, but I'm stumped. I'd really appreciate any help you can provide.
You didn't tell Perl how to encode the output. You need to add
use open ':std', ':encoding(XXX)';
where XXX is the encoding the terminal expects.
On unix boxes, you normally need
use open ':std', ':encoding(UTF-8)';
On Windows boxes, you normally need
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';

How to use unicode in perl CGI param

I have a Perl CGI script accepting unicode characters as one of the params.
The url is of the form
.../worker.pl?text="some_unicode_chars"&...
In the perl script, I pass the $text variable to a shell script:
system "a.sh \"$text\" out_put_file";
If I hardcode the text in the perl script, it works well. However, the output makes no sense when $text is got from web using CGI.
my $q = CGI->new;
my $text = $q->param('text');
I suspect it's the encoding caused the problem. uft-8 caused me so many troubles. Anyone please help me?
Perhaps this will help. From Perl Programming/Unicode UTF-8:
By default, CGI.pm does not decode your form parameters. You can use
the -utf8 pragma, which will treat (and decode) all parameters as
UTF-8 strings, but this will fail if you have any binary file upload
fields. A better solution involves overriding the param method:
(example follows)
[Wrong - see Correction] Here's documentation for the utf-8 pragma. Since uploading binary data does not appear to be a concern for you, use of the utf-8 pragma appears to be the most straightforward approach.
Correction: Per the comment from #Slaven, do not confuse the general Perl utf8 pragma with the -utf-8 pragma that has been defined for use with CGI.pm:
-utf8
This makes CGI.pm treat all parameters as UTF-8 strings. Use this with
care, as it will interfere with the processing of binary uploads. It
is better to manually select which fields are expected to return utf-8
strings and convert them using code like this:
use Encode;
my $arg = decode utf8=>param('foo');
Follow Up: duleshi, you ask: But I still don't understand the differnce between decode in Encode and utf8::decode. How do the Encode and utf8 modules differ?
From the documentation for the utf8 pragma:
Note that this function does not handle arbitrary encodings. Therefore
Encode is recommended for the general purposes; see also Encode.
Put another way, the Encode module works with many different encodings (including UTF-8), whereas the utf8 functions work only with the UTF-8 encoding.
Here is a Perl program that demonstrates the equivalence of the two approaches to encoding and decoding UTF-8. (Also see the live demo.)
#!/usr/bin/perl
use strict;
use warnings;
use utf8; # allows 'ñ' to appear in the source code
use Encode;
my $word = "Español"; # the 'ñ' is permitted because of the 'use utf8' pragma
# Convert the string to its UTF-8 equivalent.
my $utf8_word = Encode::encode("UTF-8", $word);
# Use 'utf8::decode' to convert the string back to internal form.
my $word_again_via_utf8 = $utf8_word;
utf8::decode($word_again_via_utf8); # converts in-place
# Use 'Encode::decode' to convert the string back to internal form.
my $word_again_via_Encode = Encode::decode("UTF-8", $utf8_word);
# Do the two conversion methods produce the same result?
# Prints 'Yes'.
print $word_again_via_utf8 eq $word_again_via_Encode ? "Yes\n" : "No\n";
# Do we get back the original internal string after converting both ways?
# Prints 'Yes'.
print $word eq $word_again_via_Encode ? "Yes\n" : "No\n";
If you're passing UTF-8 data around in the parameters list, then you definitely want to be URI encoding them using the URI::Escape module. This will convert any extended characters to percent values which as easily printable and readable. On the receiving end you will then need to URI decode them before continuing.

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.