MIME::Base64::decode_base64 wrong characters - perl

I get some troubles using perl MIME::Base64::decode_base64
Here is my code:
#!/usr/bin/perl
use MIME::Base64;
$string_to_decrypt="lVvfrx23jX7vX3HghyJGxo4oivqBIg";
$content=MIME::Base64::decode_base64($string_to_decrypt);
open(WRITE,">/home/laurent/decrypted.txt");
print WRITE $content;
close(WRITE);
exit;
Using online decoder (like https://www.base64decode.org/) result should be:
[߯·~ï_qà"FÆ(ú"
But in my file, I get:
<95>[߯^]·<8d>~ï_qà<87>"FÆ<8e>(<8a>ú<81>"
I don't know how to get rid of:
<95>, ^], <8d>,<87> ....
Thanks
Laurent

This is clearly not text, so it's no surprise it doesn't render properly when printed as text. base64decode.org actually produces the same correct result as decode_base64, which is the following bytes:
95.5B.DF.AF.1D.B7.8D.7E.EF.5F.71.E0.87.22.46.C6.8E.28.8A.FA.81.22
You can use either of the following to remove the characters you identified, but that's is most definitely the wrong thing to do.
$content =~ tr/\x1D\x87\x8D\x95//d;
-or-
$content =~ s/[\x1D\x87\x8D\x95]//g;

Related

Perl drop down menus and Unicode

I've been going around on this for some time now and can't quite get it. This is Perl 5 on Ubuntu. I have a drop down list on my web page:
$output .= start_form . "Student: " . popup_menu(-name=>'student', -values=>['', #students], -labels=>\%labels, -onChange=>'Javascript:submit()') . end_form;
It's just a set of names in the form "Last, First" that are coming from a SQL Server table. The labels are created from the SQL columns like so:
$labels{uc($record->{'id'})} = $record->{'lastname'} . ", " . $record->{'firstname'};
The issue is that the drop down isn't displaying some Unicode characters correctly. For instance, "Søren" shows up in the drop down as "Søren". I have in my header:
use utf8;
binmode(STDOUT, ":utf8");
...and I've also played around with various takes on the "decode( )" function, to no avail. To me, the funny thing is that if I pull $labels into a test script and print the list to the console, the names appear just fine! So what is it about the drop down that is causing this? Thank you in advance.
EDIT:
This is the relevant functionality, which I've stripped down to this script that runs in the console and yields the correct results for three entries that have Unicode characters:
#!/usr/bin/perl
use DBI;
use lib '/home/web/library';
use mssql_util;
use Encode;
binmode(STDOUT, ":utf8");
$query = "[SQL query here]";
$dbh = &connect;
$sth = $dbh->prepare($query);
$result = $sth->execute();
while ($record = $sth->fetchrow_hashref())
{
if ($record->{'id'})
{
$labels{uc($record->{'id'})} = Encode::decode('UTF-8', $record->{'lastname'} . ", " . $record->{'nickname'} . " (" . $record->{'entryid'} . ")");
}
}
$sth->finish();
print "$labels{'ST123'}\n";
print "$labels{'ST456'}\n";
print "$labels{'ST789'}\n";
The difference in what the production script is doing is that instead of printing to the console like above, it's printing to HTTP:
$my_output = "<p>$labels{'ST123'}</p><br>
<p>$labels{'ST456'}</p><br>
<p>$labels{'ST789'}</p>";
$template =~ s/\$body/$my_output/;
print header(-cookie=>$cookie) . $template;
This gives, i.e., strings like "Zoë" and "Søren" on the page. BUT, if I remove binmode(STDOUT, ":utf8"); from the top of the production script, then the strings appear just fine on the page (i.e. I get "Zoë" and "Søren").
I believe that the binmode( ) line is necessary when writing UTF-8 to output, and yet removing it here produces the correct results. What gives?
Problem #1: Decoding inputs
53.C3.B8.72.65.6E is the UTF-8 encoding for Søren. When you instruct Perl to encode it all over again (by printing it to handle with the :utf8 layer), you are producing garbage.
You need to decode your inputs ($record->{id}, $record->{lastname}, $record->{firstname}, etc)! This will transform The UTF-8 bytes 53.C3.B8.72.65.6E ("encoded text") into the Unicode Code Points 53.F8.72.65.6E ("decoded text").
In this form, you will be able to use uc, regex matches, etc. You will also be able to print them out to a handle with an encoding layer (e.g. :encoding(UTF-8), or the improper :utf8).
You let on that these inputs come from a database. Most DBD have a flag that causes strings to be decoded. For example, if it's a MySQL database, you should pass mysql_enable_utf8mb4 => 1 to connect.
Problem #2: Communicating encoding
If you're going to output UTF-8, don't tell the browser it's ISO-8859-1!
$ perl -e'use CGI qw( :standard ); print header()'
Content-Type: text/html; charset=ISO-8859-1
Fixed:
$ perl -e'use CGI qw( :standard ); print header( -type => "text/html; charset=UTF-8" )'
Content-Type: text/html; charset=UTF-8
Hard to give a definitive solution as you don't give us much useful information. But here are some pointers that might help.
use utf8 only tells Perl that your source code is encoded as UTF-8. It does nothing useful here.
Reading perldoc perlunitut would be a good start.
Do you know how your database tables are encoded?
Do you know whether your database connection is configured to automatically decode data coming from the database into Perl characters?
What encoding are you telling the browser that you have encoded your HTTP response in?

Encode module and inverted commas

I am scraping a web page, and extracting a specific section from it. That section includes inverted commas (’, character 146). I'm trying to print my extracted data to a text file, but it's giving me ’ instead of the inverted comma. I have tried the following:
$content =~ s/’/'/g;
my $invComma = chr 146;
$content =~ s/$invComma/'/g;
$content =~ s/\x{0092}/'/g;
None of it has worked. I can't decode('UTF-8', $content) because it has wide characters. When I try to encode('UTF-8', $content) the ’ changes to ’ instead. I have already tried use utf8 as well, to no effect.
I know that my text file viewer can display inverted commas, because I printed one to a test file and opened it. The problem is therefore in my script.
What am I doing wrong, and how do I fix it?
UPDATE: I am able to do $content =~ s/’/'/g to replace it with a simple apostrophe, but I still don't know why nothing else works. I'd also like a fix that actually solves the problem, instead of just solving one of the symptoms.
UPDATE 2: I have been informed by hobbs that the character is actually U+2019 RIGHT SINGLE QUOTATION MARK and changed my regex to use chr 0x2019 which now works.
The character you're trying to replace is only 0x92 / 146 in the Windows-1252 encoding. Perl uses Unicode, where that character is U+2019 RIGHT SINGLE QUOTATION MARK, aka "\x{2019}", chr(0x2019), or chr(8217).
Start by finding out what $content contains. You can use the following:
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
warn(Dumper($content));
If you get the following, $content is decoded
$VAR1 = "...\x{2019}...";
Any of the following will work.
use utf8; # Source code is encoded using UTF-8.
$content =~ s/’/'/g;
$content =~ s/\x{2019}/'/g;
$content =~ s/\N{U+2019}/'/g;
$content =~ s/\N{RIGHT SINGLE QUOTATION MARK}/'/g;
If you get the following, $content is encoded using UTF-8.
$VAR1 = "...\342\200\231...";
Start by decoding the value of $content using either of the following:
utf8::decode($content) or die;
use Encode qw( decode_utf8 );
$content = decode_utf8($content);
Then use any of the solutions for decoded content (above).
If you get the following, $content is encoded using cp1252.
$VAR1 = "...\222...";
Start by decoding the value of $content.
use Encode qw( decode );
$content = decode("cp1252", $content);
Then use any of the solutions for decoded content (above).
By the way, ’ is what the UTF-8 encoding of ’ (E2 80 99) would look like if decoded as cp1252.
The problem was not in my script, it was in my editor. The script works properly, and the question is based on false pretenses. I was using gVim on Windows, which did not play nicely with Unicode. My script was properly decoding the content, but when I opened the output file in gVim, it mangled the text and displayed it incorrectly. My attempts to use regular expressions to change the characters failed because I was using the wrong codepoint - it wasn't 0x92, it was 0x2019. This was another failing of gVim. Thanks to hobbs and ikegami for helping me figure this out.

Perl split a string on a repeating pattern in the most efficient manner?

I'd like to split a string that has a certain repeating pattern, for example:
$string = "GGGGG-SOMETHING-ELSE-GGG-LAST";
to
#array=(-SOMETHING-ELSE-,-LAST);
my attempt so far as a perl newbie
split(/G{2,}/,$string);
Unfortunately this results in only patterns of GG being split on- not the greedy GGGGG or GGG patterns that I has hoped for resulting in 2 array elements.
No, this seems to (mostly) work as intended. The following code:
use strict;
use warnings;
$_="GGGGG-SOMETHING-ELSE-GGG-LAST";
my #a=split(/G{2,}/,$_);
print join(",",#a) . "\n";
produces the output:
,-SOMETHING-ELSE-,-LAST
The issue is that there's a first element that's the empty string. So, to fix that, you can do something like:
use strict;
use warnings;
$_="GGGGG-SOMETHING-ELSE-GGG-LAST";
my #a=grep{$_ ne ""}(split(/G{2,}/,$_));
print join(",",#a) . "\n";
And this produces what you want:
-SOMETHING-ELSE-,-LAST
I just checked your code in my machine and it works fine:
$string = "GGGGG-SOMETHING-ELSE-GGG-LAST";
print join(':', split(/G{2,}/,$string));
returns:
:-SOMETHING-ELSE-:-LAST
The version of perl I'm using is: v5.10.1
Can you please add some more info about how you running it?

Why can't I use the map function to create a good hash from a simple data file in Perl?

The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!
Here's the minimized code to exhibit my problem:
The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding.
It has the following three lines:
abacus æbәkәs
abalone æbәlәuni
abandon әbændәn
The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding.
It contains the following code:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";
In the output, the hash table seems to be okay:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
But it is actually not, because I only get two values instead of three:
æbәlәuni
әbændәn
Perl gives the following warning message:
Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i
n> line 3.
where's the problem? Can someone kindly explain? Thanks.
The Solution
Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :)
As #Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.
To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
Now, the output is exactly what I expected:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
æbәkәs
æbәlәuni
әbændәn
Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.
Finally thanks again to all of you. And thank you, #Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.
Note
To clarify a little more, if I use:
open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
The output is this:
$VAR1 = {
'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
};
æbәlәuni
әbændәn
And the warning message:
Use of uninitialized value in print at C:\hash_test.pl line 13, line 3.
I find the warning message a little suspicious. It tells you that the $in filehandle is at line 3 when it should be at line 4 after having read the last line.
When I tried your code, I saved the input file using GVim which is configured on my system to save as UTF-8, I did not see the problem. Now that I tried it with Notepad, looking at the output file, I see:
"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
where \x{feff} is the BOM.
In your Dumper output, there is spurious blank before abacus (where you had not specified :utf8 for the output handle).
As I had mentioned originally (lost to the umpteen edits on this post — thanks for the reminder hobbs), specify '<:utf8' when you are opening the input file.
If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.
#! /usr/bin/env perl
use Data::Dumper;
open my $in, '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";
my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";
If you want it to be more robust, it is recommended to use :encoding(utf8) instead of :utf8, for reading a file.
open my $in, '<:encoding(utf8)', "hash_test.txt";
Read PerlIO for more information.
I think your answer may be sitting right in front of you. The output from Data::Dumper which you posted is:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
Notice the character between the ' and abacus? You tried to access the third value via $hash{abacus}. This is incorrect because of that character before abacus in the Dumper() hash. You could try plugging it into a loop which should take care of it:
foreach my $k (keys %hash) {
print $out $hash{$k};
}
split/\s/ instead of split/\t/
Works For Me. Are you sure your example matches your actual code and data?

Why does this base64 string comparison in Perl fail?

I am trying to compare an encode_base64('test') to the string variable containing the base64 string of 'test'. The problem is it never validates!
use MIMI::Base64 qw(encode_base64);
if (encode_base64("test") eq "dGVzdA==")
{
print "true";
}
Am I forgetting anything?
Here's a link to a Perlmonks page which says "Beware of the newline at the end of the encode_base64() encoded strings".
So the simple 'eq' may fail.
To suppress the newline, say encode_base64("test", "") instead.
When you do a string comparison and it fails unexpectedly, print the strings to see what is actually in them. I put brackets around the value to see any extra whitespace:
use MIME::Base64;
$b64 = encode_base64("test");
print "b64 is [$b64]\n";
if ($b64 eq "dGVzdA==") {
print "true";
}
This is a basic debugging technique using the best debugger ever invented. Get used to using it a lot. :)
Also, sometimes you need to read the documentation for things a couple time to catch the important parts. In this case, MIME::Base64 tells you that encode_base64 takes two arguments. The second argument is the line ending and defaults to a newline. If you don't want a newline on the end of the string you need to give it another line ending, such as the empty string:
encode_base64("test", "")
Here's an interesting tip: use Perl's wonderful and well-loved testing modules for debugging. Not only will that give you a head start on testing, but sometimes they'll make your debugging output a lot faster. For example:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More 0.88;
BEGIN { use_ok 'MIME::Base64' => qw(encode_base64) }
is( encode_base64("test", "dGVzdA==", q{"test" encodes okay} );
done_testing;
Run that script, with perl or with prove, and it won't just tell you that it didn't match, it will say:
# Failed test '"test" encodes okay'
# at testbase64.pl line 6.
# got: 'gGVzdA==
# '
# expected: 'dGVzdA=='
and sharp-eyed readers will notice that the difference between the two is indeed the newline. :)