Net::IMAP::Simple get not getting the entire message - perl

I'm pretty noobish in perl, so maybe I did something stupid.
My issue is that I'm currently using a script to get all the messages contained by an email box through the IMAP protocol, using Net::IMAP::Simple in PERL, but it does not gives me the entire body of the messages.
My entire code looks like:
use strict;
use Net::IMAP::Simple;
my $imap = Net::IMAP::Simple->new('xxxxxxxxxxxxxx') or die 'Impossible to connect '.$!;
$imap->login('xxxxxxxx', 'xxxxxxxxx') or die 'Login Error!';
my $nbmsg = $imap->select('INBOX') or die 'Impossible to reach this folder !';
print 'You have '.$nbmsg." messages in this folder !\n\n";
my $index = $imap->list() or die 'Impossible to list these messages !';
foreach my $msgnum (keys %$index) {
#if(!$imap->seen($msgnum)) {
my $msg = $imap->get($msgnum) or die 'Impossible to retrieve this message'.$msgnum.' !';
print $msg."\n\n";
# }
}
$imap->quit() or die 'quitting issue !';
And everytime that it is retrieving an email, it is giving me the first characters (which in my case are cryptics useless metadata generated by the bot that sending the messages), but not the entire body.
EDIT: Here is the body part displayed in the output:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: BASE64
Q2UgbWVzc2FnZSBhIMOpdMOpIGfDqW7DqXLDqSBhdXRvbWF0aXF1ZW1lbnQgcGFyIGwnaW1wcmlt
YW50ZSBtdWx0aWZvbmN0aW9ucyBYZXJveCBYRVJPWF83ODMwLgogICBFbXBsYWNlbWVudCBkdSBz
eXN0w6htZSA6IAogICBBZHJlc3NlIElQIHN5c3TDqG1lIDogMTkyLjE2OC4xLjIwMAogICBBZHJl
c3NlIE1BQyBzeXN0w6htZSA6IDlDOjkzOjRFOjMzOjM1OkJECiAgIE1vZMOobGUgc3lzdMOobWUg
OiBYZXJveCBXb3JrQ2VudHJlIDc4MzAKICAgTnVtw6lybyBkZSBzw6lyaWUgc3lzdMOobWUgOiAz
OTEyNzk4ODk0CgpMJ0Fzc2lzdGFudCBkZSBjb21wdGV1ciBhIGVudm95w6kgbGUgcmVsZXbDqSBz
dWl2YW50IGF1IHNlcnZldXIgZGUgY29tbXVuaWNhdGlvbiBYZXJveCBTTWFydCBlU29sdXRpb25z
IGxlICAxNC8xMS8xNiAgIDA5OjI0OiAKICBUaXJhZ2VzIGVuIG5vaXIgOiAxMzIwNwogIFRpcmFn
ZXMgZW4gY291bGV1ciA6IDkyNjg3CiAgVG91cyBsZXMgdGlyYWdlcyA6IDEwNTg5NA==
It is always ending by this "==" btw, which is making me think that the module is shortening the output.
I looked after some details about it in the CPAN documentation but sadly didn't find anything.

Your messages are encoded as Base64. It's perfectly normal for emails to have that MIME type, though not required. You need to decode them. A good way to do that is to use MIME::Base64. Note that the == is part of the Base64 string. It's a padding to make the string have the right length.
use strict;
use warnings;
use MIME::Base64 'decode_base64';
my $decoded_msg = decode_base64($msg_body);
However, you need to get the body out of those message objects. The documentation is vague about that, it doesn't say what those objects are, and get only returns the raw message.
I suggest you install Data::Printer and use it to dump one of your $msg objects. That dump will include the internals of the object (which is likely a hash reference), and all the methods the object has. It's possible this object includes an accessor to get the already decoded content. If that's the case, you don't need to decode yourself. If not, grab the body out and decode it with decode_base64.
Update: I read the code, and it creates Net::IMAP::Simple::_message objects in the get method. There is a package definition at the top of the code. It's a bit complex, but it's obvious. It uses the arrayref of lines as the data structure behind the object, so I was wrong above.
q( package Net::IMAP::Simple::_message;
use overload fallback=>1, '""' => sub { local $"=""; "#{$_[0]}" };
sub new { bless $_[1] })
And further down:
return wantarray ? #lines : Net::IMAP::Simple::_message->new(\#lines)
So to get the body, you need to get rid of the header string. Once you've dumped out the object, you should see how many elements at the beginning of the array are the header and the empty line. I assume index 0 is the header line, and index 1 is the empty line. If you don't care about those, you can just throw them away.
This will change the object.
shift #$msg; # get rid of header
shift #$msg; # get rid of empty line
my $decoded_msg = decode_base64("$msg");

Related

Perl parsing email body without parts using MIME::Parser

I have a perl script that uses MIME::Email to parse emails received from stdin, but it doesn't work on emails without parts. I have no ability to modify the emails before they are sent.
I'd like to be able to identify the significant part of the email, regardless of whether it's HTML or text, and store it in a buffer for processing later. Many of these emails are from a mailing list that are somehow generated automatically.
Sometimes they seem to just have one "Content-Type:" header with no boundaries.
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Other times they have multiple text/plain parts, where one is the body of the email and another is a signature.
There are a few other header lines after this, but then the body is just displayed without any boundary markers.
This is my post from two years ago showing how I was able to eventually figure out how to parse most emails with parts
Parsing email with Email::MIME and multipart/mixed with subparts
use strict;
use MIME::Parser;
use MIME::Entity;
use Email::MIME;
use Email::Simple;
my $parser = MIME::Parser->new;
$parser->extract_uuencode(1);
$parser->extract_nested_messages(1);
$parser->output_to_core(1);
my $buf;
while(<STDIN> ){
$buf .= $_;
}
my $entity = $parser->parse_data($buf);
$entity->dump_skeleton;
my $num_parts = $entity->parts;
for (my $i=0; $i < $num_parts; $i++) {
my $part = $entity->parts($i);
my $content_type = $part->mime_type;
my $body = $part->as_string;
print "body: $body\n";
}
The body text is never printed. Only the following from dump_skeleton:
Content-type: text/plain
Effective-type: text/plain
Body-file: NONE
Subject: Security update
I'd really like the ability to modify my existing script (shown in the previous stackexchange post) to be able to print emails like this without any boundaries as well.
Is this poor formatting? I've been unable to locate any examples of a library that can be used to just print the body, subject, and other basic headers of an email reliably without sophisticated steps to analyze the whole message by parts.
I know mimeexplode can do it, but I can't figure out how. I need to store the mail body in a buffer to manipulate, so using a command-line program like mimeexplode would be a roundabout way of doing it anyway.
It is not fully clear for me what you are trying to achieve since you only post code but not the intention behind it in sufficient detail. But you are using parts to inspect the message which is clearly documented to return the parts of a multipart/* or similar (i.e. message/rfc822) and does not handle single messages:
... returns the array of all sub parts, returning the empty array if there are none (e.g., if this is a single part message, or a degenerate multipart). In a scalar context, this returns you the number of parts.
If you want to just get all parts including standalone "parts" (i.e. a single entity which is not part of anything) just use parts_DFS as in the following example, which prints the body for all entities which have a non-zero body:
use MIME::Parser;
my $parser = MIME::Parser->new;
my $entity = $parser->parse(\*STDIN);
for my $part ($entity->parts_DFS) {
defined(my $body = $part->bodyhandle) or next; # has no body, likely multipart or similar
print "body: ".$body->as_string."\n";
}
EDIT: given you've updated question you are not looking for all parts but for the main text part. It is not easy to determine what the actual main part is but you might try to use the first text/* part which is inline. This would probably look something like this:
use MIME::Parser;
my $parser = MIME::Parser->new;
my $entity = $parser->parse(\*STDIN);
for my $part ($entity->parts_DFS) {
defined(my $body = $part->bodyhandle) or next; # has no body, likely multipart or similar
if (my $disp = $part->head->get('content-disposition')) {
next if $disp !~ m{inline}i;
}
print "body: ".$body->as_string."\n";
last;
}

What is the preferred method of accessing WWW::Mechanize responses?

Are both of these versions OK or is one of them to prefer?
#!/usr/bin/env perl
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my $content;
# 1
$mech->get( 'http://www.kernel.org' );
$content = $mech->content;
print $content;
# 2
my $res = $mech->get( 'http://www.kernel.org' );
$content = $res->content;
print $content;
They are both acceptable. The second one seems cleaner to me because it returns a proper HTTP::Response object which you can query and call methods on, and also means that if you make another Mechanize request, you'll still have access to the old HTTP response. With your first approach, each time you make a request, the content method will change to something new, which sounds error-prone.
Btw, for either method, you should check $response->is_success or $mech->success before accessing the content, as the request may have failed.
The content() method is sometimes more convenient:
$mech->content(...)
Returns the content that the mech uses internally for the last page fetched. Ordinarily this is the same as $mech->response()->content(), but this may differ for HTML documents if "update_html" is overloaded, and/or extra named arguments are passed to content():
$mech->content( format => 'text' )
Returns a text-only version of the page, with all HTML markup stripped. This feature requires HTML::TreeBuilder to be installed, or a fatal error will be thrown.
$mech->content( base_href => [$base_href|undef] )
Returns the HTML document, modified to contain a mark-up in the header. $base_href is $mech->base() if not specified. This is handy to pass the HTML to e.g. HTML::Display.
$mech->content is specifically there so you can bypass having to get the result response. The simpler it is, the better.

How can I detect the file type of image at a URL?

How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types
If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );
You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.
Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__
You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)
From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.

Transparently Handling GZip Encoded content with WWW::Mechanize

I am using WWW::Mechanize and currently handling HTTP responses with the 'Content-Encoding: gzip' header in my code by first checking the response headers and then using IO::Uncompress::Gunzip to get the uncompressed content.
However I would like to do this transparently so that WWW::Mechanize methods like form(), links() etc work on and parse the uncompressed content. Since WWW::Mechanize is a sub-class of LWP::UserAgent, I would prefer to use the LWP::UA::handlers to do this.
While I have been partly successful (I can print the uncompressed content for example), I am unable to do this transparently in a way that I can call
$mech->forms();
In summary: How do I "replace" the content inside the $mech object so that from that point onwards, all WWW::Mechanize methods work as if the Content-Encoding never happened?
I would appreciate your attention and help.
Thanks
WWW::Mechanize::GZip, I think.
It looks to me like you can replace it by using the $res->content( $bytes ) member.
By the way, I found this stuff by looking at the source of LWP::UserAgent, then HTTP::Response, then HTTP::Message.
It is built in with UserAgent and thus Mechanize. One MAJOR caveat to save you some hair
-To debug, make sure you check for error $# after the call to decoded_content.
$html = $r->decoded_content;
die $# if $#;
Better yet, look through the source of HTTP::Message and make sure all the support packages are there
In my case, decoded_content returned undef while content is raw binary, and I went on a wild goose chase. UserAgent will set the error flag on failure to decode, but Mechanize will just ignore it (It doesn't check or log the incidence as its own error/warning).
In my case $# sez: "Can't find IO/HTML.pm .. It was eval'ed
After having to dive into the source, I find out the built-in decoding process is long, meticulous, and arduous, covering just about every scenario and making tons of guesses (Thank you Gisle!).
if you are paranoid, explicitly set the default header to be used with every request at new()
$browser = new WWW::Mechanize('default_headers' => HTTP::Headers->new('Accept-Encoding'
=> scalar HTTP::Message::decodable()));

How can I get the entire request body with CGI.pm?

I'm trying to write a Perl CGI script to handle XML-RPC requests, in which an XML document is sent as the body of an HTTP POST request.
The CGI.pm module does a great job at extracting named params from an HTTP request, but I can't figure out how to make it give me the entire HTTP request body (i.e. the XML document in the XML-RPC request I'm handling).
If not CGI.pm, is there another module that would be able to parse this information out of the request? I'd prefer not to have to extract this information "by hand" from the environment variables. Thanks for any help.
You can get the raw POST data by using the special parameter name POSTDATA.
my $q = CGI->new;
my $xml = $q->param( 'POSTDATA' );
Alternatively, you could read STDIN directly instead of using CGI.pm, but then you lose all the other useful stuff that CGI.pm does.
The POSTDATA trick is documented in the excellent CGI.pm docs here.
Right, one could use POSTDATA, but that only works if the request Content-Type has not been set to 'multipart/form-data'.
If it is set to 'multipart/form-data', CGI.pm does its own content processing and POSTDATA is not initialized.
So, other options include $cgi->query_string and/or $cgi->Dump.
The $cgi->query_string returns the contents of the POST in a GET format (param=value&...), and there doesn't seem to be a way to simply get the contents of the POST STDIN as they were passed in by the client.
So to get the actual content of the standard input of a POST request, if modifying CGI.pm is an option for you, you could modify around line 620 to save the content of #lines somewhere in a variable, such as:
$self->{standard_input} = join '', #lines;
And then access it through $cgi->{standard_input}.
To handle all cases, including those when Content-Type is multipart/form-data, read (and put back) the raw data, before CGI does.
use strict;
use warnings;
use IO::Handle;
use IO::Scalar;
STDIN->blocking(1); # ensure to read everything
my $cgi_raw = '';
{
local $/;
$cgi_raw = <STDIN>;
my $s;
tie *STDIN, 'IO::Scalar', \$s;
print STDIN $cgi_raw;
tied(*STDIN)->setpos(0);
}
use CGI qw /:standard/;
...