Perl parsing email body without parts using MIME::Parser

Perl parsing email body without parts using MIME::Parser - perl

I have a perl script that uses MIME::Email to parse emails received from stdin, but it doesn't work on emails without parts. I have no ability to modify the emails before they are sent.
I'd like to be able to identify the significant part of the email, regardless of whether it's HTML or text, and store it in a buffer for processing later. Many of these emails are from a mailing list that are somehow generated automatically.
Sometimes they seem to just have one "Content-Type:" header with no boundaries.
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Other times they have multiple text/plain parts, where one is the body of the email and another is a signature.
There are a few other header lines after this, but then the body is just displayed without any boundary markers.
This is my post from two years ago showing how I was able to eventually figure out how to parse most emails with parts
Parsing email with Email::MIME and multipart/mixed with subparts
use strict;
use MIME::Parser;
use MIME::Entity;
use Email::MIME;
use Email::Simple;
my $parser = MIME::Parser->new;
$parser->extract_uuencode(1);
$parser->extract_nested_messages(1);
$parser->output_to_core(1);
my $buf;
while(<STDIN> ){
$buf .= $_;
}
my $entity = $parser->parse_data($buf);
$entity->dump_skeleton;
my $num_parts = $entity->parts;
for (my $i=0; $i < $num_parts; $i++) {
my $part = $entity->parts($i);
my $content_type = $part->mime_type;
my $body = $part->as_string;
print "body: $body\n";
}
The body text is never printed. Only the following from dump_skeleton:
Content-type: text/plain
Effective-type: text/plain
Body-file: NONE
Subject: Security update
I'd really like the ability to modify my existing script (shown in the previous stackexchange post) to be able to print emails like this without any boundaries as well.
Is this poor formatting? I've been unable to locate any examples of a library that can be used to just print the body, subject, and other basic headers of an email reliably without sophisticated steps to analyze the whole message by parts.
I know mimeexplode can do it, but I can't figure out how. I need to store the mail body in a buffer to manipulate, so using a command-line program like mimeexplode would be a roundabout way of doing it anyway.

It is not fully clear for me what you are trying to achieve since you only post code but not the intention behind it in sufficient detail. But you are using parts to inspect the message which is clearly documented to return the parts of a multipart/* or similar (i.e. message/rfc822) and does not handle single messages:
... returns the array of all sub parts, returning the empty array if there are none (e.g., if this is a single part message, or a degenerate multipart). In a scalar context, this returns you the number of parts.
If you want to just get all parts including standalone "parts" (i.e. a single entity which is not part of anything) just use parts_DFS as in the following example, which prints the body for all entities which have a non-zero body:
use MIME::Parser;
my $parser = MIME::Parser->new;
my $entity = $parser->parse(\*STDIN);
for my $part ($entity->parts_DFS) {
defined(my $body = $part->bodyhandle) or next; # has no body, likely multipart or similar
print "body: ".$body->as_string."\n";
}
EDIT: given you've updated question you are not looking for all parts but for the main text part. It is not easy to determine what the actual main part is but you might try to use the first text/* part which is inline. This would probably look something like this:
use MIME::Parser;
my $parser = MIME::Parser->new;
my $entity = $parser->parse(\*STDIN);
for my $part ($entity->parts_DFS) {
defined(my $body = $part->bodyhandle) or next; # has no body, likely multipart or similar
if (my $disp = $part->head->get('content-disposition')) {
next if $disp !~ m{inline}i;
}
print "body: ".$body->as_string."\n";
last;
}

Related

Net::IMAP::Simple get not getting the entire message

I'm pretty noobish in perl, so maybe I did something stupid.
My issue is that I'm currently using a script to get all the messages contained by an email box through the IMAP protocol, using Net::IMAP::Simple in PERL, but it does not gives me the entire body of the messages.
My entire code looks like:
use strict;
use Net::IMAP::Simple;
my $imap = Net::IMAP::Simple->new('xxxxxxxxxxxxxx') or die 'Impossible to connect '.$!;
$imap->login('xxxxxxxx', 'xxxxxxxxx') or die 'Login Error!';
my $nbmsg = $imap->select('INBOX') or die 'Impossible to reach this folder !';
print 'You have '.$nbmsg." messages in this folder !\n\n";
my $index = $imap->list() or die 'Impossible to list these messages !';
foreach my $msgnum (keys %$index) {
#if(!$imap->seen($msgnum)) {
my $msg = $imap->get($msgnum) or die 'Impossible to retrieve this message'.$msgnum.' !';
print $msg."\n\n";
# }
}
$imap->quit() or die 'quitting issue !';
And everytime that it is retrieving an email, it is giving me the first characters (which in my case are cryptics useless metadata generated by the bot that sending the messages), but not the entire body.
EDIT: Here is the body part displayed in the output:
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: BASE64
Q2UgbWVzc2FnZSBhIMOpdMOpIGfDqW7DqXLDqSBhdXRvbWF0aXF1ZW1lbnQgcGFyIGwnaW1wcmlt
YW50ZSBtdWx0aWZvbmN0aW9ucyBYZXJveCBYRVJPWF83ODMwLgogICBFbXBsYWNlbWVudCBkdSBz
eXN0w6htZSA6IAogICBBZHJlc3NlIElQIHN5c3TDqG1lIDogMTkyLjE2OC4xLjIwMAogICBBZHJl
c3NlIE1BQyBzeXN0w6htZSA6IDlDOjkzOjRFOjMzOjM1OkJECiAgIE1vZMOobGUgc3lzdMOobWUg
OiBYZXJveCBXb3JrQ2VudHJlIDc4MzAKICAgTnVtw6lybyBkZSBzw6lyaWUgc3lzdMOobWUgOiAz
OTEyNzk4ODk0CgpMJ0Fzc2lzdGFudCBkZSBjb21wdGV1ciBhIGVudm95w6kgbGUgcmVsZXbDqSBz
dWl2YW50IGF1IHNlcnZldXIgZGUgY29tbXVuaWNhdGlvbiBYZXJveCBTTWFydCBlU29sdXRpb25z
IGxlICAxNC8xMS8xNiAgIDA5OjI0OiAKICBUaXJhZ2VzIGVuIG5vaXIgOiAxMzIwNwogIFRpcmFn
ZXMgZW4gY291bGV1ciA6IDkyNjg3CiAgVG91cyBsZXMgdGlyYWdlcyA6IDEwNTg5NA==
It is always ending by this "==" btw, which is making me think that the module is shortening the output.
I looked after some details about it in the CPAN documentation but sadly didn't find anything.

Your messages are encoded as Base64. It's perfectly normal for emails to have that MIME type, though not required. You need to decode them. A good way to do that is to use MIME::Base64. Note that the == is part of the Base64 string. It's a padding to make the string have the right length.
use strict;
use warnings;
use MIME::Base64 'decode_base64';
my $decoded_msg = decode_base64($msg_body);
However, you need to get the body out of those message objects. The documentation is vague about that, it doesn't say what those objects are, and get only returns the raw message.
I suggest you install Data::Printer and use it to dump one of your $msg objects. That dump will include the internals of the object (which is likely a hash reference), and all the methods the object has. It's possible this object includes an accessor to get the already decoded content. If that's the case, you don't need to decode yourself. If not, grab the body out and decode it with decode_base64.
Update: I read the code, and it creates Net::IMAP::Simple::_message objects in the get method. There is a package definition at the top of the code. It's a bit complex, but it's obvious. It uses the arrayref of lines as the data structure behind the object, so I was wrong above.
q( package Net::IMAP::Simple::_message;
use overload fallback=>1, '""' => sub { local $"=""; "#{$_[0]}" };
sub new { bless $_[1] })
And further down:
return wantarray ? #lines : Net::IMAP::Simple::_message->new(\#lines)
So to get the body, you need to get rid of the header string. Once you've dumped out the object, you should see how many elements at the beginning of the array are the header and the empty line. I assume index 0 is the header line, and index 1 is the empty line. If you don't care about those, you can just throw them away.
This will change the object.
shift #$msg; # get rid of header
shift #$msg; # get rid of empty line
my $decoded_msg = decode_base64("$msg");

Cancel Download using WWW::Mechanize in Perl

I have written a Perl script which would check a list of URLs and connect to them by sending a GET request.
Now, let's say that one of these URLs has a file which is very big in size, for instance, has a size > 100 MB.
When a request is sent to download this file using this:
$mech=WWW::Mechanize->new();
$url="http://somewebsitename.com/very_big_file.txt"
$mech->get($url)
Once the GET request is sent, it will start downloading the file. I want this to be cancelled using WWW::Mechanize. How can I do that?
I checked the documentation of this Perl Module here:
http://metacpan.org/pod/WWW::Mechanize
However, I could not find a method which would help me do this.
Thanks.

Aborting a GET request
Using the :content_cb option, you can provide a callback function to get() that will be executed for each chunk of response content received from the server. You can set* the chunk size (in bytes) using the :read_size_hint option. These options are documented in LWP::UserAgent (get() in WWW::Mechanize is just an overloaded version of the same method in LWP::UserAgent).
The following request will be aborted after reading 1024 bytes of response content:
use WWW::Mechanize;
sub callback {
my ($data, $response, $protocol) = #_;
die "Too much data";
}
my $mech = WWW::Mechanize->new;
my $url = 'http://www.example.com';
$mech->get($url, ':content_cb' => \&callback, ':read_size_hint' => 1024);
print $mech->response()->header('X-Died');
Output:
Too much data at ./mechanize line 12.
Note that the die in the callback does not cause the program itself to die; it simply sets the X-Died header in the response object. You can add the appropriate logic to your callback to determine under what conditions a request should be aborted.
Don't even fetch URL if content is too large
Based on your comments, it sounds like what you really want is to never send a request in the first place if the content is too large. This is quite different from aborting a GET request midway through, since you can fetch the Content-Length header with a HEAD request and perform different actions based on the value:
my #urls = qw(http://www.example.com http://www.google.com);
foreach my $url (#urls) {
$mech->head($url);
if ($mech->success) {
my $length = $mech->response()->header('Content-Length') // 0;
next if $length > 1024;
$mech->get($url);
}
}
Note that according to the HTTP spec, applications should set the Content-Length header. This does not mean that they will (hence the default value of 0 in my code example).
* According to the documentation, the "protocol module which will try to read data from the server in chunks of this size," but I don't think it's guaranteed.

CGI Perl Echo back POSTed application/x-www-form-urlencoded Content

I need a simple CGI based Perl script to receive a POST (directly, not from another HTML page) with Content-Type being application/x-www-form-urlencoded and to echo back
I received: (encoded string)
(and if possible)
decoded, the string is: (decoded string)
I am new to CGI Perl, and this is a one off request for testing a product (I'm a sysadmin. not a programmer). I intend to learn Perl more deeply in the future, but in this case I'm hoping for a gimme.

To start off, I will quickly skim some of the basics.
Following is the package for PERL/CGI application:
use CGI;
To create CGI object:
my $web = CGI->new;
Make sure you set and then write HTTP headers to outstream, before flushing out any CGI data to outstream. Otherwise you would end up in 500 error.
To set the headers:
print $web->header();
print $web->header('application/x-www-form-urlencoded');
To receive any post data from HTML, say for example,
http://example.com?POSTDATA=helloworld
you may use param() function:
my $data = $web->param('POSTDATA');
scalar $data would be set with "helloworld".
It is advisable to check if $web->param('POSTDATA') is defined before you assign to a scalar.

Sending a video from Perl to a client over HTTP

I am currently making a perl script that will convert a file to webm/ogg/mp4 format and then send it back to the user but in embed video. It all works except that I can not send an EOF so the HTML5 video player knows what the end is and so he can correctly use the file (like going to a specific time and even knowing when the file has ended (now it just stops but you can't do anything anymore with the video.
Start-code:
elsif ($path =~ /^\/((\w|\d){11})\.webm$/ig) {
print "HTTP/1.0 200 OK\r\n";
$handler = \&resp_youtubemovie;
$handler->($cgi,$1);
Function to send webm file
sub resp_youtubemovie {
my $cgi = shift;
my $youtubeID = shift;
return if !ref $cgi;
open(movie,"<$youtubeID.webm");
local($/);
$movie = <movie>;
close(movie);
print "Content-type: movie/webm\n";
print $movie;
}
I've already tried with a while loop and a buffer but that doesn't work either, I've also tried to change the HTTP status code to 206 Partial Content because I wiresharked some other video streaming websites used it but it didn't matter. So anyone an idea how to open a movie file and stream it correctly?

Rather than doing this by hand, a framework like Dancer can take care of this. This will save you many, many, many headaches. It also allows you to take advantage of the Plack/PSGI superglue which figures out how to talk to web servers for you.
use Dancer;
get qr{/(\w{11}\.webm)$}i => sub {
my($video_file) = splat;
return send_file(
$video_file,
streaming => 1,
);
}
Using Dancer routes you should be able to adapt your existing code pretty easily especially if its a big if/elsif matching against various paths. Dancer does a very good job making simple things simple, it also gives you a huge amount of control over the exact HTTP response if you need it.
A few notes...
The content-type for webm is video/webm which may be the source of your problems. Dancer should just get it right. If not you can tell send_file the content type explicitly.
(\w|\d){11} is better written as \w{11} since \w includes \d.

You must use the 206 Partial Content HTTP status and you must also send:
The Accept-Range: bytes header.
A Content-Range: 0-2048/123456 header where you send the starting and ending byte index of the content followed by the total byte length of the content. The client will be sending you the byte ranges it wants in the request header. The client may send multiple byte ranges in a single request, in which case you'd also need to send the content with multipart word boundaries.
Finally, to get back to your question, if the client requests a byte range that isn't satisfiable then you send a 416 HTTP status and close the connection.

How can I get the entire request body with CGI.pm?

I'm trying to write a Perl CGI script to handle XML-RPC requests, in which an XML document is sent as the body of an HTTP POST request.
The CGI.pm module does a great job at extracting named params from an HTTP request, but I can't figure out how to make it give me the entire HTTP request body (i.e. the XML document in the XML-RPC request I'm handling).
If not CGI.pm, is there another module that would be able to parse this information out of the request? I'd prefer not to have to extract this information "by hand" from the environment variables. Thanks for any help.

You can get the raw POST data by using the special parameter name POSTDATA.
my $q = CGI->new;
my $xml = $q->param( 'POSTDATA' );
Alternatively, you could read STDIN directly instead of using CGI.pm, but then you lose all the other useful stuff that CGI.pm does.
The POSTDATA trick is documented in the excellent CGI.pm docs here.

Right, one could use POSTDATA, but that only works if the request Content-Type has not been set to 'multipart/form-data'.
If it is set to 'multipart/form-data', CGI.pm does its own content processing and POSTDATA is not initialized.
So, other options include $cgi->query_string and/or $cgi->Dump.
The $cgi->query_string returns the contents of the POST in a GET format (param=value&...), and there doesn't seem to be a way to simply get the contents of the POST STDIN as they were passed in by the client.
So to get the actual content of the standard input of a POST request, if modifying CGI.pm is an option for you, you could modify around line 620 to save the content of #lines somewhere in a variable, such as:
$self->{standard_input} = join '', #lines;
And then access it through $cgi->{standard_input}.

To handle all cases, including those when Content-Type is multipart/form-data, read (and put back) the raw data, before CGI does.
use strict;
use warnings;
use IO::Handle;
use IO::Scalar;
STDIN->blocking(1); # ensure to read everything
my $cgi_raw = '';
{
local $/;
$cgi_raw = <STDIN>;
my $s;
tie *STDIN, 'IO::Scalar', \$s;
print STDIN $cgi_raw;
tied(*STDIN)->setpos(0);
}
use CGI qw /:standard/;
...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl parsing email body without parts using MIME::Parser - perl

Related

Net::IMAP::Simple get not getting the entire message

Cancel Download using WWW::Mechanize in Perl

CGI Perl Echo back POSTed application/x-www-form-urlencoded Content

Sending a video from Perl to a client over HTTP

How can I get the entire request body with CGI.pm?

Categories

Resources