How to extract e-mail data into R? - email

How could I export my e-mail database from Gmail (or Thunderbird) into R?
Like there is the rgoogledocs package and twitteR, is there a gmailR package, or a standard format for exporting emails into stat packages ?
Tal

Need to install it library(edeR) first. May need to manually install Java 64 on Windows 8, may need to enable IMAP access in Gmail.
dat3 <-extractKeyword(username="YOURLOGIN#gmail.com",
password="YouRPaSS",
kw="adsense",
nmail=5)
This will download 5 emails with keyword 'adsense'.

Standard email (on a Unix system) is either an mbox file (containing several messages) or a maildir setup where each mail is a file in a directory.
Either way, it's ascii text. That is how a MUA (mail-user agents -- your mail reader) is orthogonal to your MTA (mail-transport agent -- mail server software like exim, qmail, postfix, ...). The MTA may use a network protocol like POP3 or IMAP to serve the mail files to the client in which case the client (which may be Gmail or Thunderbird) no longer sees the underlying files. So you may need to learn how to export your mail from whichever backend you employ and then read it.
This has nothing to do with R or programming so far --- unless you now feel you must extend R with POP3 or IMAP facilities to connect to a (remote) mail server.

Now there is R package to extract email data. This package still in testing phase but anyone can install it from GitHub, the package name is edeR. Right now this can extract email data from IMAP enabled Gmail.

Gmail and Thunderbird are not the same... you can enable Gmail account in Thunderbird, hence export each email in ASCII file, hence write a R batch script that will take each file and import it in R as an object, hence... you get the point. =)
Usually I'm trying to avoid "the pedestrian approach"... but I'm getting an impression that you're prone on using R as a "general purpose" programming language... Python or JAVA, on the other hand can be quite efficient, so you can write (or ask someone to write it for you) a script that will "bring" you data in desirable format, and then crunch it in R. R has matured a lot, and it's not solely a tool for statistical analysis any more, but it's always a good idea to use some widely-known programming language to carry out your data.
So there... Roll up your sleeves, and dive into Python (JAVA, C... whatever you feel like diving in)!
P.S.
I reckon that this has something to do with your previous post with word cloud...

Once you have exported your e-mails in mbox format into your PC, you can make use of both tm and tm.plugin.mail packages in R. The latter makes it possible to export your e-mails into R.
require("tm")
require("tm.plugin.mail")
Then, to convert your e-mails from mbox (i.e., several mails in a single box) format to eml (i.e., every mail in a single file) format: convert_mbox_eml(mbox, dir). In the example below, mbox is represented by "yourmails.mbox" and it describes the mbox location. The output directory is given by "your_mails".
convert_mbox_eml("yourmails.mbox", "your_mails")
You can read in an electronic mail document and inspect with the following R commands.
mails <- VCorpus(DirSource("your_mails/"), readerControl = list(reader =
readMail))
inspect(mails)

Related

Zimbra emails from Postfix queue to eml format

I'm trying to recover some emails from Zimbra Postfix not delivered ‘deferred’ queue by converting them to eml format.
First it came up that all files from the queue are some kind of binaries which can be viewed somehow by less command but that kind of display is not clear to process again for a delivery.
Are there any converters for that kind of messages or some specifications to which I can develop own converter?
OK, I have solved my issue. It came up that Zimbra is using slightly modified Postfix. In that case I was able to convert queued messages using command below (Postfix 2.7+)
postcat -bh file_name_from_queue > file_name.eml
or
postcat -qbh queue_id > file_name.eml

Does Mail::Box::Manager handle Qmail maildir format?

It's not obvious which format this module supports. No mention of 'qmail' nor 'maildir'.
http://search.cpan.org/~markov/Mail-Box-2.120/lib/Mail/Box/Manager.pod
It has functions to move between folders, which sounds like maildir, but then it also hints that it encapsulates Mail::Box::Mbox, which sounds like the single file mbox format.
Here is a description of maildir format: http://wiki.dovecot.org/MailboxFormat/Maildir, and description of both: http://www.postfix.org/virtual.8.html
http://search.cpan.org/~markov/Mail-Box-2.120/lib/Mail/Box-Overview.pod
Each folder maintains a list of messages. Much effort is made to hide differences between folder types and kinds of messages. Your program can be used for MBOX, MH, Maildir, and POP3 folders with no change at all (as long as you stick to the rules).

Splitting Emails with MIME::Parser

I got handed 4GB of emails concatenated into a single file and the suggestion that MIME::Parser could split the individual emails back out again. All my attempts to date end up with the parser just copying the original file without extracting any of the emails. So: Is this even something that MIME::Parser can handle? My code is very basic:
my $file = IO::File->new("somefile", O_RDONLY);
my $parser = new MIME::Parser;
$parser->output_dir("somedir");
my $entity = $parser->parse($file);
$file->close;
Below is a link to sample date that some have requested. This is all SPAM and phishing emails. DO NOT CLICK ANY OF THE LINKS. Enjoy: Pastbin of 4KB of emails.
MIME::Parser is for reading a single Mail to get the attachments etc. It can be used to extract mails which are attached inside another mail as message/rfc822, but is is not intended to extract mails from some kind of archive with lots of mails in it concatenated.
It is not clear what format your single file with mails has. But if it comes from a UNIX system or from a Thunderbird installation it might simply be in the classical Mbox format and there are several tools to split Mbox files into separate messages. Apart from several perl modules there are also other tools like git-mailsplit which help you extract the mails from Mbox-format.

How can I open multiple attachments of the same name in an email, then move the sender of the attachment to a spreadsheet?

I have an internship and was recently assigned the tedious task of cleaning the email lists. My employer has sent me a series of email with email bounces as attachments, many at a time, all with the same name. I have considered ways of doing this most efficiently, I'm looking to avoid just clicking through like a slave. My thoughts were to create a macro using autohotkey's language, but I feel like maybe a batch file or some sort of Perl might do the same thing. Could anybody give me an idea as to how to do this, specifically with a batch file? Thanks in advance!
Mail::DeliveryStatus::BounceParser parses bouncing email addresses out of delivery report messages.
If you don't know any perl, then I recommend that you first convert the mailbox into some format that stores each email in separate text files, like MH or similar.
At that point, you can trivially use the command grep _pattern_ | sed -e 's/:.*//' | sort | uniq > _list_ to obtain lists of all files matching _pattern_. You may inspect/edit this file _list_ to verify that the desired results were obtained.
You may then create another director junk or whatever and move all the files listed in _list_ into junk with a command like perl -e 'chomp; rename($_,"junk");' < _list_.
If you'll need this regularly, then you could automate this further, likely using perl alone, but a one off task will probably involve more messing about with getting the right message list.
Alternatively, you could load all the emails into a single folder in an sane mail reader, like Mac OS X's Mail.app, and do simply search, select all, move/delete commands.

Is there a difference between the Outlook .MSG and .OFT file formats?

This question is somewhat of a long shot, but I've spent hours on it to no avail. I have some code that generates an email file on a webserver, and allows the user to download that email and open it in Outlook. From here, they can make various manual changes to the email before they send it to a bunch of people.
Right now, I generate a .OFT file, which is basically an email template. What I want to do is generate a .MSG file, which is an actual email. From a binary point of view, it seems these file formats are identical. They have the same Stream IDs and properties and stuff.
My approach was to first create a blank email message in Outlook and then just save it to a file called Base.oft. In my code, I open the document and modify Stream ID __substg1.0_1013001E which is the ID for the HTML email body. I then save the file and write it out to the cilent. This works perfectly.
I tried the same approach with the MSG format. I created a blank email message, saved it as Base.msg, and modify the same Stream ID. If I look at the resulting file, the new body is actually in there and saved. However, if I open the email, the body is still blank.
What's even weirder is if I type in a body in Outlook and save that to the base file, I can see that body under stream 0_1013001E. If I then modify that stream with a different body, I can verify the new body is indeed saved in the file, but if I open the message in Outlook, I see the old, original body. It's as if the email body is stored in a different place in the file for the .MSG format, however I've looked through each stream and cannot find anything else that looks like it could be an email body.
Perhaps .MSG files are encrypted, or their bodies are stored in some proprietary binary format unlike .OFT files? Hopefully someone has some insight on this, as I scoured the Internet and found basically nothing on these formats.
Update:
It seems the .MSG format stores the body in Stream ID __substg1.0_10090102 - Which is encoded in some binary form (not sure what.) If I delete the stream (or set it to a single \0, the file becomes corrupt.
First of all, to find more information on this and related topics, move away from raw substream numbers and google for the corresponding MAPI properties. For example, 1013 is PR_HTML and 1009 is PR_RTF_COMPRESSED. MAPI has ways of synching the body from one format to the other.
See this article on MSDN for a good overview of all content-related MAPI properties (i.e. the different "streams" inside the .MSG file).
To write PR_RTF_COMPRESSED, wrap the stream inside WrapCompressedStream. On the other hand, in your particular situation you might want to avoid the MAPI-dependencies in your code, so maybe you're better off finding the PR_STORE_SUPPORT_MASK and setting the STORE_UNCOMPRESSED_RTF bit. This will allow you to use straight RTF in the PR_RTF_COMPRESSED substream. Or Outlooks fancy html-wrapped-in-rtf, if you are feeling brave.
None of this stuff is for the faint of heart, but seeing how you are already handing raw .MSG substream writing, I'm guessing it would be feasible.
When it comes to the format, there is no difference.
the only difference is that OFT files have CLSID_TemplateMessage ({0006F046-0000-0000-C000-000000000046}) as the storage class (WriteClassStg), while MSG files use CLSID_MailMessage ({00020D0B-0000-0000-C000-000000000046})