Extracting the body of an email from mbox file, decoding it to plain text regardless of Charset and Content Transfer Encoding

Extracting the body of an email from mbox file, decoding it to plain text regardless of Charset and Content Transfer Encoding - email

I am trying to use Python 3 to extract the body of email messages from a thunderbird mbox file. It is an IMAP account.
I would like to have the text part of the body of the email available to process as a unicode string. It should 'look like' the email does in Thunderbird, and not contain escaped characters such as \r\n =20 etc.
I think that it is the Content Transfer Encodings that I don't know how to decode or remove.
I receive emails with a variety of different Content Types, and different Content Transfer Encodings.
This is my current attempt :
import mailbox
import quopri,base64
def myconvert(encoded,ContentTransferEncoding):
if ContentTransferEncoding == 'quoted-printable':
result = quopri.decodestring(encoded)
elif ContentTransferEncoding == 'base64':
result = base64.b64decode(encoded)
mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'
for msg in mailbox.mbox(mboxfile):
if msg.is_multipart(): #Walk through the parts of the email to find the text body.
for part in msg.walk():
if part.is_multipart(): # If part is multipart, walk through the subparts.
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
body = subpart.get_payload() # Get the subpart payload (i.e the message body)
for k,v in subpart.items():
if k == 'Content-Transfer-Encoding':
cte = v # Keep the Content Transfer Encoding
elif subpart.get_content_type() == 'text/plain':
body = part.get_payload() # part isn't multipart Get the payload
for k,v in part.items():
if k == 'Content-Transfer-Encoding':
cte = v # Keep the Content Transfer Encoding
print(body)
print('Body is of type:',type(body))
body = myconvert(body,cte)
print(body)
But this fails with :
Body is of type: <class 'str'>
Traceback (most recent call last):
File "C:/Users/David/Documents/Python/test2.py", line 31, in <module>
body = myconvert(body,cte)
File "C:/Users/David/Documents/Python/test2.py", line 6, in myconvert
result = quopri.decodestring(encoded)
File "C:\Python32\lib\quopri.py", line 164, in decodestring
return a2b_qp(s, header=header)
TypeError: 'str' does not support the buffer interface

Here is some code that does the job, it prints errors instead of crashing for those messages where it would fail. I hope that it may be useful. Note that if there is a bug in Python 3, and that is fixed, then the lines .get_payload(decode=True) may then return a str object instead of a bytes object. I ran this code today on 2.7.2 and on Python 3.2.1.
import mailbox
def getcharsets(msg):
charsets = set({})
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
return charsets
def handleerror(errmsg, emailmsg,cs):
print()
print(errmsg)
print("This error occurred while decoding with ",cs," charset.")
print("These charsets were found in the one email.",getcharsets(emailmsg))
print("This is the subject:",emailmsg['subject'])
print("This is the sender:",emailmsg['From'])
def getbodyfromemail(msg):
body = None
#Walk through the parts of the email to find the text body.
if msg.is_multipart():
for part in msg.walk():
# If part is multipart, walk through the subparts.
if part.is_multipart():
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
# Get the subpart payload (i.e the message body)
body = subpart.get_payload(decode=True)
#charset = subpart.get_charset()
# Part isn't multipart so get the email body
elif part.get_content_type() == 'text/plain':
body = part.get_payload(decode=True)
#charset = part.get_charset()
# If this isn't a multi-part message then get the payload (i.e the message body)
elif msg.get_content_type() == 'text/plain':
body = msg.get_payload(decode=True)
# No checking done to match the charset with the correct part.
for charset in getcharsets(msg):
try:
body = body.decode(charset)
except UnicodeDecodeError:
handleerror("UnicodeDecodeError: encountered.",msg,charset)
except AttributeError:
handleerror("AttributeError: encountered" ,msg,charset)
return body
#mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'
print(mboxfile)
for thisemail in mailbox.mbox(mboxfile):
body = getbodyfromemail(thisemail)
print(body[0:1000])

This script seems to return all messages correctly:
def getcharsets(msg):
charsets = set({})
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
return charsets
def getBody(msg):
while msg.is_multipart():
msg=msg.get_payload()[0]
t=msg.get_payload(decode=True)
for charset in getcharsets(msg):
t=t.decode(charset)
return t
Former answer from acd often returns only some footer of the real message.
(
at least in the GMANE email messagens I am opening for this toolbox:
https://pypi.python.org/pypi/gmane
)
cheers

Related

Parsing email message headers in Go

How I can read some headers from an email message in Go?
Usually I would use ReadMIMEHeader(), but sadly not everybody has read all the relevant RFCs and for some messages I get output like:
malformed MIME header line: name="7DDA4_foo_9E5D72.zip"
I narrowed the culprit to be
Content-Type: application/x-zip-compressed; x-unix-mode=0600;
name="7DDA4_foo_9E5D72.zip"
instead of
Content-Type: application/x-zip-compressed; x-unix-mode=0600;
name="7DDA4_foo_9E5D72.zip"
in the source of the message.
Go Playground example
What is the correct way of parsing the headers correctly, regardless if indented or not?

Given that the message is malformed, I would fix it through a separate piece of code that reformats the message:
func fixBrokenMime(r_ io.Reader, w io.WriteCloser) {
r := bufio.NewScanner(bufio.NewReader(r_))
for r.Scan() {
line := r.Text()
if len(line) > 0 && line[0] != ' ' && strings.IndexByte(line, ':') < 0 {
line = " " + line
}
w.Write([]byte(line+"\n"))
}
w.Close()
}
Playground: http://play.golang.org/p/OZsXT7pmtN
Obviously, you may want a different heuristic. I assumed that a line that is not indented and doesn't contain ":", must be indented.

Check out https://github.com/sendgrid/go-gmime (disclaimer, I work with SendGrid, but did not put together anything in the lib)

cgi.parse_multipart function throws TypeError in Python 3

I'm trying to make an exercise from Udacity's Full Stack Foundations course. I have the do_POST method inside my subclass from BaseHTTPRequestHandler, basically I want to get a post value named message submitted with a multipart form, this is the code for the method:
def do_POST(self):
try:
if self.path.endswith("/Hello"):
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers
ctype, pdict = cgi.parse_header(self.headers['content-type'])
if ctype == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, pdict)
messagecontent = fields.get('message')
output = ""
output += "<html><body>"
output += "<h2>Ok, how about this?</h2>"
output += "<h1>{}</h1>".format(messagecontent)
output += "<form method='POST' enctype='multipart/form-data' action='/Hello'>"
output += "<h2>What would you like to say?</h2>"
output += "<input name='message' type='text'/><br/><input type='submit' value='Submit'/>"
output += "</form></body></html>"
self.wfile.write(output.encode('utf-8'))
print(output)
return
except:
self.send_error(404, "{}".format(sys.exc_info()[0]))
print(sys.exc_info() )
The problem is that the cgi.parse_multipart(self.rfile, pdict) is throwing an exception: TypeError: can't concat bytes to str, the implementation was provided in the videos for the course, but they're using Python 2.7 and I'm using python 3, I've looked for a solution all afternoon but I could not find anything useful, what would be the correct way to read data passed from a multipart form in python 3?

I've came across here to solve the same problem like you have.
I found a silly solution for that.
I just convert 'boundary' item in the dictionary from string to bytes with an encoding option.
ctype, pdict = cgi.parse_header(self.headers['content-type'])
pdict['boundary'] = bytes(pdict['boundary'], "utf-8")
if ctype == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, pdict)
In my case, It seems work properly.

To change the tutor's code to work for Python 3 there are three error messages you'll have to combat:
If you get these error messages
c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
AttributeError: 'HTTPMessage' object has no attribute 'getheader'
or
boundary = pdict['boundary'].decode('ascii')
AttributeError: 'str' object has no attribute 'decode'
or
headers['Content-Length'] = pdict['CONTENT-LENGTH']
KeyError: 'CONTENT-LENGTH'
when running
c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
if c_type == 'multipart/form-data':
fields = cgi.parse_multipart(self.rfile, p_dict)
message_content = fields.get('message')
this applies to you.
Solution
First of all change the first line to accommodate Python 3:
- c_type, p_dict = cgi.parse_header(self.headers.getheader('Content-Type'))
+ c_type, p_dict = cgi.parse_header(self.headers.get('Content-Type'))
Secondly, to fix the error of 'str' object not having any attribute 'decode', it's because of the change of strings being turned into unicode strings as of Python 3, instead of being equivalent to byte strings as in Python 3, so add this line just under the above one:
p_dict['boundary'] = bytes(p_dict['boundary'], "utf-8")
Thirdly, to fix the error of not having 'CONTENT-LENGTH' in pdict just add these lines before the if statement:
content_len = int(self.headers.get('Content-length'))
p_dict['CONTENT-LENGTH'] = content_len
Full solution on my Github:
https://github.com/rSkogeby/web-server

I am doing the same course and was running into the same problem. Instead of getting it to work with cgi I am now using the parse library. This was shown in the same course just a few lessons earlier.
from urllib.parse import parse_qs
length = int(self.headers.get('Content-length', 0))
body = self.rfile.read(length).decode()
params = parse_qs(body)
messagecontent = params["message"][0]
And you have to get rid of the enctype='multipart/form-data' in your form.

In my case I used cgi.FieldStorage to extract file and name instead of cgi.parse_multipart
form = cgi.FieldStorage(
fp=self.rfile,
headers=self.headers,
environ={'REQUEST_METHOD':'POST',
'CONTENT_TYPE':self.headers['Content-Type'],
})
print('File', form['file'].file.read())
print('Name', form['name'].value)

Another hack solution is to edit the source of the cgi module.
At the very beginning of the parse_multipart (around the 226th line):
Change the usage of the boundary to str(boundary)
...
boundary = b""
if 'boundary' in pdict:
boundary = pdict['boundary']
if not valid_boundary(boundary):
raise ValueError('Invalid boundary in multipart form: %r'
% (boundary,))
nextpart = b"--" + str(boundary)
lastpart = b"--" + str(boundary) + b"--"
...

creating an attachment for Google App Engine

Now my webapp is nearly finished I would like to email the result to myself in the form of an attachment. The content of the attachment is the content of the variable "Gcode" this is one big string that looks like:
"G00X12.46Y18.51
G01X12.46Y21.62
G01X13.79Y21.62
G00X10.00Y23.00"
When trying to send the mail I get an error that says: TypeError: 'list' object is not callable. I found the attachment has to be a bytestring. Is this the problem?
class send_mail(webapp2.RequestHandler):
def post(self):
material = self.request.get('material')
thickness = self.request.get('thickness')
fileName = self.request.get('fileName')
Gcode = self.request.get('Gcode')
sheetSize = self.request.get('sheetSize')
recieve_address = 'me#hotmail.com'
attachments=[(fileName+'.nc', 'Gcode')]
if not mail.is_email_valid(user_address):
Gcode = ''
else:
confirmation_url = Gcode
sender_address = "theLaserLab.com <theLaserLab#gmail.com>"
subject = "Order content"
body = "The attached file: %s.nc. Is supposed to cut from a sheet of %s that is %smm thick and has a %s dimension." % (fileName, material, thickness, sheetSize)
mail.send_mail(sender_address, recieve_address, subject, body, attachments)

Converting ISO-8859-1 to UTF-8 for MultipartFormData in Play2 + Scala when parsing email from Sendgrid

I have hooked up my Play2+Scala application to Sendgrid Parse Api and I'm really struggling in decoding and encoding the content of the email.
Since the emails could be in different encodings Sendgrid provides us with a JSON object charsets:
{"to":"UTF-8","cc":"UTF-8","subject":"UTF-8","from":"UTF-8","text":"iso-8859-1","html":"iso-8859-1"}
In my test case "text" is "Med Vänliga Hälsningar Jakobs Webshop"
If I extract that from the multipart request and print it out:
Logger.info(request.body.dataParts.get("text").get)
I get:
Med V?nliga H?lsningar Jakobs Webshop
Ok so with the given info from Sendgrid let's fix the string so that it is UTF-8.
def parseMail = Action(parse.multipartFormData) {
request => {
val inputBuffer = request.body.dataParts.get("text").map {
v => ByteBuffer.wrap(v.head.getBytes())
}
val fromCharset = Charset.forName("ISO-8859-1")
val toCharset = Charset.forName("UTF-8")
val data = fromCharset.decode(inputBuffer.get)
Logger.info(""+data)
val outputBuffer = toCharset.encode(data)
val text = new String(outputBuffer.array())
// Save stuff to MongoDB instance
}
This results in:
Med Vï¿½nliga Hï¿½lsningar Jakobs Webshop
So this is very strange. This should work.
I wonder what actually happens in the body parser parse.multipartFormData and the datapart handler:
def handleDataPart: PartHandler[Part] = {
case headers # PartInfoMatcher(partName) if !FileInfoMatcher.unapply(headers).isDefined =>
Traversable.takeUpTo[Array[Byte]](DEFAULT_MAX_TEXT_LENGTH)
.transform(Iteratee.consume[Array[Byte]]().map(bytes => DataPart(partName, new String(bytes, "utf-8")))(play.core.Execution.internalContext))
.flatMap { data =>
Cont({
case Input.El(_) => Done(MaxDataPartSizeExceeded(partName), Input.Empty)
case in => Done(data, in)
})
}(play.core.Execution.internalContext)
}
When consuming the data a new String is created with the encoding utf-8:
.transform(Iteratee.consume[Array[Byte]]().map(bytes => DataPart(partName, new String(bytes, "utf-8")))(play.core.Execution.internalContext))
Does this mean that my ISO-8859-1 encoded string text is encoded with utf-8 when parsed? If so, how should I create my parser to decode and then encode my params according to the provided JSON object charsets?
Clearly I'm doing something wrong but I can't figure it out!

You'll need to copy the implementation of the parse.multipartFormData function, changing the decodings from utf-8 to iso-8859-1, and use that in your Action.
The problem is that play decodes everything with UTF-8 by default, and there is no way to change that, other than implementing your own parser.

Have you tried changing the default encoding to UTF-8?
See this question for details: Printing Unicode from Scala interpreter

How do I generate binary RFC822-style headers in Python 3.2?

How do I convince email.generator.Generator to use binary in Python 3.2? This seems like precisely the use case for the policy framework that was introduced in Python 3.3, but I would like my code to run in 3.2.
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: \N{SNOWMAN}\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Fails with UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128).

Your data is not a valid RFC2822 header, which I suspect misleads you. It's a Unicode string, but RFC2822 is always only ASCII. To have non-ASCII characters you need to encode them with a character set and either base64 or quoted-printable encoding.
Hence, valid code would be this:
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: =?utf8?b?4piD?=\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Which of course avoids the error completely.
The question is how to generate such headers as =?utf8?b?4piD?= and the answer lies in the email.header module.
I made this example with:
>>> from email import header
>>> header.Header('\N{SNOWMAN}', 'utf8').encode()
'=?utf8?b?4piD?='
To handle files that have a Key: Value format the email module is the wrong solution. Handling such files are easy enough without the email module, and you will not have to work around the restrictions of RF2822. For example:
# -*- coding: UTF-8 -*-
import io
import sys
if sys.version_info > (3,):
def u(s): return s
else:
def u(s): return s.decode('unicode-escape')
def parse(infile):
res = {}
payload = ''
for line in infile:
key, value = line.strip().split(': ',1)
if key in res:
raise ValueError(u("Key {0} appears twice").format(key))
res[key] = value
return res
def generate(outfile, data):
for key in data:
outfile.write(u("{0}: {1}\n").format(key, data[key]))
if __name__ == "__main__":
# Ensure roundtripping:
data = {u('Key'): u('Value'), u('Foo'): u('Bar'), u('Frötz'): u('Öpöpöp')}
with io.open('/tmp/outfile.conf', 'wt', encoding='UTF8') as outfile:
generate(outfile, data)
with io.open('/tmp/outfile.conf', 'rt', encoding='UTF8') as infile:
res = parse(infile)
assert data == res
That code took 15 minutes to write, and works in both Python 2 and Python 3. If you want line continuations etc that's easy to add as well.
Here is a more complete one that supports comments etc.

A useful solution comes from http://mail.python.org/pipermail/python-dev/2010-October/104409.html :
from email.parser import Parser
from email.generator import BytesGenerator
# How do I get surrogateescape from a BytesIO/StringIO?
data = "Key: \N{SNOWMAN}\r\n\r\n" # write this to headers.txt
headers = open("headers.txt", "r", encoding="ascii", errors="surrogateescape")
message = Parser().parse(headers)
with open("/tmp/rfc882test", "wb") as out:
BytesGenerator(out, maxheaderlen=0).flatten(message)
This is for a program that wants to read and write a binary Key: value file without caring about the encoding. To consume the headers as decoded text without being able to write them back out with Generator(), Parser().parse(open("headers.txt", "r", encoding="utf-8")) should be sufficient.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extracting the body of an email from mbox file, decoding it to plain text regardless of Charset and Content Transfer Encoding - email

Related

Parsing email message headers in Go

cgi.parse_multipart function throws TypeError in Python 3

creating an attachment for Google App Engine

Converting ISO-8859-1 to UTF-8 for MultipartFormData in Play2 + Scala when parsing email from Sendgrid

How do I generate binary RFC822-style headers in Python 3.2?

Categories

Resources