I have an app that is storing images in a Windows Azure Block Blob. I'm adding meta data to each blob that gets uploaded. The metadata may include some special characters. For instance, the registered trademark symbol (®). How do I add this value to meta data in Windows Azure?
Currently, when I try, I get a 400 (Bad Request) error anytime I try to upload a file that uses a special character like this.
Thank you!
You might use HttpUtility to encode/decode the string:
blob.Metadata["Description"] = HttpUtility.HtmlEncode(model.Description);
Description = HttpUtility.HtmlDecode(blob.Metadata["Description"]);
http://lvbernal.blogspot.com/2013/02/metadatos-de-azure-vs-caracteres.html
The supported characters in the blob metadata must be ASCII characters. To work around this you can either escape the string ( percent encode), base64 encode etc.
joe
HttpUtility.HtmlEncode may not work; if Unicode characters are in your string (i.e. ’), it will fail. So far, I have found Uri.EscapeDataString does handle this edge case and others. However, there are a number of characters that get encoded unnecessarily, such as space (' '=chr(32)=%20).
I mapped the illegal ascii characters metadata will not accept and built this to restore the characters:
static List<string> illegals = new List<string> { "%1", "%2", "%3", "%4", "%5", "%6", "%7", "%8", "%A", "%B", "%C", "%D", "%E", "%F", "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17", "%18", "%19", "%1A", "%1B", "%1C", "%1D", "%1E", "%1F", "%7F", "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87", "%88", "%89", "%8A", "%8B", "%8C", "%8D", "%8E", "%8F", "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97", "%98", "%99", "%9A", "%9B", "%9C", "%9D", "%9E", "%9F", "%A0", "%A1", "%A2", "%A3", "%A4", "%A5", "%A6", "%A7", "%A8", "%A9", "%AA", "%AB", "%AC", "%AD", "%AE", "%AF", "%B0", "%B1", "%B2", "%B3", "%B4", "%B5", "%B6", "%B7", "%B8", "%B9", "%BA", "%BB", "%BC", "%BD", "%BE", "%BF", "%C0", "%C1", "%C2", "%C3", "%C4", "%C5", "%C6", "%C7", "%C8", "%C9", "%CA", "%CB", "%CC", "%CD", "%CE", "%CF", "%D0", "%D1", "%D2", "%D3", "%D4", "%D5", "%D6", "%D7", "%D8", "%D9", "%DA", "%DB", "%DC", "%DD", "%DE", "%DF", "%E0", "%E1", "%E2", "%E3", "%E4", "%E5", "%E6", "%E7", "%E8", "%E9", "%EA", "%EB", "%EC", "%ED", "%EE", "%EF", "%F0", "%F1", "%F2", "%F3", "%F4", "%F5", "%F6", "%F7", "%F8", "%F9", "%FA", "%FB", "%FC", "%FD", "%FE" };
private static string MetaDataEscape(string value)
{
//CDC%20Guideline%20for%20Prescribing%20Opioids%20Module%206%3A%20%0Ahttps%3A%2F%2Fwww.cdc.gov%2Fdrugoverdose%2Ftraining%2Fdosing%2F
var x = HttpUtility.HtmlEncode(value);
var sz = value.Trim();
sz = Uri.EscapeDataString(sz);
for (int i = 1; i < 255; i++)
{
var hex = "%" + i.ToString("X");
if (!illegals.Contains(hex))
{
sz = sz.Replace(hex, Uri.UnescapeDataString(hex));
}
}
return sz;
}
The result is:
Before ==> "1080x1080 Facebook Images"
Uri.EscapeDataString =>
"1080x1080%20Facebook%20Images"
After => "1080x1080 Facebook
Images"
I am sure there is a more efficient way, but the hit seems negligible for my needs.
How do I convince email.generator.Generator to use binary in Python 3.2? This seems like precisely the use case for the policy framework that was introduced in Python 3.3, but I would like my code to run in 3.2.
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: \N{SNOWMAN}\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Fails with UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128).
Your data is not a valid RFC2822 header, which I suspect misleads you. It's a Unicode string, but RFC2822 is always only ASCII. To have non-ASCII characters you need to encode them with a character set and either base64 or quoted-printable encoding.
Hence, valid code would be this:
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: =?utf8?b?4piD?=\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Which of course avoids the error completely.
The question is how to generate such headers as =?utf8?b?4piD?= and the answer lies in the email.header module.
I made this example with:
>>> from email import header
>>> header.Header('\N{SNOWMAN}', 'utf8').encode()
'=?utf8?b?4piD?='
To handle files that have a Key: Value format the email module is the wrong solution. Handling such files are easy enough without the email module, and you will not have to work around the restrictions of RF2822. For example:
# -*- coding: UTF-8 -*-
import io
import sys
if sys.version_info > (3,):
def u(s): return s
else:
def u(s): return s.decode('unicode-escape')
def parse(infile):
res = {}
payload = ''
for line in infile:
key, value = line.strip().split(': ',1)
if key in res:
raise ValueError(u("Key {0} appears twice").format(key))
res[key] = value
return res
def generate(outfile, data):
for key in data:
outfile.write(u("{0}: {1}\n").format(key, data[key]))
if __name__ == "__main__":
# Ensure roundtripping:
data = {u('Key'): u('Value'), u('Foo'): u('Bar'), u('Frötz'): u('Öpöpöp')}
with io.open('/tmp/outfile.conf', 'wt', encoding='UTF8') as outfile:
generate(outfile, data)
with io.open('/tmp/outfile.conf', 'rt', encoding='UTF8') as infile:
res = parse(infile)
assert data == res
That code took 15 minutes to write, and works in both Python 2 and Python 3. If you want line continuations etc that's easy to add as well.
Here is a more complete one that supports comments etc.
A useful solution comes from http://mail.python.org/pipermail/python-dev/2010-October/104409.html :
from email.parser import Parser
from email.generator import BytesGenerator
# How do I get surrogateescape from a BytesIO/StringIO?
data = "Key: \N{SNOWMAN}\r\n\r\n" # write this to headers.txt
headers = open("headers.txt", "r", encoding="ascii", errors="surrogateescape")
message = Parser().parse(headers)
with open("/tmp/rfc882test", "wb") as out:
BytesGenerator(out, maxheaderlen=0).flatten(message)
This is for a program that wants to read and write a binary Key: value file without caring about the encoding. To consume the headers as decoded text without being able to write them back out with Generator(), Parser().parse(open("headers.txt", "r", encoding="utf-8")) should be sufficient.
I am trying to use Python 3 to extract the body of email messages from a thunderbird mbox file. It is an IMAP account.
I would like to have the text part of the body of the email available to process as a unicode string. It should 'look like' the email does in Thunderbird, and not contain escaped characters such as \r\n =20 etc.
I think that it is the Content Transfer Encodings that I don't know how to decode or remove.
I receive emails with a variety of different Content Types, and different Content Transfer Encodings.
This is my current attempt :
import mailbox
import quopri,base64
def myconvert(encoded,ContentTransferEncoding):
if ContentTransferEncoding == 'quoted-printable':
result = quopri.decodestring(encoded)
elif ContentTransferEncoding == 'base64':
result = base64.b64decode(encoded)
mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'
for msg in mailbox.mbox(mboxfile):
if msg.is_multipart(): #Walk through the parts of the email to find the text body.
for part in msg.walk():
if part.is_multipart(): # If part is multipart, walk through the subparts.
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
body = subpart.get_payload() # Get the subpart payload (i.e the message body)
for k,v in subpart.items():
if k == 'Content-Transfer-Encoding':
cte = v # Keep the Content Transfer Encoding
elif subpart.get_content_type() == 'text/plain':
body = part.get_payload() # part isn't multipart Get the payload
for k,v in part.items():
if k == 'Content-Transfer-Encoding':
cte = v # Keep the Content Transfer Encoding
print(body)
print('Body is of type:',type(body))
body = myconvert(body,cte)
print(body)
But this fails with :
Body is of type: <class 'str'>
Traceback (most recent call last):
File "C:/Users/David/Documents/Python/test2.py", line 31, in <module>
body = myconvert(body,cte)
File "C:/Users/David/Documents/Python/test2.py", line 6, in myconvert
result = quopri.decodestring(encoded)
File "C:\Python32\lib\quopri.py", line 164, in decodestring
return a2b_qp(s, header=header)
TypeError: 'str' does not support the buffer interface
Here is some code that does the job, it prints errors instead of crashing for those messages where it would fail. I hope that it may be useful. Note that if there is a bug in Python 3, and that is fixed, then the lines .get_payload(decode=True) may then return a str object instead of a bytes object. I ran this code today on 2.7.2 and on Python 3.2.1.
import mailbox
def getcharsets(msg):
charsets = set({})
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
return charsets
def handleerror(errmsg, emailmsg,cs):
print()
print(errmsg)
print("This error occurred while decoding with ",cs," charset.")
print("These charsets were found in the one email.",getcharsets(emailmsg))
print("This is the subject:",emailmsg['subject'])
print("This is the sender:",emailmsg['From'])
def getbodyfromemail(msg):
body = None
#Walk through the parts of the email to find the text body.
if msg.is_multipart():
for part in msg.walk():
# If part is multipart, walk through the subparts.
if part.is_multipart():
for subpart in part.walk():
if subpart.get_content_type() == 'text/plain':
# Get the subpart payload (i.e the message body)
body = subpart.get_payload(decode=True)
#charset = subpart.get_charset()
# Part isn't multipart so get the email body
elif part.get_content_type() == 'text/plain':
body = part.get_payload(decode=True)
#charset = part.get_charset()
# If this isn't a multi-part message then get the payload (i.e the message body)
elif msg.get_content_type() == 'text/plain':
body = msg.get_payload(decode=True)
# No checking done to match the charset with the correct part.
for charset in getcharsets(msg):
try:
body = body.decode(charset)
except UnicodeDecodeError:
handleerror("UnicodeDecodeError: encountered.",msg,charset)
except AttributeError:
handleerror("AttributeError: encountered" ,msg,charset)
return body
#mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'
print(mboxfile)
for thisemail in mailbox.mbox(mboxfile):
body = getbodyfromemail(thisemail)
print(body[0:1000])
This script seems to return all messages correctly:
def getcharsets(msg):
charsets = set({})
for c in msg.get_charsets():
if c is not None:
charsets.update([c])
return charsets
def getBody(msg):
while msg.is_multipart():
msg=msg.get_payload()[0]
t=msg.get_payload(decode=True)
for charset in getcharsets(msg):
t=t.decode(charset)
return t
Former answer from acd often returns only some footer of the real message.
(
at least in the GMANE email messagens I am opening for this toolbox:
https://pypi.python.org/pypi/gmane
)
cheers