How to retrieve a IMAP body with the right charset? - email

I'm trying to make an "IMAP fetch" command to retrieve the message body. I need to pass/use the correct charset, otherwise the response will come with special characters.
How can I make the IMAP request/command to consider the charset I received in the BODYSTRUCTURE result??

The IMAP server will send non-ASCII message body data as an 8-bit "literal", which is essentially an array of bytes. You can't make the FETCH command use a specific charset, as far as I know.
It's up to the IMAP library or your application to decode the byte array into a string representation using the appropriate charset you received in the BODYSTRUCTURE response.

Related

How can I decode email contents that are differently "Content-Transfer-Encoding" encoded?

I'm reading emails with imaplib, and found out that some email contents are encoded base64, and some 7bits.
I tried to decode it with 'Content-Transfer-Encoding' value.
But even more, some have 'Content-Transfer-Encoding' header in message object, whereas some have it in message.get_payload()[0].
I can deal with these some cases, but I think there can be more cases that I haven't found.
Is there any better way to decode email contents, no matter how they are encoded?
Thanks :)
when using get_payload(), I added decode=True option, so that it can automatically decode if needed. link
And then, isinstance(content, bytes) tells you whether you have to uni-decode or not.

How does an email client read the content-type headers for encoding?

It is possible to send an email with different content types: text/html, text/plain, mime, etc. It also is possible to use different encodings, including (according to the RFCs) for header fields: us-ascii, utf8, etc.
How do you solve the chicken and egg problem? The content-type header is just one of several headers. If the headers can be any encoding, how does a mail server or client know how to read the content-type header if it does not know what encoding the headers themselves are in?
I can see it if the first line, e.g. had to be the content-type and it had to be in a pre-agreed encoding, (e.g. ascii), but that is not the case.
How do you parse a stream of bytes whose encoding is embedded as a string inside that very same stream?
Headers are defined to be in ascii. They can be in utf-8 if agreed to out of band, such as via the smtp or imap utf-8 capability extensions.
Internationalization in headers is performed via "encoded words", where the encoding is part of the header data. (This looks like a string such as =?iso8859-1?q?sample_header_data?=). See rfc2047.
Content Type headers do not apply to headers themselves, only the body content.

Need to find the requests equivalent of openurl() from urllib2

I am currently trying to modify a script to use the requests library instead of the urllib2 library. I haven't really used it before and I am looking to do the equivalent of urlopen("http://www.example.org").read(), so I tried the requests.get("http://www.example.org").text function.
This works fine with normal everyday html, however when I fetch from this url (https://gtfsrt.api.translink.com.au/Feed/SEQ) it doesn't seem to work.
So I wrote the below code to print out the responses from the same url using both the requests and urllib2 libraries.
import urllib2
import requests
#urllib2 request
request = urllib2.Request("https://gtfsrt.api.translink.com.au/Feed/SEQ")
result = urllib2.urlopen(request)
#requests request
result2 = requests.get("https://gtfsrt.api.translink.com.au/Feed/SEQ")
print result2.encoding
#urllib2 write to text
open("Output.txt", 'w').close()
text_file = open("Output.txt", "w")
text_file.write(result.read())
text_file.close()
open("Output2.txt", 'w').close()
text_file = open("Output2.txt", "w")
text_file.write(result2.text)
text_file.close()
The openurl().read() works fine but the requests.get().text doesn't work for the given this url. I suspect it has something to do with encoding, but i don't know what. Any thoughts?
Note: The supplied url is a feed in the google protocol buffer format, once I receive the message i give the feed to a google library that interprets it.
Your issue is that you're making the requests module interpret binary content in a response as text.
A response from the requests library has two main way to access the body of the response:
Response.content - will return the response body as a bytestring
Response.text - will decode the response body as text and return unicode
Since protocol buffers are a binary format, you should use result2.content in your code instead of result2.text.
Response.content will return the body of the response as-is, in bytes. For binary content this is exactly what you want. For text content that contains non-ASCII characters this means the content must have been encoded by the server into a bytestring using a particular encoding that is indicated by either a HTTP header or a <meta charset="..." /> tag. In order to make sense of those bytes they therefore need to be decoded after receiving using that charset.
Response.text now is a convenience method that does exactly this for you. It assumes the response body is text, and looks at the response headers to find the encoding, and decodes it for you, returning unicode.
But if your response doesn't contain text, this is the wrong method to use. Binary content doesn't contain characters, because it's not text, so the whole concept of character encoding does not make any sense for binary content - it's only applicable to text composed of characters. (That's also why you're seeing response.encoding == None - it's just bytes, there is no character encoding involved).
See Response Content and Binary Response Content in the requests documentation for more details.

paypal ipn wrong encoding issue

I have problem with ipn from paypal. I'm sending them data utf-8 encoded(e.g. "Naročilo št" as item name) and when somebody pays they send me response which is malformed: "Naro�~D�~Milo �~E¡t."(response claims that is UTF-8 encoded) and then when I try to validate that payment I get that it's invalid.
I've tried to change the "buy button" encoding on my paypal profile but it's not working(I still get response with wrongly encoded characters). Anybody have an idea how to fix that issue? And I would rather avoid transforming item names to plain ascii or something similar.
Solution by OP.
According to this
When the Servlet container receives the request, it always pass the request parameters to your program decoded in ISO-8859-1 encoding. (e.g. Browser encoded in UTF-8 but container decoded in ISO-8859-1.) Thus your servlet or JSP will always receive garbage for characters other than ISO-8859-1 encoding.
so the solution to my problem was acquiring request parameters that way:
String value = request.getParameter("mytext");
try{
value = new String(value.getBytes("8859_1"), "UTF-8");
}catch(java.io.UnsupportedEncodingException e){
System.err.println(e);
}

Handling diacritics in SIP headers

Following the SIMPLE specification of OMA, when sending a SIP INVITE for chat we can use a header named Subject.
Typically, this header contains the first message sent by a user to his contact.
My question is: this message can contain diacritics, so how should I encode them? Is there a standard definition on how to do this?
You should encode them as UTF-8 as specified in the SIP RFC. There are a few SIP Headers where UTF-8 is not allowed and US ASCII with escaping rules is mandated but the Subject header is not one of those.