I want get email body. My email is html and save in part. Use this code:
print('charset =', part.get_content_charset())
html = part.get_payload(decode=True)
print ('type =', type(html))
print('text =', html)
result is:
charset = utf-8
type = <class 'bytes'>
text = b'...<font face="DejaVu Sans Mono">\\u044d\\u0442\\u043e html<br>\n...
I want have normal text, but no \u044d\u0442\u043e.
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:38:22) [MSC v.1600 32 bit (Intel)] on win32
Your data is a byte array. You have to decode the bytes to a string:
text.decode('utf-8')
Related
Here I have a simple, exemplary code in MS Visual Studio:
#include<string>
#include<iostream>
using namespace std;
int main()
{
cout << static_cast<int>('ą') << endl; // -71
return 0;
}
The question is why this cout prints out -71 as if MS Visual Studio was using Windows 1250 if as far as I know it uses UTF-8?
Your source file is saved in Windows-1250, not UTF-8, so the byte stored between the two single quotes is 0xB9 (see Windows-1250 table). 0xB9 taken as a signed 8-bit value is -71.
Save your file in UTF-8 encoding and you'll get a different answer. I get 50309 which is 0xc485. since UTF-8 is a multibyte encoding, it would be better to use modern C++ to output the bytes of an explicit UTF-8 string, use UTF-8 source encoding, and tell the compiler explicitly that the source encoding it UTF-8:
test.c - saved in UTF-8 encoding and compiled with /utf-8 switch in MSVS:
#include<string>
#include<iostream>
#include <cstdint>
using namespace std;
int main()
{
string s {u8"ą马"};
for(auto c : s)
cout << hex << static_cast<int>(static_cast<uint8_t>(c)) << endl;
return 0;
}
Output:
c4
85
e9
a9
ac
Note C4 85 is the correct UTF-8 bytes for ą and E9 A9 AC are correct for Chinese 马 (horse).
I have set of .msg files stored in E:/ drive that I have to read and extract some information from it. For that i am using the below code in Python 3.6
from email.parser import Parser
p = Parser()
headers = p.parse(open('E:/Ratan/msg_files/Test1.msg', encoding='Latin-1'))
print('To: %s' % headers['To'])
print('From: %s' % headers['From'])
print('Subject: %s' % headers['subject'])
In the output I am getting as below.
To: None
From: None
Subject: None
I am not getting the actual values in To, FROM and subject fields.
Any thoughts why it is not printing the actual values?
Please download my sample msg file from this link:
drive.google.com/file/d/1pwWWG3BgsMKwRr0WmP8GqzG3WX4GmEy6/view
Here is a demonstration of how to use some of python's standard email libraries.
You didn't show us your input file in the question, and the g-drive URL is a deadlink.
The code below looks just like yours and works fine, so I don't know what is odd about your environment, modulo some Windows 'rb' binary open nonsense, CRLFs, or the Latin1 encoding.
I threw in .upper() but it does nothing beyond showing that the API is case insensitive.
#! /usr/bin/env python3
from email.parser import Parser
from pathlib import Path
import mailbox
def extract_messages(maildir, mbox_file, k=2, verbose=False):
for n, message in enumerate(mailbox.mbox(mbox_file)):
with open(maildir / f'{n}.txt', 'w') as fout:
fout.write(str(message))
hdrs = 'From Date Subject In-Reply-To References Message-ID'.split()
p = Parser()
for i in range(min(k, n)):
with open(maildir / f'{i}.txt') as fin:
msg = p.parse(fin)
print([len(msg[hdr.upper()] or '')
for hdr in hdrs])
for k, v in msg.items():
print(k, v)
print('')
if verbose:
print(msg.get_payload())
if __name__ == '__main__':
# from https://mail.python.org/pipermail/python-dev/
maildir = Path('/tmp/py-dev/')
extract_messages(maildir, maildir / '2018-January.txt')
I have this script to encrypt and decrypt text.
Why is it that when converting the decrypted text byte array to ASCII there is a space in between each character?
#Encrypt:
$unencryptedData = "passwordToEncrypt"
$pfxPassword = "P#ssw0rd1"
$certLocation = "D:\Ava\CA\Scripts\Encryption\PFXfiles\f-signed.pfx"
$cert = New-Object 'System.Security.Cryptography.X509Certificates.X509Certificate2'($certLocation, $pfxPassword, [System.Security.Cryptography.X509Certificates.X509KeyStorageFlags]::Exportable)
$publicKey = $cert.PublicKey.Key.ToXmlString($false)
$privateKey = $cert.PrivateKey.ToXmlString($true)
$unencryptedDataAsByteArray = [System.Text.Encoding]::Unicode.GetBytes($unencryptedData)
$keySize = 16384
$rsaProvider = New-Object System.Security.Cryptography.RSACryptoServiceProvider($keySize)
$rsaProvider.FromXmlString($publicKey)
$encryptedDataAsByteArray = $rsaProvider.Encrypt($unencryptedDataAsByteArray, $false)
$encryptedDataAsString = [System.Convert]::ToBase64String($encryptedDataAsByteArray)
Write-Host "Encrypted password = $encryptedDataAsString"
#Decrypt:
$rsaProvider.FromXmlString($privateKey)
$encryptedDataAsByteArray = [System.Convert]::FromBase64String($encryptedDataAsString)
$decryptedDataAsByteArray = $rsaProvider.Decrypt($encryptedDataAsByteArray, $false)
$decryptedDataAsString = [System.Text.Encoding]::ASCII.GetString($decryptedDataAsByteArray)
###### "p a s s w o r d T o E n c r y p t " ######
#$decryptedDataAsString = [System.Text.Encoding]::Unicode.GetString($decryptedDataAsByteArray)
###### "passwordToEncrypt" ######
Write-Host "Decrypted password = $decryptedDataAsString"
Consult Character Encodings in the .NET Framework. [System.Text.Encoding]::Unicode is UTF-16LE so the character A is encoded as the 16-bit value 0x0041, bytes 0x41 0x00. [System.Text.Encoding]::ASCII is an 8-bit encoding so when you decode 0x41 0x00 with ASCII you get the characters A and NUL (not space) .
You have to decode your byte array with the same encoding you encoded it in.
In the line:
$unencryptedDataAsByteArray = [System.Text.Encoding]::Unicode.GetBytes($unencryptedData)
You are setting the unencrypted byte array to a Unicode string. This means 2 bytes in the array for every character in the string. When it is later decrypted, it is still 2 bytes per character.
You need to decrypt it back in reverse order. First, decrypt it back to Unicode. Then, if you need to go to ASCII, use one of the .Net Encoding.Convert methods.
I have a capture packet raw packet using python's sockets:
s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, socket.ntohs(0x0003))
while True:
message = s.recv(4096)
test = []
print(len(message))
print(repr(message))
I assumed that the packet returned would be in hex string format, however the printout of print(repr(message)) get me something like this:
b'\x00\x1b\xac\x00Gd\x00\x14\xd1+\x1f\x19\x05\n\x124VxC!UUUU\x00\x00\x00\x00\xcd\xcc\xcc=\xcd\xccL>\x9a\x99\x99>\xcd\xcc\xcc>\x00\x00\x00?\x9a\x......'
which has weird non hex characters like !UUUU or =. What encoding is this, and how do I decode the packet?
I know what the packet looks like ahead of time for now, since I'm the one generating the packets using winpcapy:
from ctypes import *
from winpcapy import *
import zlib
import binascii
import time
from ChanPackets import base, FrMessage, FrTodSync, FrChanConfig, FlChan, RlChan
while (1):
now = time.time()
errbuf = create_string_buffer(PCAP_ERRBUF_SIZE)
fp = pcap_t
deviceName = b'\\Device\\NPF_{8F5BD2E9-253F-4659-8256-B3BCD882AFBC}'
fp = pcap_open_live(deviceName, 65536, 1, 1000, errbuf)
if not bool(fp):
print ("\nUnable to open the adapter. %s is not supported by WinPcap\n" % deviceName)
sys.exit(2)
# FrMessage is a custom class that creates the packet
test = FrMessage('00:1b:ac:00:47:64', '00:14:d1:2b:1f:19', 0x12345678, 0x4321, 0x55555555, list(i/10 for i in range(320)))
# test.get_Raw_Packet() returns a c_bytes array needed for winpcap to send the packet
if (pcap_sendpacket(fp, test.get_Raw_Packet(), test.packet_size) != 0):
print ("\nError sending the packet: %s\n" % pcap_geterr(fp))
sys.exit(3)
elapsed = time.time() - now
if elapsed < 0.02 and elapsed > 0:
time.sleep(0.02 - elapsed)
pcap_close(fp)
Note: I would like to get an array of hex values representing each byte
What encoding is this, and how do I decode the packet?
What you see is the representation of bytes object in Python. As you might have guessed \xab represents byte 0xab (171).
which has weird non hex characters like !UUUU or =
Printable ASCII characters represent themselves i.e., instead of \x55 the representation contains just U.
What you have is a sequence of bytes. How to decode them depends on your application. For example, to decode a data packet that contains Ethernet frame, you could use scapy (Python 2):
>>> b = '\x00\x02\x157\xa2D\x00\xae\xf3R\xaa\xd1\x08\x00E\x00\x00C\x00\x01\x00\x00#\x06x<\xc0\xa8\x05\x15B#\xfa\x97\x00\x14\x00P\x00\x00\x00\x00\x00\x00\x00\x00P\x02 \x00\xbb9\x00\x00GET /index.html HTTP/1.0 \n\n'
>>> c = Ether(b)
>>> c.hide_defaults()
>>> c
<Ether dst=00:02:15:37:a2:44 src=00:ae:f3:52:aa:d1 type=0x800 |
<IP ihl=5L len=67 frag=0 proto=tcp chksum=0x783c src=192.168.5.21 dst=66.35.250.151 |
<TCP dataofs=5L chksum=0xbb39 options=[] |
<Raw load='GET /index.html HTTP/1.0 \n\n' |>>>>
I would like to get an array of hex values representing each byte
You could use binascii.hexlify():
>>> pkt = b'\x00\x1b\xac\x00Gd\x00'
>>> import binascii
>>> binascii.hexlify(pkt)
b'001bac00476400'
or If you want a list with string hex values:
>>> hexvalue = binascii.hexlify(pkt).decode()
>>> [hexvalue[i:i+2] for i in range(0, len(hexvalue), 2)]
['00', '1b', 'ac', '00', '47', '64', '00']
In python raw packet decode can be done using the scapy functions like IP(), TCP(), UDP() etc.
import sys
import socket
from scapy.all import *
s = socket.socket(socket.AF_INET, socket.SOCK_RAW, socket.IPPROTO_TCP)
while 1:
packet = s.recvfrom(2000);
packet = packet[0]
ip = IP(packet)
ip.show()
This appears to be a rare gem: where to find documentation on the structure of Apple Mail's .emlx files (and their partial variants, and the meaning of the directory structures). The docs do not appear to exist on Apple's site, nor can I find any reasonable mention of it via Google.
The point of this is the creation of a bash/ruby/python/insert-script-langauge-here script to convert a mess of these files into something usable/pliable, like Maildir or Mbox. The ultimate goal is to migrate a snapshot of a user's /Library/Mail store into an existing Dovecot setup, which uses a form of Maildir.
Yes, I am aware of this program but it does not address the solution I am after. Converting 20 mailboxes by hand and manually inserting them into an existing installation will require more hours than just writing a script that digests the messages into something else and then automatically storing them where they should be. Nevermind that there are potentially a half-dozen more users that will require this procedure. So it's worth my time to script it up.
Please vote to close the duplicate of this question while it is pending deletion, instead of voting for this question to close. For some reason, there are occasional posting glitches when using Chrome as a browser.
FOLLOW-UP: It appears that the format really is undocumented, and that most sources have reverse-engineered it. If I have time I will attempt to do so my self; and if I'm successful, I will post a 2nd follow-up with the details of my findings.
A few more information documenting emlx format.
The message is composed:
a byte count for the message on the first line
a MIME dump of the message
an XML plist
The XML plist contains certains code such as
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>date-sent</key>
<real>1362211252</real>
<key>flags</key>
<integer>8590195713</integer>
<key>original-mailbox</key>
<string>imap://****#127.0.0.1:143/mail/2013/03</string>
<key>remote-id</key>
<string>252</string>
<key>subject</key>
<string>Re: Foobar</string>
</dict>
The flags have been described by jwz and represents a 30 bit integer:
0 read 1 << 0
1 deleted 1 << 1
2 answered 1 << 2
3 encrypted 1 << 3
4 flagged 1 << 4
5 recent 1 << 5
6 draft 1 << 6
7 initial (no longer used) 1 << 7
8 forwarded 1 << 8
9 redirected 1 << 9
10-15 attachment count 3F << 10 (6 bits)
16-22 priority level 7F << 16 (7 bits)
23 signed 1 << 23
24 is junk 1 << 24
25 is not junk 1 << 25
26-28 font size delta 7 << 26 (3 bits)
29 junk mail level recorded 1 << 29
30 highlight text in toc 1 << 30
31 (unused)
Sending myself a simple message and removing some details, so you can see the full data structure of emlx files.
875
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on ******.*********.***
X-Spam-Level:
X-Spam-Status: No, score=-3.2 required=4.2 tests=BAYES_00,RP_MATCHES_RCVD,
SPF_PASS,TVD_SPACE_RATIO autolearn=ham version=3.3.2
Received: from [127.0.0.1] (******.*********.*** [***.**.**.**])
by ******.*********.*** (8.14.5/8.14.5) with ESMTP id r2TN8m4U099571
for <****#*********.***>; Fri, 29 Mar 2013 19:08:48 -0400 (EDT)
(envelope-from ****#*********.***)
Subject: very simple
From: Karl Dubost <****#*********.***>
Content-Type: text/plain; charset=us-ascii
Message-Id: <4E83618E-BB56-404F-8595-87352648ADC7#*********.***>
Date: Fri, 29 Mar 2013 19:09:06 -0400
To: Karl Dubost <****#*********.***>
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v1283)
X-Mailer: Apple Mail (2.1283)
message Foo
--
Karl Dubost
http://www.la-grange.net/karl/
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>date-sent</key>
<real>1364598546</real>
<key>flags</key>
<integer>8590195713</integer>
<key>original-mailbox</key>
<string>imap://********#127.0.0.1:11143/mail/2013/03</string>
<key>remote-id</key>
<string>41147</string>
<key>subject</key>
<string>very simple</string>
</dict>
</plist>
Here is an emlx2mbox converter in ruby: Mailbox Converter.
I don't think it was written from any documentation of the spec, but it has undergone multiple updates, so hopefully evolved to handle at least some of the quirks of the format. The source code is about 250 lines long, and it looks readable and well-commented.
As of 2020, Python has a leightweight emlx library.
pip install emlx
and then
>>> import emlx
>>> m = emlx.read("12345.emlx")
>>> m.headers
{'Subject': 'Re: Emlx library ✉️',
'From': 'Michael <michael#example.com>',
'Date': 'Thu, 30 Jan 2020 20:25:43 +0100',
'Content-Type': 'text/plain; charset=utf-8',
...}
>>> m.headers['Subject']
'Re: Emlx library ✉️'
>>> m.plist
{'color': '000000',
'conversation-id': 12345,
'date-last-viewed': 1580423184,
'flags': {...},
...}
>>> m.flags
{'read': True, 'answered': True, 'attachment_count': 2}
I am using mailcore2 to parse .eml messages. To make this work with .emlx, I just had to remove the first line (containing a number). The message itself is equipped with the length of the message so the XML block at the end does not need to be removed.
Here is how I did it in objective-c/cocoa (MCOMessageParser comes from the mailcore2 framework):
-(Documents *)ParseEmlMessageforPath: (NSString*)fullpath filename:(NSString*)filename{
NSLog(#"fullpath = %#", fullpath);
NSError * error;
error = nil;
NSData *fileContents = [NSData dataWithContentsOfFile:fullpath options:NSDataReadingMappedIfSafe error:&error];
if (error) {
[[NSApplication sharedApplication] presentError:error];
}
MCOMessageParser * parser;
if (fileContents) {
if ([[fullpath pathExtension] isEqualToString:#"emlx"]) {
NSData * linefeed = [(NSString*)#"\n" dataUsingEncoding:NSUTF8StringEncoding ];
NSInteger filelength = [fileContents length];
NSRange xx = NSMakeRange(0, 20);
NSRange pos = [fileContents rangeOfData:linefeed options:0 range:xx] ;
if (pos.location != NSNotFound) {
NSData *subcontent = [fileContents subdataWithRange:(NSRange){pos.location+1, filelength-(pos.location)-1}];
parser = [MCOMessageParser messageParserWithData:subcontent];
} else {
return nil;
}
} else {
parser = [MCOMessageParser messageParserWithData:fileContents];
}
And there you go....
The original emlx2mbox ruby script was written a long time ago. I have updated it to run with modern ruby environment. Please check it out on https://github.com/imdatsolak/elmx2mbox