Why is mail command adding extra character (">") to the email? - email

Why is mail command adding extra character (">") to the email?
Here is the text entry:
From the ...
Here is the text in Email:
>From the ...

Because From  (that is From with F in upper-case, rom in lower, followed by a space character) in the beginning of the line marks the beginning of a new message in the mbox format. The mbox format is just really one text file with messages appended to each other.
Quoting from mbox (Wikipedia):
mboxo and mboxrd locate the message start by scanning for From lines that are found before the email message headers. If a From string occurs at the beginning of a line in either the header or the body of a message (a mail standard violation for the former, but not for the latter), the email message must be modified before the message is stored in an mbox mailbox file or the line will be taken as a message boundary. To avoid misinterpreting a From string at the beginning of the line in the email body as the beginning of a new email, some systems "From-munge" the message, typically by prepending a greater-than sign:
>From my point of view...

That same practice has been introduced in another Linux tool: Git (2.10, June 2016), which teached format-patch and mailsplit (hence "am") how a line that happens to begin with "From " in the e-mail message is quoted with ">", so that these lines can be restored to their original shape.
See commit d9925d1, commit c88098d, commit 9f23e04 (05 Jun 2016) by Eric Wong (ele828).
(Merged by Junio C Hamano -- gitster -- in commit e25a4de, 06 Jul 2016)
pretty: support "mboxrd" output format
Signed-off-by: Eric Wong
This output format prevents format-patch output from breaking readers if somebody copy+pasted an mbox into a commit message.
Unlike the traditional "mboxo" format, "mboxrd" is designed to be fully-reversible.
"mboxrd" also gracefully degrades to showing extra ">" in existing "mboxo" readers.
This degradation is preferable to breaking message splitting completely, a problem I've seen in "mboxcl" due to having multiple, non-existent, or inaccurate Content-Length headers.
"mboxcl2" is a non-starter since it's inherits the problems of "mboxcl" while being completely incompatible with existing tooling based around mailsplit.
Documentation in Git 2.27 (Q2 2020):
See commit 88eaf36 (16 Apr 2020) by Emma Brooks (``).
(Merged by Junio C Hamano -- gitster -- in commit e08387d, 28 Apr 2020)
Documentation: explain "mboxrd" pretty format
Signed-off-by: Emma Brooks
Acked-by: Eric Wong
The "mboxrd" pretty format was introduced in 9f23e04061 (pretty: support "mboxrd" output format, 2016-06-05, Git 2.16.0) but wasn't mentioned in the documentation.
mboxrd
Like 'email', but lines in the commit message starting with "From " (preceded by zero or more ">") are quoted with ">" so they aren't confused as starting a new commit.

Related

Converting email metadata RFC 5322 dates to Org mode dates in doom emacs

I have about 200 old emails, as *.eml files, that I want to concatenate into one *.org file so that I can use the information in Org mode. Each file has the string "Date: " followed by a timestamp in the RFC 5322 date format, i.e.
Date: Tue, 23 Apr 2019 13:31:18 -0400
I know the UNIX date command can convert the date part of that string to RFC 3339 date format, i.e. the command:
date --rfc-3339='ns' --date='Tue, 23 Apr 2019 13:31:18 -0400'
would give the result:
2019-04-23 13:31:18.000000000-04:00
I guess I could do all the conversion with awk in one go, but my awk is rusty, and I've been having trouble getting it right.
I'd really like to convert all of these dates to Org mode dates with one command, either using a vi command or a doom emacs command.
Any suggestions?
You could build up a regular expression for use with M-x query-replace-regexp which invokes a shell command to use the date command you mention above. The trick is the \, replacement option which allows you to execute any Emacs LISP code, as described in the Regexp Replacement section of the Emacs info manual.

postfix header_checks.pcre wrongly blocking IPHONE emails

I have a postfix/dovecot mail server which has been working fine for a year or so but today one user came to me with his iPhone and said he couldn't send emails.
It turns out the emails were being rejected by my header_checks.pcre which I set up as per the example in http://www.postfix.org/header_checks.5.html
The error I got was something like:
Apr 30 09:48:28 mail06 postfix/cleanup[28849]: 53893A00CD: reject:
header Content-Type:
image/png;??name=email_logo.png;??x-apple-part-url="part22.05080008.04000601#mydomain.com"
from unknown[112.134.156.178]; from=
to= proto=ESMTP helo=<[192.168.1.12]>: 5.7.1
Attachment name
"email_logo.png;??x-apple-part-url="part22.05080008.04000601#mydomain.com"
may not end with ".com"
So it seems that the iPhone mail app was appending an "x-apple-part-url" suffix to the attachment name and the PCRE was mistakenly blocking this as a .com instead of allowing through a .png.
Does anyone know how I can safely modify the PCRE in http://www.postfix.org/header_checks.5.html to avoid this happening?
So far as I know ".com" is still a viable extension for Windows malware. The problem is that the second .* in the example PCRE in the Postfix documentation is spanning two parameters as if the .com ended the name or filename parameter.
According to RFC 2045, value := token / quoted-string. This means you need to cater for both the quoted and unquoted cases by providing appropriate character classes. You could split into two rules or, to save multiple lists of extensions, do something like:
/etc/postfix/header_checks.pcre:
/^Content-(Disposition|Type).*name\s*=\s*
("(?:[^"]|\\")*|[^();:,\/<>\#\"?=<>\[\]\ ]*)
((?:\.|=2E)(
ade|adp|asp|bas|bat|chm|cmd|com|cpl|crt|dll|exe|
hlp|ht[at]|
inf|ins|isp|jse?|lnk|md[betw]|ms[cipt]|nws|
\{[[:xdigit:]]{8}(?:-[[:xdigit:]]{4}){3}-[[:xdigit:]]{12}\}|
ops|pcd|pif|prf|reg|sc[frt]|sh[bsm]|swf|
vb[esx]?|vxd|ws[cfh])(\?=)?"?)\s*(;|$)/x
REJECT Attachment name $2$3 may not end with ".$4"
The new second line of the rule distinguishes between the quoted and unquoted cases and any closing quotation mark is absorbed into $3.
BTW I'd probably stick .mso, .xl, .ocx (obscure MS extensions) and .jar in there too. Obviously this check is useful against malware floods but doesn't substitute for an up-to-date antivirus or more detailed spam analysis.

How to understand these email header fields? ("From" field prefixed with a "greater" sign)

I have a raw email with headers that look like this:
From xxxx#xxxx Fri Apr 25 22:46:08 2003
>From xxxx#mxxxx Wed Feb 19 20:06:07 2003
Envelope-to: yyyy#xxxx
...
Date: Wed, 19 Feb 2003 22:05:59 +0500
From: "Actual Author" <xxxx#xxxx>
I don't know how to interpret the first two lines, and the initial reading of RFC2822 has left me without a clue. They don't look like normal headers and manage to confuse Python 2.7 email parser (fine if I remove the > sign at the start of the second line). I have the same email body in Apple mail's cache, and it seems fine, so the input is clearly correct.
What's that header format? (From <email> <date>\r\n)
Why is the second one prefixed with > (greater sign)?
What you have is a mail in mbox format, where the first "From" line marks the start of the message. The second line (>From) seems to be caused by the escaping strategy of mbox known as From quoting - has this message been double-encoded as mbox?

Quoted Printable Email showing equal signs in certain email clients

I'm generating emails. They Show up fine for me in gmail and Outlook 2010. However, my client sees the = sign that gets added to the end of lines by the quoted-printable formatting. It also eats the character on the next line, but then displaying the equal sign.
Example:
line that en=
ds like this
shows up like
line that en=s like this
(Note: The EOL character in my emails is just LF. No CR.)
I'm confirming what outlook version my client is using, but I think it's 2007. The email headers from her appear to come through Exchange 6.5.
My emails are created in php using the HtmlMimeMail5 library. They are multipart emails, with the applicable section sent with:
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
It appears I could just make sure nothing in my email reaches the line wrap at 76 characters, but that seems like the wrong way to solve the problem. Should the EOL character be different? (In emails from the client, the EOL character is simply a LF) Any other ideas?
I do not know what the PHP library does, but in the end MIME mail must contain CR LF line endings. Obviously the client notices that = is not followed by a proper CR LF sequence, so it assumes that it is not a soft line break, but a character encoded in two hex digits, therefore it reads the next two bytes. It should notice that the next two bytes are not valid hex digits, so its behavior is wrong too, but we have to admit that at that point it does not have a chance to display something useful. They opted for the garbage in, garbage out approach.

Historical reason behind different line ending at different platforms

Why did DOS/Windows and Mac decide to use \r\n and \r for line ending instead of \n? Was it just a result of trying to be "different" from Unix?
And now that Mac OS X is Unix (-like), did Apple switch to \n from \r?
DOS inherited CR-LF line endings (what you're calling \r\n, just making the ascii characters explicit) from CP/M. CP/M inherited it from the various DEC operating systems which influenced CP/M designer Gary Kildall.
CR-LF was used so that the teletype machines would return the print head to the left margin (CR = carriage return), and then move to the next line (LF = line feed).
The Unix guys handled that in the device driver, and when necessary translated LF to CR-LF on output to devices that needed it.
And as you guessed, Mac OS X now uses LF.
Really adding to #Mark Harrison...
The people who tell you that Unix is "just outputting the text the programmer specified" whereas DOS is broken are plain wrong. There are also claims that it's stupid for DOS to flag EOF when it sees an EOF character, raising the question of what exactly that EOF character is for.
There is no one true convention for text file line endings - only platform-specific conventions. After all, even CR-LF, CR and LF aren't the only line end conventions to ever be used, and ASCII was never even the one and only character set. The problem is the C standard library and runtime, which didn't abstract away this platform-dependent detail. Other third generation languages (such as Pascal and even Basic) managed it, at least to some degree. Because of this, when C compilers were written for other platforms, runtime library hacks were needed to achieve compatibility with existing source code and books.
In fact, it's Unix and Multics that originally needed string translation for console I/O, since users usually sat at an ASCII terminal that required CR LF line ends. This translation was done in a device driver, though - the goal was to abstract away the device-specifics, assuming that it was better to adopt one convention and stick to it for stored text files.
The C text I/O hack is similar in principle to what CygWin does now, hacking Linux runtimes to work as well as can be expected on Windows. There's a real history of hacking things about to turn them into Unix-alikes - but then there's also Wine, turning Linux into Windows. Oddly enough, you can read some misplaced line-end criticism of Windows in the CygWin FAQ (Internet Archive link added 2013 - the page no longer exists). Maybe it's just their sense of humour, since they are basically doing what they are criticising, but on a much grander scale ;-)
The C++ standard library (whatever platform its implemented on) avoids this issue using iostreams, which abstract away line ends. For output, that suits me fine. For input, I need more control, so I either interpret character-by-character or else use a scanner generator.
[EDIT It turns out that the struck-out claim above isn't true, and never was. The std::endl literally translates to a \n and a flush. The \n is exactly the same \n you get in C - it tends to get called "new line", but it's actually an ASCII line feed character, which then gets translated by the runtime if necessary. Funny how false assumptions can get so ingrained you never question them - basically, C++ had no choice to do what C did (other than adding more layers on top) for compatibility reasons, and that should always have been obvious.]
The biggest slice of blame from my POV is with C, but C isn't the only project to fail to anticipate its move to other platforms. Blaming Bill Gates is just nuts - all he did was buy and polish a variant of the then popular CP/M. Really, it's just history - the same reason why we don't know what character codes 128 to 255 refer to in most text files. Given the ease of coping with all three line end conventions, it's odd that some developers still insist on that "my platforms convention is the one true way, and I shall force it on you like it or not" attitude.
Also - will the Unicode line separator codepoint U+2028 replace all these conventions in future text files? ;-)
It's interesting to note the CRLF is pretty much the internet standard. That is, pretty much every standard internet protocol that is line oriented uses CRLF. SMTP, POP, IMAP, NNTP, etc.. The body of email consists of lines terminated by CRLF.
According to Wikipedia: in the beginning, the program had to put in extra CR characters before the LF to slow the program down so the printer had time to keep up - and CP/M and then later Windows used this method. But Multics's printer driver put in extra characters automatically so the program didn't have to - and Unix developer from that. But none of that explains why the early Mac didn't do that (they do now that they are based on Unix).
https://en.wikipedia.org/wiki/Newline#History:
The sequence CR+LF was commonly used on many early computer systems that had adopted Teletype machines—typically a Teletype Model 33 ASR—as a console device, because this sequence was required to position those printers at the start of a new line. The separation of newline into two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in time to print the next character. Any character printed after a CR would often print as a smudge in the middle of the page while the print head was still moving the carriage back to the first position. "The solution was to make the newline two characters: CR to move the carriage to column one, and LF to move the paper up."[1] In fact, it was often necessary to send extra characters—extraneous CRs or NULs—which are ignored but give the print head time to move to the left margin. Many early video displays also required multiple character times to scroll the display.
On such systems, applications had to talk directly to the Teletype machine and follow its conventions since the concept of device drivers hiding such hardware details from the application was not yet well developed. Therefore, text was routinely composed to satisfy the needs of Teletype machines. Most minicomputer systems from DEC used this convention. CP/M also used it in order to print on the same terminals that minicomputers used. From there MS-DOS (1981) adopted CP/M's CR+LF in order to be compatible, and this convention was inherited by Microsoft's later Windows operating system.
The Multics operating system began development in 1964 and used LF alone as its newline. Multics used a device driver to translate this character to whatever sequence a printer needed (including extra padding characters), and the single byte was more convenient for programming. What seems like a more obvious[citation needed] choice—CR—was not used, as CR provided the useful function of overprinting one line with another to create boldface and strikethrough effects. Perhaps more importantly, the use of LF alone as a line terminator had already been incorporated into drafts of the eventual ISO/IEC 646 standard. Unix followed the Multics practice, and later Unix-like systems followed Unix. This created conflicts between Windows and Unix-like OSes, whereby files composed on one OS cannot be properly formatted or interpreted by another OS (for example a UNIX shell script written in a Windows text editor like Notepad).