What is this Perl string encoded in? - perl

I'm using use Mail::IMAPClient to retrieve mail headers from an imap server. It works great. But when the header contains any character other that [a-z|A-Z|0-9] I'm served with strings that look like this :
Subject : Un message en =?UTF-8?B?ZnJhbsOnYWlzIMOgIGxhIGNvbg==?= (original string : "Un message en français à la con")
Body :
=C3=A9aeio=C3=B9=C3=A8=C3=A8 (original string : éaeioùèè)
What is this strange format ? Is that the famous "perl string
internal" format ?
what is the safest way of handling human idioms
coming from IMAP servers ?

The body encoding is Quoted-Printable; the header (subject) encoding is MIME "encoded-word" encoding ("B" type for base64). The best way to deal with both of them is to pass the email into a module that's capable of dealing with MIME, such as Email::MIME or the older and buggier MIME::Lite.
For example:
# $message was retrieved from IMAP
my $mime = Email::MIME->new($message);
my $subject = $mime->header('Subject'); # automatically decoded
my $body = $mime->body_str; # also automatically decoded
However if you need to deal with them outside of the context of an entire message, there are also modules like Encode::MIME::Header and MIME::QuotedPrint.

It is quoted-printable coded. It is a standard encoding used in email. It has nothing to do with Perl's internal string format.

Related

How to put some text into procmail forwarded e-mail?

For a couple of days, I've been trying to write procmail script.
I want to forward messages, and inject some text into message contents.
What I want to accomplish :
someone send me e-mail, with word "weather" in the subject
email is forwarded to address "mymail#somedomain.com"
every forwarded email gets some added text in contents
But so far, no success.
In .procmail.log, there's a message "procmail: Missing action"
SHELL=/bin/bash
VERBOSE=off
LOGFILE=/home/test/.procmail.log
LOGDATE_=`/bin/date +%Y-%m-%d`
:0
* ^Subject:.*weather
:0 bfw
| echo "This is injected text" ; echo "" ; cat
:0 c
! mymail#somedomain.com
When I looked into email source, I saw that text is injected.
But the place is wrong ...
Take a look:
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="------------148F3F0AD3D65DD3F3498ACA"
Content-Language: pl
Status:
X-EsetId: 37303A29AA1D9F60667466
This is injected text
This is a multi-part message in MIME format.
--------------148F3F0AD3D65DD3F3498ACA
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
CONTENT CONTENT CONTENT
*********************************************************
Injected text should be placed, where content is. Now it is above ...
You don't explain your code, but it looks like that you are attempting to use multiple actions under a single condition. Use braces for that.
:0
* ^Subject:.*weather
{
:0 bfw
| echo "This is injected text" ; echo "" ; cat
:0 c
! mymail#somedomain.com
}
Just to summarize, every recipe must have a header line (the :0 and possible flags) and an action. The conditions are optional, and there can be more than one. A block of further recipes is one form of action so that satisfies these requirements (the other action types are saving to a folder, piping to a command, or forwarding to an email address).
To inject text at the top of the first MIME body part of a multipart message, you need to do some MIME parsing. Procmail unfortunately has no explicit support for MIME, but if you know that the incoming message will always have a particular structure, you might get away with something fairly simple.
:0
* ^Subject:.*weather
{
:0fbw
* ^Mime-version: 1\.0
* ^Content-type: multipart/
| awk '/^Content-type: text\/plain;/&&!s {n=s=1} \
n&&/^$/{n=0; p=1} \
1; \
p{ print "This is injected text.\n"; p=0 }'
:0 c
! mymail#somedomain.com
}
The body (which contains all the MIME body parts, with their headers and everything) is passed to a simple Awk script, which finds the first empty line after (what we optimistically assume is) the first text/plain MIME body part header, and injects the text there. (Awk is case-sensitive, so the regex text might need to be adapted or generalized, and I have assumed the whitespace in the input message is completely regular. For a production system, these simplifying assumptions are unrealistic.)
If you need full MIME support (for example, the input message may or may not be multipart, or contain nested multiparts), my recommendation would be to write the injection code in some modern script language with proper MIME support libraries; Python would be my pick, though it is still (even after the email library update in 3.6) is slightly cumbersome and clumsy.

%40 converted into # on Get

I am passing my variables as follows to url using GET method
http://www.mysite.com/demo.php?sid=123121&email_id=stevemartin144%40gmail.com
& when i print $_GET on demo.php it displays parameters as follows:
email_id stevemartin144#gmail.com
sid 123121
instead of above output i want parameters as i passed
email_id stevemartin144%40gmail.com
sid 123121
I don't want to convert %40 into #
please suggest me solution on this
Thanks in advance
"%40" in a URL means "#". If you want a "%" to mean "%", you need to URL encode it to "%25".
URL encoding is just a transport encoding. If you feed in "#", its transport encoded version is "%40", but the recipient will get "#" again. If you want to feed in "%40" and have the recipient receive "%40", you need to URL encode it to "%2540".
If the recipient correctly receives "#" but you want to use the URL encoded version for whatever reason, you can also have the recipient urlencode it again.
Notes:
Online Converter:
Replace special characters with its equivalent hexadecimal unicode.
For a list of unicodes refer the website https://unicode-table.com (or) http://unicodelookup.com
Local Converter using Python:
Reference: conversion of password "p#s#w:E" to unicode will be as follows,
# = %40
$ = %24
# = %23
: = %3A
p#s#w:E = p%40s%23w%3AE
Input:
[root#localhost ~]# python -c "import sys, urllib as enc; print enc.quote_plus(sys.argv[1])" "p#s#w:E"
Output:
p%40s%23w%3AE

converting base64 encoded mail subject to text

Set out to write a simple procmail recipie that would forward the mail to me if it found the text "Unprovisioned" in the subject.
:0:
* ^Subject:.*Unprovisioned.*
! me#test.com
Unfortunately the subject field in the mail message coming from the mail server was in MIME encoded-word syntax.
The form is: "=?charset?encoding?encoded text?=".
Subject: =?UTF-8?B?QURWSVNPUlk6IEJNRFMgMTg0NSwgTkVXIFlPUksgLSBVbnByb3Zpc2lvbmVkIENvbm4gQQ==?=
=?UTF-8?B?bGVydA==?=
The above subject is utf-8 charset, base64 encoding with text folded to two lines. So was wondering if there are any mechanisms/scripts/utilities to parse this and convert to string format so that I could apply my procmail filter. Ofcourse I can write a perl script to parse this an perform the required validations, but looking to avoid it if possible.
Encode::MIME::Header, which ships with Perl, accessed directly through Encode:
use Encode qw(encode decode);
my $header_text = decode('MIME-Header', $header);

Why does Perl's LWP gives me a different encoding than the original website?

Lets say i have this code:
use strict;
use LWP qw ( get );
my $content = get ( "http://www.msn.co.il" );
print STDERR $content;
The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94"
which i'm guessing it's utf-16 ?
The website's encoding is with
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">
so why these characters appear and not the windows-1255 chars ?
And, another weird thing is that i have two servers:
the first server returning CP1255 chars and i can simply convert it to utf8,
and the current server gives me these chars and i can't do anything with it ...
is there any configuration file in apache/perl/module that is messing up the encoding ?
forcing something ... ?
The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"
One more thing that i tested is ...
Through perl:
my $content = `curl "http://www.anglo-saxon.co.il"`;
I get utf8 encoding.
Through Bash:
curl "http://www.anglo-saxon.co.il"
and here i get CP1255 ( Windows-1255 ) encoding ...
Also,
when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...
fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:
use Text::Iconv;
my $converter = Text::Iconv->new("utf8", "CP1255");
$content=$converter->convert($content);
my $converter = Text::Iconv->new("CP1255", "utf8");
$content=$converter->convert($content);
All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.
Anyway, since the server does return the correct encoding, this works:
my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;
$content is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode on it is appropriate; do not use Encode::decode as it's already been decoded once.
http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.
I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.
First, note that you should import get from LWP::Simple. Second, everything works fine with:
#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';
which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.
The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.
You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like
use Encode;
...;
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));
The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.

Decode an UTF8 email header

I have an email subject of the form:
=?utf-8?B?T3.....?=
The body of the email is utf-8 base64 encoded - and has decoded fine.
I am current using Perl's Email::MIME module to decode the email.
What is the meaning of the =?utf-8 delimiter and how do I extract information from this string?
The encoded-word tokens (as per RFC 2047) can occur in values of some headers. They are parsed as follows:
=?<charset>?<encoding>?<data>?=
Charset is UTF-8 in this case, the encoding is B which means base64 (the other option is Q which means Quoted Printable).
To read it, first decode the base64, then treat it as UTF-8 characters.
Also read the various Internet Mail RFCs for more detail, mainly RFC 2047.
Since you are using Perl, Encode::MIME::Header could be of use:
SYNOPSIS
use Encode qw/encode decode/;
$utf8 = decode('MIME-Header', $header);
$header = encode('MIME-Header', $utf8);
ABSTRACT
This module implements RFC 2047 Mime
Header Encoding. There are 3 variant
encoding names; MIME-Header, MIME-B
and MIME-Q. The difference is
described below
decode() encode()
MIME-Header Both B and Q =?UTF-8?B?....?=
MIME-B B only; Q croaks =?UTF-8?B?....?=
MIME-Q Q only; B croaks =?UTF-8?Q?....?=
I think that the Encode module handles that with the MIME-Header encoding, so try this:
use Encode qw(decode);
my $decoded = decode("MIME-Header", $encoded);
Check out RFC2047. The 'B' means that the part between the last two '?'s is base64-encoded. The 'utf-8' naturally means that the decoded data should be interpreted as UTF-8.
MIME::Words from MIME-tools work well too for this. I ran into some issue with Encode and found MIME::Words succeeded on some strings where Encode did not.
use MIME::Words qw(:all);
$decoded = decode_mimewords(
'To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld#dkuug.dk>',
);
This is a standard extension for charset labeling of headers, specified in RFC2047.