Charsets and "preferred MIME name" - forms

The HTTP Accept-Encoding header contains atoms of acceptable character sets, and the charset= field in a MIME Content-Type header contains an atom of the character set for the following data.
My question is the following: must these atoms match the preferred MIME encoding name or charset name, or can they match any alias of a charset?
Alias and preferred MIME encoding as used by http://www.iana.org/assignments/character-sets.
I'm planning on using iconv to convert to platform-native wide UTF, and I don't want to make the entry in the form of an array of fields in the form of (iconv_alias, { list-of-aliases }) per charset. Rather, a simple (alias, iconv_alias) 2-tuple.

Related

How to extract email body and attachment

I am trying to extract a message rom multi-part email body or from attachment, so I used :0B to try each option like the following:
msgID=""
#extract message in the attachment if it's plain text
:0B
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($)($)\/[a-z0-9+]+=*
{msgID="$MATCH"}
#extract message in the body if it's there
:0EB
* ^()\/[a-z]+[0-9]+[^\+]
{msgID = "$MATCH"}
But msgID got the same message from the body which was inline image code, what's wrong with it, who know the better condition to filter it?
I also need to detect if the sub-header in the body is text and base64 encoded, then decode it, how to stipulate it with regex:
:0B
* ^Content-Type:text/html;
* ^Content-Location:text_0.txt
* ^Content-Transfer-Encoding:base64
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($)($)\/[a-z0-9+]+=*
{ msgID= msgId =`printf '%s' "$MATCH" | base64 -d` }
It always complains no match: ^Content-Type:text/html;
I'm guessing you are trying to say, there are two types of incoming messages. One looks something like this:
From: Sender <there#example.net>
To: You <AmyX#example.com>
Subject: plain text
ohmigod0
And the other is a complex MIME multipart with the same contents:
From: Sender <there#example.net>
To: Amy X <AmyX#example.com>
Subject: MIME complexity
MIME-Version: 1.0
Content-Type: multipart/related; boundary=12345
--12345
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: base64
Content-disposition: attachment; filename="text_0.txt"
Content-location: text_0.txt
b2htaWdvZDA=
--12345--
If this is correct, you would want to create a recipe to handle the more complex case first, because it has more features -- if your regex hits, it's unlikely to be a false positive. If not, fall back to the simpler pattern, and assume there will never be any false positives on this (perhaps because this account only receives email from a single system).
# extract message in the attachment if this is a MIME message
:0B
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($))($)\/[a-z0-9+]+=*
{ msgID="$MATCH" } # hafta have spaces inside the braces
:0EB # else, do this: assume the first non-empty body line is msgID
* ^()\/[a-z]+[0-9]+[^\+]
{ msgID="$MATCH" } # still need spaces inside braces;
# ... and, as pointed out many times before, cannot have spaces
# around the equals sign
The regular expression for the attachment is an oversimplification, but I already showed you how to cope with a complex MIME message in a previous question of yours -- if you have multiple cases (for example, base64-encoded attachment, or just a plain-text attachment, or no MIME), I would arrange them from more-complex (meaning more features in the regex) and fall back successively to simpler regexes, with higher chance of false positives. You can chain :0E ("else") cases for as long as you like -- if a regex succeeds and the following recipes are :0E recipes, they will all be skipped.
In response to your update, there are two problems with your attempt. The first, as you note, is that the first regex doesn't match. You have no space after the colon, and I'm guessing there is one in the message you are matching against. You need to understand that every character in a regex needs to match exactly, with the exception of regex metacharacters, which have special meaning. You would typically see something like this in many Procmail recipes:
* ^Content-Type:[ ]*text/html;
where the spaces between the square brackets are a space and a tab. The character class (the stuff in the square brackets) matches either character once, and the asterisk * says to repeat this pattern zero or more times. This allows for arbitrary spacing after the colon. The square brackets and the star are metacharacters. (This is very basic stuff which should be in any Procmail introduction you may have read.)
Your other problem is that each regex is applied in isolation. So your recipe says, if the Content-Type header appears anywhere in the body, and the Content-Location header appears anywhere else (typically, in another MIME header somewhere) etc. In other words, your recipe is very prone to false positives. This is why the rule I proposed earlier is so complex: It looks for these headers in sequence, in a single block, that is, in a single MIME header (though there is nothing to actually make sure that the context is a MIME body part header; more on that in a bit).
Because we want to ensure that there are four different headers, in any order, the regex for this is going to be huge: ABCD|ACDB|ACDB|ABDC|ADCB|BACD|... where A is the Content-Type header regex, B is the Content-Location regex, etc. You could cheat a little bit and craft a single regex which matches a sequence of four matches of the same header-identifying regex -- this is unlikely to cause any false positives (there is no sane reason to have two copies of the same header) and simplifies the code significantly, though it's still complex. Pay attention here: We want to create a single regex which matches any one out of these four headers.
^Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment)
... followed by any header, repeated four times, followed by the MIME body part (which you had after the Content-Disposition header, slightly out of context, but not incorrectly per se).
(Your code has text/html but if the attachment isn't HTML, as suggested by the format and the filename, it should be text/plain; so I'm going with that instead.)
Before we go there, I'll point out that MIME parsing in Procmail is not done a lot, precisely because it tends to explode into enormously complex regular expressions. MIME has a lot of options, and you need each regex to allow for omission or inclusion of each optional element. There are options for how to encode things (base64, or quoted-printable, or not encoded at all) and options to include or omit quotes around many elements, and options to use a multipart message with one or more body parts or just put the data in the body, like in my constructed first example message (which is still technically a MIME message; its implied content type is text/plain; charset="us-ascii" and the default content transfer encoding is 7bit, which conveniently happens to be what email before MIME always had to look like).
So unless you are in this because (a) you really, really want to learn the deepest secrets of Procmail or (b) you are on a very constrained system where you have to because there is nothing else you can use, I would seriously suggest that you move to a language with a proper MIME parser. A Python script which decodes this would be just half a dozen lines or so, and you get everything normalized and decoded nicely for you with no need for you to reinvent quoted-printable decoding or character set translation. (You can still call the Python script from Procmail if you like.)
I'll also point out here that a proper MIME parser would extract the boundary= parameter from the top-level headers in a multipart message, and make sure any matching on body part headers only occurs immediately after a boundary separator. The following Procmail code does not do that, so we could get a false positive if a message contains a match somewhere else than in the MIME body part headers (such as, for example, if a bounce message contains a fragment of the MIME headers of the bounced message; in this case, you would like for the recipe not to match, but it will).
:0B
* ^(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
($)\/[a-z0-9/+]+=*
{ msgid=`printf '%s' "$MATCH" | base64 -d` }
:0BE
* ^^\/[a-z]+[0-9]*[^\+]
{ msgid="$MATCH" }
(Unfortunately, Procmail's regex engine doesn't have the {4} repetition operator, so we have to repeat the regex literally four times!)
As noted before, Procmail, unfortunately, doesn't know anything about MIME. As far as Procmail is concerned, the top-level headers are headers, and everything else is body. There have been attempts to write MIME libraries or extensions for Procmail, but they don't tend to reduce complexity, just shuffle it around.

Fetching unicode data from PostgreSQL Erlang

I'm trying to fetch data from PostgreSQL with Erlang.
Here's my code that gets data from DB. However i have cyrrilic data in 'status' column. This cyrrilic data is not being fetched correctly.
I tried using UserInfo = io_lib:format("~tp ~n",[UserInfoQuery]), however this doesn't seem to work, because it crashes the app.
UserInfoQuery = odbc_queries:get_user_info(LServer,LUser),
UserInfo = io_lib:format("~p",[UserInfoQuery]),
?DEBUG("UserInfo: ~p",[UserInfo]),
StringForUserInfo = lists:flatten(UserInfo),
get_user_info(LServer, Id) ->
ejabberd_odbc:sql_query(
LServer,
[<<"select * from users "
"where email_hash='">>, Id, "';"]).
Here's the data that is fetched from DB
{selected,[<<"username">>,<<"password">>,<<"created_at">>,
<<"id">>,<<"email_hash">>,<<"status">>],
[{<<"admin">>,<<"admin">>,<<"2014-05-13 12:40:30.757433">>,
<<"1">>,<<"adminhash">>,
<<209,139,209,132,208,178,208,176,209,139,209,132,208,
178,208,176>>}]}
Question:
How can i extract data from column? For example only data from
'status' column?
How can i extract data in unicode from DB? Should i fetch the data from DB then use
io_lib:format("~tp~n") on it? Is there any better way to do it?
Additional question: is there any way to get string in human readable format, so that StringForUserInfo = 'ыфваыфва' from RowUnicode?
I tried this:
{selected, _, [Row]} = UserInfoQuery,
RowUnicode = io_lib:format("~tp~n", [Row]),
?DEBUG("RowUnicode: ~p",[RowUnicode]),
StringForUserInfo = lists:flatten(RowUnicode),
Error:
bad argument in call to erlang:iolist_size([123,60,60,34,97,100,109,105,110,34,
62,62,44,60,60,34,97,100,109,105,110,34,62,62,44,60,60,34,50,...])
Erlang ODBC driver perfectly fetched the status column from your database. Indeed, PostgreSQL encodes your data is UTF-8, and the value you get is UTF-8 encoded.
Status = <<209,139,209,132,208,178,208,176,209,139,209,132,208,178,208,176>>.
This is a binary representing the string ыфваыфва in UTF-8.
You can directly use UTF-8 encoded binaries in your code. If you want to use unicode character points instead of UTF-8 bytes, you can convert this to a list of integers (a string in Erlang parlance). Just use unicode:characters_to_list/1, which in your case will yield list [1099,1092,1074,1072,1099,1092,1074,1072]. This is a list representation of the same string. Unicode character 1099 (16#044B in hex) is ы (CYRILLIC SMALL LETTER YERU, cf Cyrillic excerpt unicode chart).
Erlang can handle unicode texts in the two representations above: lists of unicode characters as integers and binaries of UTF-8 encoded characters.
Let's examine a smaller example, string "ы". This string is composed of unicode character 044B CYRILLIC SMALL LETTER YERU, and it can be encoded as a binary as <<209,139>> or as a list as [16#044B] (= [1099]).
Historically, lists of integers as well as binaries were Latin-1 (ISO-8859-1) encoded. Unicode and ISO-8859-1 have the same values from 0 to 255, but UTF-8 transformation only matches ISO-8859-1 for characters in the 0-127 range. For this reason, Erlang's ~s format argument has a unicode translation modifier, ~ts. The following line will not work as expected:
io:format("~s", [<<209,139>>]).
It will output two characters, 00D1 (LATIN CAPITAL LETTER N WITH TILDE) and 008B (PARTIAL LINE FORWARD). This is because <<209,139>> is interpreted as a Latin-1 string and not as a UTF-8 encoded string.
The following line will fail:
io:format("~s", [[1099]]).
This is because [1099] is not a valid Latin-1 string.
Instead, you should write:
io:format("~ts", [<<209,139>>]),
io:format("~ts", [[1099]]).
Erlang's ~p format argument also has a unicode translation modifier, ~tp. However, ~tp will not do what you are looking for alone. Whether you use ~p or ~tp, by default, io_lib:format/2 will format the Status UTF-8 encoded binary above as:
<<209,139,209,132,208,178,208,176,209,139,209,132,208,178,208,176>>
Indeed, t modifier only means the argument shall accept unicode input. If you do use ~p, when formatting a string or a binary, Erlang will determine whether this could be represented as a Latin-1 string since input may be Latin-1 encoded. This heuristic allows Erlang to properly distinguish lists of integers and strings, most of the time. To see the heuristic at work, you can try something like:
io:format("~p\n~p\n", [[69,114,108,97,110,103], [1,2,3,4,5,6]]).
The heuristic detects that [69,114,108,97,110,103] actually is "Erlang", while [1,2,3,4,5,6] is just, well, a list of integers.
If you do use ~tp, Erlang will expect strings or binaries to be unicode-encoded, and then apply the default identification heuristic. And default heuristic happens to currently (R17) be latin-1 as well. Since your string cannot be represented with Latin-1, Erlang will display it as a list of integers. Fortunately, you can switch to Unicode heuristics by passing +pc unicode to Erlang on command line, and this will produce what you are looking for.
$ erl +pc unicode
So a solution to your problem is to pass +pc unicode and to use ~tp.
I don't understand why io:format("~tp") doesn't work, but you can extract the row and column you need and print that with io:format("~ts"):
> {selected, _, [Row]} = UserInfoQuery.
> io:format("~ts~n", [element(6, Row)]).
ыфваыфва
ok

C# To transform Facebook Response to proper encoded string

I am using regular Stream Reader to get response from Facebook graph API response
https://graph.facebook.com/XXXX?access_token=&fields=id,name,about,address,last_name
I am reading the response stream yet it returns me
{"id":"XXXXX","name":"K\u0131r\u0131nt\u0131 Reklam"...}
My code is below - I unsuccessfully tried using explicitly UTF-8 and "iso-8859-9" (Turkish) encodings and setting accept-charset headers. I read Joel's famous article about encodings. It looks like each of the chars '\' 'u' '1' '3' '1' are coming as characters from facebook - I thought this would have been 2 bytes for value 131 in UTF-8. I am confused. I expect this string to be "Kırıntı Reklam".
I could simply find/replace those strings - yet it would be far from elegant and maintainable. How should I properly process or convert the facebook graph api response for strings with accents?
using (WebResponse response = request.GetResponse())
{
using (Stream dataStream = response.GetResponseStream())
{
if (dataStream != null)
{
using (StreamReader reader = new StreamReader(dataStream))
{
responseFromServer = reader.ReadToEnd();
}
}
}
}
Thank you in advance
tldr; use a JSON library - I like Json.NET - and don't worry about it.
The JSON shown is valid JSON where \uABCD in a JSON string represents a UTF-16 encoded character1. The internal JSON character escaping format is useful to avoid having to deal with Unicode stream encoding issues - it allows JSON to be represented entirely in ASCII/7-bit-clean characters (which is a subset of UTF-8).
Using a conforming JSON library to parse the JSON with such escapes would restore the JSON into an appropriate object-graph, of which some values will be properly-decoded String values. The library is responsible for understanding JSON and converting/reading it as appropriate - this includes correctly handling any such \u escape sequences.
The stream itself (that of the JSON text) should use the encoding that the server says, is indicated by a BOM, or has been pre-negotiated: but really, just UTF-8 here. This is how the JSON text is encoded, but has no bearing on the escape sequences found in JSON strings.
1 Per RFC 4627, The application/json Media Type for JavaScript Object Notation (JSON):
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase. So, for example, a string containing
only a single reverse solidus character may be represented as
"\u005C".
Alternatively, there are two-character sequence escape
representations of some popular characters. So, for example, a
string containing only a single reverse solidus character may be
represented more compactly as "\\".
To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a twelve-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E"
For the doubters, here is a LINQPad example. This uses JSON.Net and imports the Newtonsoft.Json.Linq namespace.
var json = #"{""name"":""K\u0131r\u0131nt\u0131 Reklam""}";
json.Dump(); // -> {"name":"K\u0131r\u0131nt\u0131 Reklam"}
var name = JObject.Parse(json)["name"].ToString();
(name == "Kırıntı Reklam").Dump(); // -> true

How to encode newlines in vCard 4.0 parameter values: ^n or \n?

The vCard 4.0 RFC 6350 says that Newlines in property parameter values must be encoded as \n (at least for LABEL parameter of ADR property):
The property can also include a "LABEL" parameter to present a
delivery address label for the address. Its value is a plain-text
string representing the formatted address. Newlines are encoded
as \n, as they are for property values.
ADR;GEO="geo:12.3457,78.910";LABEL="Mr. John Q. Public, Esq.\n
Mail Drop: TNE QB\n123 Main Street\nAny Town, CA 91921-1234\n
U.S.A.":;;123 Main Street;Any Town;CA;91921-1234;U.S.A.
However vCard 4.0 RFC 6350 is updated by 'Parameter Value Encoding in iCalendar and vCard' RFC 6868, which says:
formatted text line breaks are encoded into ^n (U+005E, U+006E)
GEO;X-ADDRESS="Pittsburgh Pirates^n115 Federal St^nPitt
sburgh, PA 15212":geo:40.446816,-80.00566
being used
How do I encode newlines in vCard 4.0 parameter values as \n or as ^n?
Look at the actual grammar:
param-value = *SAFE-CHAR / DQUOTE *QSAFE-CHAR DQUOTE
vCard/iCalendar (unfortunately) does not support generic escaping in property attribute values. As mentioned in RFC 6868:
The \-escaping mechanism used for property text values is not defined
for use with parameter values
(which is the whole point of RFC 6868).
LABEL is special and explicitly spec'ed to support \n:
The property can also include a "LABEL" parameter to present a
delivery address label for the address. Its value is a plain-text
string representing the formatted address. Newlines are encoded as
\n, as they are for property values.
This is just for LABEL.
To answer your question: "How do I encode newlines in vCard 4.0 parameter values as \n or as ^n?"
You first look whether the value of the parameter is spec'ed in a special way, like LABEL. If so, encode it as described for the parameter. If it isn't, encode it via ^.

Decode an UTF8 email header

I have an email subject of the form:
=?utf-8?B?T3.....?=
The body of the email is utf-8 base64 encoded - and has decoded fine.
I am current using Perl's Email::MIME module to decode the email.
What is the meaning of the =?utf-8 delimiter and how do I extract information from this string?
The encoded-word tokens (as per RFC 2047) can occur in values of some headers. They are parsed as follows:
=?<charset>?<encoding>?<data>?=
Charset is UTF-8 in this case, the encoding is B which means base64 (the other option is Q which means Quoted Printable).
To read it, first decode the base64, then treat it as UTF-8 characters.
Also read the various Internet Mail RFCs for more detail, mainly RFC 2047.
Since you are using Perl, Encode::MIME::Header could be of use:
SYNOPSIS
use Encode qw/encode decode/;
$utf8 = decode('MIME-Header', $header);
$header = encode('MIME-Header', $utf8);
ABSTRACT
This module implements RFC 2047 Mime
Header Encoding. There are 3 variant
encoding names; MIME-Header, MIME-B
and MIME-Q. The difference is
described below
decode() encode()
MIME-Header Both B and Q =?UTF-8?B?....?=
MIME-B B only; Q croaks =?UTF-8?B?....?=
MIME-Q Q only; B croaks =?UTF-8?Q?....?=
I think that the Encode module handles that with the MIME-Header encoding, so try this:
use Encode qw(decode);
my $decoded = decode("MIME-Header", $encoded);
Check out RFC2047. The 'B' means that the part between the last two '?'s is base64-encoded. The 'utf-8' naturally means that the decoded data should be interpreted as UTF-8.
MIME::Words from MIME-tools work well too for this. I ran into some issue with Encode and found MIME::Words succeeded on some strings where Encode did not.
use MIME::Words qw(:all);
$decoded = decode_mimewords(
'To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld#dkuug.dk>',
);
This is a standard extension for charset labeling of headers, specified in RFC2047.