DataWeave Multipart/form-data input encoding not recognized in Mule4 - encoding

Is there an option to set encoding by part in multipart/form-data received to listener?
I have this kind of multipart to be received:
----------------------------180928595588258919887097
Content-Disposition: form-data; name="qualifyResult"; filename="json1.json"
Content-Type: application/json
{
"json1": "1"
}
----------------------------180928595588258919887097
Content-Disposition: form-data; name="raceOneResult"; filename="json2.json"
Content-Type: application/json
{
"json2": "2"
}
----------------------------180928595588258919887097--
Both JSON files are in UTF-16, so when I try to store file data to variable - using the following DataWeave script:
%dw 2.0
output application/json
---
payload.parts.'qualifyResult'.content
It returns this kind of error:
org.mule.runtime.core.api.expression.ExpressionRuntimeException: "Unexpected character 'ÿ' at qualifyResult#[1:1] (line:column), expected false or true or null or {...} or [...] or number but was , while reading `qualifyResult` as Json.
1|
^" evaluating expression: "%dw 2.0
output application/json
---
payload.parts.'qualifyResult'.content".
So I think the DataWeave just tries to read the data using UTF-8 encoding instead of UTF-16.
I also tried to set encoding on the Set variable component where I use the DataWeave (as following snippet shows), but this didn't change anything.
<set-variable value="#[%dw 2.0
output application/json
---
payload.parts.'qualifyResult'.content]" doc:name="qualifyResult" doc:id="34800576-0539-4f95-8977-8353a255d83d" variableName="qualifyResult" encoding="UTF-16"/>
How to handle the inbound file data in correct encoding?
Thanks for all valid answers.

It doesn't look like problem with the encoding. It seems from the character ÿ mentioned in the error that it is part of a Byte Order Marker (BOM) at the beginning of the data to indicate the UTF encoding. DataWeave doesn't currently support BOMs and thinks it is garbage, so it throws an error. If you can't avoid that payload being sent with a BOM you may need some Java method or a script to remove it before trying to parse it with DataWeave.

Related

Python requests says it's UTF-8, so why are there still unicode characters?

Using requests to query the DarkSky API says it returns UTF-8 encoded document, but string is defaulting to ASCII with error. If I explicitly encode as UTF-8, there are no errors, but string contains extra characters and raw unicode. What's going on? I've set my py file to use UTF-8 encoding in Sublime.
# Fetch weather data from DarkSky, parse resulting JSON
try:
url = "https://api.darksky.net/forecast/" + API_KEY + "/" + LAT + "," + LONG + "?exclude=[minutely,hourly,alerts,flags]&units=us"
response = requests.get(url)
data = response.json()
print(response.headers['content-type'])
print(response.encoding)
which returns:
application/json; charset=utf-8
d_summary = data['daily']['summary']
print("Daily Summary: ", d_summary.encode('utf-8'))
which returns: Daily Summary: b'No precipitation throughout the week, with temperatures rising to 82\xc2\xb0F on Tuesday.'
What's going on with the extra characters in front and quoted substring with unicode text?
I don't see any problem here. Decoding the JSON doesn't cause an error, and encoding to UTF-8 produces a byte string literal repr b'...' as expected. Top-bit-set bytes are expected to look like \xXX in byte string literals.
string is defaulting to ASCII with error
What do you mean by that? Please show us the actual problem.
My guess is you are trying to print non-ASCII characters to the terminal on Windows and getting UnicodeEncodeError. If so that's because the Windows Console is broken and can't print Unicode properly. PEP 528 works around the problem in Python 3.6.

How to extract email body and attachment

I am trying to extract a message rom multi-part email body or from attachment, so I used :0B to try each option like the following:
msgID=""
#extract message in the attachment if it's plain text
:0B
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($)($)\/[a-z0-9+]+=*
{msgID="$MATCH"}
#extract message in the body if it's there
:0EB
* ^()\/[a-z]+[0-9]+[^\+]
{msgID = "$MATCH"}
But msgID got the same message from the body which was inline image code, what's wrong with it, who know the better condition to filter it?
I also need to detect if the sub-header in the body is text and base64 encoded, then decode it, how to stipulate it with regex:
:0B
* ^Content-Type:text/html;
* ^Content-Location:text_0.txt
* ^Content-Transfer-Encoding:base64
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($)($)\/[a-z0-9+]+=*
{ msgID= msgId =`printf '%s' "$MATCH" | base64 -d` }
It always complains no match: ^Content-Type:text/html;
I'm guessing you are trying to say, there are two types of incoming messages. One looks something like this:
From: Sender <there#example.net>
To: You <AmyX#example.com>
Subject: plain text
ohmigod0
And the other is a complex MIME multipart with the same contents:
From: Sender <there#example.net>
To: Amy X <AmyX#example.com>
Subject: MIME complexity
MIME-Version: 1.0
Content-Type: multipart/related; boundary=12345
--12345
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: base64
Content-disposition: attachment; filename="text_0.txt"
Content-location: text_0.txt
b2htaWdvZDA=
--12345--
If this is correct, you would want to create a recipe to handle the more complex case first, because it has more features -- if your regex hits, it's unlikely to be a false positive. If not, fall back to the simpler pattern, and assume there will never be any false positives on this (perhaps because this account only receives email from a single system).
# extract message in the attachment if this is a MIME message
:0B
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($))($)\/[a-z0-9+]+=*
{ msgID="$MATCH" } # hafta have spaces inside the braces
:0EB # else, do this: assume the first non-empty body line is msgID
* ^()\/[a-z]+[0-9]+[^\+]
{ msgID="$MATCH" } # still need spaces inside braces;
# ... and, as pointed out many times before, cannot have spaces
# around the equals sign
The regular expression for the attachment is an oversimplification, but I already showed you how to cope with a complex MIME message in a previous question of yours -- if you have multiple cases (for example, base64-encoded attachment, or just a plain-text attachment, or no MIME), I would arrange them from more-complex (meaning more features in the regex) and fall back successively to simpler regexes, with higher chance of false positives. You can chain :0E ("else") cases for as long as you like -- if a regex succeeds and the following recipes are :0E recipes, they will all be skipped.
In response to your update, there are two problems with your attempt. The first, as you note, is that the first regex doesn't match. You have no space after the colon, and I'm guessing there is one in the message you are matching against. You need to understand that every character in a regex needs to match exactly, with the exception of regex metacharacters, which have special meaning. You would typically see something like this in many Procmail recipes:
* ^Content-Type:[ ]*text/html;
where the spaces between the square brackets are a space and a tab. The character class (the stuff in the square brackets) matches either character once, and the asterisk * says to repeat this pattern zero or more times. This allows for arbitrary spacing after the colon. The square brackets and the star are metacharacters. (This is very basic stuff which should be in any Procmail introduction you may have read.)
Your other problem is that each regex is applied in isolation. So your recipe says, if the Content-Type header appears anywhere in the body, and the Content-Location header appears anywhere else (typically, in another MIME header somewhere) etc. In other words, your recipe is very prone to false positives. This is why the rule I proposed earlier is so complex: It looks for these headers in sequence, in a single block, that is, in a single MIME header (though there is nothing to actually make sure that the context is a MIME body part header; more on that in a bit).
Because we want to ensure that there are four different headers, in any order, the regex for this is going to be huge: ABCD|ACDB|ACDB|ABDC|ADCB|BACD|... where A is the Content-Type header regex, B is the Content-Location regex, etc. You could cheat a little bit and craft a single regex which matches a sequence of four matches of the same header-identifying regex -- this is unlikely to cause any false positives (there is no sane reason to have two copies of the same header) and simplifies the code significantly, though it's still complex. Pay attention here: We want to create a single regex which matches any one out of these four headers.
^Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment)
... followed by any header, repeated four times, followed by the MIME body part (which you had after the Content-Disposition header, slightly out of context, but not incorrectly per se).
(Your code has text/html but if the attachment isn't HTML, as suggested by the format and the filename, it should be text/plain; so I'm going with that instead.)
Before we go there, I'll point out that MIME parsing in Procmail is not done a lot, precisely because it tends to explode into enormously complex regular expressions. MIME has a lot of options, and you need each regex to allow for omission or inclusion of each optional element. There are options for how to encode things (base64, or quoted-printable, or not encoded at all) and options to include or omit quotes around many elements, and options to use a multipart message with one or more body parts or just put the data in the body, like in my constructed first example message (which is still technically a MIME message; its implied content type is text/plain; charset="us-ascii" and the default content transfer encoding is 7bit, which conveniently happens to be what email before MIME always had to look like).
So unless you are in this because (a) you really, really want to learn the deepest secrets of Procmail or (b) you are on a very constrained system where you have to because there is nothing else you can use, I would seriously suggest that you move to a language with a proper MIME parser. A Python script which decodes this would be just half a dozen lines or so, and you get everything normalized and decoded nicely for you with no need for you to reinvent quoted-printable decoding or character set translation. (You can still call the Python script from Procmail if you like.)
I'll also point out here that a proper MIME parser would extract the boundary= parameter from the top-level headers in a multipart message, and make sure any matching on body part headers only occurs immediately after a boundary separator. The following Procmail code does not do that, so we could get a false positive if a message contains a match somewhere else than in the MIME body part headers (such as, for example, if a bounce message contains a fragment of the MIME headers of the bounced message; in this case, you would like for the recipe not to match, but it will).
:0B
* ^(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
($)\/[a-z0-9/+]+=*
{ msgid=`printf '%s' "$MATCH" | base64 -d` }
:0BE
* ^^\/[a-z]+[0-9]*[^\+]
{ msgid="$MATCH" }
(Unfortunately, Procmail's regex engine doesn't have the {4} repetition operator, so we have to repeat the regex literally four times!)
As noted before, Procmail, unfortunately, doesn't know anything about MIME. As far as Procmail is concerned, the top-level headers are headers, and everything else is body. There have been attempts to write MIME libraries or extensions for Procmail, but they don't tend to reduce complexity, just shuffle it around.

Encoding DOCX to be Sent via email

I am writing a little Perl app to send emails internally where I work, but I am having troubles with attachments. Now, I can hear all of you screaming "use MIME::Lite", but it aint that easy - management rules tell me I am unable to use anything from the cpan...
Anyhow, I am using MIME::Base64 to encode any attachements I am sending. I am using Net::SMTP for the email side of things.
encoding is done as such (straight out of the Perldoc page):
my $encoded = "";
use MIME::Base64 qw(encode_base64);
open (FILE, "7857084216_9816ae9bec_b.jpg") or die "$!\n";
while (read(FILE, $buf, 60*57)) {
$encoded .= encode_base64($buf);
}
The relevant Net::SMTP code is as follows:
use Net::SMTP;
$smtp = Net::SMTP->new("mailserver",
Hello => 'somedomain.com.au',
Timeout => 60,
Debug => 0,
);
$smtp->mail(<services\#somedomain.com.au>);
$smtp->to(<me\#somedomain.com.au>);
$smtp->cc(<another\#somedomain.com.au>);
$smtp->data();
$smtp->datasend("Subject: testing 123\n");
$smtp->datasend("To: me\#somedomain.com.au\n");
$smtp->datasend("CC: another\#somedomain.com.au\n");
$smtp->datasend("MIME-Version: 1.0\n");
$smtp->datasend("Content-type: multipart/mixed;\n\tboundary=\"$boundary\"\n");
$smtp->datasend("--$boundary\n");
$smtp->datasend("Content-type: text/html; charset=utf-8\n");
$smtp->datasend("\n");
$smtp->datasend("<p> Test Email!</p>\n");
$smtp->datasend("\n");
$smtp->datasend("--$boundary\n");
$smtp->datasend("Content-type: image/jpeg; name=\"7857084216_9816ae9bec_b.jpg\"\n");
$smtp->datasend("Content-ID: \n");
$smtp->datasend("Content-Disposition: attachment; filename=\"7857084216_9816ae9bec_b.jpg\"\n");
$smtp->datasend("Content-transfer-encoding: base64\n");
$smtp->datasend("\n");
$smtp->datasend("$encoded");
$smtp->datasend("\n");
$smtp->datasend("--$boundary--\n");
$smtp->dataend;
$smtp->quit;
At this moment, I am directing these emails through to myself and reading them using Outlook 2010
When sending images (such as jpegs), I am not having any issues at all - everything seems to decode OK byte for byte as far as I can see.
When I am sending docx type files with just plain text, everything seems fine.
But, when I am sending docx files with images inserted, the files are corrupting.
Not being an expert in mail sending and attaching, I am at a bit of a loss. How should I be encoding docx files for attaching to emails? Any help would be appreciated!
Also forgot to mention, I have attempted to set the content-type accordingly: Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
I've seen this happen before and usually comes down to just the incorrect mime type or mime structure. You mention you tried adding the mime type, so that might indicate that the mime structure might be a little off. I've seen this with xslx and csv files. CSV would come in find since the decoder assumes text, but if you don't have the right Mime for binary data it will always attempt to convert to ascii. I realize this around an xslx vs docx, but I think the same principles apply.
Here is a snip of an sample email I'm generating (Using Mime:Lite), but might help.
From: me#example.com
To: example#example.com
Subject: Example
Content-Transfer-Encoding: binary
Content-Type: multipart/mixed; boundary="_----------=_13433203262384120"
--_----------=_13433203262384120
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html
Content-Disposition: inline
<body>Hello,<br>
<p>EMAIL BODY</p>
<p>Thanks,<br> Blah</body>
--_----------=_13433203262384120
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; name="sample.xlsx"
Content-Disposition: attachment; filename="sample.xlsx"
Content-Transfer-Encoding: base64
UEsDBBQAAAAIAAFk+kDm/EEvVgEAACQFAAATAAAAW0NvbnRlbnRfVHlwZXNdLnhtbMVUS28CIRC+
N+l/IFybXdRD0zSuHvo4tia1P4DC6BJZIAxa/fed3a32EesjmvQCgfleQ8j0h8vKsgVENN4VvJt3
OAOnvDZuWvDX8WN2wxkm6bS03kHBV4B8OLi86I9XAZAR22HBy5TCrRCoSqgk5j6Ao8rEx0omOsap
CFLN5BREr9O5Fsq7BC5lqdbgg/49TOTcJvawpOs2SQSLnN21wNqr4DIEa5RMVBcLp3+5ZJ8OOTEb
DJYm4BUBOBNbLZrSnw5r4jM9TjQa2EjG9CQrggnt1Sj6gIII+W6ZLUH9ZGIUkMa8IkoOdSINOgsk
CTEZ+Eq901z5CMe7r5+pZh9qubQC08oCntwshghSYwmQKpu3ovusE30qaNfuyQEamX2O7z7O3ryf
.... <Snipped>
ZRQyppbEwDToT079O3NfFSgofVlmcbvM34PqkQQdCmrLiWYxleokviz25OPYvpRw/s64avTwn+uh
SSg4ctedMMaTkj47g3IXX1BLAQIUAxQAAAAIAAFk+kDm/EEvVgEAACQFAAATAAAAAAAAAAEAAAC2
gQAAAABbQ29udGVudF9UeXBlc10ueG1sUEsBAhQDFAAAAAgAAWT6QPWzn3yRAQAAQAMAABAAAAAA
AAAAAQAAALaBhwEAAGRvY1Byb3BzL2FwcC54bWxQSwECFAMUAAAACAABZPpAkg6xSFoBAAC2AgAA
EQAAAAAAAAABAAAAtoFGAwAAZG9jUHJvcHMvY29yZS54bWxQSwECFAMUAAAACAABZPpAyfySzFpy
BAAcqRgAFAAAAAAAAAABAAAAtoHPBAAAeGwvc2hhcmVkU3RyaW5ncy54bWxQSwECFAMUAAAACAAB
ZPpARQtWMQgCAABaBQAADQAAAAAAAAABAAAAtoFbdwQAeGwvc3R5bGVzLnhtbFBLAQIUAxQAAAAI
AABk+kCVKtbcsAEAAFgDAAAPAAAAAAAAAAEAAAC2gY55BAB4bC93b3JrYm9vay54bWxQSwECFAMU
AAAACAABZPpA8XOrpLUFAABVGwAAEwAAAAAAAAABAAAAtoFrewQAeGwvdGhlbWUvdGhlbWUxLnht
bFBLAQIUAxQAAAAIAO1j+kD+7E73UA8AAM1zAAAYAAAAAAAAAAEAAAC2gVGBBAB4bC93b3Jrc2hl
ZXRzL3NoZWV0MS54bWxQSwECFAMUAAAACAAAZPpA7/olcLO/MgAHxsABGAAAAAAAAAABAAAAtoHX
kAQAeGwvd29ya3NoZWV0cy9zaGVldDIueG1sUEsBAhQDFAAAAAgAAWT6QCFo34b0AAAATgMAABoA
AAAAAAAAAQAAALaBwFA3AHhsL19yZWxzL3dvcmtib29rLnhtbC5yZWxzUEsBAhQDFAAAAAgAAWT6
QIB6g4TsAAAAUQIAAAsAAAAAAAAAAQAAALaB7FE3AF9yZWxzLy5yZWxzUEsFBgAAAAALAAsAxgIA
AAFTNwAAAA==
--_----------=_13433203262384120--
I have found the issue and it does appear that the 'content-type' does seem to make a difference. According to what I found, I replaced all the Content-type calls, that include Base64 encoded data with the following:
Content-type: application/octet-stream;
This seemed to have sent it straight through, no issues with either Outlook 2010 and 2003. I have tested it against binary files, plain text, ect and all seems to work OK in my current environment.
Cheers

Charsets and "preferred MIME name"

The HTTP Accept-Encoding header contains atoms of acceptable character sets, and the charset= field in a MIME Content-Type header contains an atom of the character set for the following data.
My question is the following: must these atoms match the preferred MIME encoding name or charset name, or can they match any alias of a charset?
Alias and preferred MIME encoding as used by http://www.iana.org/assignments/character-sets.
I'm planning on using iconv to convert to platform-native wide UTF, and I don't want to make the entry in the form of an array of fields in the form of (iconv_alias, { list-of-aliases }) per charset. Rather, a simple (alias, iconv_alias) 2-tuple.

Zend Framework and UTF-8 characters (æøå)

I use Zend Framework and I have problem with JSON and UTF-8.
Output
\u00c3\u00ad\u00c4\u008d
íÄ
I use...
JavaScript (jQuery)
contentType : "application/json; charset=utf-8",
dataType : "json"
Zend Framework
$view->setEncoding('UTF-8');
$view->headMeta()->appendHttpEquiv('Content-Type', 'text/html;charset=utf-8');
header('Content-Type: application/json; charset=utf-8');
utf8_encode();
Zend_Json::encode
Database
resources.db.params.charset = "utf8"
resources.db.params.driver_options.1002 = "SET NAMES utf8"
resources.db.isDefaultTableAdapter = true
Collation
utf8_unicode_ci
Type
MyISAM
Server
PHP Version 5.2.6
What did I do wrong? Thank you for your reply!
utf8_encode();
If you've got UTF-8 strings from your database and UTF-8 strings from your browser, then you don't need to utf8_encode any more. You've already got UTF-8 strings; calling this function again will just give you the UTF-8 representation of what you'd get if you read UTF-8 bytes as ISO-8859-1 by mistake.
Pass your untouched UTF-8 strings straight to the JSON encoder.
I think this question is some how related to yours
my problem was when encoding some [ Arabic , Hebrew or Chinese as you might see ]
turns out that unicode notation understood by javascript/ecmascript like what did you see
I hope that explain to you in details