Mule ESB Groovy Unicode output issue - unicode

I have this little chain of components in my Mule ESB project:
<set-payload value="Получена заявка ##[sessionVars['ticketID']]" doc:name="Set SMS Text"/>
<scripting:transformer doc:name="Send SMS" ignoreBadInput="true">
<scripting:script engine="Groovy"><![CDATA[
new File("/tmp/groovy.out").withWriter { out ->
out.println message.payload
}
]]></scripting:script>
</scripting:transformer>
When message passes this chain in /tmp/groovy.out I can see "Џолучена заЯвка #4041" instead of expected "Получена заявка #4041" ("Получена заявка" - Russian words), i.e. there is a problem with unicode chars output and there are no problems with ASCII chars.
When I check /tmp/groovy.out with HEX editor I see that all Russain chars has 1-byte lenght (in unicode that must be 2-bytes length), i.e. output of my Groovy component is not unicode.
There is no problem with unicode output to Mule log when I user Echo and Logger components. Also in SMTP component everything is perfect: I successfully receive letters in unicode from Mule.
Can you help me with unicode output to file with Groovy in Mule ESB?

Can you try:
new File("/tmp/groovy.out").withWriter( 'UTF-8' ) { out ->

Related

Import nodeset in node-opcua server opens file ascii encoded

I create the node-opcua addressspace by usage of nodeset.xml files. I fill the server_options.nodeset_filename array with filenames to load. This works fine.
Now I wanted to load the nodeset "Opc.Ua.Ijt.Tightening.NodeSet2.xml" from companion specification for tightening (https://github.com/OPCFoundation/UA-Nodeset/blob/v1.04/IJT/Tightening/Opc.Ua.Ijt.Tightening.NodeSet2.xml) and recognized that some descriptions are cut if connecting to the server and reading with an opcua client. For example UAVariable NodeId="ns=1;i=6094" contains field 'Error' with description '0 – OK ...'.
The '-' in '0 - OK' is utf-8 encoded character in the nodeset xml.
After some investigation I found fs.readFile(xmlFile, "ascii", (err, xmlData: string) in export function readNodeSet2XmlFile https://github.com/node-opcua/node-opcua/blob/master/packages/node-opcua-address-space/source_nodejs/generate_address_space.ts#:~:text=fs.readFile(xmlFile%2C%20%22ascii%22%2C%20(err%2C%20xmlData%3A%20string)
The OPC UA specification tells 'All String values are encoded as a sequence of UTF-8 characters'.
Questions:
Does node-opcua really read nodeset xml 'ascii' encoded or is it my
wrong interpretation?
Is there a way to force node-opcua to use 'utf-8' encoding when reading nodesets?
this issue is now fixed in node-opcua. THere is not problem to read nodeset2.xml containing UTF-8 special characters anymore.

WiX Installer heat.exe and non-ascii filenames

I added a file in my WiX script with the character " î " in the path name. Light.exe will complain:
A string was provided with characters that are not available in the specified database code page '1252'
The character in question is 0xEE in Windows-1252 encoding, that is, 0x00EE Unicode or 0xC3AE in UTF-8. These files are in a wxs file generated by heat.exe, and this xml is encoded as UTF-8.
I assume the error message comes from the fact that it tries to input the character in UTF encoding while the database is 1252? Since UTF isn't really supported by Windows Installer (as described in the WiX documentation), should I be using input xml encoded in 1252 or iso-8859? If so, can I tell heat.exe to use another encoding for its output?
My question is similar to this one:
Leveraging heat.exe and harvest already localized file names and including them to msi using wix but the difference is that in that case the characters are "true" non-ansi charcaters, in my case the character can be encoded correctly in 1252, but it seems the conversion from utf-8 input files does not work.
The WiX toolset verifies codepages like so (roughly):
encoding = Encoding.GetEncoding(codepage, new EncoderExceptionFallback(),
new DecoderExceptionFallback());
writer = new StreamWriter(idtPath, false, encoding);
try
{
// GetBytes will throw an exception if any character doesn't
// match our current encoding
rowBytes = writer.Encoding.GetBytes(rowString);
}
catch (EncoderFallbackException)
{
rowBytes = convertEncoding.GetBytes(rowString);
messageHandler.OnMessage(WixErrors.InvalidStringForCodepage(
row.SourceLineNumbers,
writer.Encoding.WindowsCodePage));
}
It is possible that NETFX is not translating that "i" correctly. Explicitly setting the codepage on your XML may help. To do that from heat, you can try to use an XSLT (I've never tried changing the XML doc codepage via XSL but seems possible) or post-edit the document.

MSXML.DOMDocument.4.0 loadXML with Chinese Unicode characters

Currently, I'm trying to use the MSXML loadXML method in ASP to load XML string which may contain Unicode Chinese characters like
𠮢 (U+20BA2) 4bytes
and the xml string looks like
<City>City</City><Name>𠮢</Name>
So, in my code, I could see the xml string comes in right, but the loadXML returns an an error message like
Invalid unicode characters, & #55362;&#57250
Can someone please tell me what I can do to resolve this issue?
Thanks,
Edited
The code looks like this
Set objDoc = CreateObject("MSXML2.DOMDocument")
objDoc.async = false
objDoc.setProperty "SelectionLanguage", "XPath"
objDoc.validateOnParse = false
objDoc.loadXML(strXml)
I suggest posting the exact code, XML source and error message you are getting. I cannot reproduce an error by parsing <element>𠮢</element> in MSXML 4.0 SP3; this works fine.
I certainly do get a parseError with reason "Invalid unicode character" by trying to parse <element>𠮢</element>, because that's not well-formed XML. If you do have this in your markup then you need to fix the serialiser that produced it because neither MSXML nor any standards-compliant XML parser will load it.
If 𠮢 is turned into a character reference it must be 𠮢 (or 𠮢). Code units 55362 and 57250 are 'surrogates', reserved for encoding astral plane characters in UTF-16. They can't be included in an XML document.
𠮢 is the entity encoded form of 0xD842 0xDFA2, which is the UTF-16 encoded form of the Unicode 𠮢 character. Make sure that the XML is completely UTF-16 encoded, not mixed single-byte ASCII and multi-byte UTF-16.

How did SourceForge maim this Unicode character?

A little encoding puzzle for you.
A comment on a SourceForge tracker item contains the character U+2014, EM DASH, which is rendered by the web interface as — like it should.
In the XML export, however, it shows up as:
—
Decoding the entities, that results in these code points:
U+00E2 U+20AC U+201D
I.e. the characters —. The XML should have been —, the decimal representation of 0x2014, so this is probably a bug in the SF.net exporter.
Now I'm looking to reverse the process, but I can't find a way to get the above output from this Unicode character, no matter what erroneous encoding/decoding sequence I try. Any idea what happened here and how to reverse the process?
The the XML output is incorrectly been encoded using CP1252. To revert this, convert — to bytes using CP1252 encoding and then convert those bytes back to string/char using UTF-8 encoding.
Java based evidence:
String s = "—";
System.out.println(new String(s.getBytes("CP1252"), "UTF-8")); // —
Note that this assumes that the stdout console uses by itself UTF-8 to display the character.
In .Net, Encoding.UTF8.GetString(Encoding.GetEncoding(1252).GetBytes("—")) returns —.
SourceForge converted it to UTF8, interpreted the each of the bytes as characters in CP1252, then saved the characters as three separate entities using the actual Unicode codepoints for those characters.

"’" showing on page instead of " ' "

’ is showing on my page instead of '.
I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
In addition, my browser is set to Unicode (UTF-8):
So what's the problem, and how can I fix it?
So what's the problem,
It's a ’ (RIGHT SINGLE QUOTATION MARK - U+2019) character which is being decoded as CP-1252 instead of UTF-8. If you check the encodings table, then you see that this character is in UTF-8 composed of bytes 0xE2, 0x80 and 0x99. If you check the CP-1252 code page layout, then you'll see that each of those bytes stand for the individual characters â, € and ™.
and how can I fix it?
Use UTF-8 instead of CP-1252 to read, write, store, and display the characters.
I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
This only instructs the client which encoding to use to interpret and display the characters. This doesn't instruct your own program which encoding to use to read, write, store, and display the characters in. The exact answer depends on the server side platform / database / programming language used. Do note that the one set in HTTP response header has precedence over the HTML meta tag. The HTML meta tag would only be used when the page is opened from local disk file system instead of from HTTP.
In addition, my browser is set to Unicode (UTF-8):
This only forces the client which encoding to use to interpret and display the characters. But the actual problem is that you're already sending ’ (encoded in UTF-8) to the client instead of ’. The client is correctly displaying ’ using the UTF-8 encoding. If the client was misinstructed to use, for example ISO-8859-1, you would likely have seen ââ¬â¢ instead.
I am using ASP.NET 2.0 with a database.
This is most likely where your problem lies. You need to verify with an independent database tool what the data looks like.
If the ’ character is there, then you aren't connecting to the database correctly. You need to tell the database connector to use UTF-8.
If your database contains ’, then it's your database that's messed up. Most probably the tables aren't configured to use UTF-8. Instead, they use the database's default encoding, which varies depending on the configuration. If this is your issue, then usually just altering the table to use UTF-8 is sufficient. If your database doesn't support that, you'll need to recreate the tables. It is good practice to set the encoding of the table when you create it.
You're most likely using SQL Server, but here is some MySQL code (copied from this article):
CREATE DATABASE db_name CHARACTER SET utf8;
CREATE TABLE tbl_name (...) CHARACTER SET utf8;
If your table is however already UTF-8, then you need to take a step back. Who or what put the data there. That's where the problem is. One example would be HTML form submitted values which are incorrectly encoded/decoded.
Here are some more links to learn more about the problem:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), from our own Joel.
Unicode - How to get the characters right?, with more concise and practical information, solutions are targeted on Java environments.
How to setup your PHP site to use UTF8, targeted on PHP environments.
Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.
Or use ’.
’ (Unicode codepoint U+2019 RIGHT SINGLE QUOTATION MARK) is encoded in UTF-8 as bytes:
0xE2 0x80 0x99.
’ (Unicode codepoints U+00E2 U+20AC U+2122) is encoded in UTF-8 as bytes:
0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2.
These are the bytes your browser is actually receiving in order to produce ’ when processed as UTF-8.
That means that your source data is going through two charset conversions before being sent to the browser:
The source ’ character (U+2019) is first encoded as UTF-8 bytes:
0xE2 0x80 0x99
those individual bytes were then being mis-interpreted and decoded to Unicode codepoints U+00E2 U+20AC U+2122 by one of the Windows-125X charsets (1252, 1254, 1256, and 1258 all map 0xE2 0x80 0x99 to U+00E2 U+20AC U+2122), and then those codepoints are being encoded as UTF-8 bytes:
0xE2 -> U+00E2 -> 0xC3 0xA2
0x80 -> U+20AC -> 0xE2 0x82 0xAC
0x99 -> U+2122 -> 0xE2 0x84 0xA2
You need to find where the extra conversion in step 2 is being performed and remove it.
This sometimes happens when a string is converted from Windows-1252 to UTF-8 twice.
We had this in a Zend/PHP/MySQL application where characters like that were appearing in the database, probably due to the MySQL connection not specifying the correct character set. We had to:
Ensure Zend and PHP were communicating with the database in UTF-8 (was not by default)
Repair the broken characters with several SQL queries like this...
UPDATE MyTable SET
MyField1 = CONVERT(CAST(CONVERT(MyField1 USING latin1) AS BINARY) USING utf8),
MyField2 = CONVERT(CAST(CONVERT(MyField2 USING latin1) AS BINARY) USING utf8);
Do this for as many tables/columns as necessary.
You can also fix some of these strings in PHP if necessary. Note that because characters have been encoded twice, we actually need to do a reverse conversion from UTF-8 back to Windows-1252, which confused me at first.
mb_convert_encoding('’', 'Windows-1252', 'UTF-8'); // returns ’
I have some documents where … was showing as … and ê was showing as ê. This is how it got there (python code):
# Adam edits original file using windows-1252
windows = '\x85\xea'
# that is HORIZONTAL ELLIPSIS, LATIN SMALL LETTER E WITH CIRCUMFLEX
# Beth reads it correctly as windows-1252 and writes it as utf-8
utf8 = windows.decode("windows-1252").encode("utf-8")
print(utf8)
# Charlie reads it *incorrectly* as windows-1252 writes a twingled utf-8 version
twingled = utf8.decode("windows-1252").encode("utf-8")
print(twingled)
# detwingle by reading as utf-8 and writing as windows-1252 (it's really utf-8)
detwingled = twingled.decode("utf-8").encode("windows-1252")
assert utf8==detwingled
To fix the problem, I used python code like this:
with open("dirty.html","rb") as f:
dt = f.read()
ct = dt.decode("utf8").encode("windows-1252")
with open("clean.html","wb") as g:
g.write(ct)
(Because someone had inserted the twingled version into a correct UTF-8 document, I actually had to extract only the twingled part, detwingle it and insert it back in. I used BeautifulSoup for this.)
It is far more likely that you have a Charlie in content creation than that the web server configuration is wrong. You can also force your web browser to twingle the page by selecting windows-1252 encoding for a utf-8 document. Your web browser cannot detwingle the document that Charlie saved.
Note: the same problem can happen with any other single-byte code page (e.g. latin-1) instead of windows-1252.
You have a mismatch in your character encoding; your string is encoded in one encoding (UTF-8) and whatever is interpreting this page is using another (say ASCII).
Always specify your encoding in your http headers and make sure this matches your framework's definition of encoding.
Sample http header:
Content-Type text/html; charset=utf-8
Setting encoding in asp.net
<configuration>
<system.web>
<globalization
fileEncoding="utf-8"
requestEncoding="utf-8"
responseEncoding="utf-8"
culture="en-US"
uiCulture="de-DE"
/>
</system.web>
</configuration>
Setting encoding in jsp
If your content type is already UTF8 , then it is likely the data is already arriving in the wrong encoding. If you are getting the data from a database, make sure the database connection uses UTF-8.
If this is data from a file, make sure the file is encoded correctly as UTF-8. You can usually set this in the "Save as..." Dialog of the editor of your choice.
If the data is already broken when you view it in the source file, chances are that it used to be a UTF-8 file but was saved in the wrong encoding somewhere along the way.
If someone gets this error on WordPress website, you need to change wp-config db charset:
define('DB_CHARSET', 'utf8mb4_unicode_ci');
instead of:
define('DB_CHARSET', 'utf8mb4');
If the other answers haven't helped, you might want to check whether your database is actually storing the mojibake characters. I was viewing the text in utf-8, but I was still seeing the mojibake and it turned out that, due to a database upgrade, the text had been permanently "mojibaked".
In this case, one option is to "fix" the text with Python's ftfy package (or JavaScript verion here).
You must have copy/paste text from Word Document. Word document use Smart Quotes. You can replace it with Special Character (’) or simply type in your HTML editor (').
I'm sure this will solve your problem.
In DBeaver (or other editors) the script file you're working can prompt to save as UTF8 and that will change the char:
–
into
–
or
–
The same thing happened to me with the '–' character (long minus sign).
I used this simple replace so resolve it:
htmlText = htmlText.Replace('–', '-');