Encoding special chars in XSLT output - unicode

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)

<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)

Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

Related

The list of unicode unusual characters

Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.
There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❀️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, π’ˆ™ (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is Η„, but many fixed-width fonts have a narrow glyph for it: Η„.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: π’ˆ™. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But Η„ is ok.
Also remember that there are multiple ways to encode the same "character." For example, Γ© can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases Γ© may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).

How do I create a character set like ASCII?

I'm curious about the way that in the past it was implemented and I want to get information about how can I implement a character set of my own.
ASCII (American Standard Code for Information Interchange) was the "original" characterset, and remains the basis for most text data. ASCII is actually a 7-bit code (the numeric values range from 0 to 127) with the most significant bit of a byte indicating if the rest of the byte refers to ASCII (if zero) or the current Codepage.
Extra (non-ascii) characters were then added to these codepages, and the user's computer would load a specific codepage to use. Unfortunately this meant that you needed to load the correct codepage before viewing a file or the wrong characters would appear.
We have now moved on, and most systems use Unicode which is a variable character length (rather than the single-byte characters used previously) which can contain thousands upon thousands of characters, allowing for a single encoding to cater for what would have been multiple codepages using the ASCII+Codepage method of old.
That's the brief history; As to how to create your own characterset, I'm not sure what you are trying to achieve - You can create your own fonts, but if you're talking about an actual characterset (i.e. characters that do not already exist) then you'll have to get your characterset added to a standard such as Unicode so that other computers can make use of your new characters, which would be a considerable amount of work (and I have no idea how you'd even go about it) -- It's worth considering, however, that almost every character in existence already exists in Unicode so you may want to review what's already been done before you try and take on a mammoth undertaking such as creating an entirely new characterset.

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard γ‚’γ‚€γƒ³γ‚ΉγƒˆγƒΌγƒ«γ™γ‚‹ε ΄εˆγ€ζ–°γ—γ„
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.

How do I recover a document that has been sent through the character encoding wringer?

Until recently, my blog used mismatched character encoding settings for PHP and MySQL. I have since fixed the underlying problem, but I still have a ton of text that is filled with garbage. For instance, Γ― has become ï.
Is there software that can use pattern recognition and statistics to automatically discover broken text and fix it?
For example, it looks like U+00EF (UTF-8 0xC3 0xAF) has become U+00C3 U+00AF (UTF-8 0xC3 0x83 0xC2 0xAF). In other words, the hexadecimal encoding has been used for the code points. This pattern has happened to (seemingly random) non-ASCII characters across my site.
The example you cite looks like good old utf8-over-latin1. You might quickly try out a query like:
select convert(convert(the_problem_column using binary) using utf8)
to see if it irons out the problem.
An encoding conversion along those lines should work as long as all of your data went through the same sequence of encoding transformations, and as long as none of those transformations were lossy - you're just reversing the effect of some of those transformations.
If you can't rely on the data having gone through the same set of encoding transformations, then it's a matter of scanning through the data for garbage characters and replacing them with the intended character, which is risky because it depends on somebody's definition of what was garbage and what was intended.
Some discussion in this answer on how you might do that kind of repair using handmade scripts. I don't know of a tool that's aware of the full range of natural languages and encodings, that takes a more advanced statistical approach in spotting possible problems, and that recommends the exact transformation to fix the problem - something like that would be useful.
You probably want to look into regex, http://en.wikipedia.org/wiki/Regular_expression.
Using this you can then search out and replace the characters in question.
Here is the MySQL regex documentation, http://dev.mysql.com/doc/refman/5.1/en/regexp.html.

Detect presence of a specific charset

I need a way to detect whether a file contains characters from a certain charset.
Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?
Thanks
If you are looking for ready solution, you might want to try Enca.
However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).
Both methods have their good and bad sides and may sometimes give wrong results.
IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.
Edit: I did remember correctly, check out this paper / tutorial