Interpretation of Greek characters by FOP - unicode

Can you please help me interpret the Greek Characters with HTML display as HTML= & #8062; and Hex value 01F7E
Details of these characters can be found on the below URL
http://www.isthisthingon.org/unicode/index.php?page=01&subpage=F&hilite=01F7E
When I run this character in Apache FOP, they give me an ArrayIndexOut of Bounds Exception
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.fop.text.linebreak.LineBreakUtils.getLineBreakPairProperty(LineBreakUtils.java:668)
at org.apache.fop.text.linebreak.LineBreakStatus.nextChar(LineBreakStatus.java:117)
When I looked into the FOP Code, I was unable to understand the need for lineBreakProperties[][] Array in LineBreakUtils.java.
I also noticed that FOP fails for all the Greek characters mentioned on the above page which are non-displayable with the similar error.
What are these special characters ?
Why is their no display for these characters are these Line Breaks or TAB’s ?
Has anyone solved a similar issue with FOP ?

The U+1F7E code point is part of the Greek Extended Unicode block. But it is does not represent any actual character; it is a reserved but unassigned code point. Here is the chart from Unicode 6.0: http://www.unicode.org/charts/PDF/U1F00.pdf.
So the errors you are getting are perhaps not so surprising.

I ran a FO file that included the following <fo:block> through both FOP 0.95 and FOP 1.0:
<fo:block>Unassigned code point: ὾</fo:block>
I did get the same java.lang.ArrayIndexOutOfBoundsException that you are seeing.
When using an adjacent "real" character, there was no error:
<fo:block>Assigned code point: ώ</fo:block>
So it seems like you have to ensure that your datastream does not contain non-characters like U+1F7E.

Answer from Apache
At first glance, this seems like a minor oversight in the implementation of Unicode linebreaking in FOP. This does not take into account the possibility that a given codepoint is not assigned a 'class' in linebreaking context. (=
U+1F7E does not appear in the file
http://www.unicode.org/Public/UNIDATA/LineBreak.txt, which is used as a basis to generate those arrays in LineBreakUtils.java)
On the other hand, one could obviously raise the question why you so desperately need to have an unassigned codepoint in your output. Are you absolutely sure you need this? If yes, then can you elaborate on the exact reason? (i.e. What exactly is this unassigned codepoint used for?)
The most straightforward 'fix' seems to be roughly as follows:
Index: src/java/org/apache/fop/text/linebreak/LineBreakStatus.java
--- src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (revision
1054383)
+++ src/java/org/apache/fop/text/linebreak/LineBreakStatus.java (working
copy)
## -87,6 +87,7 ##
/* Initial conversions */
switch (currentClass) {
+ case 0: // Unassigned codepoint: consider as AL?
case LineBreakUtils.LINE_BREAK_PROPERTY_AI:
case LineBreakUtils.LINE_BREAK_PROPERTY_SG:
case LineBreakUtils.LINE_BREAK_PROPERTY_XX:
What this does, is assign the class 'AL' or 'Alphabetic' to any codepoint that has not been assigned a class by Unicode. This means it will be treated as a regular letter.
Now, the reason why I am asking the question whether you are sure you know what you're doing, is that this may turn out to be undesirable. Perhaps the character in question needs to be treated as a space rather than a letter.
Unicode does not define U+1F7E other than as a 'reserved' character, so it makes sense that Unicode cannot say what should happen with this character in the context of linebreaking...
That said, it is also wrong of FOP to crash in this case, so the bug is definitely genuine.

Related

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

Simplified Chinese Unicode table

Where can I find a Unicode table showing only the simplified Chinese characters?
I have searched everywhere but cannot find anything.
UPDATE :
I have found that there is another encoding called GB 2312 -
http://en.wikipedia.org/wiki/GB_2312
- which contains only simplified characters.
Surely I can use this to get what I need?
I have also found this file which maps GB2312 to Unicode -
http://cpansearch.perl.org/src/GUS/Unicode-UTF8simple-1.06/gb2312.txt
- but I'm not sure if it's accurate or not.
If that table isn't correct maybe someone could point me to one that is, or maybe just a table of the GB2312 characters and some way to convert them?
UPDATE 2 :
This site also provides a GB/Unicode table and even a Java program to generate a file
with all the GB characters as well as the Unicode equivalents :
http://www.herongyang.com/gb2312/
The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:
U+673A kTraditionalVariant U+6A5F
U+6A5F kSimplifiedVariant U+673A
In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).
Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:
宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/
The first column is traditional characters, and the second column is simplified.
To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.
The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.
https://github.com/jpatokal/script_detector
Sample:
p string
=> "我的氣墊船充滿了鱔魚."
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.simplified_chinese?
=> false
But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:
p string
=> "東京"
> string.chinese?
=> true
> string.traditional_chinese?
=> true
> string.japanese?
=> false
Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.
I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.
Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.
https://pastebin.com/xw4p7RVJ
You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.
If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese
According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.
I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

JMeter CSV Data Set is corrupting Japanese strings stored as proper UTF-8, I get Question Marks instead

I read in search terms from a simple text file to send to a search engine.
It works fine in English, but gives me ???? for any Japanese text.
Text with mixed English and Japanese does show the English text, so I know it's reading it.
What I'm seeing:
Input text:
Snow Leopard をインストールする場合、新しい
Turns into:
Snow Leopard ???????????????
This is in my POST field of an HTTP.
If I set JMeter to encode the data, it just puts in the percent sequence for question marks.
About the Data:
The CSV file is very simple in
structure.
There's only one field / one column,
which I name TERM, and later use as
${TERM}
I don't really need full CSV because it's only one string per line.
There's no commas or quotes.
It's UTF-8 and when I run the Unix "file" command on the file, it says UTF-8 text.
I've also verified UTF-8 in command line and graphical mode on two machines.
Interesting note:
An interesting coincidence that I noticed: if there are 15 Japanese characters then I get 15 question marks, so at some point it's being seen as full characters and not just bytes.
JMeter CSV Dataset Config:
Filename: japanese-searches.csv
File encoding: UTF-8 (also tried without)
Variable names: TERM
Delimiter: ,
Allow Quoted Data: False (I also tried True, different, but still wrong)
Recycle at EOF: True
Stop at EOF: False
Staring mode: All threads
A few things I've tried:
- Tried Allow quoted Data. It changed to other strange characters.
- Added -Dfile.encoding=UTF-8
- Tried encoding the POST stage, but it just turned into a bunch of %nn for question marks
And I'm not sure how "debug" just after the each line of the CSV is read in. I think it's corrupted right away, but I'm not sure.
If it's only mangled when I reference it, then instead of ${TERM} perhaps there's some other "to bytes" function call. I'll start checking into that. I haven't done anything with the JMeter functions yet.
Edited Dec 24:
Tweaks:
Changed formatting and added bullet
points for more clarity.
Clarified that the file is UTF-8, and have verified that.
A new theory:
Is it possible that the Japanese characters are making it through, and the issue is that EVERY SINGLE place that shows them maps them to a "?" at DISPLAY TIME only. So even though I've checked in a bunch of places, they all have a display issue just in the UI?
Is there a way in JMeter to see the numeric value of a character or string? Actually, to tell JMeter to display the list of Unicode code points?
I'll look at my last log files... although I suppose even the server logs could mis-mapped the characters.
Also, perhaps when doing variable expansion inside of the text field that I POST, where I reference the ${TERM}, maybe at that point it also maps to question marks, but that the corruption happens at that later point. If that happened, AND it was mis-displayed in the UI, then it might lead to a false conclusion.
What I'd really like to do is pause JMeter after the first CSV record, just after that line is loaded, and look at it with a "data scope" or byte editor or something. Not sure if this is possible.
Found the issue, there was another place the UTF-8 had to be specified.
In the HTTP Request, to the right of the Method, you have to also set Content Encoding to UTF-8
Yes, in hindsight, this seems obvious, but there were a number of reasons I didn't think this was needed. Some of my incorrect assumptions might be helpful for others who are debugging, so here goes - I would have thought that:
1: Once text has made it into Java as Unicode, it stays as Unicode, and goes in and out by UTF-8. Obviously not in this case.
2: I sort of thought HTTP defaulted to UTF-8 unless you say otherwise, but maybe I'm just used to XML, but probably not a good practice to assume that, and maybe HTTP defaults to ISO-Latin1 or something, or even if there's a spec, maybe folks don't follow it.
3: And if I don't specific it, I'd think the "do no harm" approach would be to pass the characters on, and let the receiver on the other end deal with it. Wrong again!
(OK, so points 1, 2 and 3 overlap a bit)
4: Even though my HTTP Request POST, I did still try the Encode checkbox. I certainly thought that would have encoded it, but all I got was the repeating % hex for question marks, so seemed to me that the data was already corrupted at that point. Wrong again. I suspect WITHIN the HTTP phase, there's TWO character transitions, first from Unicode to whatever encoding it thinks you have, and THEN a second encoding into the %signs, and my data was mis-encoded at the first step.
5: And I would have thought JMeter would say something or warn, but from my reading, apparently it's not helpful in that respect. You can do logging or whatever.
And the "?" is Java's way of reporting a problem BY default, this started in the Java 1.4x timeframe. In my Java code I prefer to set encoding errors to report as an exception, but again, not the default, and not what JMeter does.
So I learned my lesson.
The HINT that the Unicode was at least starting out OK was that the number of question marks equaled the number of Japanese characters, instead of having 2 or 3 times as many question marks. If the length of "???" matches your Japanese (or Chinese) string, then Java DID see actual Unicode characters at some point along the journey. Whereas if you see 3 times as many ?'s as input text, then Java always saw them as bytes or ints or whatever, and NEVER as valid codepoints.
Came across this topic when searching for solution to use parameters from csv file that contained some columns written in Hebrew.
I used Excel 2007 to create a 1000 lines data for user registrations. The first and the last names had to be in Hebrew.
I exported the file to "Unicode text" file. It became tab delimited.
"Unicode Text" saves in UTF-16 LE (Little Endian), not in UTF-8. That is important.
I opened the result in Notepad++. I could see the Hebrew letters properly. The Notepad++ has the "Encoding" menu item, where you can check the encoding or change it. So I changed the Little Endian to UTF-8.
Then I replaced tabs with commas (just selected the tab and pasted it into the Find box.
The parameters were substituted ok, but after running the script I saw the following:
In the "View Results Tree" listener I opened the "Result" tab of the "Http Request".
The parameters were substituted, but the HTTP view tab (on the bottom) of the Request showed me some gibberish.
But when I looked at the Raw view, I saw that the request parameters actually contained strings like %D7%A9%D7%A8%D7%9E%D7%95%D7%98%D7%94 that when taken in pairs (%D7 %A9) corersponded properly to Hebrew letters.
To my mind, the JMeter has a bug and can not properly display the unicode chars. But it sends (POSTs) them out ok.
Hope I am right and hope it will help someone.
You can try to use "SHIFT-JIS" in Content encoding (it's nearby Method selection). Then you should uncheck "Encode?" for parameter that included Japanese.
Hope it works you.

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
Try to add a single quote in your rule to see if it passes by making this change,
<squote><squote> => <squote>{1,2}
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
G"ABC<ヲァィ>" <> are Shift-out/shift-in
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.

How to detect malformed UTF characters

I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?
Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon
EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.
RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.