How do I use FXML and properties files with non-Latin characters? - unicode

I need to create i18n properties files for non-Latin languages (simplified Chinese, Japanese Kanji, etc.) With the Swing portion of our product, we use Java properties files with the raw UTF-8 characters in them, which Netbeans automatically converts to 8859-1 for us, and it works fine. With JavaFX, this strategy isn't working. Our strategy matches this answer precisely which doesn't seem to be working in this case.
In my investigation into the problem, I discovered this old article indicating that I need to use native2ascii to convert the characters in the properties file; still doesn't work.
In order to eliminate as many variables as possible, I created a sample FXML project to illustrate the problem. There are three internationalized labels in Japanese Kanji. The first label has the text in the FXML document. The second loads the raw unescaped character from the properties file. The third loads the escaped Unicode (matching native2ascii output).
jp_test.properties
btn.one=閉じる
btn.two=\u00e9\u2013\u2030\u00e3\ufffd\u02dc\u00e3\u201a\u2039
jp_test.fxml
<?xml version="1.0" encoding="UTF-8"?>
<?import java.lang.*?>
<?import java.util.*?>
<?import javafx.scene.control.*?>
<?import javafx.scene.layout.*?>
<?import javafx.scene.paint.*?>
<?scenebuilder-preview-i18n-resource jp_test.properties?>
<AnchorPane id="AnchorPane" maxHeight="-Infinity" maxWidth="-Infinity" minHeight="-Infinity" minWidth="-Infinity" prefHeight="147.0" prefWidth="306.0" xmlns:fx="http://javafx.com/fxml">
<children>
<Label layoutX="36.0" layoutY="33.0" text="閉じる" />
<Label layoutX="36.0" layoutY="65.0" text="%btn.one" />
<Label layoutX="36.0" layoutY="97.0" text="%btn.two" />
<Label layoutX="132.0" layoutY="33.0" text="Static Label" textFill="RED" />
<Label layoutX="132.0" layoutY="65.0" text="Properties File Unescaped" textFill="RED" />
<Label layoutX="132.0" layoutY="97.0" text="Properties File Escaped" textFill="RED" />
</children>
</AnchorPane>
Result
As you can see, the third label is not rendered correctly.
Environment:
Java 7 u21, u27, u45, u51, 32-bit and 64-bit. (JavaFX 2.2.3-2.2.45)
Windows 7 Enterprise, Professional 64-bit.
UPDATE
I've verified that the properties files is ISO 8859-1

Most IDEs (NetBeans at least) handle the files in unicode encoding by default. If you are creating the properties files in NetBeans and entering the Japanese text in it, then the entered text will be automatically encoded to utf. To see this open the properties file with notepad(++), you will see that the Japanese characters are escaped.
The utf escaped equivalent of "閉じる" is "\u9589\u3058\u308b", whereas "\u00e9\u2013\u2030\u00e3\ufffd\u02dc\u00e3\u201a\u2039" is "é–‰ã�˜ã‚‹" on reverse side. So the program output in the picture is correct. Additionally, if you reopen the jp_test.properties file in NetBeans, you will see the escaped utf encoded texts will be seen as decoded.
EDIT: as per comment,
Why does it do this?
It maybe because you are omitting the -encoding parameter of native2ascii, then the default charset of your system may not be UTF. This maybe the reason of that output.
Also, why is it that Java and Swing have no problems with our properties files as they are,
but FXML can't handle it?
It cannot be the case, because the "FXML is a Java". The only difference may also be the "usage of system charset" vs "overriding the charset in some configuration place".
Anyway, I suggest using right encoding parameter of native2ascii according to the input files encoding. More specifically, convert the properties files to utf-8 encoding first then do the rest. If you are using NetBeans as IDE, then no need for native2ascii.

Properties files should be ISO 8859-1 encoded, not UTF-8.
Characters can be escaped using \uXXXX.
Tools such as NetBeans are doing this by default, AFAIK.
http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html
http://docs.oracle.com/javase/tutorial/i18n/text/convertintro.html

Related

having problems opening DITA files in OxygenXML which contain special characters

I am having problems opening files which contain special characters like é, è, ë, ê, à, á, ö, etc. The error message I get from OxygnXML is:
File encoding (UTF8) does not support all characters from the current file.
To ignore these errors or to replace invalid characters follow the link below to change the "Encoding errors handling" option value from REPORT to IGNORE or REPLACE.
The strange thing is: when I alter the file (by swapping the 'ó' for an 'o', for instance), I can import the files both in OxygenXML and in FontoXML.Afterwards I can correct them again and save the file. But I don't see a difference between the original file and the altered file.
This is the original file
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is zó zenuwachtig, dat ze bijna aan de ... moet .</p>
And this is the saved corrected file (from FontoXML, in this case - just to show the added instructions):
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is
z<?fontoxml-change-addition-start author-id="erik.verhaar" change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22" timestamp="1627473671530"?>ó<?fontoxml-change-addition-end change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22"?><?fontoxml-change-deletion author-id="erik.verhaar" change-id="0296c77c-863b-421f-bf5c-c0901c7a2751" text="ó" timestamp="1627473669483"?>
zenuwachtig, dat ze bijna aan de ... moet .</p>
What is the difference between the original ó and the corrected one? And how can I change my original files so they can be imported in OxygenXML?
Thanks!!
Text files (XML for example) are saved on disk using bytes, they are edited and presented using characters. An encoding takes care of converting bytes to characters (sometimes multiple bytes are converted to characters) when the document is opened and again the encoding does the conversion of characters to bytes when the document is saved.
There are many encodings but with the most popular (like UTF-8) characters belonging to the 0-128 ASCII range like a-z A-Z are usually saved to a single byte. Characters outside of the range, for example e-acute (é) usually get saved as multiple bytes, depending on the encoding used for saving.
When an XML document is opened Oxygen attempts to understand what encoding to use for reading it. If the XML document has a heading like this:
Oxygen uses the encoding specified in the heading. If the XML doc is lacking the heading Oxygen will fallback to UTF-8. Basically Oxygen implements the XML specification when it comes to detecting the encoding of the XML file:
https://www.w3.org/TR/xml/#sec-guessing
In your case Oxygen detected the encoding as UTF-8 and started to use UTF-8 to convert bytes to characters. It encountered a sequence of bytes which were not encoded using UTF-8. Oxygen does not continue loading the file because in such cases you may end up with corrupt content when saving it back.
In my opinion the other editor tool you used to create the XML files was not XML aware, it did not actually saved the XML as UTF-8 even if the heading in the XML document specified this.
We do not actually know with what encoding that other editing tool used to save the XML, one thing you could try would be to reopen the XML document in that other editing tool and change its encoding heading declaration from:
<?xml version='1.0' encoding='UTF-8'?>
to:
<?xml version='1.0' encoding='CP1250'?>
because I suspect that other editing tool actually used for saving the XML document the default platform encoding which on Windows should usually be CP1250.
Then save the XML document in the other editing tool and try to re-open it in Oxygen, if it works change its heading encoding declaration back to UTF-8 and save the XML document in Oxygen in order to properly save it using the UTF-8 encoding.
This older set of slides I made about XML encoding might also be useful to you:
https://www.oxygenxml.com/events/2018/large_xml_documents.pdf

VSCode: Delete all occurences of xml tag pair including differing contents

I'm working in a kml (xml) file in VSCode. There are 267 instances of the <description></description> tags with the same contents schema but different contents. I would like a fast way to delete all of the instances of <description> including the contents instead of manually deleting each one. I'm not married to VSCode if Notepad++ or another editor will do what I'm trying to do.
Use one command/macro to delete both of these (plus 265 more)
<description><![CDATA[<center><table><tr><th colspan='2' align='center'>
<em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">
<th>NAME</th>
<td>Anderson</td>
</tr><tr bgcolor="#E3E3F3">
</tr></table></center>]]>
</description>
<description><![CDATA[<center><table><tr><th colspan='2' align='center'>
<em>Attributes</em></th></tr><tr bgcolor="#E3E3F3">
<th>NAME</th>
<td>Billingsly</td>
</tr><tr bgcolor="#F00000">
</tr></table></center>]]>
</description>
Thank you, Paul
You can use this regex in vscode find/replace:
\n?<description>[\S\s\n]*?<\/description>\n?
and replace with nothing. The \n?'s at the beginning and end are there if you want to delete the lines the tags occur on as well - see how it works, you can remove those if you don't care about empty lines where your deleted content used to be.
Obviously, if you have malformed input, like unmatched <description> or </description> tags the regex won't work.

Apache FOP insert special character [duplicate]

I am maintaining a program which uses the Apache FOP for printing PDF documents. There have been a couple complaints about the Chinese characters coming up as "####". I have found an existing thread out there about this problem and done some research on my side.
http://apache-fop.1065347.n5.nabble.com/Chinese-Fonts-td10789.html
I do have the uming.tff language files installed on my system. Unlike the person in this thread, I am still getting the "####".
From this point forward, has anyone seen a work around that would allow you to print complex characters in a PDF document using Apache FOP?
Three steps must be taken for chinese characters to correctly show in a PDF file created with FOP (this is also true for all characters not available in the default font, and more generally to use a non-default font).
Let us use this simple fo example to show the warnings produced by FOP when something is wrong:
<?xml version="1.0" encoding="UTF-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
<fo:layout-master-set>
<fo:simple-page-master master-name="one">
<fo:region-body />
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="one">
<fo:flow flow-name="xsl-region-body">
<!-- a block of chinese text -->
<fo:block>博洛尼亚大学中国学生的毕业论文</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
Processing this input, FOP gives several warnings similar to this one:
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "?" (0x535a) not available in font "Helvetica".
...
Without any explicit font-family indication in the FO file, FOP defaults to using Helvetica, which is one of the Base-14 fonts (fonts that are available everywhere, so there is no need to embed them).
Each font supports a set of characters, assigning a visible glyphs to them; when a font does not support a character, the above warning is produced, and the PDF shows "#" instead of the missing glyph.
Step 1: set font-family in the FO file
If the default font doesn't support the characters of our text (or we simply want to use a different font), we must use the font-family property to state the desired one.
The value of font-family is inherited, so if we want to use the same font for the whole document we can set the property on the fo:page-sequence; if we need a special font just for some paragraphs or words, we can set font-family on the relevant fo:block or fo:inline.
So, our input becomes (using a font I have as example):
<?xml version="1.0" encoding="UTF-8"?>
<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
<fo:layout-master-set>
<fo:simple-page-master master-name="one">
<fo:region-body />
</fo:simple-page-master>
</fo:layout-master-set>
<fo:page-sequence master-reference="one">
<fo:flow flow-name="xsl-region-body">
<!-- a block of chinese text -->
<fo:block font-family="SimSun">博洛尼亚大学中国学生的毕业论文</fo:block>
</fo:flow>
</fo:page-sequence>
</fo:root>
But now we get a new warning, in addition to the old ones!
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Font "SimSun,normal,400" not found. Substituting with "any,normal,400".
org.apache.fop.events.LoggingEventListener processEvent
WARNING: Glyph "?" (0x535a) not available in font "Times-Roman".
...
FOP doesn't know how to map "SimSun" to a font file, so it defaults to a generic Base-14 font (Times-Roman) which does not support our chinese characters, and the PDF still shows "#".
Step 2: configure font mapping in FOP's configuration file
Inside FOP's folder, the file conf/fop.xconf is an example configuration; we can directly edit it or make a copy to start from.
The configuration file is an XML file, and we have to add the font mappings inside /fop/renderers/renderer[#mime = 'application/pdf']/fonts/ (there is a renderer section for each possible output mime type, so check you are inserting your mapping in the right one):
<?xml version="1.0"?>
<fop version="1.0">
...
<renderers>
<renderer mime="application/pdf">
...
<fonts>
<!-- specific font mapping -->
<font kerning="yes" embed-url="/Users/furini/Library/Fonts/SimSun.ttf" embedding-mode="subset">
<font-triplet name="SimSun" style="normal" weight="normal"/>
</font>
<!-- "bulk" font mapping -->
<directory>/Users/furini/Library/Fonts</directory>
</fonts>
...
</renderer>
...
</renderers>
</fop>
each font element points to a font file
each font-triplet entry identifies a combination of font-family + font-style (normal, italic, ...) + font-weight (normal, bold, ...) mapped to the font file in the parent font element
using directory elements it is also possible to automatically configure all the font files inside the indicated folders (but this takes some time if the folders contain a lot of fonts)
If we have a complete file set with specific versions of the desired font (normal, italic, bold, light, bold italic, ...) we can map each file to the precise font triplet, thus producing a very sophisticated PDF.
On the opposite end of the spectrum we can map all the triplet to the same font file, if it's all we have available: in the output all text will appear the same, even if in the FO file parts of it were marked as italic or bold.
Note that we don't need to register all possible font triplets; if one is missing, FOP will use the font registered for a "similar" one (for example, if we don't map the triplet "SimSun,italic,400" FOP will use the font mapped to "SimSun,normal,400", warning us about the font substitution).
We are not done yet, as without the next and last step nothing changes when we process our input file.
Step 3: tell FOP to use the configuration file
If we are calling FOP from the command line, we use the -c option to point to our configuration file, for example:
$ fop -c /path/to/our/fop.xconf input.fo input.pdf
From java code we can use (see also FOP's site):
fopFactory.setUserConfig(new File("/path/to/our/fop.xconf"));
Now, at last, the PDF should correctly use the desired fonts and appear as expected.
If instead FOP terminates abruptly with an error like this:
org.apache.fop.cli.Main startFOP
SEVERE: Exception org.apache.fop.apps.FOPException: Failed to resolve font with embed-url '/Users/furini/Library/Fonts/doesNotExist.ttf'
it means that FOP could not find the font file, and the font configuration needs to be checked again; typical causes are
a typo in the font url
insufficient privileges to access the font file

Sitecore: Valid item names

How do I expand the list of valid characters in item names, to include æøåÆØÅ?
As per default the valid characters seems to be defined by this rule in web.config:
<setting name="ItemNameValidation" value="^[\w\*\$][\w\s\-\$]*(\(\d{1,}\)){0,1}$" />
changing the regex to :
<setting name="ItemNameValidation" value="^[\wæøåÆØÅ\*\$][\wæøåÆØÅ\s\-\$]*(\(\d{1,}\)){0,1}$" />
Should in theory allow the characters, but that just "kills" the sitecore.
Edit:
A regex that allows dots, are working perfectly like this:
<setting name="ItemNameValidation" value="^[\w\*\$][\w\.\s\-\$]*(\(\d{1,}\)){0,1}$" />
So I am allowed to change some aspects of it, just not for the æøå characters?!?!?
Note:
- Using æøå in item names is for some reason possible from the "Page Editor", when creating and saving new content items, but it is not possible to do the same from the "Content Editor"!
- We are using SC v6.6.0 (rev. 120918).
Cause of error was not saving the file as UTF-8
Make sure your config file is saved as "UTF-8"
A bit late, but adding as an answer :)
Cause of error was not saving the file as UTF-8

UIWebView, quote characters with Arial font not showing up correctly

I have some .html with the font defined as:
<font color="white" face="Arial">
I have no other style applied to my tag. In it, when I display data like:
<b> “Software” </b>
or
<b>“Software”</b>
they both display characters I do not want in the UIWebView. It looks like this on a black background:
How do I avoid that? If I don't use font face="arial", it works fine.
This is an encoding issue. Make sure you use the same encoding everywhere. UTF8 is probably the best choice.
You can put a line
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
in your html to tell UIWebView about the encoding.
To be precise, “ is what you get when you take the UTF-8 encoding of “, and interpret it as ISO-8859-1. So your data is encoded in UTF-8, which is good, and you just need to set the content type to UTF-8 instead of ISO-8859-1 (e.g. using the <meta> tag above)
You shouldn’t generally use the curly quote characters themselves—character encodings will always mess you up somehow. No idea why it works correctly when you don’t use Arial (though that suggests a great idea: don’t use Arial), but your best bet is to use the HTML entities “ and ” instead.