having problems opening DITA files in OxygenXML which contain special characters - special-characters

I am having problems opening files which contain special characters like é, è, ë, ê, à, á, ö, etc. The error message I get from OxygnXML is:
File encoding (UTF8) does not support all characters from the current file.
To ignore these errors or to replace invalid characters follow the link below to change the "Encoding errors handling" option value from REPORT to IGNORE or REPLACE.
The strange thing is: when I alter the file (by swapping the 'ó' for an 'o', for instance), I can import the files both in OxygenXML and in FontoXML.Afterwards I can correct them again and save the file. But I don't see a difference between the original file and the altered file.
This is the original file
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is zó zenuwachtig, dat ze bijna aan de ... moet .</p>
And this is the saved corrected file (from FontoXML, in this case - just to show the added instructions):
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is
z<?fontoxml-change-addition-start author-id="erik.verhaar" change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22" timestamp="1627473671530"?>ó<?fontoxml-change-addition-end change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22"?><?fontoxml-change-deletion author-id="erik.verhaar" change-id="0296c77c-863b-421f-bf5c-c0901c7a2751" text="ó" timestamp="1627473669483"?>
zenuwachtig, dat ze bijna aan de ... moet .</p>
What is the difference between the original ó and the corrected one? And how can I change my original files so they can be imported in OxygenXML?
Thanks!!

Text files (XML for example) are saved on disk using bytes, they are edited and presented using characters. An encoding takes care of converting bytes to characters (sometimes multiple bytes are converted to characters) when the document is opened and again the encoding does the conversion of characters to bytes when the document is saved.
There are many encodings but with the most popular (like UTF-8) characters belonging to the 0-128 ASCII range like a-z A-Z are usually saved to a single byte. Characters outside of the range, for example e-acute (é) usually get saved as multiple bytes, depending on the encoding used for saving.
When an XML document is opened Oxygen attempts to understand what encoding to use for reading it. If the XML document has a heading like this:
Oxygen uses the encoding specified in the heading. If the XML doc is lacking the heading Oxygen will fallback to UTF-8. Basically Oxygen implements the XML specification when it comes to detecting the encoding of the XML file:
https://www.w3.org/TR/xml/#sec-guessing
In your case Oxygen detected the encoding as UTF-8 and started to use UTF-8 to convert bytes to characters. It encountered a sequence of bytes which were not encoded using UTF-8. Oxygen does not continue loading the file because in such cases you may end up with corrupt content when saving it back.
In my opinion the other editor tool you used to create the XML files was not XML aware, it did not actually saved the XML as UTF-8 even if the heading in the XML document specified this.
We do not actually know with what encoding that other editing tool used to save the XML, one thing you could try would be to reopen the XML document in that other editing tool and change its encoding heading declaration from:
<?xml version='1.0' encoding='UTF-8'?>
to:
<?xml version='1.0' encoding='CP1250'?>
because I suspect that other editing tool actually used for saving the XML document the default platform encoding which on Windows should usually be CP1250.
Then save the XML document in the other editing tool and try to re-open it in Oxygen, if it works change its heading encoding declaration back to UTF-8 and save the XML document in Oxygen in order to properly save it using the UTF-8 encoding.
This older set of slides I made about XML encoding might also be useful to you:
https://www.oxygenxml.com/events/2018/large_xml_documents.pdf

Related

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

Applescript: Save Word documents as plaintext while retaining accents

I'm trying to save Word documents as plain text docs. Currently, some times the accents turn into other symbols (usually the same ones, for example: é turns into a theta). Other times it works fine. How do I prevent this?
Currently using the line:
save as active document file name FullDocPath file format format Unicode text
When I encounter this error, I can save the document using the dialog (selecting Western Mac OS Roman encoding...that fixes the problem.
The applescript Word dictionary mentions:
[text encoding unsigned integer] : Text encoding to use when saving out as text file
I have no idea if this is the piece I'm missing or how to utilize it (is there a set integer that designates Western Mac OS Roman encoding?)
Anyone have any ideas?
Try:
set wordDoc to choose file
do shell script "textutil -convert txt " & quoted form of POSIX path of (wordDoc as text)
Check out StefanK's solution using textutil
This is in response to your comment beginning "Thanks Stefan and bibadiak"
With .txt file formats is that there is no universally used way to specify the encoding of a file inside the file, so either the application has to guess, or you have to know the encoding and the application has to let you tell it.
AFAIK if you do not specify an output encoding when you use textutil to convert from .doc or .docx format to text, you get UTF-8. But Mac Word just does not seem to recognise that when you try to open it, either programmatically or in the UI.
So I think you need to do some mix of the following:
a. save in, and work with, a format that uses 16-bit Unicode encoding. Word should recognise that, certainly if the BOM is preserved
b. save to UTF and work with UTF elsewhere, but use textutil to do the conversion back to (say) .docx before you re-open the document in Mac Word
c. if all your characters can be encoded using Mac OS Roman, use e.g.
textutil -convert txt -encoding 30
to save, ensure you work only with that character set, and re-open with Word. (30 is the value of the APple NSString value NSMacOSRomanStringEncoding). I think textutil will fail to convert documents that contain characters outside the MacOS Roman set.

How is this file encoded?

I have a .csv file generated by Excel that I got from my customer. My software has to open and parse it in java. I'm using universalchardet but it did not detect the encoding from the first 1,000 bytes of the file.
Within these 1,000 first bytes, there is a sequence that should be read as Boîte, however I cannot find the correct encoding to use to convert this file to UTF-8 strings in java.
In the file, Boîte is encoded as 42,6F,94,74,65 (read using a hex editor). B, o, t and e are using the standard latin encoding with 1 byte per character. The î is also encoded on only one byte, 0x94.
I don't know how to guess this charset, none of my searches online yielded relevant results.
I also tried to use file on that file:
$ file export.csv
/Users/bicou/Desktop/export.csv: Non-ISO extended-ASCII text, with CR line terminators
However I looked at the extended-ASCII charset, the value 0x94 stands for ö.
Have you got other ideas for guessing the encoding of that file?
This was Mac OS Roman encoding. When using the following java code, the text was properly decoded:
InputStreamReader isr = new InputStreamReader(new FileInputStream(targetFileName), "MacRoman");
I don't know how to delete my own question. I don't think it is useful anymore...

GWT: Character encoding umlauts

I want to set a text in a label:
labelDemnaechst.setText(" Demnächst fällig:");
On the output in the application the characters "ä" are displayed wrong.
How can I display them well?
GWT assumes all source files are encoded in UTF-8 (and this is why you see löschen if you open them with an editor that interprets them as Windows-1252 or ISO-8859-1).
GWT also generates UTF-8-encoded files, so your web server should serve them as UTF-8 (or don't specify an encoding at all, to let browsers "sniff" it) and your HTML host page should be served in UTF-8 too (best is to put a <meta charset=utf-8> at the beginning of your <head>).
If a \u00E4 in Java is displayed correctly in the browser, then you're probably in the first case, where your source files are not encoded in UTF-8.
See http://code.google.com/webtoolkit/doc/latest/FAQ_Troubleshooting.html#International_characters_don't_display_correctly
well you have to encode your special charactars to Unicode. You can finde a list of the representive Unicode characters here.
Your examle would look like this:
labelDemnaechst.setText("Demn\u00E4lachst f\u00E4llig:");
Hope this helps, if noone has a better solution.
Appendix:
Thanks Thomas for your tipp, you really have to change the format in which eclipse safes it's source files. Per default it uses something like Cp1252. If you change it to UTF-8, your example works correct. (So Demnächst is written correctly).
You can edit the safing format, if you right-click on your file --> Preferences.
To get UTF-8 encoding for your entire workspace, go to Window -> Preferences. In the pop-up start typing encoding. Now you should have Content Types, Workspace, CSS Files, HTML Files, XML Files as result. In content Types you can type UTF-8 in the Default encoding text box, for the other elements you can just select the encoding in their respective listboxes.
Then check the encoding for your project in Project -> Properties -> Resource.
Detailed instruction with pictures can be found here:
http://stijndewitt.wordpress.com/2010/05/05/unicode-utf-8-in-eclipse-java/
Cheers
what i did:
open the file with notepad (Windows Explorer),
and save it with the option UFT-8 instead of proposed ANSI.
Encoding the project to UTF-8 didn't work (for me)
Cheerio
Use iso-8859-1 (western europe) character set instead of UTF-8.

UTF8 charset, diacritical elements, conversion problems - and Zend Framework form escaping

I am writing a webapp in ZF and am having serious issues with UTF8. It's using multi lingual content through Zend Form and it seems that ZF heavily escapes all of these characters and basically just won't show a field if there's diacritical elements 'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
Zend Form allows for having non escaped data, but trying to use this is confusing, and it seems it'd need to be used all over the place.
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
And if the last question is true, then how do I convert the source text to UTF8? I am comfortable setting up apache so that it sends a default UTF8 charset heading, and also adding the charset meta tag to the html, but doing this I am still getting messed up encoding. I have also tried opening the translation csv file in TextWrangler on OSX as UTF8, but it has done nothing.
Thanks!
L
'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
This I don't understand. Can you show an example of how it is displayed, as opposed to how it should be displayed?
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
Yup. In more detail: If the data you're displaying and the encoding of the HTML page are both UTF-8, the multi-byte special characters will be displayed correctly.
And if the last question is true, then how do I convert the source text to UTF8?
Advanced editors and IDEs enable you to define what encoding the source file is saved in. You would need to open the file in its current encoding (with special characters being displayed correctly) and save it as UTF-8.
If the content is messed up when you have the right content-type header and/or meta tag specified, then the content is not UTF-8 yet. If you don't get it sorted, post an example of what it looks like here.