How can i preserve the ASCII encoding when loading xml with XDocument.Parse? - encoding

When I load an xml document with encoding="us-ascii" using XDocument.Parse, the resultant document has converted the ascii into utf-8.
For example, in my xml file i have (not full file)
<data>é some other stuff</data>
When this is loaded, it converts to
<data>é some other stuff</data>
How can I prevent this? I need to be able to later select descendant and re-write them to another file, but when doing so, it's converted.

Related

What are the characters shown on a file after forcefully changing the extension?

Recently I changed the extension of an .apk file to .txt and despite this, I was able to open it on Notepad with some random characters, that weren't available on the keyboard in the file. org/antlr/runtime/ANTLRFileStream.class…TmOÓP=w[×QËÀ)ê|A…ÑETÔ¢NP¢™ãË—º•Q3ZÓcüþ¿j",£ß4ñGÏmÇñ˽Ïs{žçœçeûùëóW ±¨á0F5d0ÖA˔‹LÈã’ŠËR˜PqEƒ†Iy\•ØkÒºÞÁЂ´¦TL«˜H­95{ÙÚ°2K/­×–Y³Üªù(ð·:%œv\'¸!Гû÷óðª#¢èUܵä¸öòæÆÛ_±^ÔÂt^Ùª­Z¾#ýæc"XwêKž_5-7¨ù¦¿éΆmÞZ^Y*ÍS “ÛÖ¹µ¹7eûUàxn]%µ‘Ð^TÊvË^…kžUˆ;u_àTw<sÁ}µDL%ÛªØ>ùÄš#º…Rø˜¨;o)\,0ǚԞ݇ؓ‡àΪ<ò6ýr³¥GsÃ횪EOÌ_…É =è•Ç¬Ž#8ª£½ú^fùõ˜Ž›¸%pü IT{`Á2þ¶<Š:î`NÇ<î긇A˜èÿïˆ8Ç0Q¥»¨#- Ze7srRÉšíVƒõÐ]0rí&tÀ”O´‡[Y±K ö¬H›¯Ü %÷¬8Ì) r+åšW·ÑÏF†¿,bd—i%h³­ˆá8½YÄiª‘
Not just this, but while converting many other extensions like .jar,.xapk, etc. would show me these characters.Can anyone please explain, what factors are these characters based on, and how does the OS decides or try what characters to show in an unsupported file exactly.
Is there a way to get the original content through this data?
Lets say you created a text editor, which can write and save text files as well as open text files. you also defined the encoding that will be used to save text in binary files(all files when saved are binary). So your encoding looks something as following:
Your encoding Emacs encoding
TEXT BINARY TEXT BINARY
A 01000001 ă 01000001
B 01000010 Ћ 01000010
... ...
Z 01011010 Ϡ 01011010
lets say you create a file with 'ABZ' as its contents. this file when saved contains value 010000010100001001011010. When you open this file with your text editor, the editor finds 010000010100001001011010 as file contents and using above encoding it knows that its 'ABZ' hence it prints 'ABZ' on the screen.
Now lets say you open same file using emacs, since emacs uses its own encoding it displays "ăЋϠ", There is nothing wrong with emacs. it just doesn't know that data was written using your custom encoding.
So the point is that every file is written in a specific format, for example APK format can only be correctly understood by Android system. when you try to open the APK file in a text editor it just tries to make sense of binary data in the same way as emacs does in above example.
Is there a way to get the original content through this data?
If you know the originally encoding using which data was written, then you can read the contents of file using same encoding.

having problems opening DITA files in OxygenXML which contain special characters

I am having problems opening files which contain special characters like é, è, ë, ê, à, á, ö, etc. The error message I get from OxygnXML is:
File encoding (UTF8) does not support all characters from the current file.
To ignore these errors or to replace invalid characters follow the link below to change the "Encoding errors handling" option value from REPORT to IGNORE or REPLACE.
The strange thing is: when I alter the file (by swapping the 'ó' for an 'o', for instance), I can import the files both in OxygenXML and in FontoXML.Afterwards I can correct them again and save the file. But I don't see a difference between the original file and the altered file.
This is the original file
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is zó zenuwachtig, dat ze bijna aan de ... moet .</p>
And this is the saved corrected file (from FontoXML, in this case - just to show the added instructions):
<p id="id-9f3a1788-a751-4f48-ed9c-9e19447ad3b0">Ze is
z<?fontoxml-change-addition-start author-id="erik.verhaar" change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22" timestamp="1627473671530"?>ó<?fontoxml-change-addition-end change-id="6f6bb382-3d43-4c5b-b35f-f857d729cf22"?><?fontoxml-change-deletion author-id="erik.verhaar" change-id="0296c77c-863b-421f-bf5c-c0901c7a2751" text="ó" timestamp="1627473669483"?>
zenuwachtig, dat ze bijna aan de ... moet .</p>
What is the difference between the original ó and the corrected one? And how can I change my original files so they can be imported in OxygenXML?
Thanks!!
Text files (XML for example) are saved on disk using bytes, they are edited and presented using characters. An encoding takes care of converting bytes to characters (sometimes multiple bytes are converted to characters) when the document is opened and again the encoding does the conversion of characters to bytes when the document is saved.
There are many encodings but with the most popular (like UTF-8) characters belonging to the 0-128 ASCII range like a-z A-Z are usually saved to a single byte. Characters outside of the range, for example e-acute (é) usually get saved as multiple bytes, depending on the encoding used for saving.
When an XML document is opened Oxygen attempts to understand what encoding to use for reading it. If the XML document has a heading like this:
Oxygen uses the encoding specified in the heading. If the XML doc is lacking the heading Oxygen will fallback to UTF-8. Basically Oxygen implements the XML specification when it comes to detecting the encoding of the XML file:
https://www.w3.org/TR/xml/#sec-guessing
In your case Oxygen detected the encoding as UTF-8 and started to use UTF-8 to convert bytes to characters. It encountered a sequence of bytes which were not encoded using UTF-8. Oxygen does not continue loading the file because in such cases you may end up with corrupt content when saving it back.
In my opinion the other editor tool you used to create the XML files was not XML aware, it did not actually saved the XML as UTF-8 even if the heading in the XML document specified this.
We do not actually know with what encoding that other editing tool used to save the XML, one thing you could try would be to reopen the XML document in that other editing tool and change its encoding heading declaration from:
<?xml version='1.0' encoding='UTF-8'?>
to:
<?xml version='1.0' encoding='CP1250'?>
because I suspect that other editing tool actually used for saving the XML document the default platform encoding which on Windows should usually be CP1250.
Then save the XML document in the other editing tool and try to re-open it in Oxygen, if it works change its heading encoding declaration back to UTF-8 and save the XML document in Oxygen in order to properly save it using the UTF-8 encoding.
This older set of slides I made about XML encoding might also be useful to you:
https://www.oxygenxml.com/events/2018/large_xml_documents.pdf

How to programatically get the fragments called in an RTFtemplate?

I need to programmatically find the fragments that are called by each rtftemplate.
So, for example in the figure, I would need to get the "GlossaryTermsAcronyms" fragment for the H2_terms_acronyms template.
I can't seem to find any query or script solution to do this. But this should be possible, right?
Unfortunately that is (almost) impossible.
The information is stored in the t_documents.bincontent column. It is binary encoded RTF.
Somewhere in that RTF there should be a reference to the templates fragments that are used.
If you can figure out how to decode the bincontent to get to the actual RTF code of your template, you might have a chance.
Binary fields in EA are usually stored as a zipped text file.
In case the field is included in an xml file (or xml string in the database), it will be base64 encoded.

Writing CR+LF into Open XML from a Database

I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?
To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.

Does Apache Tika do character set conversion?

I'm using org.apache.tika.Tika.parseToString() to convert documents into plain text (i.e., unformatted text) files. My application potentially needs to convert documents that don't use a Unicode character set. For instance, some documents may be encoded in the Chinese GB2312 character set. It would be great if Tika re-coded the output into UTF-8. This would require Tika to reference a mapping between many different character sets and Unicode in order to convert the characters.
Does Tika convert the non-Unicode character set text into Unicode as the output of parseToString()? There are a lot of character sets out there so I would be impressed if Tika did this for more than a few character sets.
Update: I was able to create a couple different files with some non-Latin charsets (GB2312 (Chinese) and KOI8-R (Russian)). Tika.parseToString() couldn't even detect the charset or encoding. I opened an issue on the Tika bug tracker here: https://issues.apache.org/jira/browse/TIKA-1262
When talking about Character Sets in Apache Tika, you need to consider two kinds of files differently. One kind is that of basically just plain text, the other are more complex types (including binary ones)
With the more complex files, Tika mostly uses third party libraries, and these libraries are responsible for returning Java Strings. The exact way of doing that will depend on the file format in question - sometimes the file format will including encoding information, other times it'll be fixed in what it supports. Either way, Tika gets Java Strings, and returns to you a Java String. How you choose to encode that for output is up to you. (For Windows users especially, check the encoding of your terminal, and the font used. There've been lots of "Tika Encoding Problems" which were actually people failing to correctly set the default Java encoding on output, or failing to have a Unicode capable terminal!)
With plain text files, there's no encoding information in the file, all we have is a bunch of bytes. Here, Apache Tika uses one of a number of EncodingDetector instances to do the detection. These use hints, n-grams, language detection etc, to try to work out the most likely encoding of the file based on information given, pattern of bytes in the file etc.
The definition of EncodingDetector is held in the Tika-Core jar, but most of the implentations are held in the Tika-Parsers jar (and loaded by the service loader method, just like Detectors and Parsers). The main ones are here in SVN. If you check there, you'll see the main list of encodings that Tika can detect.
One final thing - the encoding detection is only performed on files that are text files, it isn't done on the binary type files. Depending on how you call Tika, you might need to tweak that and/or provide a hint that it's a text file, so that the EncodingDetector logic gets triggered.
This answer actually comes from a JIRA user on the Tika project. https://issues.apache.org/jira/browse/TIKA-1262
It turns out that if you tell Tika that the file extension is '.txt' it will treat the file as plain text, attempt to detect the encoding, and convert it to UTF.
An easy way to do this is to pass an empty Metadata object to TikaInputStream.get(). This will fill out the resourceName field of the Metadata object. Then pass this object to parseToString(). With the resourceName field set to a file name that ends with .txt the parser knows to treat this file as plain text and will do a encoding detection to try to discover how to decode the file. The string returned from parseToString() is a Java UTF-16 String object. When written to a file you can see that it is Unicode and uses the UCS charset.
Tika tika = new Tika();
Metadata metadata = new Metadata();
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
String contents = tika.parseToString(reader, metadata);
So far this has worked for text files using either GB2312/GB18030 and KOI8-R. This is the expected behavior and it's perfect! I don't know what other charsets/encoding is can handle.