Parsing XML processing instructions in PySpark - pyspark

I am trying to parse one XML file that has processing instructions using databricks spark-xml.
Example XML
<books>
<?SOURCE sample_file?>
<?DATE 12/01/2022?>
<book>
<title>Spark Tutorial</title>
<desc>Spark Tutorial for beginners</desc>
<author>John C</author>
<details>
<price>1234</price>
<pagecount>1000</pagecount>
<chapters>
<chapter>C1</chapter>
<chapter>C2</chapter>
<chapter>C3</chapter>
</chapters>
</details>
</book>
<book>
<title>Scala</title>
<desc>Scala Tutorial for beginners</desc>
<author>John C</author>
<details>
<price>599</price>
<pagecount>1000</pagecount>
<chapters>
<chapter>C10</chapter>
<chapter>C20</chapter>
<chapter>C30</chapter>
</chapters>
</details>
</book>
</books>
Is there any way to parse those XML processing instructions SOURCE & DATE?
I can read other XML tag values but not able to read the processing instructions.
I tried with lxml library & able to read the processing instructions but not able to do the same using spark-xml library.
Thanks in advance

Related

How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark

I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `d:col`
Sample xml looks like this:
<m:properties>
<d:FileSystemObjectType m:type="Edm.Int32">0</d:FileSystemObjectType>
<d:Id m:type="Edm.Int32">10</d:Id>
<d:Modified m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Modified>
<d:Created m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Created>
<d:ID m:type="Edm.Int32">10</d:ID>
<d:Title m:null="true" />
<d:Description m:type="Edm.String">Test</d:Description>
<d:PurposeCode m:type="Edm.Int32">1</d:PurposeCode>
</m:properties>
Notice there are tags for d:Id and d:ID which are causing the duplicate error.
Found this documentation that states that although they are of different case, they are considered duplicate: https://learn.microsoft.com/en-us/azure/databricks/kb/sql/dupe-column-in-metadata
But I cannot modify the xml and have to read as it is. Is there a work around so I can still read the xml?
Or, is there a way to read the xml without using spark? I'm thinking of reading the xml file using the scala.xml.XML library to load the file and parse the file. But when I attempt this, I get an error:
abfss:/<container>#<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml (No such file or directory)
Code snippet below:
import scala.xml.XML
val xml = XML.loadFile("abfss://<container>#<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml")
Note: error really only displayed abfss:/ as opposed to the path on the parameter which has //
Thanks.
Found a way to set spark to be case sensitive and is able now to read the xml successfully:
spark.conf.set("spark.sql.caseSensitive", "true")

How to import this dita 1.3 xml?

Could somebody help poor developer with upgrading to Dita 1.3 :)
I need to make dita-ot work with newer version of xml's I was given (example below). I need to adjust something in the library but I don't have any clue where to start. I've replaced the problematic bit just for example - //FOOBAR/
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE concept PUBLIC "-//FOOBAR//DTD DITA Concept//EN" "file:///D:/InfoShare/Web/Author/ASP/DocTypes/dita-sdl/1.3/dtd/technicalContent/dtd/sdlConcept.dtd">
<?ish ishref="GUID-874B737D-F63A-48C3-887A-571C38D5ED5A" version="1" lang="en-us"?>
<concept xml:lang="en-us" id="xs_help_me_contextually_please" rev="for Desktop" product="Foobar product">
<title id="GUID-F92ED443-BE97-44C7-AB36-726B2A76ECF9">New DITA declaration topic without any new elements</title>
<shortdesc id="GUID-8D7A677D-6782-4A65-96B4-F7F4B3CB5CCD">
<ph>Short description of the topic.</ph>
</shortdesc>
<prolog>
<metadata>
<category>
Content area
<keyword>Templates</keyword>
</category>
<keywords>
<indexterm id="GUID-32379B47-E4F9-4E00-A8A7-383584241D88">indexterm</indexterm>
</keywords>
</metadata>
</prolog>
<conbody>
<p id="GUID-A2466389-DC06-4052-A0EE-8684F3C3D7D3">
<ph>Text here.</ph>
</p>
</conbody>
</concept>
If I change FOOBAR TO OASIS, then it seems to work - at least it does not give any error. The
command that I'm running is:
dita -i=/app/dita/in/foobar.ditamap -f=xhtml -o=/app/dita/out
The error it gives:
[gen-list] [DOTJ079E][ERROR] File 'file:/app/dita/in/xs_help_me_contextually_please.xml' could not be loaded. Ensure that grammar files for this document type are referenced and installed properly. Cannot load file: /D:/InfoShare/Web/Author/ASP/DocTypes/dita-sdl/1.3/dtd/technicalContent/dtd/sdlConcept.dtd (No such file or directory)
[move-meta] I/O error reported by XML parser processing file:/tmp/temp20191106165059386/in/xs_help_me_contextually_please.xml: /tmp/temp20191106165059386/in/xs_help_me_contextually_please.xml (No such file or directory)
[move-meta] file:/app/dita/in/foobar.ditamap:3:327: [DOTX026W][WARN]: Unable to retrieve linktext from target: 'xs_help_me_contextually_please.xml'. Using navigation title as fallback.
Also I should add the technicalContent/dtd/sdlConcept.dtd (that I was also given) somewhere in the library but not sure where. Tried to put it in plugins/org.oasis-open.dita.v1_3 and thought it works but when removing the file and having //OAOSIS/ in the source xml, then it didn't give out any error either.
How can it all work if the path is file:///D:/InfoShare/Web/Author/ASP/Doc... that does not exist in the system where the import happens (Docker container). Is it just informational?
Very confused of all of this.
Thank you in advance!
It's hard to help you given what you have provided, but I can add some clarifying information:
You are working with DITA source that either is (or has been stored) in the SDL CCMS. Depending on the age of the SDL product it has different names: Trisoft, SDL Live Content, SDL Tridion Docs.
DITA 1.3 is backwards compatible with all previous versions of DITA, so you should not have to adjust any DITA source files. But -- if the DITA source uses different DTDs -- as any content stored in the SDL product does, you'll need those DTDs, as they are different than the OASIS DTDs that ship with DITA-OT.
Hope this helps a little; you also might have better luck posting on the dita-users list at Yahoo!
Best,
Kris

Spring Batch Writing to write a complex XML output

I have to design a spring batch job which reads from database and write the data in to XML the output format is as follows.
Please suggest the spring batch configuration for Reader and writer.
<Report>
<ContentLocation>I0001</ContentLocation>
<Header documentId="Doc1">
<Mark>e-mark</Mark>
<EndDate>2014-04-30 00:00:00</EndDate>
<Type>109</Type>
<Business>
<Id>123456789</Id>
<LegalName>Company</LegalName>
<LegalAddress>12345 Main St. JamesTown CA 92869</LegalAddress>
<LegalPhoneNumber>567-678-8909</LegalPhoneNumber>
</Business>
</Header>
<ITD documentId="34">
<Client>
<Name>Client1</Name>
<Address>Address1</Address>
</Client>
<Associate>
<Id>1</Id>
<Department>Finance</Department>
</Associate>
<Associate>
<Id>2</Id>
<Department>Accounts</Department>
</Associate>
</Itd>
</Report>
in your case i suggest this Mkyong tutorial, is detailed and very simple.
http://www.mkyong.com/spring-batch/spring-batch-example-mysql-database-to-xml/
also if your data retrieved from database are represented as entity in java you can use spring batch with Xstream, i can help you just let me know what way you want go.
See Ya!

Objective C: Parse malformed XML

I'm writing an iPhone app that downloads data from the Flickr API. At the moment there seems to be no way to limit how many comments it downloads, and although I'd like to get maybe 8 or 10, it sometimes sends me hundreds. I have subclassed ASIHTTPRequest so that it will only download a set amount of bytes (for example, it will stop downloading after receiving 1024 bytes of comment data).
Now, the information I'd like to parse is all there (the comment data contains things like a user id, the text, etc.). However, since it is cut off before the end, the XML is malformed, and my current solution (using ObjectiveFlickr's XML parser) is unable to parse the XML. Is there a way to handle badly formed XML, a la the way old web browsers handled HTML, and only extract the data that is well-formed?
Here is some sample data:
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<comments photo_id="5692627867">
<comment id="49862655-5692627867-72157626659891768" author="29114051#N05" authorname="eαse*" iconserver="4046" iconfarm="5" datecreate="1304689286" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626659891768">lovely lovely lovelyyyyy!!! ♥♥♥♥♥♥♥♥♥</comment>
<comment id="49862655-5692627867-72157626535581359" author="49946698#N06" authorname="RandomPics Art" iconserver="4017" iconfarm="5" datecreate="1304692593" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626535581359">nice! like soft colors and tones!</comment>
<comment id="49862655-5692627867-72157626660240896" author="49907977#N06" authorname="kiki_chi" iconserver="4014" iconfarm="5" datecreate="1304693051" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626660240896">&gt;eαse*
&gt;RandomPics Art
Thank you:) :) :)</comment>
<comment id="49862655-5692627867-72157626660761230" author="41717031#N08" authorname="petia.bourova" iconserver="4082" iconfarm="5" datecreate="1304698244" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626660761230">Thank you!Very nice photo!I like The coulers very,very much!!!</comment>
<comment id="49862655-5692627867-72157626661258700" author="31540474#N08" authorname="Leentje32" iconserver="4067" iconfarm="5" datecreate="1304703576" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626661258700">Aww so lovely!! Beautiful capture.</comment>
<comment id="49862655-5692627867-72157626662413410" author="61373986#N06" authorname="My NIKON And Me" iconserver="5310" iconfarm="6" datecreate="1304716098" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626662413410">lovely image!!!</comment>
<comment id="49862655-5692627867-72157626663408864" author="7652657#N02" authorname="Majlee" iconserver="3130" iconfarm="4" datecreate="1304728344" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626663408864">This is just adorable !</comment>
<comment id="49862655-5692627867-72157626663519092" author="15613254#N05" authorname="mr_jyoti" iconserver="4011" iconfarm="5" datecreate="1304729940" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626663519092">Cool shot. Nice bokey.</comment>
<comment id="49862655-5692627867-72157626663642456" author="16327396#N03" authorname="my beanie hat rocks" iconserver="2550" iconfarm="3" datecreate="1304731810" permalink="http://www.flickr.com/photos/kikicchi/5692627867/#comment72157626663642456">Maybe she could cheer this fella up!!
<a href="http://www.flickr.com/photos/weasteman/5652855802/in/photostream">www.flickr.com/photos/weasteman/5652855802/in/photostream</a>
=D</comment>
I am not sure about the exact XML format, but it looks simple. In such case you can try to figure out last tag in the data and add the missing closing tags manually. It should not be more than a simple string search & replace.

iPhone: repeating Elements in XML

I am new to XML parsing. I am parsing the following XML. There are tutorials for if XML has unique attributes but this XML has repeating attributes.
<?xml version="1.0" encoding="utf-8"?>
<start>
<Period periodType="A" fYear="2005" endCalYear="2005" endMonth="3">
<ConsEstimate type="High">
<ConsValue dateType="CURR">-8.9919</ConsValue>
</ConsEstimate>
<ConsEstimate type="Low">
<ConsValue dateType="CURR">-13.1581</ConsValue>
</ConsEstimate>
</Period>
< Period periodType="A" fYear="2006" endCalYear="2006" endMonth="3">
<ConsEstimate type="High">
<ConsValue dateType="CURR">-100.000</ConsValue>
</ConsEstimate>
<ConsEstimate type="Low">
<ConsValue dateType="CURR">-13.1581</ConsValue>
</ConsEstimate>
</Period>
</start>
I need to fetch the low and high values based on the years 2005 and 2006.
I agree with SB's comment, if you wan't to handle xml-datastructurse, you should know at least the basic stuff.
A good tutorial i can reccomend is ww3 schools XML Tutorial
once you did that, you should know that there are several ways to parse xml files. For flatfiles i recommend to use the TBXML Library, it is really fast and easy to handle within your code.