Spring Batch Writing to write a complex XML output - spring-batch

I have to design a spring batch job which reads from database and write the data in to XML the output format is as follows.
Please suggest the spring batch configuration for Reader and writer.
<Report>
<ContentLocation>I0001</ContentLocation>
<Header documentId="Doc1">
<Mark>e-mark</Mark>
<EndDate>2014-04-30 00:00:00</EndDate>
<Type>109</Type>
<Business>
<Id>123456789</Id>
<LegalName>Company</LegalName>
<LegalAddress>12345 Main St. JamesTown CA 92869</LegalAddress>
<LegalPhoneNumber>567-678-8909</LegalPhoneNumber>
</Business>
</Header>
<ITD documentId="34">
<Client>
<Name>Client1</Name>
<Address>Address1</Address>
</Client>
<Associate>
<Id>1</Id>
<Department>Finance</Department>
</Associate>
<Associate>
<Id>2</Id>
<Department>Accounts</Department>
</Associate>
</Itd>
</Report>

in your case i suggest this Mkyong tutorial, is detailed and very simple.
http://www.mkyong.com/spring-batch/spring-batch-example-mysql-database-to-xml/
also if your data retrieved from database are represented as entity in java you can use spring batch with Xstream, i can help you just let me know what way you want go.
See Ya!

Related

How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark

I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `d:col`
Sample xml looks like this:
<m:properties>
<d:FileSystemObjectType m:type="Edm.Int32">0</d:FileSystemObjectType>
<d:Id m:type="Edm.Int32">10</d:Id>
<d:Modified m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Modified>
<d:Created m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Created>
<d:ID m:type="Edm.Int32">10</d:ID>
<d:Title m:null="true" />
<d:Description m:type="Edm.String">Test</d:Description>
<d:PurposeCode m:type="Edm.Int32">1</d:PurposeCode>
</m:properties>
Notice there are tags for d:Id and d:ID which are causing the duplicate error.
Found this documentation that states that although they are of different case, they are considered duplicate: https://learn.microsoft.com/en-us/azure/databricks/kb/sql/dupe-column-in-metadata
But I cannot modify the xml and have to read as it is. Is there a work around so I can still read the xml?
Or, is there a way to read the xml without using spark? I'm thinking of reading the xml file using the scala.xml.XML library to load the file and parse the file. But when I attempt this, I get an error:
abfss:/<container>#<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml (No such file or directory)
Code snippet below:
import scala.xml.XML
val xml = XML.loadFile("abfss://<container>#<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml")
Note: error really only displayed abfss:/ as opposed to the path on the parameter which has //
Thanks.
Found a way to set spark to be case sensitive and is able now to read the xml successfully:
spark.conf.set("spark.sql.caseSensitive", "true")

Parsing XML processing instructions in PySpark

I am trying to parse one XML file that has processing instructions using databricks spark-xml.
Example XML
<books>
<?SOURCE sample_file?>
<?DATE 12/01/2022?>
<book>
<title>Spark Tutorial</title>
<desc>Spark Tutorial for beginners</desc>
<author>John C</author>
<details>
<price>1234</price>
<pagecount>1000</pagecount>
<chapters>
<chapter>C1</chapter>
<chapter>C2</chapter>
<chapter>C3</chapter>
</chapters>
</details>
</book>
<book>
<title>Scala</title>
<desc>Scala Tutorial for beginners</desc>
<author>John C</author>
<details>
<price>599</price>
<pagecount>1000</pagecount>
<chapters>
<chapter>C10</chapter>
<chapter>C20</chapter>
<chapter>C30</chapter>
</chapters>
</details>
</book>
</books>
Is there any way to parse those XML processing instructions SOURCE & DATE?
I can read other XML tag values but not able to read the processing instructions.
I tried with lxml library & able to read the processing instructions but not able to do the same using spark-xml library.
Thanks in advance

Mirth -> HL7 into XML conversion Ques

I'm new to the Mirth Connect When I tried to convert HL7 into XML I 'm struggling.Suppose my HL7 messages have repeat segments like ORC in ORM Messages how to iterate that.
below is my code:
tmp['Messages']['orderList']['order'][count]['provider']=msg['ORC'][count]['ORC.10']['ORC.10.1'].toString();
but it is throwing an error:
`TypeError: Cannot read property "provider"` from undefined.
please help me to proceed further.
It's failing because your count is higher than the number of elements returned by tmp['Messages']['orderList']['order'], so it is returning undefined. The short answer is that you need to add another order node to tmp['Messages']['orderList'] before you can access it. It's hard to say how best to do that without seeing more of your code, requirements, outbound template, etc... Most frequently I build the node first, and then use appendChild to add it.
A simple example would be:
var tmp = <xml>
<Messages>
<orderList />
</Messages>
</xml>;
var prov = 12345;
var nextOrder = <order>
<provider>{prov}</provider>
</order>;
tmp.Messages.orderList.appendChild(nextOrder);
After which, tmp will look like:
<xml>
<Messages>
<orderList>
<order>
<provider>12345</provider>
</order>
</orderList>
</Messages>
</xml>
The technology you are using to work with xml is called e4x, and it's running on the Mozilla Rhino Javascript engine. Here are a couple resources that might help you.
https://web.archive.org/web/20181120184304/https://wso2.com/project/mashup/0.2/docs/e4xquickstart.html
https://developer.mozilla.org/en-US/docs/Archive/Web/E4X/Processing_XML_with_E4X

Vertica-Tableau error Multiple commands cannot be active

We have dataset in Vertica and Tableau is querying the data (4 Billions record) from vertica for dashboard as shown below :
All list and graphs are separate worksheets in tableau and using same connection to Vertica DB. Each list is a column in DB and list is descending order of # count of items in dataset's respective column. Graph also same as list but calculated in slightly different manner. Start Date and End Date is date range for Data to be query like data connection filter which will restrict the query to fixed amount of data example past week, last month, etc.
But I get this ERROR :
Vertica][VerticaDSII] (10) An error occurred during query preparation: Multiple commands cannot be active on the same connection. Consider increasing ResultBufferSize or fetching all results before initiating another command.
Is the any workaround this issue or any better way to do this
you'll need a TDC file which specifies a particular ODBC connection string option to get around the issue.
The guidance from Vertica was to add an ODBC Connect String parameter with the value “ResultBufferSize=0“. This apparently forces the result buffer to be unlimited, preventing the error. This is simple enough to accomplish when building a connection string manually or working with a DSN, but Vertica is one of Tableau’s native connectors. So how do you tell the native connector to do something else with its connection?
Native Connections in Tableau can be customized using TDC files
“Native connectors” still connect through the vendor’s ODBC drivers, and can be customized just the same as an “Other Databases” / ODBC connection. In the TDC files themselves, “ODBC” connections are referred to as “Generic ODBC”, which is a much more accurate way to think about the difference.
The full guide to TDC customizations, with all of the options, is available here although it is pretty dense reading. One thing that isn’t provided is an example of customizing a “native connector”. The basic structure of a TDC file is this
<?xml version='1.0' encoding='utf-8' ?>
<connection-customization class='genericodbc' enabled='true' version='7.7'>
<vendor name='' />
<driver name='' />
<customizations>
</customizations>
</connection-customization>
When using “Generic ODBC”, the class is “genericodbc” and then the vendor and driver name must be specified so that Tableau can know when the TDC file should be applied. It’s much simpler for a native connector — you just use the native connector name in all three places. The big list of native connector names is at the end of this article. Luckily for us, Vertica is simply referred to as “vertica”. So our Vertica TDC framework will look like:
<?xml version='1.0' encoding='utf-8' ?>
<connection-customization class='vertica' enabled='true' version='7.7'>
<vendor name='vertica' />
<driver name='vertica' />
<customizations>
</customizations>
</connection-customization>
This is a good start, but we need some actual customization tags to cause anything to happen. Per the documentation, to add additional elements to the ODBC connection string, we use a tag named ‘odbc-connect-string-extras‘. This would look like
<customization name='odbc-connect-string-extras' value='ResultBufferSize=0;' />
One important thing we discovered was that all ODBC connection extras need to be in this single tag. Because we wanted to turn on load balancing in the Vertica cluster, there was a second parameter recommended: ConnectionLoadBalance=1. To get both of these parameters in place, the correct way method is
<customization name='odbc-connect-string-extras' value='ResultBufferSize=0;ConnectionLoadBalance=1;' />
There are a whole set of other customizations you can put in to place to see how they affect performance. Make sure you understand the way the customization option is worded — if it starts with ‘SUPRESS’ then giving a ‘yes’ value will turn off the feature; other times you want to set the value to ‘no’ to turn off the feature. Some of the other ones we tried were
<customization name='CAP_SUPPRESS_DISCOVERY_QUERIES' value='yes' />
<customization name='CAP_ODBC_METADATA_SUPPRESS_PREPARED_QUERY' value='yes' />
<customization name='CAP_ODBC_METADATA_SUPPRESS_SELECT_STAR' value='yes' />
<customization name='CAP_ODBC_METADATA_SUPPRESS_EXECUTED_QUERY' value='yes' />
<customization name='CAP_ODBC_METADATA_SUPRESS_SQLSTATISTICS_API' value='yes' />
<customization name= 'CAP_CREATE_TEMP_TABLES' value='no' />
<customization name= 'CAP_SELECT_INTO' value='no' />
<customization name= 'CAP_SELECT_TOP_INTO' value='no' />
The first set were mostly about reducing the number of queries for metadata detection, while the second set tell Tableau not to use TEMP tables.
The best way to see the results of these customizations is to change the TDC file and restart Tableau Desktop Once you are satisfied with the changes, then move the TDC file to your Tableau Server and restart it.
Where to put the TDC files
Per the documentation ”
For Tableau Desktop on Windows: Documents\My Tableau Repository\Datasources
For Tableau Server: Program Files\Tableau\Tableau Server\\bin
Note: The file must be saved using a .tdc extension, but the name does not matter.”
If you are running a Tableau Server cluster, the .tdc file must be placed on every worker node in the bin folder so that the vizqlserver process can find it. I’ve also highlighted the biggest issue of all — you should edit these using a real text editor like Notepad++ or SublimeText rather than Notepad, because Notepad likes to save things with a hidden .TXT ending, and the TDC file will only be recognized if the ending is really .tdc, not .tdc.txt.
Restarting the taableau resolved my issue which was giving same error.

iPhone: repeating Elements in XML

I am new to XML parsing. I am parsing the following XML. There are tutorials for if XML has unique attributes but this XML has repeating attributes.
<?xml version="1.0" encoding="utf-8"?>
<start>
<Period periodType="A" fYear="2005" endCalYear="2005" endMonth="3">
<ConsEstimate type="High">
<ConsValue dateType="CURR">-8.9919</ConsValue>
</ConsEstimate>
<ConsEstimate type="Low">
<ConsValue dateType="CURR">-13.1581</ConsValue>
</ConsEstimate>
</Period>
< Period periodType="A" fYear="2006" endCalYear="2006" endMonth="3">
<ConsEstimate type="High">
<ConsValue dateType="CURR">-100.000</ConsValue>
</ConsEstimate>
<ConsEstimate type="Low">
<ConsValue dateType="CURR">-13.1581</ConsValue>
</ConsEstimate>
</Period>
</start>
I need to fetch the low and high values based on the years 2005 and 2006.
I agree with SB's comment, if you wan't to handle xml-datastructurse, you should know at least the basic stuff.
A good tutorial i can reccomend is ww3 schools XML Tutorial
once you did that, you should know that there are several ways to parse xml files. For flatfiles i recommend to use the TBXML Library, it is really fast and easy to handle within your code.