Where is Project Open Data data.xml format defined - metadata

I would like to convert a dataset into Project Open Data data.xml format so that I can ingest in via the DKAN harvester. I know there is data.json format but prefer to use data.xml as the data is already XML and I am comfortable transforming it.
I can't seem to find where the data.xml schema/format is defined and would be grateful for pointers.

The data.json file is created by implementing the Open Data Schema Map module in DKAN.
You have a couple of options:
Create a new endpoint at "/admin/config/services/odsm/add/api" and use XML as the output format and "Project Open Data v1.1" as the schema:
You can do the same thing in code using hook hook_open_data_schema_apis_defaults() that Open Data Schema Map DKAN uses. Copy the declaration for "data_json" in your own module and change the endpoint to "'endpoint' => 'data.xml'," and the format to xml.
You could use hook_open_data_schema_map_results_alter to alter the schema to use "data.xml" and format to XML in a custom module.

Related

How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark

I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `d:col`
Sample xml looks like this:
<m:properties>
<d:FileSystemObjectType m:type="Edm.Int32">0</d:FileSystemObjectType>
<d:Id m:type="Edm.Int32">10</d:Id>
<d:Modified m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Modified>
<d:Created m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Created>
<d:ID m:type="Edm.Int32">10</d:ID>
<d:Title m:null="true" />
<d:Description m:type="Edm.String">Test</d:Description>
<d:PurposeCode m:type="Edm.Int32">1</d:PurposeCode>
</m:properties>
Notice there are tags for d:Id and d:ID which are causing the duplicate error.
Found this documentation that states that although they are of different case, they are considered duplicate: https://learn.microsoft.com/en-us/azure/databricks/kb/sql/dupe-column-in-metadata
But I cannot modify the xml and have to read as it is. Is there a work around so I can still read the xml?
Or, is there a way to read the xml without using spark? I'm thinking of reading the xml file using the scala.xml.XML library to load the file and parse the file. But when I attempt this, I get an error:
abfss:/<container>#<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml (No such file or directory)
Code snippet below:
import scala.xml.XML
val xml = XML.loadFile("abfss://<container>#<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml")
Note: error really only displayed abfss:/ as opposed to the path on the parameter which has //
Thanks.
Found a way to set spark to be case sensitive and is able now to read the xml successfully:
spark.conf.set("spark.sql.caseSensitive", "true")

Provide JSON Schema for a JSON file like settings JSON in Visual Studio Code

When editing settings JSON file, Visual Studio Code suggests property names. I am assuming it's using a JSON Schema internally.
It's a very nice feature and can be useful for editing other JSON files that have a JSON Schema. I can't find an option for adding schema to a JSON file. If it's an internal feature, where it is located in the source code and can we build an extension that surfaces the feature so we can provide JSON Schema for editing a JSON? Using $schema property doesn't seem to be working.
I found out it's possible to provide those JSON Schemas. In settings.json there are a couple of examples for Bower and package.json
They are under "json.schemas".
We can also do it this way:
In your .json file, just after the first opening curly bracket, add $schema property that is the path to the schema file:
{
"$schema": "../node_modules/sp-build-tasks/schema/v1/sppp.json",
startTypinghere...
}
More info

Talend Component for Creating XSD from a XML File

I have a XML file with me with Data in the file.
I am Looking for a Component in Talend which can create a XSD file for me from the XML input.
There are online Utility to do it (freeformatter.com/xsd-generator), however is there a component in Talend ?
In Talend , no xsdoutput .
If what you need is a scheme , you can play and get the XML schema with METADATA.
XML and XSD are usually supplemented in the same file .
Once done, you can use the XSD generation or Oxygen.
We can try tAdvancedOutputXML component of talend.
Here we can use following advanced setting
Advanced setting >> Create associated XSD file.
or
Here you can have the details about component
"https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/22.1+tAdvancedFileOutputXML"

Perl XML parser to extract the default values of the optional attributes fromt the xsd file

I need to parse an XML file through a perl script. While parsing the file, I need to validate it with the corresponding xsd(schema definition) file. The xsd file contains some optional attribute with a default value being provided. I have to parse the xml file in such a way so that I should be able to fetch the default value for all the attribute that are not present in the XML(from xsd). The issue is that I couldn't find any appropriate parser in perl for this Job.I could find a parser that is able to validate the file but couldn't get the default value of attributes that are not present the XML.
Can you please suggest me any parser/module that can help me in getting the required data.
Note: I need the values of the attributes that are not present in the XML to make a complete record for database insertion.
Thanks for your time,
You might want to take a look at XML::Compile. It appears to me that the default values feature is supported, but I've never actually used the module at all.
You could also change the database schema so all fields have defaults.

What is the best file parsing solution for converting files?

I am looking for the best solution for custom file parsing for our enterprise import routines. I want to basically change one file format into a standard file format and have one routine that imports that data into the database. I need to be able to create custom scripts for each client since its difficult to get the customer to comply with a standard or template format. I have looked at PowerShell and Iron Python to do this so far but I am not sure this is the route I want to go. I have also looked at some tools such as Talend which is a drag and drop style tool which may or may not give me what I want as far as flexibility. We are a .NET shop and have created custom code to do this in the past but I need something that is quicker to create then coding custom parsing functions each time we get a new file format in.
Depending on the complexity and variability of your work, you should consider an ETL tool like SSIS (SQL Server Integration Services).
Python is wonderful for this kind of thing. That's why we use. Each new customer transfer is a new adventure and Python gives us the flexibility to respond quickly.
Edit. All python scripts that read files are "custom file parsers". Without an actual example, it's not sensible to provide a detailed example.
with open( "some file", "r" ) as source:
for line in source:
process( line )
That's about all there is to a "custom file parser". If you're parsing .csv or .xml files, then Python has modules for that. If you're parsing fixed-format files, you'd use string slicing operations. If you're parsing other files (X12? JSON? YAML?) you'll need appropriate parsers.
Tab-Delim.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
record = RecordLayout( aLine.split('\t') )
...
Fixed Layout.
from collections import namedtuple
RecordLayout = namedtuple('RecordLayout',['field1','field2','field3',...])
def process( aLine ):
fields = ( aLine[:10], aLine[10:20], aLine[20:30], ... )
record = RecordLayout( fields )
...