Create dataframe from xml file transform parent row tag to column - pyspark

I have the following xml:
<?xml version="1.0" encoding="UTF-8"?>
<Tables>
<Table Name="T1">
<Field Order="1" ColumnName="c1" Length="1" Type="C" />
<Field Order="2" ColumnName="c2" Length="50" Type="C" />
<Field Order="3" ColumnName="c3" Length="2" Type="C" />
</Catalog>
<Table Name="T2">
<Field Order="1" ColumnName="c1" Length="9" Type="I" />
<Field Order="2" ColumnName="c2" Length="120" Type="C" />
</Table>
</Tables>
And I want to transform it in the following Dataframe using PySpark:
How can I accomplish this using PySpark?
I tried:
df= spark.read.format('com.databricks.spark.xml').option("rowTag", "Field").load(path))
But with a solution like this I lose the Table Name.

Related

xpath extract field name and "column" name from jdo mapping

First time dealing with xpath and XML data. I have below xpath query that I got through some Stack Overflow answers. Below, I want to extract all the column names
with t(x) as (
values
('<?xml version="1.0" encoding="UTF-8"?>
<mapping>
<package name="mypackage">
<class name="mytable">
<jdbc-class-map type="base" pk-column="id" table="public.mytable" />
<jdbc-version-ind type="version-number" column="version" />
<jdbc-class-ind type="myclass" column="jdoclass" />
<field name="majorVersion">
<jdbc-field-map type="value" column="majorversion" />
</field>
<field name="minorVersion">
<jdbc-field-map type="value" column="minorversion" />
</field>
<field name="patchVersion">
<jdbc-field-map type="value" column="patchversion" />
</field>
<field name="version">
<jdbc-field-map type="value" column="version0" />
</field>
<field name="webAddress">
<jdbc-field-map type="value" column="webaddress" />
</field>
</class>
</package>
</mapping>'::xml)
)
select
unnest(xpath('./package/class/field/text()', x)) as "fieldname",
unnest(xpath('./package/class/field/jdbc-field-map/text()', x)) as "columns"
from t
The above query returns fieldname empty and coluns as null. I understand there is some problem with the XML path.
I expect to see field name and column lists
fieldName columns
--------------------------
majorversion majorversion
minorversion minorversion
...
If you want to turn XML into a "table", this is typically done much easier using xmltable()
select info.*
from t
cross join xmltable('/mapping/package/class/field' passing x
columns fieldname text path '#name',
"column" text path './jdbc-field-map/#column') as info
Online example
I was able to achieve the result by
with myTempTable(myXmlColumn) as (
values ('<?xml version="1.0" encoding="UTF-8"?>
<mapping>
<package name="mypackage">
<class name="mytable">
<jdbc-class-map type="base" pk-column="id" table="public.mytable" />
<jdbc-version-ind type="version-number" column="version" />
<jdbc-class-ind type="myclass" column="jdoclass" />
<field name="majorVersion">
<jdbc-field-map type="value" column="majorversion" />
</field>
<field name="minorVersion">
<jdbc-field-map type="value" column="minorversion" />
</field>
<field name="patchVersion">
<jdbc-field-map type="value" column="patchversion" />
</field>
<field name="version">
<jdbc-field-map type="value" column="version0" />
</field>
<field name="webAddress">
<jdbc-field-map type="value" column="webaddress" />
</field>
</class>
</package>
</mapping>'::xml))
SELECT
unnest(xpath('//package/class/field/jdbc-field-map/#column', myTempTable.myXmlColumn))::text AS columns,
unnest(xpath('//package/class/field//#name', myTempTable.myXmlColumn))::text AS fieldName
FROM myTempTable
result
fieldName columns
--------------------------
"majorversion" "majorVersion"
"minorversion" "minorVersion"
"patchversion" "patchVersion"
"version0" "version"
"webaddress" "webAddress"

Error: text index required for $text query

I would like to know how to add text type indexes in my ODM configuration by XML to solve this problem and search by name.
Thanks for everything.
Regards.
PD: I'm sorry for my English.
<document name="App\Document\Doc" db="db" collection="collection"
repository-class="App\Repository\DocRepository">
<id field-name="id" strategy="INCREMENT" type="int"/>
<field field-name="code" name="code" type="string"/>
<field field-name="name" name="name" type="string"/>
<field field-name="type" name="type" type="string"/>
<indexes>
???
</indexes>
</document>
After digging into some code I found this works:
<document name="App\Document\Doc" db="db" collection="collection"
repository-class="App\Repository\DocRepository">
<id field-name="id" strategy="INCREMENT" type="int"/>
<field field-name="code" name="code" type="string"/>
<field field-name="name" name="name" type="string"/>
<field field-name="type" name="type" type="string"/>
<indexes>
<index name="fts">
<key name="code" order="text" />
<key name="name" order="text" />
<key name="type" order="text" />
</index>
</indexes>
</document>
However the keyword order seems counterintuitive.

Data type not working in Solr

I wanna fetch records including a date type from Cassandra in solr, the following are my codes:
in dataconfig.xml:
<entity name="artist" query="SELECT artist_id, name, email, total_jobs, created FROM artist_list">
<field column="artist_id" template="ARTIST_${artist.artist_id}" name="id"/>
<field column="created" name="artist_created" />
</entity>
in schema.xml:
<fieldType name="tdate" class="solr.TrieDoubleField" omitNorms="true" />
<field name="artist_created" type="tdate" indexed="false" stored="true"/>
But the result did not contain created field. Is there anyone can tell me what the problem is? Thanks very much!
You are defining tdate data type as solr.TrieDoubleField. That's why result don't contain artist_created data.
Change your schema to :
<fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
<field name="artist_created" type="date" indexed="false" stored="true"/>

Getting a JSONParseException when indexing fields from MongoDB collection in SOLR using DataImportHandler

I am seeing this exception while I am trying to index data from MongoDB collection :
Exception while processing: products document : SolrInputDocument(fields: []):org.apache.solr.handler.dataimport.DataImportHandlerException: com.mongodb.util.JSONParseException:
{idStr,name,code,description,price,brand,size,color}
^
at org.apache.solr.handler.dataimport.MongoEntityProcessor.initQuery(MongoEntityProcessor.java:46)
at org.apache.solr.handler.dataimport.MongoEntityProcessor.nextRow(MongoEntityProcessor.java:54)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462)
Caused by: com.mongodb.util.JSONParseException:
{idStr,name,code,description,price,brand,size,color}
^
at com.mongodb.util.JSONParser.parseString(JSON.java:387)
Following is my data-source-config file in dataimport directory in conf folder of my core :
<dataConfig>
<dataSource name="mymongodb" type="MongoDataSource" database="mongodb://*.*.*.*/testdb" />
<document name="data">
<entity
name="products"
processor="MongoEntityProcessor"
query="{idStr,name,code,description,price,brand,size,color}"
collection="products"
datasource="mymongodb"
transformer="MongoMapperTransformer" >
<field column="idstr" name="idstr" mongoField="idStr"/>
<field column="name" name="name" mongoField="name"/>
<field column="code" name="code" mongoField="code"/>
<field column="description" name="description" mongoField="description"/>
<field column="price" name="price" mongoField="price"/>
<field column="brand" name="brand" mongoField="brand"/>
<field column="size" name="size" mongoField="size"/>
<field column="color" name="color" mongoField="color"/>
<entity
name="categories"
processor="MongoEntityProcessor"
query="{'idStr':'${categories.idstr}'}"
collection="categories"
datasource="mymongodb"
transformer="MongoMapperTransformer">
<field column="type" name="type" mongoField="type"/>
</entity>
</entity>
</document>
</dataConfig>
I am trying to join the field idStr of categories collection with the idStr of products collection(field name => idstr) and get the above fields ( name,description,... from products and type field from categories).
Any comments/solution on this exception would be really appreciated.Thanks!
Your SOLR field is declared as idstr but you are referencing it in the query attribute of dataConfig as idStr (camelcase difference).
I was able to resolve this ...
Following is the working configuration in the data-source-config file :
<entity
name="products"
query="select idStr,name,code,description,price,brand,size,color from products">
<field name="prodidStr" column="idStr" />
<field name="name" column="name" />
<field name="code" column="name" />
<field name="description" column="description" />
<field name="price" column="price" />
<field name="brand" column="brand" />
<field name="size" column="size" />
<field name="color" column="color" />
<entity
name="categories"
dataSource="mongod"
query="select idStr,ancestors from categories where idStr = '${products.idStr}'">
<field name="catidStr" column="idStr" />
<field name="ancestors" column="ancestors" />
</entity>
</entity>

How to make many2one field auto-complete in Odoo(formerly openerp)?

We have partner_id field in sale order form, I want to make that field auto-complete when user going to input the value, any suggestions will be welcome.
my code is :
<?xml version="1.0" encoding="utf-8"?>
<openerp>
<data>
<record id="air_odoo_sale_order_view" model="ir.ui.view">
<field name="name">air.odoo.sale.order.view</field>
<field name="model">sale.order</field>
<field name="inherit_id" ref="sale.view_order_form" />
<field name="arch" type="xml">
<field name="partner_id" position="attributes">
<attribute name="group">base.group_manager,base.group_sale_salesman</attribute>
</field>
</field>
</record>
</data>
</openerp>