spark structured streaming for XML files - apache-spark-xml

I am trying to parse an xml files using spark xml databricks package(spark-xml_2.11 of com.databricks) using structred streaming (spark.readStream--).
While performing readstream operation, it is saying like unsupported operation "readstream".
Please advice any plans to support this or other alternative to achieve xml streaming.

Related

Spark Parquet Examine Metadata

One of Parquet's key features is metadata, including custom metadata.
However, I have been completely unable to read this metadata from Spark.
I have parquet files that contain file level metadata describing the data contained within. How can I gain access to that Metadata from Spark?
I'm currently using Scala for my Spark applications. I'm reading it into a dataframe using spark.read.parquet.

How to use spark streaming to get data from HBASE table using scala

I am trying to identify a solution to read data from HBASE table using spark streaming and write the data to another HBASE table.
I found numerous samples in internet which asks to create a DSTREAM to get the data from HDFS files and all.But I was unable to find any examples to get data from HBASE tables
For e.g, if I have a HBASE table 'SAMPLE' with columns as 'name' and 'activeStatus'. How can I retrieve the data from the table SAMPLE based on activeStatus column using spark streaming (New data?
Any examples to retrieve the data from HBASE table using spark streaming is welcome.
Regards,
Adarsh K S
You can connect to hbase from spark multiple ways
Hortonwork Spark hbase connector:
https://github.com/hortonworks-spark/shc
Unicredit hbase rdd : https://github.com/unicredit/hbase-rdd
Hortonworks SHC read hbase directly to dataframe using user defined
catalog whereas hbase-rdd read it as rdd and can be converted to DF
using toDF method. hbase-rdd has bulk write option (direct write HFiles) preferred for massive data write.
What you need is a library that enables spark to interact with hbase. Horton Works' shc is such an extension:
https://github.com/hortonworks-spark/shc

How to parse EDIFACT file data using apache spark?

Can someone suggest me how to parse EDIFACT format data using Apache spark ?
i have a requirement as every day EDIFACT data will be written to aws s3 bucket. i am trying to find a best way to convert this data to structured format using Apache spark.
In case you have your invoices in EDIFACT format you can read each one of them as one String per Invoice using RDD´s. Then you will have a RDD[String] which represents the distributed invoice collection. Take a look to https://github.com/CenPC434/java-tools with this you can convert the EDIFACT strings to XML. This repo https://github.com/databricks/spark-xml shows how to use XML format as input source to create Dataframes and perform multiples queries, aggregation... Etc.

Custom records reader PST file format in Spark Scala

I am working on PST files, I have worked on writing custom record reader for a Mapreduce program for different input formats but this time it is going to be spark.
I am not getting any clue or documentation on implementing record readers in spark.
Can some body help on this? Is it possible to implement this functionality in spark?

Sending data from my spark code to redshift

I have a Spark code programmed in Scala. My code reads an xml and extracts all the info in it. The goal is to store the info from the XML into Redshift tables.
Is it possible to send the data directly from my Scala Spark code to Redshift without using S3?
Cheers!
If you're using Spark SQL you can read your XML data into DataFrame using spark-xml and then writing it into Redshift tables using spark-redshift .
You can also take a look on this question .
You can do row level insert using pre-prepared SQL statements into your Python/ Java code, but it will be extremely inefficient if you are going to insert more than few records.