Custom records reader PST file format in Spark Scala - scala

I am working on PST files, I have worked on writing custom record reader for a Mapreduce program for different input formats but this time it is going to be spark.
I am not getting any clue or documentation on implementing record readers in spark.
Can some body help on this? Is it possible to implement this functionality in spark?

Related

How to convert InfluxDB Line Protocol to Parquet in NiFi

I have influxDB Line Protocol records coming in to NiFi via a ConsumeKafka processor, and then merged into flowfiles containing 10,000 records. Now I'd like to get them converted to Parquet and stored in HDFS with an end goal of building Impala tables for the end user. Is there a way to convert Line Protocol to something consumable by the PutParquet processor, or another way to convert to Parquet files?
I did find a custom influxlineprotocolreader processor, however there's very little information and no examples (that I've found) on how to use this processor so I'm not sure if it fits this use case.
Alternatively, I can use Spark to do the conversion and write Parquet files, but I was hoping to do everything in NiFi if at all possible, especially since I haven't found many resources on doing such a conversion in Spark either (I'm new to both Spark and NiFi).
There is nothing out of the box in NiFi that understands InfluxDB line protocol. You would have to implement something that converted that to a known format like JSON, Avro, etc, and then you could go to Parquet, or if you implemented a InfluxDbRecordReader then you could use ConvertRecord with that and a parquet writer to go directly between the two.

How to parse EDIFACT file data using apache spark?

Can someone suggest me how to parse EDIFACT format data using Apache spark ?
i have a requirement as every day EDIFACT data will be written to aws s3 bucket. i am trying to find a best way to convert this data to structured format using Apache spark.
In case you have your invoices in EDIFACT format you can read each one of them as one String per Invoice using RDD´s. Then you will have a RDD[String] which represents the distributed invoice collection. Take a look to https://github.com/CenPC434/java-tools with this you can convert the EDIFACT strings to XML. This repo https://github.com/databricks/spark-xml shows how to use XML format as input source to create Dataframes and perform multiples queries, aggregation... Etc.

Difference between File write using spark and scala and the advantages?

DF().write
.format("com.databricks.spark.csv")
.save("filepath/selectedDataset.csv")
vs
scala.tools.nsc.io.File("/Users/saravana-6868/Desktop/hello.txt").writeAll("String"))
In the above code, I used to write a file using both dataframes and scala. What is the difference in the above code?
The first piece of code is specific to SPARK API of writing the dataframe to a file in csv format. You can write to hdfs or local file system using this. even you can repartition and parallellize your write. The second piece of code is SCALA API which can only write in the local file system. You cannot parallelize it. The first code levearage the whole cluster to do its work but not the second piece of code.

Data Analysis Scala on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367

Sending data from my spark code to redshift

I have a Spark code programmed in Scala. My code reads an xml and extracts all the info in it. The goal is to store the info from the XML into Redshift tables.
Is it possible to send the data directly from my Scala Spark code to Redshift without using S3?
Cheers!
If you're using Spark SQL you can read your XML data into DataFrame using spark-xml and then writing it into Redshift tables using spark-redshift .
You can also take a look on this question .
You can do row level insert using pre-prepared SQL statements into your Python/ Java code, but it will be extremely inefficient if you are going to insert more than few records.