How to parse EDIFACT file data using apache spark? - scala

Can someone suggest me how to parse EDIFACT format data using Apache spark ?
i have a requirement as every day EDIFACT data will be written to aws s3 bucket. i am trying to find a best way to convert this data to structured format using Apache spark.

In case you have your invoices in EDIFACT format you can read each one of them as one String per Invoice using RDD´s. Then you will have a RDD[String] which represents the distributed invoice collection. Take a look to https://github.com/CenPC434/java-tools with this you can convert the EDIFACT strings to XML. This repo https://github.com/databricks/spark-xml shows how to use XML format as input source to create Dataframes and perform multiples queries, aggregation... Etc.

Related

Is there a way to validate each row of a spark dataframe?

My activity setup :
I have a text file containing multiple json entries .
I want to access each json entry and verify its key value pair .
Is there a way to do this using Pyspark ?
I tried to load the txt file by reading it into a spark session and validating its schema using the dataframe.schema() function. But I recently learnt that dataframe.schema() does data sampling and doesnt validate all the records in the dataframe.
You're probably better off using a framework like Deequ (https://github.com/awslabs/python-deequ) to test your dataset.

Spark Elasticsearch connector fails to parse date as date. They are parsed as Long

I am pushing HDFS parquet Data to ElasticSearch using ES spark connector.
I have two columns having dates that I am unable to parse as date. They keep being ingested as long. They have the following formats:
basic_ordinal_date: YYYYddd
epoch_millis: in milliseconds; e.g. 1555498747861
I tried the following
Defining ingest pipelines
Defining mappings
Defining dynamic mappings
Defining index_templates with mapping
Converting my date columns to string so Elastic does some pattern matching
Depending on the method, either I have errors and my Spark job fails, or the documents are pushed with the concerned columns that remain parked as long.
How should I proceed to have my ordinal_dates, epoch_millis parsed as dates ?
Thank you in advance for your help

How to convert InfluxDB Line Protocol to Parquet in NiFi

I have influxDB Line Protocol records coming in to NiFi via a ConsumeKafka processor, and then merged into flowfiles containing 10,000 records. Now I'd like to get them converted to Parquet and stored in HDFS with an end goal of building Impala tables for the end user. Is there a way to convert Line Protocol to something consumable by the PutParquet processor, or another way to convert to Parquet files?
I did find a custom influxlineprotocolreader processor, however there's very little information and no examples (that I've found) on how to use this processor so I'm not sure if it fits this use case.
Alternatively, I can use Spark to do the conversion and write Parquet files, but I was hoping to do everything in NiFi if at all possible, especially since I haven't found many resources on doing such a conversion in Spark either (I'm new to both Spark and NiFi).
There is nothing out of the box in NiFi that understands InfluxDB line protocol. You would have to implement something that converted that to a known format like JSON, Avro, etc, and then you could go to Parquet, or if you implemented a InfluxDbRecordReader then you could use ConvertRecord with that and a parquet writer to go directly between the two.

Storing & reading custom metadata in parquet files using Spark / Scala

I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.

Data Analysis Scala on Spark

I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367