My activity setup :
I have a text file containing multiple json entries .
I want to access each json entry and verify its key value pair .
Is there a way to do this using Pyspark ?
I tried to load the txt file by reading it into a spark session and validating its schema using the dataframe.schema() function. But I recently learnt that dataframe.schema() does data sampling and doesnt validate all the records in the dataframe.
You're probably better off using a framework like Deequ (https://github.com/awslabs/python-deequ) to test your dataset.
Related
I'm able to write pytest functions by manually giving column names and values to create a data frame and passing it to the production code to check all the transformed fields values in palantir foundry code repository.
Instead of manually passing column names and their respective values I want to store all the required data in the dataset and import that dataset into pytest function to fetch all the required values and passing over to the production code to check all the transformed field values.
Is there anyways to accept the dataset as input to the test function in planatir code repository.
You can't read from a dataset at the time of running the tests. However, perhaps you can create the test dataset, save it as a CSV, and then in the test read the CSV using the test's spark session. The code for that would be like spark.read.csv("path").
This will add file I/O time and slow your test down, so my recommendation is to just create the dataframe using test data that is already in memory. The code for that would be spark.createDataFrame(data).
Can someone suggest me how to parse EDIFACT format data using Apache spark ?
i have a requirement as every day EDIFACT data will be written to aws s3 bucket. i am trying to find a best way to convert this data to structured format using Apache spark.
In case you have your invoices in EDIFACT format you can read each one of them as one String per Invoice using RDD´s. Then you will have a RDD[String] which represents the distributed invoice collection. Take a look to https://github.com/CenPC434/java-tools with this you can convert the EDIFACT strings to XML. This repo https://github.com/databricks/spark-xml shows how to use XML format as input source to create Dataframes and perform multiples queries, aggregation... Etc.
I am playing around with Apache Spark with the Azure CosmosDB connectors in Scala and was wondering if anyone had examples or insight on how I would write my DataFrame back to a collection in my CosmosDB. Currently I am able to connect to my one collection and return the data and manipulate it but I want to write the results back to a different collection inside the same database.
I created a writeConfig that contains my EndPoint, MasterKey, Database, and the Collection that I want to write to.
I then tried writing it to the collection using the following line.
manipulatedData.toJSON.write.mode(SaveMode.Overwrite).cosmosDB(writeConfig)
This runs fine and does not display any errors but nothing is showing up in my collection.
I went through the documentation I could find at https://github.com/Azure/azure-cosmosdb-spark but did not have much luck with finding any examples of writing data back to the database.
If there is an easier way to write to a documentDB/cosmosDB than what I am doing? I am open to any options.
Thanks for any help.
You can save to Cosmos DB directly from a Spark DataFrame just like you had noted. You may not need to use toJSON, for example:
// Import SaveMode so you can Overwrite, Append, ErrorIfExists, Ignore
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
// Create new DataFrame `df` which has slightly flights information
// i.e. change the delay value to -999
val df = spark.sql("select -999 as delay, distance, origin, date, destination from c limit 5")
// Save to Cosmos DB (using Append in this case)
// Ensure the baseConfig contains a Read-Write Key
// The key provided in our examples is a Read-Only Key
df.write.mode(SaveMode.Append).cosmosDB(baseConfig)
As for the documentation, you are correct in that the save function should be have been better called out. I've created Include in User Guide / sample scripts how to save to Cosmos DB #91 to address this.
As for the saving but seeing no error, by any chance is your config using the Read-Only key instead of the Read-write key? I just created Saving to CosmosDB using read-only key has no error #92 calling out the same issue.
I know parquet files store meta data, but is it possible to add custom metadata to a parquet file, using Scala (preferably) using Spark?
The idea is that I store many similar structured parquet files in a Hadoop storage, but each has a uniquely named source (String field, also present as column in the parquet file), however, I'd like to access this information without creating the overhead of actually reading the parquet and possibly even removing this redundant column from the parquet.
I really don't want to put this info in a filename, so my best option is now just to read the first line of each parquet and use the source column as String field.
It works, but I was just wondering if there is a better way.
I am new to Scala, and i have to use Scala and Spark's SQL, Mllib and GraphX in order to perform some analysis on huge data set. The analyses i want to do are:
Customer life cycle Value (CLV)
Centrality measures (degree, Eigenvector, edge-betweenness,
closeness) The data is in a CSV file (60GB (3 years transnational data))
located in Hadoop cluster.
My question is about the optimal approach to access the data and perform the above calculations?
Should i load the data from the CSV file into dataframe and work on
the dataframe? or
Should i load the data from the CSV file and convert it into RDD and
then work on the RDD? or
Are there any other approach to access the data and perform the analyses?
Thank you so much in advance for your help..
Dataframe gives you sql like syntax to work with the data where as RDD gives Scala collection like methods for data manipulation.
One extra benefit with Dataframes is underlying spark system will optimise your queries just like sql query optimisation. This is not available in case of RDD's.
As you are new to Scala its highly recommended to use Dataframes API initially and then Pick up RDD API later based on requirement.
You can use Databricks CSV reader api, which is easy to use and returns DataFrame. It automatically infer data types. If you pass the file with header it can automatically use that as Schema, otherwise you can construct schema using StructType.
https://github.com/databricks/spark-csv
Update:
If you are using Spark 2.0 Version , by default it support CSV datasource, please see the below link.
https://spark.apache.org/releases/spark-release-2-0-0.html#new-features
See this link for how to use.
https://github.com/databricks/spark-csv/issues/367