How to convert wrapped array to dataset in spark scala? - scala

Hi I am new to spark scala. I have this structure in json file which I need to convert it to dataset. I am being unable to do this because of the nested data.
I tried to do something like this which I got from some post but it does not work. Can someone please suggest me the solution?
spark.read.json(path).map(r=>r.getAs[mutable.WrappedArray[String]]("readings"))

Your JSON format is invalid for spark to convert into dataframe. json informations that needs to be converted into dataframe/dataset row should be a line.
So the first step for you to do is read the json file and convert into valid json format. You can use wholeTextFiles api and some replacements.
val rdd = sc.wholeTextFiles("path to your json text file")
val validJson = rdd.map(_._2.replace(" ", "").replace("\n", ""))
Second step is to covert the valid json data into dataframe or dataset. Here I am using dataframe
val dataFrame = sqlContext.read.json(validJson)
which should give you
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|did |readings |
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|d7cc92c24be32d5d419af1277289313c|[[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++]),1506770544]]|
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
Now selecting WrappedArray is easy step as
dataFrame.select("readings.clients")
which should give you
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|clients |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++])]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
I hope the answer is helpful
Updated
Dataframe and datasets are almost the same except that datasets are type safety with encoders used, and that datasets are optimized than dataframes.
Long story short, you can change the dataframe to dataset by creating case classes. For your case you would need three case classes.
case class client(cid: String, clientOS: String, rssi: Long, snRatio: Long, ssid: String)
case class reading(clients: Array[client], ts: Long)
case class dataset(did: String, readings: Array[reading])
And then cast the dataframe to dataset as
val dataSet = sqlContext.read.json(validJson).as[dataset]
You should have dataset in your hand :)

You cannot create DataSet with the following code
spark.read.json(path).map(r => r.getAs[WrappedArray[String]]("readings"))
Check the schema of clients type for the DF created upon reading the JSON.
spark.read.json(path).printSchema
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
You can get the scala.collection.mutable.WrappedArrayobject with the below code
spark.read.json(path).first.getAs[WrappedArray[(String,String,Long,Long,String)]]("readings")
If you need a create the dataframe use the below.
spark.read.json(path).select("readings.clients")

Related

how to explode a dataframe schema in databricks

I have a schema that should be exploded, below is the schema
|-- CaseNumber: string (nullable = true)
|-- Customers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Contacts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- FirstName: string (nullable = true)
| | | | |-- LastName: string (nullable = true)
I want my schema to be like this,
|-- CaseNumber: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
or
+----------+---------------------+
| CaseNumber| FirstName| LastName|
+----------+---------------------+
| 1 | aa | bb |
+----------|-----------|---------|
| 2 | cc | dd |
+------------------------------- |
I am new to databricks, any help would be appreciated.thanks
Here is one way to solve it without using explode command -
case class MyCase(val Customers = Array[Customer](), CaseNumber : String
)
case class Customers(val Contacts = Array[Contacts]()
)
case class Contacts(val Firstname:String, val LastName:String
)
val dataset = // dataframe.as[MyCase]
dataset.map{ mycase =>
// return a Seq of tuples like - (mycase.caseNumber, //read customer's contract's first and last name )
//one row per first and last names, repeat mycase.caseNumber .. basically a loop
}.flatmap(identity)
I think you can still do explode(customersFlat.contacts). I sure this something like this some while ago, so forgive me my syntax and let me know whether this works
df.select("caseNumber",explode("customersFlat.contacts").as("contacts").select("caseNumber","contacts.firstName","contacts.lastName")

Adding new column for DataFrame with complex column (Array<Map<String,String>>

I am loading a Dataframe from an external source with the following schema:
|-- A: string (nullable = true)
|-- B: timestamp (nullable = true)
|-- C: long (nullable = true)
|-- METADATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- M_1: integer (nullable = true)
| | |-- M_2: string (nullable = true)
| | |-- M_3: string (nullable = true)
| | |-- M_4: string (nullable = true)
| | |-- M_5: double (nullable = true)
| | |-- M_6: string (nullable = true)
| | |-- M_7: double (nullable = true)
| | |-- M_8: boolean (nullable = true)
| | |-- M_9: boolean (nullable = true)
|-- E: string (nullable = true)
Now, I need to add new column, METADATA_PARSED, with column type Array and the following case class:
case class META_DATA_COL(M_1: String, M_2: String, M_3, M_10:String)
My approach here, based on examples is to create a UDF and pass in the METADATA column. But since it is of a complex type I am having a lot of trouble parsing it.
On top of that in the UDF, for the "new" variable M_10, I need to do some string manipulation on the method as well. So I need to access each of the elements in the metadata column.
What would be the best way to approach this issue? I attempted to convert the source dataframe (+METADATA) to a case class; but that did not work as it was translated back to spark WrappedArray types upon entering the UDF.
you can Use something like this.
import org.apache.spark.sql.functions._
val tempdf = df.select(
explode( col("METADATA")).as("flat")
)
val processedDf = tempdf.select( col("flat.M_1"),col("flat.M_2"),col("flat.M_3"))
now write a udf
def processudf = udf((col1:Int,col2:String,col3:String) => /* do the processing*/)
this should help, i can provide some more help if you can provide more details on the processing.

How can I perform ETL on a Spark Row and return it to a dataframe?

I'm currently using Scala Spark for some ETL and have a base dataframe that contains has the following schema
|-- round: string (nullable = true)
|-- Id : string (nullable = true)
|-- questions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- bonusQuestions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- difficulty : string (nullable = true)
| | |-- answerOptions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- followUpAnswers: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- school: string (nullable = true)
I only need to perform ETL on rows where the round type is primary (there are 2 types primary and secondary). However, I need both type of rows in my final table.
I'm stuck doing the ETL which should be according to -
If tag is non-bonus, the bonusQuestions should be set to null and difficulty should be null.
I'm currently able to access most fields of the DF like
val round = tr.getAs[String]("round")
Next, I'm able to get the questions array using
val questionsArray = tr.getAs[Seq[StructType]]("questions")
and can iterate using for (question <- questionsArray) {...}; However I cannot access struct fields like question.bonusQuestions or question.tagwhich returns an error
error: value tag is not a member of org.apache.spark.sql.types.StructType
Spark treats StructType as GenericRowWithSchema, more specific as Row. So instead of Seq[StructType] you have to use Seq[Row] as
val questionsArray = tr.getAs[Seq[Row]]("questions")
and in the loop for (question <- questionsArray) {...} you can get the data of Row as
for (question <- questionsArray) {
val tag = question.getAs[String]("tag")
val bonusQuestions = question.getAs[Seq[String]]("bonusQuestions")
val difficulty = question.getAs[String]("difficulty")
val answerOptions = question.getAs[Seq[String]]("answerOptions")
val followUpAnswers = question.getAs[Seq[String]]("followUpAnswers")
}
I hope the answer is helpful

Spark SQL data frame

Data structure:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}
Now I want to load the data into a data frame and want to append zip to loc. The loc column name should be same (loc). The transformed data should be like this:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}
No RDDs. I need a data frame operation to achieve this, preferably with the withColumn function. How can I do this?
Given a data structure as
val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""
You can covert it to dataframe as
val df = spark.read.json(sc.parallelize(jsonString::Nil))
which would give you
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
Now to get the desired output you would need to separate struct Emp column to separate columns and use Address array column in udf function to get your desired result as
import org.apache.spark.sql.functions._
def attachZipWithLoc = udf((array: Seq[Row])=> array.map(row => address(row.getAs[String]("loc")+row.getAs[String]("Zip"), row.getAs[String]("Zip"))))
df.select($"Emp.*")
.withColumn("Address", attachZipWithLoc($"Address"))
.select(struct($"Name".as("Name"), $"Sal".as("Sal"), $"Address".as("Address")).as("Emp"))
where address in udf class is a case class
case class address(loc: String, Zip: String)
which should give you
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
Now to get the json you can just use .toJSON and you should get
+-----------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+

Partitioning by column in Apache Spark to S3

have use-case where we want to read files from S3 which has JSON. Then, based on a particular JSON node value we want to group the data and write it to S3.
I am able to read the data but not able to find good example on how partition the data based on JSON key and then upload to S3. Can anyone provide any example or point me to a tutorial which can help me with this use-case?
I have got the schema of my data after creating the dataframe:
root
|-- customer: struct (nullable = true)
| |-- customerId: string (nullable = true)
|-- experiment: string (nullable = true)
|-- expiryTime: long (nullable = true)
|-- partitionKey: string (nullable = true)
|-- programId: string (nullable = true)
|-- score: double (nullable = true)
|-- startTime: long (nullable = true)
|-- targetSets: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- featured: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- data: struct (nullable = true)
| | | | | |-- asinId: string (nullable = true)
| | | | |-- pk: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- reason: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- recommended: array (nullable = true)
| | | |-- element: string (containsNull = true)
I want to partition the data based on the random hash on the customerId column. But when i do this:
df.write.partitionBy("customerId").save("s3/bucket/location/to/save");
It give error:
org.apache.spark.sql.AnalysisException: Partition column customerId not found in schema StructType(StructField(customer,StructType(StructField(customerId,StringType,true)),true), StructField(experiment,StringType,true), StructField(expiryTime,LongType,true), StructField(partitionKey,StringType,true), StructField(programId,StringType,true), StructField(score,DoubleType,true), StructField(startTime,LongType,true), StructField(targetSets,ArrayType(StructType(StructField(featured,ArrayType(StructType(StructField(data,StructType(StructField(asinId,StringType,true)),true), StructField(pk,StringType,true), StructField(type,StringType,true)),true),true), StructField(reason,ArrayType(StringType,true),true), StructField(recommended,ArrayType(StringType,true),true)),true),true));
Please let me know i can access customerId column.
Let's take example dataset sample.json
{"CUST_ID":"115734","CITY":"San Jose","STATE":"CA","ZIP":"95106"}
{"CUST_ID":"115728","CITY":"Allentown","STATE":"PA","ZIP":"18101"}
{"CUST_ID":"115730","CITY":"Allentown","STATE":"PA","ZIP":"18101"}
{"CUST_ID":"114728","CITY":"San Mateo","STATE":"CA","ZIP":"94401"}
{"CUST_ID":"114726","CITY":"Somerset","STATE":"NJ","ZIP":"8873"}
Now start hacking it with Spark
val jsonDf = spark.read
.format("json")
.load("path/of/sample.json")
jsonDf.show()
+---------+-------+-----+-----+
| CITY|CUST_ID|STATE| ZIP|
+---------+-------+-----+-----+
| San Jose| 115734| CA|95106|
|Allentown| 115728| PA|18101|
|Allentown| 115730| PA|18101|
|San Mateo| 114728| CA|94401|
| Somerset| 114726| NJ| 8873|
+---------+-------+-----+-----+
Then partition dataset by column "ZIP" and write to S3
jsonDf.write
.partitionBy("ZIP")
.save("s3/bucket/location/to/save")
// one liner athentication to s3
//.save("s3n://$accessKey:$secretKey" + "#" + s"$buckectName/location/to/save")
Note: In order this code successfully S3 access and secret key has to
be configured properly. Check this answer for Spark/Hadoop
integration with S3
Edit: Resolution: Partition column customerId not found in schema (as per comment)
customerId exists inside customer struct, so try extract the customerId then do partition.
df.withColumn("customerId", $"customer.customerId")
.drop("customer")
.write.partitionBy("customerId")
.save("s3/bucket/location/to/save")