Selected Values of a JSON key Fetch to DataFrame in Spark scala - scala

Structure of JSON looks like below.
|-- destination: struct (nullable = true)
| |-- activity: string (nullable = true)
| |-- id: string (nullable = true)
| |-- destination_class: array (nullable = true)
|-- Health: struct (nullable = true)
| |-- sample: struct (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
Marks: struct (nullable = true)
| |-- exam_score: double (nullable = true)
|-- sourceID: string (nullable = true)
unique_exam_fields: struct (nullable = true)
| |-- indOrigin: string (nullable = true)
| |-- compo: string (nullable = true)
how come i select only few feilds from each object.
i am trying to bring below feilds to Dataframe.
from destination-- id and activity
from Health-- id and name
from Marks -- exam_score
code:
Code i tried as
val DF = spark.read.json("D:/data.json"),
but the above code bring all feilds
output-- Dataframe looks like
destination_id|activity|Health_id|Name|Exam_score
Please help

You can use the dot notation to access the nested structures and then give the columns an alias:
df.select(col("destination.id").as("destination_id"),
col("destination.activity").as("activity"),
col("Health.sample.id").as("Health_id"),
col("Health.sample.name").as("Name"),
col("Marks.exam_score").as("Exam_score"))
.show()
prints
+--------------+--------+---------+----+----------+
|destination_id|activity|Health_id|Name|Exam_score|
+--------------+--------+---------+----+----------+
| b| a| c| d| e|
| b1| a1| c1| d1| e1|
+--------------+--------+---------+----+----------+

Option: 1 Load complete file & select required columns like below.
Add all required columns inside Seq & then use those columns inside selectExpr
val columns = Seq(
"destination.id as destination_id",
"destination.activity as activity",
"Health.sample.id as health_id",
"Health.sample.name as name",
"Marks.exam_score as exam_score"
)
df.selectExpr(columns:_*)
Option: 2 Create StructType with required columns & apply schema before load file data.
val schema = // Your required columns in schema
val DF = spark.read.schema(schema).json("D:/data.json")

Related

Adding new column for DataFrame with complex column (Array<Map<String,String>>

I am loading a Dataframe from an external source with the following schema:
|-- A: string (nullable = true)
|-- B: timestamp (nullable = true)
|-- C: long (nullable = true)
|-- METADATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- M_1: integer (nullable = true)
| | |-- M_2: string (nullable = true)
| | |-- M_3: string (nullable = true)
| | |-- M_4: string (nullable = true)
| | |-- M_5: double (nullable = true)
| | |-- M_6: string (nullable = true)
| | |-- M_7: double (nullable = true)
| | |-- M_8: boolean (nullable = true)
| | |-- M_9: boolean (nullable = true)
|-- E: string (nullable = true)
Now, I need to add new column, METADATA_PARSED, with column type Array and the following case class:
case class META_DATA_COL(M_1: String, M_2: String, M_3, M_10:String)
My approach here, based on examples is to create a UDF and pass in the METADATA column. But since it is of a complex type I am having a lot of trouble parsing it.
On top of that in the UDF, for the "new" variable M_10, I need to do some string manipulation on the method as well. So I need to access each of the elements in the metadata column.
What would be the best way to approach this issue? I attempted to convert the source dataframe (+METADATA) to a case class; but that did not work as it was translated back to spark WrappedArray types upon entering the UDF.
you can Use something like this.
import org.apache.spark.sql.functions._
val tempdf = df.select(
explode( col("METADATA")).as("flat")
)
val processedDf = tempdf.select( col("flat.M_1"),col("flat.M_2"),col("flat.M_3"))
now write a udf
def processudf = udf((col1:Int,col2:String,col3:String) => /* do the processing*/)
this should help, i can provide some more help if you can provide more details on the processing.

Spark SQL data frame

Data structure:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}
Now I want to load the data into a data frame and want to append zip to loc. The loc column name should be same (loc). The transformed data should be like this:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}
No RDDs. I need a data frame operation to achieve this, preferably with the withColumn function. How can I do this?
Given a data structure as
val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""
You can covert it to dataframe as
val df = spark.read.json(sc.parallelize(jsonString::Nil))
which would give you
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
Now to get the desired output you would need to separate struct Emp column to separate columns and use Address array column in udf function to get your desired result as
import org.apache.spark.sql.functions._
def attachZipWithLoc = udf((array: Seq[Row])=> array.map(row => address(row.getAs[String]("loc")+row.getAs[String]("Zip"), row.getAs[String]("Zip"))))
df.select($"Emp.*")
.withColumn("Address", attachZipWithLoc($"Address"))
.select(struct($"Name".as("Name"), $"Sal".as("Sal"), $"Address".as("Address")).as("Emp"))
where address in udf class is a case class
case class address(loc: String, Zip: String)
which should give you
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
Now to get the json you can just use .toJSON and you should get
+-----------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+

How to convert wrapped array to dataset in spark scala?

Hi I am new to spark scala. I have this structure in json file which I need to convert it to dataset. I am being unable to do this because of the nested data.
I tried to do something like this which I got from some post but it does not work. Can someone please suggest me the solution?
spark.read.json(path).map(r=>r.getAs[mutable.WrappedArray[String]]("readings"))
Your JSON format is invalid for spark to convert into dataframe. json informations that needs to be converted into dataframe/dataset row should be a line.
So the first step for you to do is read the json file and convert into valid json format. You can use wholeTextFiles api and some replacements.
val rdd = sc.wholeTextFiles("path to your json text file")
val validJson = rdd.map(_._2.replace(" ", "").replace("\n", ""))
Second step is to covert the valid json data into dataframe or dataset. Here I am using dataframe
val dataFrame = sqlContext.read.json(validJson)
which should give you
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|did |readings |
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|d7cc92c24be32d5d419af1277289313c|[[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++]),1506770544]]|
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
Now selecting WrappedArray is easy step as
dataFrame.select("readings.clients")
which should give you
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|clients |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++])]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
I hope the answer is helpful
Updated
Dataframe and datasets are almost the same except that datasets are type safety with encoders used, and that datasets are optimized than dataframes.
Long story short, you can change the dataframe to dataset by creating case classes. For your case you would need three case classes.
case class client(cid: String, clientOS: String, rssi: Long, snRatio: Long, ssid: String)
case class reading(clients: Array[client], ts: Long)
case class dataset(did: String, readings: Array[reading])
And then cast the dataframe to dataset as
val dataSet = sqlContext.read.json(validJson).as[dataset]
You should have dataset in your hand :)
You cannot create DataSet with the following code
spark.read.json(path).map(r => r.getAs[WrappedArray[String]]("readings"))
Check the schema of clients type for the DF created upon reading the JSON.
spark.read.json(path).printSchema
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
You can get the scala.collection.mutable.WrappedArrayobject with the below code
spark.read.json(path).first.getAs[WrappedArray[(String,String,Long,Long,String)]]("readings")
If you need a create the dataframe use the below.
spark.read.json(path).select("readings.clients")

Update Schema for DataFrame in Apache Spark

I have a DataFrame with the following schema
root
|-- col_a: string (nullable = false)
|-- col_b: string (nullable = false)
|-- col_c_a: string (nullable = false)
|-- col_c_b: string (nullable = false)
|-- col_d: string (nullable = false)
|-- col_e: string (nullable = false)
|-- col_f: string (nullable = false)
now I want to convert the Schema for this data frame to something like this.
root
|-- col_a: string (nullable = false)
|-- col_b: string (nullable = false)
|-- col_c: struct (nullable = false)
|-- col_c_a: string (nullable = false)
|-- col_c_b: string (nullable = false)
|-- col_d: string (nullable = false)
|-- col_e: string (nullable = false)
|-- col_f: string (nullable = false)
I can able to do this with the help of map transformation by explicitly fetching the value of each column from row type but this is very complex process and does not look good So,
is there any way I can achieve this?
Thanks
There is an in-built struct function with the definition :
def struct(cols: Column*): Column
You can use it like :
df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 2| 3|
+---+---+
df.withColumn("struct_col", struct($"a", $"b")).show
+---+---+----------+
| a| b|struct_col|
+---+---+----------+
| 1| 2| [1,2]|
| 2| 3| [2,3]|
+---+---+----------+
The schema of the new dataframe being :
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- struct_col: struct (nullable = false)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
In you case, you can do something like :
df.withColumn("col_c" , struct($"col_c_a", $"col_c_b") ).drop($"col_c_a").drop($"col_c_b")

Spark/Scala: join dataframes when id is nested in an array of structs

I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))