how to explode a dataframe schema in databricks - scala

I have a schema that should be exploded, below is the schema
|-- CaseNumber: string (nullable = true)
|-- Customers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Contacts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- FirstName: string (nullable = true)
| | | | |-- LastName: string (nullable = true)
I want my schema to be like this,
|-- CaseNumber: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
or
+----------+---------------------+
| CaseNumber| FirstName| LastName|
+----------+---------------------+
| 1 | aa | bb |
+----------|-----------|---------|
| 2 | cc | dd |
+------------------------------- |
I am new to databricks, any help would be appreciated.thanks

Here is one way to solve it without using explode command -
case class MyCase(val Customers = Array[Customer](), CaseNumber : String
)
case class Customers(val Contacts = Array[Contacts]()
)
case class Contacts(val Firstname:String, val LastName:String
)
val dataset = // dataframe.as[MyCase]
dataset.map{ mycase =>
// return a Seq of tuples like - (mycase.caseNumber, //read customer's contract's first and last name )
//one row per first and last names, repeat mycase.caseNumber .. basically a loop
}.flatmap(identity)

I think you can still do explode(customersFlat.contacts). I sure this something like this some while ago, so forgive me my syntax and let me know whether this works
df.select("caseNumber",explode("customersFlat.contacts").as("contacts").select("caseNumber","contacts.firstName","contacts.lastName")

Related

How do I check if column present in the Spark DataFrame

I am trying a logic to return an empty column if column does not exist in dataframe.
Schema changes very frequent, sometime the whole struct will be missing (temp1) or array inside struct will be missing (suffix)
Schema looks like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp1: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- code1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- suffix: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
Or like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
When I am trying the below logic for the second schema, getting an exception that Struct not found
def has_Column(df: DataFrame, path: String) = Try(df(path)).isSuccess
df.withColumn("id", col("id")).
withColumn("tempLn", explode(col("temp"))).
withColumn("temp1_code1", when(lit(has_Column(df, "tempLn.temp1.code1")), concat_ws(" ",col("tempLn.temp1.code1"))).otherwise(lit("").cast("string"))).
withColumn("temp2_suffix", when(lit(has_Column(df, "tempLn.temp2.suffix")), concat_ws(" ",col("tempLn.temp2.suffix"))).otherwise(lit("").cast("string")))
Error:
org.apache.spark.sql.AnalysisException: No such struct field temp1;
You need to do the check the existence outside the select/withColumn... methods. As you reference it in the then part of case when expression, Spark tries to resolve it during the analysis of the query.
So you'll need to test like this:
if (has_Column(df, "tempLn.temp1.code1"))
df.withColumn("temp2_suffix", concat_ws(" ",col("tempLn.temp2.suffix")))
else
df.withColumn("temp2_suffix", lit(""))
To do it for multiple columns you can use foldLeft like this:
val df1 = Seq(
("tempLn.temp1.code1", "temp1_code1"),
("tempLn.temp2.suffix", "temp2_suffix")
).foldLeft(df) {
case (acc, (field, newCol)) => {
if (has_Column(acc, field))
acc.withColumn(newCol, concat_ws(" ", col(field)))
else
acc.withColumn(newCol, lit(""))
}
}

Exploding nested df columns in Spark Scala

Column name is 'col1' and is of the form:
col1: array (nullable = true)
| |-- A1: struct (containsNull = true)
| | |-- B0: struct (nullable = true)
| | | |-- B01: string (nullable = true)
| | | |-- B02: string (nullable = true)
| | |-- B1: string (nullable = true)
| | |-- B2: string (nullable = true)
| | |-- B3: string (nullable = true)
| | |-- B4: string (nullable = true)
| | |-- B5: string (nullable = true)
I am trying 2 things first to fetch the value B2. Code:
val explodeDF = test_df.explode($"col1") { case Row(col1_details:Array[String]) =>
col1_details:Array.map{ col1_details:Array =>
val firstName = col1_details:Array(2).asInstanceOf[String]
val lastName = col1_details:Array(3).asInstanceOf[String]
val email = col1_details:Array(4).asInstanceOf[String]
val salary = col1_details:Array(5).asInstanceOf[String]
notes_details(firstName, lastName, email, salary)
}
}
Error:
error: too many arguments for method apply: (index: Int)Char in class StringOps
col1_details(firstName, lastName, email, salary)
I have tried various snippets and I have been getting different errors. Any suggestions on the what the mistake would be highly helpful.

Spark SQL data frame

Data structure:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}
Now I want to load the data into a data frame and want to append zip to loc. The loc column name should be same (loc). The transformed data should be like this:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}
No RDDs. I need a data frame operation to achieve this, preferably with the withColumn function. How can I do this?
Given a data structure as
val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""
You can covert it to dataframe as
val df = spark.read.json(sc.parallelize(jsonString::Nil))
which would give you
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
Now to get the desired output you would need to separate struct Emp column to separate columns and use Address array column in udf function to get your desired result as
import org.apache.spark.sql.functions._
def attachZipWithLoc = udf((array: Seq[Row])=> array.map(row => address(row.getAs[String]("loc")+row.getAs[String]("Zip"), row.getAs[String]("Zip"))))
df.select($"Emp.*")
.withColumn("Address", attachZipWithLoc($"Address"))
.select(struct($"Name".as("Name"), $"Sal".as("Sal"), $"Address".as("Address")).as("Emp"))
where address in udf class is a case class
case class address(loc: String, Zip: String)
which should give you
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
Now to get the json you can just use .toJSON and you should get
+-----------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+

How to convert wrapped array to dataset in spark scala?

Hi I am new to spark scala. I have this structure in json file which I need to convert it to dataset. I am being unable to do this because of the nested data.
I tried to do something like this which I got from some post but it does not work. Can someone please suggest me the solution?
spark.read.json(path).map(r=>r.getAs[mutable.WrappedArray[String]]("readings"))
Your JSON format is invalid for spark to convert into dataframe. json informations that needs to be converted into dataframe/dataset row should be a line.
So the first step for you to do is read the json file and convert into valid json format. You can use wholeTextFiles api and some replacements.
val rdd = sc.wholeTextFiles("path to your json text file")
val validJson = rdd.map(_._2.replace(" ", "").replace("\n", ""))
Second step is to covert the valid json data into dataframe or dataset. Here I am using dataframe
val dataFrame = sqlContext.read.json(validJson)
which should give you
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|did |readings |
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|d7cc92c24be32d5d419af1277289313c|[[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++]),1506770544]]|
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
Now selecting WrappedArray is easy step as
dataFrame.select("readings.clients")
which should give you
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|clients |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++])]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
I hope the answer is helpful
Updated
Dataframe and datasets are almost the same except that datasets are type safety with encoders used, and that datasets are optimized than dataframes.
Long story short, you can change the dataframe to dataset by creating case classes. For your case you would need three case classes.
case class client(cid: String, clientOS: String, rssi: Long, snRatio: Long, ssid: String)
case class reading(clients: Array[client], ts: Long)
case class dataset(did: String, readings: Array[reading])
And then cast the dataframe to dataset as
val dataSet = sqlContext.read.json(validJson).as[dataset]
You should have dataset in your hand :)
You cannot create DataSet with the following code
spark.read.json(path).map(r => r.getAs[WrappedArray[String]]("readings"))
Check the schema of clients type for the DF created upon reading the JSON.
spark.read.json(path).printSchema
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
You can get the scala.collection.mutable.WrappedArrayobject with the below code
spark.read.json(path).first.getAs[WrappedArray[(String,String,Long,Long,String)]]("readings")
If you need a create the dataframe use the below.
spark.read.json(path).select("readings.clients")

Explode array in apache spark Data Frame

I am trying to flatten a schema of existing dataframe with nested fields. Structure of my dataframe is something like that:
root
|-- Id: long (nullable = true)
|-- Type: string (nullable = true)
|-- Uri: string (nullable = true)
|-- Type: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Gender: array (nullable = true)
| |-- element: string (containsNull = true)
Type and gender can contain array of elements, one element or null value.
I tried to use the following code:
var resDf = df.withColumn("FlatType", explode(df("Type")))
But as a result in a resulting data frame I loose rows for which I had null values for Type column. It means, for example, if I have 10 rows and in 7 rows type is null and in 3 type is not null, after I use explode in resulting data frame I have only three rows.
How can I keep rows with null values but explode array of values?
I found some kind of workaround but still stuck in one place. For standard types we can do the following:
def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
case "string" =>
val avoidNull = udf((column: Seq[String]) =>
if (column == null) Seq[String](null)
else column)
exploded = Some(explode(avoidNull(df(field))))
case "boolean" =>
val avoidNull = udf((xs: Seq[Boolean]) =>
if (xs == null) Seq[Boolean]()
else xs)
exploded = Some(explode(avoidNull(df(field))))
case _ => exploded = Some(explode(df(field)))
}
exploded.get
}
And after that just use it like this:
val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)
However, I have a problem for struct type for the following type of structure:
|-- Address: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AddressType: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- DEA: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Number: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- ExpirationDate: array (nullable = true)
| | | | | |-- element: timestamp (containsNull = true)
| | | | |-- Status: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
How can we process that kind of schema when DEA is null?
Thank you in advance.
P.S. I tried to use Lateral views but result is the same.
Maybe you can try using when:
val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))
As shown in the when function's documentation, the value null is inserted for the values that do not match the conditions.
I think what you wanted is to use explode_outer instead of explode
see apache docs : explode and explode_outer