Spark nested complex dataframe - scala

I am trying to get the complex data into normal dataframe format
My data schema:
root
|-- column_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- values: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
My Data File(JSON Format):
{"column_names":["2_col_name","3_col_name"],"id":["a","b","c","d","e"],"values":[["2_col_1",1],["2_col_2",2],["2_col_3",9],["2_col_4",10],["2_col_5",11]]}
I am trying to convert above data into this format:
+----------+----------+----------+
|1_col_name|2_col_name|3_col_name|
+----------+----------+----------+
| a| 2_col_1| 1|
| b| 2_col_2| 2|
| c| 2_col_3| 9|
| d| 2_col_4| 10|
| e| 2_col_5| 11|
+----------+----------+----------+
I tried using explode function on id and values but got different output as below:
+---+-------------+
| id| values|
+---+-------------+
| a| [2_col_1, 1]|
| a| [2_col_2, 2]|
| a| [2_col_3, 9]|
| a|[2_col_4, 10]|
| a|[2_col_5, 11]|
| b| [2_col_1, 1]|
| b| [2_col_2, 2]|
| b| [2_col_3, 9]|
| b|[2_col_4, 10]|
+---+-------------+
only showing top 9 rows
Not sure where i am doing wrong

You can use array_zip + inline functions to flatten then pivot the column names :
val df1 = df.select(
$"column_names",
expr("inline(arrays_zip(id, values))")
).select(
$"id".as("1_col_name"),
expr("inline(arrays_zip(column_names, values))")
)
.groupBy("1_col_name")
.pivot("column_names")
.agg(first("values"))
df1.show
//+----------+----------+----------+
//|1_col_name|2_col_name|3_col_name|
//+----------+----------+----------+
//|e |2_col_5 |11 |
//|d |2_col_4 |10 |
//|c |2_col_3 |9 |
//|b |2_col_2 |2 |
//|a |2_col_1 |1 |
//+----------+----------+----------+

Related

Drop rows in spark which dont follow schema

currently, schema for my table is:
root
|-- product_id: integer (nullable = true)
|-- product_name: string (nullable = true)
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
I want to apply the below schema on the above table and delete all the rows which do not follow the below schema:
val productsSchema = StructType(Seq(
StructField("product_id",IntegerType,nullable = true),
StructField("product_name",StringType,nullable = true),
StructField("aisle_id",IntegerType,nullable = true),
StructField("department_id",IntegerType,nullable = true)
))
Use option "DROPMALFORMED" while loading the data which ignores corrupted records.
spark.read.format("json")
.option("mode", "DROPMALFORMED")
.option("header", "true")
.schema(productsSchema)
.load("sample.json")
If data is not matching with schema, spark will put null as value in that column. We just have to filter the null values for all columns.
Used filter to filter ```null`` values for all columns.
scala> "cat /tmp/sample.json".! // JSON File Data, one row is not matching with schema.
{"product_id":1,"product_name":"sampleA","aisle_id":"AA","department_id":"AAD"}
{"product_id":2,"product_name":"sampleBB","aisle_id":"AAB","department_id":"AADB"}
{"product_id":3,"product_name":"sampleCC","aisle_id":"CC","department_id":"CCC"}
{"product_id":3,"product_name":"sampledd","aisle_id":"dd","departmentId":"ddd"}
{"name","srinivas","age":29}
res100: Int = 0
scala> schema.printTreeString
root
|-- aisle_id: string (nullable = true)
|-- department_id: string (nullable = true)
|-- product_id: long (nullable = true)
|-- product_name: string (nullable = true)
scala> val df = spark.read.schema(schema).option("badRecordsPath", "/tmp/badRecordsPath").format("json").load("/tmp/sample.json") // Loading Json data & if schema is not matching we will be getting null rows for all columns.
df: org.apache.spark.sql.DataFrame = [aisle_id: string, department_id: string ... 2 more fields]
scala> df.show(false)
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
|null |null |null |null |
+--------+-------------+----------+------------+
scala> df.filter(df.columns.map(c => s"${c} is not null").mkString(" or ")).show(false) // Filter null rows.
+--------+-------------+----------+------------+
|aisle_id|department_id|product_id|product_name|
+--------+-------------+----------+------------+
|AA |AAD |1 |sampleA |
|AAB |AADB |2 |sampleBB |
|CC |CCC |3 |sampleCC |
|dd |null |3 |sampledd |
+--------+-------------+----------+------------+
scala>
do check out na.drop functions on data-frame, you can drop rows based on null values, min nulls in a row, and also based on a specific column which has nulls.
scala> sc.parallelize(Seq((1,"a","a"),(1,"a","a"),(2,"b","b"),(3,"c","c"),(4,"d","d"),(4,"d",null))).toDF
res7: org.apache.spark.sql.DataFrame = [_1: int, _2: string ... 1 more field]
scala> res7.show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//dropping row if a null is found
scala> res7.na.drop.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//drops only if `minNonNulls = 3` if accepted to each row
scala> res7.na.drop(minNonNulls = 3).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+
//not dropping any
scala> res7.na.drop(minNonNulls = 2).show()
+---+---+----+
| _1| _2| _3|
+---+---+----+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
| 4| d|null|
+---+---+----+
//drops row based on nulls in `_3` column
scala> res7.na.drop(Seq("_3")).show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| a| a|
| 1| a| a|
| 2| b| b|
| 3| c| c|
| 4| d| d|
+---+---+---+

Lateral view explode strange behaviour

I am concatenating two array columns and convert them back to array. Now when I apply explode, nothing is happening. Using Spark 2.3. Is there anything weird here?
df = spark.createDataFrame([(1,25,['A','B','B','C'],['A','B','B','C']),(1,20,['A','A','B','C'],['A','B','B','C']),(1,20,['A','C','B','C'],['A','B','B','C']),(2,26,['X','Y','Z','C'],['A','B','B','C'])],['id','age','one','two'])
+---+---+------------+------------+
| id|age| one| two|
+---+---+------------+------------+
| 1| 25|[A, B, B, C]|[A, B, B, C]|
| 1| 20|[A, A, B, C]|[A, B, B, C]|
| 1| 20|[A, C, B, C]|[A, B, B, C]|
| 2| 26|[X, Y, Z, C]|[A, B, B, C]|
+---+---+------------+------------+
>>> df.createOrReplaceTempView('df')
>>> df2 = spark.sql('''select id,age, array(concat_ws(',', one, two)) as three from df''')
>>> df2.show()
+---+---+-----------------+
| id|age| three|
+---+---+-----------------+
| 1| 25|[A,B,B,C,A,B,B,C]|
| 1| 20|[A,A,B,C,A,B,B,C]|
| 1| 20|[A,C,B,C,A,B,B,C]|
| 2| 26|[X,Y,Z,C,A,B,B,C]|
+---+---+-----------------+
>>> df2.createOrReplaceTempView('df2')
>>> spark.sql('''select id, age, four from df2 lateral view explode(three) tbl as four''').show() //not exploding
+---+---+---------------+
| id|age| four|
+---+---+---------------+
| 1| 25|A,B,B,C,A,B,B,C|
| 1| 20|A,A,B,C,A,B,B,C|
| 1| 20|A,C,B,C,A,B,B,C|
| 2| 26|X,Y,Z,C,A,B,B,C|
+---+---+---------------+
Please note that I can make it work by
>>> df2 = spark.sql('''select id,age, split(concat_ws(',', one, two),',') as three from df''')
But just wondering why the first approach is not working.
concat_ws creates a single string column and not an array:
df.select(F.size(df.one)).show()
df2.select(F.size(df2.three)).show()
Output:
+---------+
|size(one)|
+---------+
| 4|
| 4|
| 4|
| 4|
+---------+
+-----------+
|size(three)|
+-----------+
| 1|
| 1|
| 1|
| 1|
+-----------+
That means your array has just one element:
df2.select(df2.three.getItem(0)).show()
df2.select(df2.three.getItem(1)).show()
df2.printSchema()
Output:
+---------------+
| three[0]|
+---------------+
|A,B,B,C,A,B,B,C|
|A,A,B,C,A,B,B,C|
|A,C,B,C,A,B,B,C|
|X,Y,Z,C,A,B,B,C|
+---------------+
+--------+
|three[1]|
+--------+
| null|
| null|
| null|
| null|
+--------+
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- three: array (nullable = false)
| |-- element: string (containsNull = false)
So what you actually should use, is concat on spark >= 2.4:
df3 = spark.sql('''select id,age, concat(one, two) as three from df''')
df3.show(truncate=False)
df3.printSchema()
df3.select(df3.three.getItem(0)).show()
df3.select(df3.three.getItem(1)).show()
Output:
+---+---+------------------------+
|id |age|three |
+---+---+------------------------+
|1 |25 |[A, B, B, C, A, B, B, C]|
|1 |20 |[A, A, B, C, A, B, B, C]|
|1 |20 |[A, C, B, C, A, B, B, C]|
|2 |26 |[X, Y, Z, C, A, B, B, C]|
+---+---+------------------------+
root
|-- id: long (nullable = true)
|-- age: long (nullable = true)
|-- three: array (nullable = true)
| |-- element: string (containsNull = true)
+--------+
|three[0]|
+--------+
| A|
| A|
| A|
| X|
+--------+
+--------+
|three[1]|
+--------+
| B|
| A|
| C|
| Y|
+--------+
Concating two arrays with spark < 2.4 requires an udf (check this answer for an example).
Example way to do this using a UDF:
arraycat = F.udf(lambda x,y : x + y, ArrayType(StringType()))
df = df.withColumn("combined", arraycat("one", "two"))
df = df.withColumn("combined", F.explode("combined"))

How to perform aggregation (sum) on different columns and group the result based on another column of a spark dataframe?

Using scala-spark, I read a table in postgres and formed a dataframe: locationDF which contains data related to locations in the below format.
val opts = Map("url" -> "databaseurl","dbtable" -> "locations")
val locationDF = spark.read.format("jdbc").options(opts).load()
locationDF.printSchema()
root
|-- locn_id: integer (nullable = true)
|-- start_date: string (nullable = true)
|-- work_min: double (nullable = true)
|-- coverage: double (nullable = true)
|-- speed: double (nullable == true)
Initial Data:
+-------------+----------+-------------------+----------------+------------------+
| locn_id|start_date| work_min| coverage| speed|
+-------------+----------+-------------------+----------------+------------------+
| 3|2012-02-22| 53.62948333333333| 13.644|3.9306276263070457|
| 7|2012-02-22|0.11681666666666667| 0.0| 0.0|
| 1|2012-02-21| 22.783333333333335| 2.6| 8.762820512820513|
| 1|2012-01-21| 23.033333333333335| 2.6| 8.85897435897436|
| 1|2012-01-21| 44.98533333333334| 6.99| 6.435670004768718|
| 4|2012-02-21| 130.34788333333333| 54.67| 2.384267117858667|
| 2|2012-01-21| 94.61035| 8.909|10.619637445280052|
| 1|2012-02-21| 0.0| 0.0| 0.0|
| 1|2012-02-21| 29.3377| 4.579| 6.407010264249837|
| 1|2012-01-21| 59.13276666666667| 8.096| 7.303948451910409|
| 2|2012-03-21| 166.41843333333333| 13.048|12.754325056202738|
| 1|2012-03-21| 14.853183333333334| 2.721| 5.458722283474213|
| 9|2012-03-21| 1.69895| 0.845|2.0105917159763314|
+-------------+----------+-------------------+----------------+------------------+
I am trying to perform the sum of work_min (and convert into hours), sum of coverage, average speed of that particular year and month and form another dataframe.
To do that, I have seperated the month and year from the date column: start_date as below and got two columns: year and month out of it.
locationDF.withColumn("year", date_format(to_date($"start_date"),
"yyyy").cast(("Integer"))).withColumn("month",
date_format(to_date($"start_date"), "MM").cast(("Integer")))
+-------------+----------+-------------------+----------------+------------------+----+-----+
| locn_id|start_date| work_min| coverage| speed|year|month|
+-------------+----------+-------------------+----------------+------------------+----+-----+
| 3|2012-02-22| 53.62948333333333| 13.644|3.9306276263070457|2012| 2|
| 7|2012-02-22|0.11681666666666667| 0.0| 0.0|2012| 2|
| 1|2012-02-21| 22.783333333333335| 2.6| 8.762820512820513|2012| 2|
| 1|2012-01-21| 23.033333333333335| 2.6| 8.85897435897436|2012| 1|
| 1|2012-01-21| 44.98533333333334| 6.99| 6.435670004768718|2012| 1|
| 4|2012-02-21| 130.34788333333333| 54.67| 2.384267117858667|2012| 2|
| 2|2012-01-21| 94.61035| 8.909|10.619637445280052|2012| 1|
| 1|2012-02-21| 0.0| 0.0| 0.0|2012| 2|
| 1|2012-02-21| 29.3377| 4.579| 6.407010264249837|2012| 2|
| 1|2012-01-21| 59.13276666666667| 8.096| 7.303948451910409|2012| 1|
| 2|2012-03-21| 166.41843333333333| 13.048|12.754325056202738|2012| 3|
| 1|2012-03-21| 14.853183333333334| 2.721| 5.458722283474213|2012| 3|
| 9|2012-03-21| 1.69895| 0.845|2.0105917159763314|2012| 3|
+-------------+----------+-------------------+----------------+------------------+----+-----+
But I dont understand how to perform an aggregation -> sum on two separate columns: work_hours & coverage, average value of the column: speed for that particular month all at the same time and obtain the result as below.
+----+-----+-------------+------------+-----------------+
|year|month|sum_work_mins|sum_coverage| avg_speed|
+----+-----+-------------+------------+-----------------+
|2012| 1|221.7617833 | 26.595 |11.07274342031118|
|2012| 2|236.2152166 | 75.493 |7.161575173745354|
|2012| 3|182.9705666 | 16.614 |6.741213018551094|
+----+-----+-------------+------------+-----------------+
Could anyone let me know how can I achieve that ?
I think you are looking for this
scala> dfd.groupBy("year","month").agg(sum("work_min").as("sum_work_min"),sum("coverage").as("sum_coverage"),avg("speed").as("avg_speed")).show
+----+-----+------------------+------------------+-----------------+
|year|month| sum_work_min| sum_coverage| avg_speed|
+----+-----+------------------+------------------+-----------------+
|2012| 1|221.76178333333334|26.595000000000002|8.304557565233385|
|2012| 2| 236.2152166666667| 75.493|3.580787586872677|
|2012| 3|182.97056666666666| 16.614|6.741213018551094|
+----+-----+------------------+------------------+-----------------+
hope it helps you.

Create a new column from one of the value available in another columns as an array of Key Value pair

I have extracted some data from hive to dataframe, which is in the below shown format.
+--------------------+-----------------+--------------------+---------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+----------------------+---------------+--------------------+---------------+
|XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...|
|XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...|
|XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
If we take only one signal it will be as below.
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|
[{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|
[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]|
[{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}]
For each NUM_ID , each SIG column will have an array of E and V pairs.
The schema for the above data is
fromHive.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- SIG1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
My requirement is to get the all E values from all the columns for a particular NUM_ID and create as a new cloumn with corresponding signal values in another columns as shown below.
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560531000|33.7825|34.7825| null|96.3354|
|XXXXX01|1569560505000| null| null|35.5501| null|
|XXXXX01|1569560531001|73.7825| null| null| null|
|XXXXX02|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354|
|XXXXX02|1569560505001|73.7825| null| null| null|
|XXXXX02|1569560502000| null| null|35.5501|96.3354|
|XXXXX03[1569560531000|73.7825| null| null| null|
|XXXXX03|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX03|1569560509000| null|34.7825|35.5501|96.3354|
The E values from all four signals column, for a particular NUM_ID should be taken as a single column without duplicates and the V values for corresponding E should be populated in different columns. Suppose a Signal is not having any E-V pair for a particular E, then that column should be null. as shown above.
Thanks in advance. Any lead appreciated.
For better Understanding below is the sample structure for input and expected output.
INPUT:
+-------------------------+-----------------+-----------------+------------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+-------------------------+-----------------+-----------------+------------------+
|XXXXX01|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
|XXXXX02|[{E7,V1},{E8,V2}]|[{E1,V3},{E3,V4}]|[{E1,V5},{E5,V6}]|[{E9,V7},{E8,V8}]|
|XXXXX03|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
OUTPUT EXPECTED:
+-------+---+--------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+---+-------+-------+-------+-------+
|XXXXX01| E1| V1| V3| null| null|
|XXXXX01| E2| V2| null| null| V8|
|XXXXX01| E3| null| V4| null| null|
|XXXXX01| E4| null| null| V5| null|
|XXXXX01| E5| null| null| V6| V7|
|XXXXX02| E1| null| V3| V5| null|
|XXXXX02| E3| null| V4| null| null|
|XXXXX02| E5| null| null| V6| null|
|XXXXX02[ E7| V1| null| null| null|
|XXXXX02| E8| V2| null| null| V7|
|XXXXX02| E9| null|34.7825| null| V8|
Input CSV file is as below:
NUM_ID|SIG1|SIG2|SIG3|SIG4 XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.UserDefinedFunction
val df = spark.read.format("csv").option("header","true").option("delimiter", "|").load("path .csv")
df.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
//UDF to generate column E
def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","")
val b = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + b.mkString(",")
}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})
//UDF to generate column value with Signal
def UDF_V:UserDefinedFunction=udf((E: String, SIG:String)=>{
val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "")
val SigMap = "(\\w+),([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap
var out = ""
if(SigMap.keys.toList.contains(E)){
out = SigMap(E).toString
}
out})
//new DataFrame with Column "E"
val df1 = df.withColumn("E", UDF_E(struct(df.columns map col: _*))).withColumn("E", explode(split(col("E"), ",")))
df1.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |E |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
//Final DataFrame
val df2 = df1.withColumn("SIG1_V", UDF_V(col("E"),col("SIG1"))).withColumn("SIG2_V", UDF_V(col("E"),col("SIG2"))).withColumn("SIG3_V", UDF_V(col("E"),col("SIG3"))).withColumn("SIG4_V", UDF_V(col("E"),col("SIG4"))).drop("SIG1","SIG2","SIG3","SIG4")
df2.show()
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560475000| 3.7812| | | |
|XXXXX01|1569560483000| 3.7812| | |34.7825|
|XXXXX01|1569560489000| |34.7825| | |
|XXXXX01|1569560491000|34.7875| | | |
|XXXXX01|1569560497000| |34.7825| | |
|XXXXX01|1569560505000| | |34.7825| |
|XXXXX01|1569560513000| | |34.7825| |
|XXXXX01|1569560521000| | |34.7825| |
|XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825|
|XXXXX01|1569560535000| | | |34.7825|
|XXXXX01|1569560537000| | 3.7825| | |
+-------+-------------+-------+-------+-------+-------+

DataFrame explode list of JSON objects

I have JSON data in the following format:
{
"date": 100
"userId": 1
"data": [
{
"timeStamp": 101,
"reading": 1
},
{
"timeStamp": 102,
"reading": 2
}
]
}
{
"date": 200
"userId": 1
"data": [
{
"timeStamp": 201,
"reading": 3
},
{
"timeStamp": 202,
"reading": 4
}
]
}
I read it into Spark SQL:
val df = SQLContext.read.json(...)
df.printSchema
// root
// |-- date: double (nullable = true)
// |-- userId: long (nullable = true)
// |-- data: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- timeStamp: double (nullable = true)
// | | |-- reading: double (nullable = true)
I would like to transform it in order to have one row per reading. To my understanding, every transformation should produce a new DataFrame, so the following should work:
import org.apache.spark.sql.functions.explode
val exploded = df
.withColumn("reading", explode(df("data.reading")))
.withColumn("timeStamp", explode(df("data.timeStamp")))
.drop("data")
exploded.printSchema
// root
// |-- date: double (nullable = true)
// |-- userId: long (nullable = true)
// |-- timeStamp: double (nullable = true)
// |-- reading: double (nullable = true)
The resulting schema is correct, but I get every value twice:
exploded.show
// +-----------+-----------+-----------+-----------+
// | date| userId| timeStamp| reading|
// +-----------+-----------+-----------+-----------+
// | 100| 1| 101| 1|
// | 100| 1| 101| 1|
// | 100| 1| 102| 2|
// | 100| 1| 102| 2|
// | 200| 1| 201| 3|
// | 200| 1| 201| 3|
// | 200| 1| 202| 4|
// | 200| 1| 202| 4|
// +-----------+-----------+-----------+-----------+
My feeling is that there is something about the lazy evaluation of the two explodes that I don't understand.
Is there a way to get the above code to work? Or should I use a different approach all together?
The resulting schema is correct, but I get every value twice
While schema is correct the output you've provided doesn't reflect actual result. In practice you'll get Cartesian product of timeStamp and reading for each input row.
My feeling is that there is something about the lazy evaluation
No, it has nothing to do with lazy evaluation. The way you use explode is just wrong. To understand what is going on lets trace execution for date equal 100:
val df100 = df.where($"date" === 100)
step by step. First explode will generate two rows, one for 1 and one for 2:
val df100WithReading = df100.withColumn("reading", explode(df("data.reading")))
df100WithReading.show
// +------------------+----+------+-------+
// | data|date|userId|reading|
// +------------------+----+------+-------+
// |[[1,101], [2,102]]| 100| 1| 1|
// |[[1,101], [2,102]]| 100| 1| 2|
// +------------------+----+------+-------+
The second explode generate two rows (timeStamp equal 101 and 102) for each row from the previous step:
val df100WithReadingAndTs = df100WithReading
.withColumn("timeStamp", explode(df("data.timeStamp")))
df100WithReadingAndTs.show
// +------------------+----+------+-------+---------+
// | data|date|userId|reading|timeStamp|
// +------------------+----+------+-------+---------+
// |[[1,101], [2,102]]| 100| 1| 1| 101|
// |[[1,101], [2,102]]| 100| 1| 1| 102|
// |[[1,101], [2,102]]| 100| 1| 2| 101|
// |[[1,101], [2,102]]| 100| 1| 2| 102|
// +------------------+----+------+-------+---------+
If you want correct results explode data and select afterwards:
val exploded = df.withColumn("data", explode($"data"))
.select($"userId", $"date",
$"data".getItem("reading"), $"data".getItem("timestamp"))
exploded.show
// +------+----+-------------+---------------+
// |userId|date|data[reading]|data[timestamp]|
// +------+----+-------------+---------------+
// | 1| 100| 1| 101|
// | 1| 100| 2| 102|
// | 1| 200| 3| 201|
// | 1| 200| 4| 202|
// +------+----+-------------+---------------+