I have extracted some data from hive to dataframe, which is in the below shown format.
+--------------------+-----------------+--------------------+---------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+----------------------+---------------+--------------------+---------------+
|XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...|
|XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...|
|XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
If we take only one signal it will be as below.
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|
[{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|
[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]|
[{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}]
For each NUM_ID , each SIG column will have an array of E and V pairs.
The schema for the above data is
fromHive.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- SIG1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
My requirement is to get the all E values from all the columns for a particular NUM_ID and create as a new cloumn with corresponding signal values in another columns as shown below.
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560531000|33.7825|34.7825| null|96.3354|
|XXXXX01|1569560505000| null| null|35.5501| null|
|XXXXX01|1569560531001|73.7825| null| null| null|
|XXXXX02|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354|
|XXXXX02|1569560505001|73.7825| null| null| null|
|XXXXX02|1569560502000| null| null|35.5501|96.3354|
|XXXXX03[1569560531000|73.7825| null| null| null|
|XXXXX03|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX03|1569560509000| null|34.7825|35.5501|96.3354|
The E values from all four signals column, for a particular NUM_ID should be taken as a single column without duplicates and the V values for corresponding E should be populated in different columns. Suppose a Signal is not having any E-V pair for a particular E, then that column should be null. as shown above.
Thanks in advance. Any lead appreciated.
For better Understanding below is the sample structure for input and expected output.
INPUT:
+-------------------------+-----------------+-----------------+------------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+-------------------------+-----------------+-----------------+------------------+
|XXXXX01|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
|XXXXX02|[{E7,V1},{E8,V2}]|[{E1,V3},{E3,V4}]|[{E1,V5},{E5,V6}]|[{E9,V7},{E8,V8}]|
|XXXXX03|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
OUTPUT EXPECTED:
+-------+---+--------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+---+-------+-------+-------+-------+
|XXXXX01| E1| V1| V3| null| null|
|XXXXX01| E2| V2| null| null| V8|
|XXXXX01| E3| null| V4| null| null|
|XXXXX01| E4| null| null| V5| null|
|XXXXX01| E5| null| null| V6| V7|
|XXXXX02| E1| null| V3| V5| null|
|XXXXX02| E3| null| V4| null| null|
|XXXXX02| E5| null| null| V6| null|
|XXXXX02[ E7| V1| null| null| null|
|XXXXX02| E8| V2| null| null| V7|
|XXXXX02| E9| null|34.7825| null| V8|
Input CSV file is as below:
NUM_ID|SIG1|SIG2|SIG3|SIG4 XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.UserDefinedFunction
val df = spark.read.format("csv").option("header","true").option("delimiter", "|").load("path .csv")
df.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
//UDF to generate column E
def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","")
val b = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + b.mkString(",")
}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})
//UDF to generate column value with Signal
def UDF_V:UserDefinedFunction=udf((E: String, SIG:String)=>{
val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "")
val SigMap = "(\\w+),([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap
var out = ""
if(SigMap.keys.toList.contains(E)){
out = SigMap(E).toString
}
out})
//new DataFrame with Column "E"
val df1 = df.withColumn("E", UDF_E(struct(df.columns map col: _*))).withColumn("E", explode(split(col("E"), ",")))
df1.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |E |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
//Final DataFrame
val df2 = df1.withColumn("SIG1_V", UDF_V(col("E"),col("SIG1"))).withColumn("SIG2_V", UDF_V(col("E"),col("SIG2"))).withColumn("SIG3_V", UDF_V(col("E"),col("SIG3"))).withColumn("SIG4_V", UDF_V(col("E"),col("SIG4"))).drop("SIG1","SIG2","SIG3","SIG4")
df2.show()
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560475000| 3.7812| | | |
|XXXXX01|1569560483000| 3.7812| | |34.7825|
|XXXXX01|1569560489000| |34.7825| | |
|XXXXX01|1569560491000|34.7875| | | |
|XXXXX01|1569560497000| |34.7825| | |
|XXXXX01|1569560505000| | |34.7825| |
|XXXXX01|1569560513000| | |34.7825| |
|XXXXX01|1569560521000| | |34.7825| |
|XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825|
|XXXXX01|1569560535000| | | |34.7825|
|XXXXX01|1569560537000| | 3.7825| | |
+-------+-------------+-------+-------+-------+-------+
I have 2 dataframes:
val df1 = sc.parallelize(Seq((123, 2.23, 1.12), (234, 2.45, 0.12), (456, 1.112, 0.234))).toDF("objid", "ra", "dec")
val df2 = sc.parallelize(Seq((4567, 123, "name1", "val1"), (2322, 456, "name2", "val2"), (3324, 555, "name3", "val3"), (5556, 123, "name4", "val4"), (3345, 123, "name5", "val5"))).toDF("specid", "objid", "name", "value")
They look like below:
df1.show()
+-----+-----+-----+
|objid| ra| dec|
+-----+-----+-----+
| 123| 2.23| 1.12|
| 234| 2.45| 0.12|
| 456|1.112|0.234|
+-----+-----+-----+
df2.show()
+------+-----+-----+-----+
|specid|objid| name|value|
+------+-----+-----+-----+
| 4567| 123|name1| val1|
| 2322| 456|name2| val2|
| 3324| 555|name3| val3|
| 5556| 123|name4| val4|
| 3345| 123|name5| val5|
+------+-----+-----+-----+
Now I want to nest df2 inside df1 as a nested column so the schema should look like below:
val new_schema = df1.schema.add("specs", df2.schema)
new_schema: org.apache.spark.sql.types.StructType = StructType(StructField(objid,IntegerType,false), StructField(ra,DoubleType,false), StructField(dec,DoubleType,false), StructField(specs,StructType(StructField(specid,IntegerType,false), StructField(objid,IntegerType,false), StructField(name,StringType,true), StructField(value,StringType,true)),true))
The reason I wanted to do it this way was because there is a one to many relationship between df1 and df2, which means there are more than 1 specs per objid. And I am not going to join only these two tables. There are about 50 tables that I want to ultimately join together to create a mega table. Most of those tables have 1 to n relationships and I was just thinking about a way to avoid having a lot of duplicate rows and null cells in the ultimate join result.
The ultimate result would look something like:
+-----+-----+-----+----------------------+
| | specs |
|objid| ra| dec| specid| name | value|
+-----+-----+-----+------+----+-------+ |
| 123| 2.23| 1.12| 4567 | name1 | val1 |
| | 5556 | name4 | val4 |
| | 3345 | name5 | val5 |
+-----+-----+-----+----------------------+
| 234| 2.45| 0.12| |
+-----+-----+-----+----------------------+
| 456|1.112|0.234| 2322 | name2 | val2 |
+-----+-----+-----+----------------------+
I was trying to add the column to df1 using .withColumn but ran into errors.
What I actually wanted to do was to select all the columns from df2 with the condition where df2.objid = df1.objid to match the rows and make that the new column in df1 but I am not sure if that's the best approach. Even if so, I am not sure how to do that.
Could someone please tell me how to do this?
As per my knowledge, you cannot have dataframe inside another dataframe(same is the case with RDDs).
What you need is a join between two dataframes. You can perform different types of joins and join the rows from two dataframes(this is where you make nest df2 columns inside df1)
You need to join both the dataframes based on the column objid like below
val join = df1.join(df2, "objid")
join.printSchema()
output:
root
|-- objid: integer (nullable = false)
|-- ra: double (nullable = false)
|-- dec: double (nullable = false)
|-- specid: integer (nullable = false)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
and when we say
join.show()
the output will be
+-----+-----+-----+------+-----+-----+
|objid| ra| dec|specid| name|value|
+-----+-----+-----+------+-----+-----+
| 456|1.112|0.234| 2322|name2| val2|
| 123| 2.23| 1.12| 4567|name1| val1|
+-----+-----+-----+------+-----+-----+
for more details you can check here
Update:
I think you are looking for something like this
df1.join(df2, df1("objid") === df2("objid"), "left_outer").show()
and the output is:
+-----+-----+-----+------+-----+-----+-----+
|objid| ra| dec|specid|objid| name|value|
+-----+-----+-----+------+-----+-----+-----+
| 456|1.112|0.234| 2322| 456|name2| val2|
| 234| 2.45| 0.12| null| null| null| null|
| 123| 2.23| 1.12| 4567| 123|name1| val1|
| 123| 2.23| 1.12| 5556| 123|name4| val4|
| 123| 2.23| 1.12| 3345| 123|name5| val5|
+-----+-----+-----+------+-----+-----+-----+