Append new column to spark DF based on logic

Append new column to spark DF based on logic - scala

Need to add a new column to below DF based on other columns. Here is the DF schema
scala> a.printSchema()
root
|-- ID: decimal(22,0) (nullable = true)
|-- NAME: string (nullable = true)
|-- AMOUNT: double (nullable = true)
|-- CODE: integer (nullable = true)
|-- NAME1: string (nullable = true)
|-- code1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- revised_code string (nullable = true)
now i want to add a column say flag as per below conditions
1=> if code == revised_code, than flag is P
2 => if code != revised code than I
3=> if both code and revised_code is null than no flag.
this is the udf that i am trying, but its giving I for both case 1 and 3.
def tagsUdf =
udf((code: String, revised_code: String) =>
if (code == null && revised_code == null ) ""
else if (code == revised_code) "P" else "I")
tagsUdf(col("CODE"), col("revised_code"))
Can anyone please point out as what mistake am I doing
I/P DF
+-------------+-------+------------+
|NAME | CODE|revised_code|
+-------------+-------+------------+
| amz | null| null|
| Watch | null| 5812|
| Watch | null| 5812|
| Watch | 5812| 5812|
| amz | null| null|
| amz | 9999 | 4352|
+-------------+-------+-----------+
Schema:
root
|-- MERCHANT_NAME: string (nullable = true)
|-- CODE: integer (nullable = true)
|-- revised_mcc: string (nullable = true)
O/P DF
+-------------+-------+-----------------+
|NAME | CODE|revised_code|flag|
+-------------+-------+-----------------+
| amz | null| null| null|
| Watch | null| 5812| I |
| Watch | null| 5812| I |
| Watch | 5812| 5812| P |
| amz | null| null| null|
|amz | 9999 | 4352| I |
+-------------+-------+-----------------+

You don't need a udf function for that. A simple nested when inbuilt function should do the trick.
import org.apache.spark.sql.functions._
df.withColumn("CODE", col("CODE").cast("string"))
.withColumn("flag", when(((isnull(col("CODE")) || col("CODE") === "null") && (isnull(col("revised_code")) || col("revised_code") === "null")), "").otherwise(when(col("CODE") === col("revised_code"), "P").otherwise("I")))
.show(false)
Here, CODE column is casted to stringType before logic applying using when so that both CODE and revised_code match in datatype when comparing.
Note: CODE column is an IntegerType and it cannot be null in any case.

Related

Spark Scala - How to explode a column into multiple rows in spark scala

I have a dataframe like this
+----------+-------------+
|A |Devices |
+----------+-------+------
|house1 |[100,101,102]|
|house1 |[103,104] |
+----------+-------------+
And I want to explode the column 'Devices' into multiple rows. My final dataframe should look like this
+----------+--------+
|A |Devices |
+----------+--------+
|house1 |100 |
|house1 |101 |
|house1 |102 |
|house1 |103 |
|house1 |104 |
+----------+--------+
The schema of the table is
root
|-- A: String (nullable = true)
|-- Devices: array (nullable = true)
| |-- element: String (containsNull = true)
I tried doing this but it is showing error (UnresolvedAttribute in $"Devices")
Df.withColumn("c", explode(split($"Devices","\\,")))

Df.select(col("A"),explode(col("devices"))
Using this I am able to find the required answer

PySpark: Perform logical AND on timestamp

I have a table consisting of main ID, subID and two timestamps (start-end).
+-------+---------------------+---------------------+---------------------+
|main_id|sub_id |start_timestamp |end_timestamp |
+-------+---------------------+---------------------+---------------------+
| 1| 1 | 2021/06/01 19:00 | 2021/06/01 19:10 |
| 1| 2 | 2021/06/01 19:01 | 2021/06/01 19:10 |
| 1| 3 | 2021/06/01 19:01 | 2021/06/01 19:05 |
| 1| 3 | 2021/06/01 19:07 | 2021/06/01 19:09 |
My goal is to pick all the records with the same mainID (different subIDs) and perform a logical AND on the timestamp column (the goal is to find periods, where all the subIDs were active).
+-------+---------------------------+---------------------------+
|main_id| global_start | global_stop |
+-------+---------------------------+---------------------------+
| 1| 2021/06/01 19:01 | 2021/06/01 19:05 |
| 1| 2021/06/01 19:07 | 2021/06/01 19:09 |
Thanks!

Partial solution
This kind of logic is probably really difficult to implement in pure Spark. Built-in functions are not enough for that.
Also, the expected output is 2 lines, but a simple group by main_id should output only one line. Therefore, the logic behind is not trivial.
I would advice you to group your data by main_id and process them with python, using an UDF.
# Agg your data by main_id
df2 = (
df.groupby("main_id", "sub_id")
.agg(
F.collect_list(F.struct("start_timestamp", "end_timestamp")).alias("timestamps")
)
.groupby("main_id")
.agg(F.collect_list(F.struct("sub_id", "timestamps")).alias("data"))
)
df2.show(truncate=False)
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|main_id|data |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[[3, [[2021/06/01 19:01, 2021/06/01 19:05], [2021/06/01 19:07, 2021/06/01 19:09]]], [1, [[2021/06/01 19:00, 2021/06/01 19:10]]], [2, [[2021/06/01 19:01, 2021/06/01 19:10]]]]|
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df2.printSchema()
root
|-- main_id: long (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sub_id: long (nullable = true)
| | |-- timestamps: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- start_timestamp: string (nullable = true)
| | | | |-- end_timestamp: string (nullable = true)
Once this step achieved, you can process the column data with Python and perform your logical AND.
#F.udf # Add required type depending on function return
def process(data):
"""
data is a complex type (see df2.printSchema()) :
list(dict(
"sub_id": value_of_sub_id,
"timestamps": list(dict(
"start_timestamp": value_of_start_timestamp,
"end_timestamp": value_of_end_timestamp,
))
))
"""
... # implement the "logical AND" here.
df2.select(
"main_id",
process(F.col("data"))
)
I hope this can help you or others to build a final solution.

Spark nested complex dataframe

I am trying to get the complex data into normal dataframe format
My data schema:
root
|-- column_names: array (nullable = true)
| |-- element: string (containsNull = true)
|-- values: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
My Data File(JSON Format):
{"column_names":["2_col_name","3_col_name"],"id":["a","b","c","d","e"],"values":[["2_col_1",1],["2_col_2",2],["2_col_3",9],["2_col_4",10],["2_col_5",11]]}
I am trying to convert above data into this format:
+----------+----------+----------+
|1_col_name|2_col_name|3_col_name|
+----------+----------+----------+
| a| 2_col_1| 1|
| b| 2_col_2| 2|
| c| 2_col_3| 9|
| d| 2_col_4| 10|
| e| 2_col_5| 11|
+----------+----------+----------+
I tried using explode function on id and values but got different output as below:
+---+-------------+
| id| values|
+---+-------------+
| a| [2_col_1, 1]|
| a| [2_col_2, 2]|
| a| [2_col_3, 9]|
| a|[2_col_4, 10]|
| a|[2_col_5, 11]|
| b| [2_col_1, 1]|
| b| [2_col_2, 2]|
| b| [2_col_3, 9]|
| b|[2_col_4, 10]|
+---+-------------+
only showing top 9 rows
Not sure where i am doing wrong

You can use array_zip + inline functions to flatten then pivot the column names :
val df1 = df.select(
$"column_names",
expr("inline(arrays_zip(id, values))")
).select(
$"id".as("1_col_name"),
expr("inline(arrays_zip(column_names, values))")
)
.groupBy("1_col_name")
.pivot("column_names")
.agg(first("values"))
df1.show
//+----------+----------+----------+
//|1_col_name|2_col_name|3_col_name|
//+----------+----------+----------+
//|e |2_col_5 |11 |
//|d |2_col_4 |10 |
//|c |2_col_3 |9 |
//|b |2_col_2 |2 |
//|a |2_col_1 |1 |
//+----------+----------+----------+

Spark dataframe replace null values for different data types in scala

Similar question has answers in SO, but this requirement is little different.
na.fill in Spark DataFrame Scala
I have a sample dataframe like below. Each column in the dataframe is different data types.
So using df.na.fill is not working to replace all the null values.
+-------------------------+-------------------------+-------------------------+
| date | id | name |
+-------------------------+-------------------------+-------------------------+
| 2000-01-01 | NULL | ABC |
| NULL | 123 | NULL |
| NULL | NULL | CDE |
+-------------------------+-------------------------+-------------------------+
So irrespective of the data type of the column all the NULL should be replaced by empty string.
The number,and name of the columns in the dataframe will be keep changing.
Based on the input dataframe data type NULLs should be replaced.
Schema of this dataframe:
root
|-- date: timestamp (nullable = false)
|-- id: integer (nullable = true)
|-- points_redeemed: string (nullable = false)
Expected Result:
+-------------------------+-------------------------+-------------------------+
| date | id | name |
+-------------------------+-------------------------+-------------------------+
| 2000-01-01 | | ABC |
| | 123 | |
| | | CDE |
+-------------------------+-------------------------+-------------------------+
Can someone advise ?

Use foldLeft on dataframe columns and check if the value is literal NULL or .isNull or length(trim(col)) === 0 then replace with "" else don't change the value.
Example:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val df= Seq(("2000-01-01","NULL","ABC"),("NULL","123","NULL")).
toDF("date","id","name").
withColumn("date",col("date").cast("timestamp")).
withColumn("id",col("id").cast("int"))
df.printSchema
//root
// |-- date: timestamp (nullable = true)
// |-- id: integer (nullable = true)
// |-- name: string (nullable = true)
df.show()
//+--------------------+----+----+
//| date| id|name|
//+--------------------+----+----+
//|2000-01-01 00:00:...|null| ABC|
//| null| 123|NULL|
//+--------------------+----+----+
#using when otherwise to determine NULL value and replacing with "" if matches.
val df2=df.columns.foldLeft(df)((df, c) => {
df.withColumn(s"$c",when((lower(col(s"$c")) === lit("null")) || (col(s"$c").isNull)|| (length(trim(col(s"$c"))) === 0),lit("")).otherwise(col(s"$c")))
})
df2.show()
//+-------------------+---+----+
//| date| id|name|
//+-------------------+---+----+
//|2000-01-01 00:00:00| | ABC|
//| |123| |
//+-------------------+---+----+
You can also add more checks in when statement if needed!

Create a new column from one of the value available in another columns as an array of Key Value pair

I have extracted some data from hive to dataframe, which is in the below shown format.
+--------------------+-----------------+--------------------+---------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+----------------------+---------------+--------------------+---------------+
|XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...|
|XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...|
|XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
If we take only one signal it will be as below.
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|
[{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|
[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]|
[{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}]
For each NUM_ID , each SIG column will have an array of E and V pairs.
The schema for the above data is
fromHive.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- SIG1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- SIG4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
My requirement is to get the all E values from all the columns for a particular NUM_ID and create as a new cloumn with corresponding signal values in another columns as shown below.
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560531000|33.7825|34.7825| null|96.3354|
|XXXXX01|1569560505000| null| null|35.5501| null|
|XXXXX01|1569560531001|73.7825| null| null| null|
|XXXXX02|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354|
|XXXXX02|1569560505001|73.7825| null| null| null|
|XXXXX02|1569560502000| null| null|35.5501|96.3354|
|XXXXX03[1569560531000|73.7825| null| null| null|
|XXXXX03|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX03|1569560509000| null|34.7825|35.5501|96.3354|
The E values from all four signals column, for a particular NUM_ID should be taken as a single column without duplicates and the V values for corresponding E should be populated in different columns. Suppose a Signal is not having any E-V pair for a particular E, then that column should be null. as shown above.
Thanks in advance. Any lead appreciated.
For better Understanding below is the sample structure for input and expected output.
INPUT:
+-------------------------+-----------------+-----------------+------------------+
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+-------------------------+-----------------+-----------------+------------------+
|XXXXX01|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
|XXXXX02|[{E7,V1},{E8,V2}]|[{E1,V3},{E3,V4}]|[{E1,V5},{E5,V6}]|[{E9,V7},{E8,V8}]|
|XXXXX03|[{E1,V1},{E2,V2}]|[{E1,V3},{E3,V4}]|[{E4,V5},{E5,V6}]|[{E5,V7},{E2,V8}] |
OUTPUT EXPECTED:
+-------+---+--------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+---+-------+-------+-------+-------+
|XXXXX01| E1| V1| V3| null| null|
|XXXXX01| E2| V2| null| null| V8|
|XXXXX01| E3| null| V4| null| null|
|XXXXX01| E4| null| null| V5| null|
|XXXXX01| E5| null| null| V6| V7|
|XXXXX02| E1| null| V3| V5| null|
|XXXXX02| E3| null| V4| null| null|
|XXXXX02| E5| null| null| V6| null|
|XXXXX02[ E7| V1| null| null| null|
|XXXXX02| E8| V2| null| null| V7|
|XXXXX02| E9| null|34.7825| null| V8|

Input CSV file is as below:
NUM_ID|SIG1|SIG2|SIG3|SIG4 XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.UserDefinedFunction
val df = spark.read.format("csv").option("header","true").option("delimiter", "|").load("path .csv")
df.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+
//UDF to generate column E
def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","")
val b = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + b.mkString(",")
}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})
//UDF to generate column value with Signal
def UDF_V:UserDefinedFunction=udf((E: String, SIG:String)=>{
val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "")
val SigMap = "(\\w+),([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap
var out = ""
if(SigMap.keys.toList.contains(E)){
out = SigMap(E).toString
}
out})
//new DataFrame with Column "E"
val df1 = df.withColumn("E", UDF_E(struct(df.columns map col: _*))).withColumn("E", explode(split(col("E"), ",")))
df1.show(false)
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |E |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
//Final DataFrame
val df2 = df1.withColumn("SIG1_V", UDF_V(col("E"),col("SIG1"))).withColumn("SIG2_V", UDF_V(col("E"),col("SIG2"))).withColumn("SIG3_V", UDF_V(col("E"),col("SIG3"))).withColumn("SIG4_V", UDF_V(col("E"),col("SIG4"))).drop("SIG1","SIG2","SIG3","SIG4")
df2.show()
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560475000| 3.7812| | | |
|XXXXX01|1569560483000| 3.7812| | |34.7825|
|XXXXX01|1569560489000| |34.7825| | |
|XXXXX01|1569560491000|34.7875| | | |
|XXXXX01|1569560497000| |34.7825| | |
|XXXXX01|1569560505000| | |34.7825| |
|XXXXX01|1569560513000| | |34.7825| |
|XXXXX01|1569560521000| | |34.7825| |
|XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825|
|XXXXX01|1569560535000| | | |34.7825|
|XXXXX01|1569560537000| | 3.7825| | |
+-------+-------------+-------+-------+-------+-------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Append new column to spark DF based on logic - scala

Related

Spark Scala - How to explode a column into multiple rows in spark scala

PySpark: Perform logical AND on timestamp

Spark nested complex dataframe

Spark dataframe replace null values for different data types in scala

Create a new column from one of the value available in another columns as an array of Key Value pair

Categories

Resources