Join two spark dataframes by evaluating an expression - scala

I have two spark dataframes
userMemberShipDF :
user
membership_array
a1
s1, s2 , s3
a2
s4 , s6
a3
s5, s4 ,s3
a4
s1,s3,s4,s5
a5
s2, s4, s6
a6
s3, s7, s1
a7
s1, s4, s6
and categoryDF with
category_id
membership_expression
start_date
duration
c1
s1 || s2
2022-05-01
30
c2
s4 && s6 && !s2
2022-06-20
50
c3
s3 && s4
2022-06-10
60
with resultant data frame to contain columns user , category_id , start_date , duration
I already have a function written which would take in membership_expression from the second data frame along with membership_array in the first dataframe and evaluate to true or false .
For example membership_expression = s1 || s2 would match all user a1 , a4 , a5, a6 and a7 and expression s4 && s6 !s2 would only match a2 etc.
I wanted to join both the dataframes based on if this expression evaluates to true or false . I looked up at the spark join and it would only take in column as join condition but not a boolean expression.
So i have tried the below approach
val matchedUserSegments = userMemberShipDF
.map { r =>
{
// categoryDF is broadcasted
val category_items_set = categoryDF.value.flatMap { fl =>
{
if (CategoryEvaluator.evaluateMemberShipExpression(fl.membership_expression, r.membership_array)) {
Some(fl.category_id)
} else {
None
}
}
}
(r.user_id, category_items_set)
}
}
.toDF("user_id", "category_items_set")
and then exploded the resultant dataframe on category_items_set and then join on the categoryDF to obtain the desired output table .
I understand i am doing the operations twice but could not find a better way of calculating everything by iterating through both the dataframes just one.
Please suggest an efficient way of doing this.
I have a lot of data and the spark job is taking more than 24 hrs to get this job done. Thx

PS: To keep things simple, I've not included start_date and duration and also limited the sample user rows to a1, a2, a3, a4. Output shown here may not extactly match your expected output; but if you use full data, I'm sure the ouput will match.
import pyspark.sql.functions as F
userMemberShipDF = spark.createDataFrame([
("a1",["s1","s2","s3"]),
("a2",["s4","s6"]),
("a3",["s5","s4","s3"]),
("a4",["s1","s3","s4","s5"]),
], ["user","membership_array"])
Convert each membership s1, s2, s3 etc. into individual columns and mark as true, if user has that membership:
userMemberShipDF = userMemberShipDF.withColumn("membership_individual", F.explode("membership_array"))
+----+----------------+---------------------+
|user|membership_array|membership_individual|
+----+----------------+---------------------+
| a1| [s1, s2, s3]| s1|
| a1| [s1, s2, s3]| s2|
| a1| [s1, s2, s3]| s3|
| a2| [s4, s6]| s4|
| a2| [s4, s6]| s6|
| a3| [s5, s4, s3]| s5|
| a3| [s5, s4, s3]| s4|
| a3| [s5, s4, s3]| s3|
| a4|[s1, s3, s4, s5]| s1|
| a4|[s1, s3, s4, s5]| s3|
| a4|[s1, s3, s4, s5]| s4|
| a4|[s1, s3, s4, s5]| s5|
+----+----------------+---------------------+
userMemberShipDF = userMemberShipDF.groupBy("user").pivot("membership_individual").agg(F.count("*").isNotNull()).na.fill(False)
+----+-----+-----+-----+-----+-----+-----+
|user| s1| s2| s3| s4| s5| s6|
+----+-----+-----+-----+-----+-----+-----+
| a3|false|false| true| true| true|false|
| a4| true|false| true| true| true|false|
| a2|false|false|false| true|false| true|
| a1| true| true| true|false|false|false|
+----+-----+-----+-----+-----+-----+-----+
In category data, replace ||, &&, ! with or, and, not:
categoryDF = spark.createDataFrame([
("c1", "s1 || s2"),
("c2", "s4 && s6 && !s2"),
("c3", "s3 && s4"),
], ["category_id", "membership_expression"])
categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\|\|", " or "))
categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\&\&", " and "))
categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\!", " not "))
+-----------+-------------------------+
|category_id|membership_expression |
+-----------+-------------------------+
|c1 |s1 or s2 |
|c2 |s4 and s6 and not s2|
|c3 |s3 and s4 |
+-----------+-------------------------+
Cross join user and category data to evaluate each user against each category:
resultDF_sp = categoryDF.crossJoin(userMemberShipDF)
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+
|category_id|membership_expression |user|s1 |s2 |s3 |s4 |s5 |s6 |
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+
|c1 |s1 or s2 |a3 |false|false|true |true |true |false|
|c1 |s1 or s2 |a4 |true |false|true |true |true |false|
|c1 |s1 or s2 |a2 |false|false|false|true |false|true |
|c1 |s1 or s2 |a1 |true |true |true |false|false|false|
|c2 |s4 and s6 and not s2|a3 |false|false|true |true |true |false|
|c2 |s4 and s6 and not s2|a4 |true |false|true |true |true |false|
|c2 |s4 and s6 and not s2|a2 |false|false|false|true |false|true |
|c2 |s4 and s6 and not s2|a1 |true |true |true |false|false|false|
|c3 |s3 and s4 |a3 |false|false|true |true |true |false|
|c3 |s3 and s4 |a4 |true |false|true |true |true |false|
|c3 |s3 and s4 |a2 |false|false|false|true |false|true |
|c3 |s3 and s4 |a1 |true |true |true |false|false|false|
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+
Evaluate membership_expression
Ahhh! This part is not elegant
Spark provides expr function to evaluate SQL expressions using column values; but this works only if the expression is a static string:
resultDF_sp.select(F.expr("s1 or s2"))
But if the expression "s1 or s2" is a column value (like membership_expression column above), then there is no way to evaluate it. This results in error Column is not iterable:
resultDF_sp.select(F.expr(F.col("membership_expression")))
There are several questions on stackoverflow for this; but all of them suggest parsing expression, and writing an evaluator to manually evaluate the parsed expression:
How to evaluate expressions that are the column values?
Use a column value as column name
Fortunately, it is possible to evaluate expression as column value using parameter values from other columns.
So, the part I don't like; but have no choice is to convert dataframe to pandas, evaluate expression and convert back to spark (if someone can suggest how to achieve this in spark, I'll be happy to include it in edit):
resultDF_pd = resultDF_sp.toPandas()
def evaluate_expr(row_series):
df = row_series.to_frame().transpose().infer_objects()
return df.eval(df["membership_expression"].values[0]).values[0]
resultDF_pd["is_matching_user"] = resultDF_pd.apply(lambda row: evaluate_expr(row), axis=1)
resultDF_sp = spark.createDataFrame(resultDF_pd[["category_id", "user", "is_matching_user"]])
+-----------+----+----------------+
|category_id|user|is_matching_user|
+-----------+----+----------------+
| c1| a3| false|
| c1| a4| true|
| c1| a2| false|
| c1| a1| true|
| c2| a3| false|
| c2| a4| false|
| c2| a2| true|
| c2| a1| false|
| c3| a3| true|
| c3| a4| true|
| c3| a2| false|
| c3| a1| false|
+-----------+----+----------------+
At last, filter the matching users:
resultDF_sp = resultDF_sp.filter("is_matching_user")
+-----------+----+----------------+
|category_id|user|is_matching_user|
+-----------+----+----------------+
| c1| a4| true|
| c1| a1| true|
| c2| a2| true|
| c3| a3| true|
| c3| a4| true|
+-----------+----+----------------+

Related

Column wise comparison between Spark Dataframe using Spark core

Given Example, But looking for N number of columns comparison between two data frame as column-wise.
Given sample with 5 rows and 3 columns with EmpID as Primary key.
How can I do this comparison in Spark core?
InputDf1:
|EMPID |Dept | Salary
--------------------------
|1 |HR | 100
|2 |IT | 200
|3 |Finance | 250
|4 |Accounts | 200
|5 |IT | 150
InfputDF2:
|EMPID |Dept |Salary
------------------------------
|1 |HR | 100
|2 |IT | 200
|3 |FIN | 250
|4 |Accounts | 150
|5 |IT | 150
Expected Result DF:
|EMPID |Dept |Dept |status |Salary |Salary |status
--------------------------------------------------------------------
|1 |HR |HR | TRUE | 100 | 100 | TRUE
|2 |IT |IT | TRUE | 200 | 200 | TRUE
|3 |Finance |FIN | False | 250 | 250 | TRUE
|4 |Accounts |Accounts | TRUE | 200 | 150 | FALSE
|5 |IT |IT | TRUE | 150 | 150 | TRUE
You can do a join using the EMPID and compare the resulting columns:
val result = df1.alias("df1").join(
df2.alias("df2"), "EMPID"
).select(
$"EMPID",
$"df1.Dept", $"df2.Dept",
($"df1.Dept" === $"df2.Dept").as("status"),
$"df1.Salary", $"df2.Salary",
($"df1.Salary" === $"df2.Salary").as("status")
)
result.show
+-----+--------+--------+------+------+------+------+
|EMPID| Dept| Dept|status|Salary|Salary|status|
+-----+--------+--------+------+------+------+------+
| 1| HR| HR| true| 100| 100| true|
| 2| IT| IT| true| 200| 200| true|
| 3| Finance| FIN| false| 250| 250| true|
| 4|Accounts|Accounts| true| 200| 150| false|
| 5| IT| IT| true| 150| 150| true|
+-----+--------+--------+------+------+------+------+
Note that you may wish to rename the columns because duplicate column names are not possible to query in the future.
You can use join and then iterate over df.columns to select the desired output columns :
val df_final = df1.alias("df1")
.join(df2.alias("df2"), "EMPID")
.select(
Seq(col("EMPID")) ++
df1.columns.filter(_ != "EMPID")
.flatMap(c =>
Seq(
col(s"df1.$c").as(s"df1_$c"),
col(s"df2.$c").as(s"df2_$c"),
(col(s"df1.$c") === col(s"df2.$c")).as(s"status_$c")
)
): _*
)
df_final.show
//+-----+--------+--------+-----------+----------+----------+-------------+
//|EMPID|df1_Dept|df2_Dept|status_Dept|df1_Salary|df2_Salary|status_Salary|
//+-----+--------+--------+-----------+----------+----------+-------------+
//| 1| HR| HR| true| 100| 100| true|
//| 2| IT| IT| true| 200| 200| true|
//| 3| Finance| FIN| false| 250| 250| true|
//| 4|Accounts|Accounts| true| 200| 150| false|
//| 5| IT| IT| true| 150| 150| true|
//+-----+--------+--------+-----------+----------+----------+-------------+
You could also do this in a way below:
//Source data
val df = Seq((1,"HR",100),(2,"IT",200),(3,"Finance",250),(4,"Accounts",200),(5,"IT",150)).toDF("EMPID","Dept","Salary")
val df1 = Seq((1,"HR",100),(2,"IT",200),(3,"Fin",250),(4,"Accounts",150),(5,"IT",150)).toDF("EMPID","Dept","Salary")
//joins and other operations
val finalDF = df.as("d").join(df1.as("d1"),Seq("EMPID"),"inner")
.withColumn("DeptStatus",$"d.Dept" === $"d1.Dept")
.withColumn("Salarystatus",$"d.Salary" === $"d1.Salary")
.selectExpr("EMPID","d.Dept","d1.Dept","DeptStatus as
Status","d.Salary","d1.Salary","SalaryStatus as Status")
display(finalDF)
You can see the output as below:

Doing multiple column value look up after joining with lookup dataset

I am using spark-sql-2.4.1v how to do various joins depend on the value of column
I need get multiple look up values of map_val column for given value columns as show below.
Sample data:
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12),
("21", "score", "school", "2018-03-31", 13 , 13),
("22", "rate", "school", "2018-03-31", 11 , 14),
("21", "rate", "school", "2018-03-31", 13 , 12)
)
val df = data.toDF("id", "code", "entity", "date", "value1", "value2")
df.show
+---+-----+------+----------+------+------+
| id| code|entity| date|value1|value2|
+---+-----+------+----------+------+------+
| 20|score|school|2018-03-31| 14| 12|
| 21|score|school|2018-03-31| 13| 13|
| 22| rate|school|2018-03-31| 11| 14|
| 21| rate|school|2018-03-31| 13| 12|
+---+-----+------+----------+------+------+
Lookup dataset rateDs:
val rateDs = List(
("21","2018-01-31","2018-06-31", 12 ,"C"),
("21","2018-01-31","2018-06-31", 13 ,"D")
).toDF("id","start_date","end_date", "map_code","map_val")
rateDs.show
+---+----------+----------+--------+-------+
| id|start_date| end_date|map_code|map_val|
+---+----------+----------+--------+-------+
| 21|2018-01-31|2018-06-31| 12| C|
| 21|2018-01-31|2018-06-31| 13| D|
+---+----------+----------+--------+-------+
Joining with lookup table for map_val column based on start_date and end_date:
val resultDs = df.filter(col("code").equalTo(lit("rate"))).join(rateDs ,
(
df.col("date").between(rateDs.col("start_date"), rateDs.col("end_date"))
.and(rateDs.col("id").equalTo(df.col("id")))
//.and(rateDs.col("mapping_value").equalTo(df.col("mean")))
)
, "left"
)
//.drop("start_date")
//.drop("end_date")
resultDs.show
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| id|code|entity| date|value1|value2| id|start_date| end_date|map_code|map_val|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| 21|rate|school|2018-03-31| 13| 12| 21|2018-01-31|2018-06-31| 13| D|
| 21|rate|school|2018-03-31| 13| 12| 21|2018-01-31|2018-06-31| 12| C|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
The expected output should be:
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| id|code|entity| date|value1|value2| id|start_date| end_date|map_code|map_val|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
| 21|rate|school|2018-03-31| D | C | 21|2018-01-31|2018-06-31| 13| D|
| 21|rate|school|2018-03-31| D | C | 21|2018-01-31|2018-06-31| 12| C|
+---+----+------+----------+------+------+----+----------+----------+--------+-------+
Please let me know if any more details are needed.
Try this-
Create lookup map before join per id and use the same to replace
val newRateDS = rateDs.withColumn("lookUpMap",
map_from_entries(collect_list(struct(col("map_code"), col("map_val"))).over(Window.partitionBy("id")))
)
newRateDS.show(false)
/**
* +---+----------+----------+--------+-------+------------------+
* |id |start_date|end_date |map_code|map_val|lookUpMap |
* +---+----------+----------+--------+-------+------------------+
* |21 |2018-01-31|2018-06-31|12 |C |[12 -> C, 13 -> D]|
* |21 |2018-01-31|2018-06-31|13 |D |[12 -> C, 13 -> D]|
* +---+----------+----------+--------+-------+------------------+
*/
val resultDs = df.filter(col("code").equalTo(lit("rate"))).join(broadcast(newRateDS) ,
rateDs("id") === df("id") && df("date").between(rateDs("start_date"), rateDs("end_date"))
//.and(rateDs.col("mapping_value").equalTo(df.col("mean")))
, "left"
)
resultDs.withColumn("value1", expr("coalesce(lookUpMap[value1], value1)"))
.withColumn("value2", expr("coalesce(lookUpMap[value2], value2)"))
.show(false)
/**
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
* |id |code|entity|date |value1|value2|id |start_date|end_date |map_code|map_val|lookUpMap |
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
* |22 |rate|school|2018-03-31|11 |14 |null|null |null |null |null |null |
* |21 |rate|school|2018-03-31|D |C |21 |2018-01-31|2018-06-31|13 |D |[12 -> C, 13 -> D]|
* |21 |rate|school|2018-03-31|D |C |21 |2018-01-31|2018-06-31|12 |C |[12 -> C, 13 -> D]|
* +---+----+------+----------+------+------+----+----------+----------+--------+-------+------------------+
*/

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

How to join two DataFrames and change column for missing values?

val df1 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds1"),
("a1",20,"ACTIVE","ds1"),
("a2",50,"ACTIVE","ds1"),
("a3",60,"ACTIVE","ds1"))
).toDF("c1","c2","c3","c4")`
val df2 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds2"),
("a1",20,"ACTIVE","ds2"),
("a1",30,"ACTIVE","ds2"),
("a1",40,"ACTIVE","ds2"),
("a4",20,"ACTIVE","ds2"))
).toDF("c1","c2","c3","c5")`
df1.show()
// +---+---+------+---+
// | c1| c2| c3| c4|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds1|
// | a1| 20|ACTIVE|ds1|
// | a2| 50|ACTIVE|ds1|
// | a3| 60|ACTIVE|ds1|
// +---+---+------+---+
df2.show()
// +---+---+------+---+
// | c1| c2| c3| c5|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds2|
// | a1| 20|ACTIVE|ds2|
// | a1| 30|ACTIVE|ds2|
// | a1| 40|ACTIVE|ds2|
// | a4| 20|ACTIVE|ds2|
// +---+---+------+---+
My requirement is: I need to Join both dataframes.
My output dataframe should be having all the records from df1 and also the records from df2 which are not in df1 for the matching "c1" only. And the records which I pull from df2 should be updated to Inactive at column "c3".
In this example only matching value of "c1" is a1. So I need to pull c2=30 and 40 records from df2 and make them INACTIVE.
Here is the output.
df_output.show()
// +---+---+--------+---+
// | c1| c2| c3 | c4|
// +---+---+--------+---+
// | a1| 10|ACTIVE |ds1|
// | a1| 20|ACTIVE |ds1|
// | a2| 50|ACTIVE |ds1|
// | a3| 60|ACTIVE |ds1|
// | a1| 30|INACTIVE|ds1|
// | a1| 40|INACTIVE|ds1|
// +---+---+--------+---+
Can any one help me to do this.
First, a small thing. I use different names for the columns in df2:
val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")
No big deal, but this made things easier for me to reason about.
Now for the fun stuff. I am going to be a bit verbose for the sake of clarity:
val join = df1
.join(df2, df1("c1") === df2("d1"), "inner")
.select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
.dropDuplicates
Here I do the following:
Inner join between df1 and df2 on the c1 and d1 columns
Select the df2 columns and simply "hardcode" ds1 in the last column to replace ds2
Drop duplicates
This basically just filters out everything in df2 that does not have a corresponding key in c1 in df1.
Next I diff:
val diff = join
.except(df1)
.select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")
This is a basic set operation that finds everything in join that is not in df1. These are the items to deactivate, so I select all the columns but replace the third with a hardcoded INACTIVE value.
All that's left is to put them all together:
df1.union(diff)
This simply combines df1 with the table of deactivated values we calculated earlier to produce the final result:
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
+---+---+--------+---+
And again, you don't need all these intermediate values. I just was verbose to help trace through the process.
here is dirty solution -
from pyspark.sql import functions as F
# find the rows from df2 that have matching key c1 in df2
df3 = df1.join(df2,df1.c1==df2.c1)\
.select(df2.c1,df2.c2,df2.c3,df2.c5.alias('c4'))\
.dropDuplicates()
df3.show()
:
+---+---+------+---+
| c1| c2| c3| c4|
+---+---+------+---+
| a1| 10|ACTIVE|ds2|
| a1| 20|ACTIVE|ds2|
| a1| 30|ACTIVE|ds2|
| a1| 40|ACTIVE|ds2|
+---+---+------+---+
:
# Union df3 with df1 and change columns c3 and c4 if c4 value is 'ds2'
df1.union(df3).dropDuplicates(['c1','c2'])\
.select('c1','c2',\
F.when(df1.c4=='ds2','INACTIVE').otherwise('ACTIVE').alias('c3'),
F.when(df1.c4=='ds2','ds1').otherwise('ds1').alias('c4')
)\
.orderBy('c1','c2')\
.show()
:
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
+---+---+--------+---+
Enjoyed the challenge and here is my solution.
val c1keys = df1.select("c1").distinct
val df2_in_df1 = df2.join(c1keys, Seq("c1"), "inner")
val df2inactive = df2_in_df1.join(df1, Seq("c1", "c2"), "leftanti").withColumn("c3", lit("INACTIVE"))
scala> df1.union(df2inactive).show
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds2|
| a1| 40|INACTIVE|ds2|
+---+---+--------+---+

Splitting row in multiple row in spark-shell

I have imported data in Spark dataframe in spark-shell. Data is filled in it like :
Col1 | Col2 | Col3 | Col4
A1 | 11 | B2 | a|b;1;0xFFFFFF
A1 | 12 | B1 | 2
A2 | 12 | B2 | 0xFFF45B
Here in Col4, the values are of different kinds and I want to separate them like (suppose "a|b" is type of alphabets, "1 or 2" is a type of digit and "0xFFFFFF or 0xFFF45B" is a type of hexadecimal no.):
So, the output should be :
Col1 | Col2 | Col3 | alphabets | digits | hexadecimal
A1 | 11 | B2 | a | 1 | 0xFFFFFF
A1 | 11 | B2 | b | 1 | 0xFFFFFF
A1 | 12 | B1 | | 2 |
A2 | 12 | B2 | | | 0xFFF45B
Hope I've made my query clear to you and I am using spark-shell. Thanks in advance.
Edit after getting this answer about how to make backreference in regexp_replace.
You can use regexp_replace with a backreference, then split twice and explode. It is, imo, cleaner than my original solution
val df = List(
("A1" , "11" , "B2" , "a|b;1;0xFFFFFF"),
("A1" , "12" , "B1" , "2"),
("A2" , "12" , "B2" , "0xFFF45B")
).toDF("Col1" , "Col2" , "Col3" , "Col4")
val regExStr = "^([A-z|]+)?;?(\\d+)?;?(0x.*)?$"
val res = df
.withColumn("backrefReplace",
split(regexp_replace('Col4,regExStr,"$1;$2;$3"),";"))
.select('Col1,'Col2,'Col3,
explode(split('backrefReplace(0),"\\|")).as("letter"),
'backrefReplace(1) .as("digits"),
'backrefReplace(2) .as("hexadecimal")
)
+----+----+----+------+------+-----------+
|Col1|Col2|Col3|letter|digits|hexadecimal|
+----+----+----+------+------+-----------+
| A1| 11| B2| a| 1| 0xFFFFFF|
| A1| 11| B2| b| 1| 0xFFFFFF|
| A1| 12| B1| | 2| |
| A2| 12| B2| | | 0xFFF45B|
+----+----+----+------+------+-----------+
you still need to replace empty strings by nullthough...
Previous Answer (somebody might still prefer it):
Here is a solution that sticks to DataFrames but is also quite messy. You can first use regexp_extract three times (possible to do less with backreference?), and finally split on "|" and explode. Note that you need a coalesce for explode to return everything (you still might want to change the empty strings in letter to null in this solution).
val res = df
.withColumn("alphabets", regexp_extract('Col4,"(^[A-z|]+)?",1))
.withColumn("digits", regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",2))
.withColumn("hexadecimal",regexp_extract('Col4,"^([A-z|]+)?;?(\\d+)?;?(0x.*)?$",3))
.withColumn("letter",
explode(
split(
coalesce('alphabets,lit("")),
"\\|"
)
)
)
res.show
+----+----+----+--------------+---------+------+-----------+------+
|Col1|Col2|Col3| Col4|alphabets|digits|hexadecimal|letter|
+----+----+----+--------------+---------+------+-----------+------+
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| a|
| A1| 11| B2|a|b;1;0xFFFFFF| a|b| 1| 0xFFFFFF| b|
| A1| 12| B1| 2| null| 2| null| |
| A2| 12| B2| 0xFFF45B| null| null| 0xFFF45B| |
+----+----+----+--------------+---------+------+-----------+------+
Note: The regexp part could be so much better with backreference, so if somebody knows how to do it, please comment!
Not sure this is doable while staying 100% with Dataframes, here's a (somewhat messy?) solution using RDDs for the split itself:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
// we switch to RDD to perform the split of Col4 into 3 columns
val rddWithSplitCol4 = input.rdd.map { r =>
val indexToValue = r.getAs[String]("Col4").split(';').map {
case s if s.startsWith("0x") => 2 -> s
case s if s.matches("\\d+") => 1 -> s
case s => 0 -> s
}
val newCols: Array[String] = indexToValue.foldLeft(Array.fill[String](3)("")) {
case (arr, (index, value)) => arr.updated(index, value)
}
(r.getAs[String]("Col1"), r.getAs[Int]("Col2"), r.getAs[String]("Col3"), newCols(0), newCols(1), newCols(2))
}
// switch back to Dataframe and explode alphabets column
val result = rddWithSplitCol4
.toDF("Col1", "Col2", "Col3", "alphabets", "digits", "hexadecimal")
.withColumn("alphabets", explode(split(col("alphabets"), "\\|")))
result.show(truncate = false)
// +----+----+----+---------+------+-----------+
// |Col1|Col2|Col3|alphabets|digits|hexadecimal|
// +----+----+----+---------+------+-----------+
// |A1 |11 |B2 |a |1 |0xFFFFFF |
// |A1 |11 |B2 |b |1 |0xFFFFFF |
// |A1 |12 |B1 | |2 | |
// |A2 |12 |B2 | | |0xFFF45B |
// +----+----+----+---------+------+-----------+