Pyspark windows function: preceding and following event - pyspark

I have the following dataframe in pyspark:
+------------------- +-------------------+---------+-----------------------+-----------+
|device_id |order_creation_time|order_id |status_check_time |status_code|
+--------------------+-------------------+---------+-----------------------+-----------+
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:33.858|200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:13.1 |200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:57.682|200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:36.676|200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:21.293|200 |
+--------------------+-------------------+---------+-----------------------+-----------+
I need to get the time of the status_check_time immediately preceding, and immediately after the order_creation_time.
The order_creation_time column will be always constant across the same order_id (so, each order_id has only 1 order_creation_time)
In this case, the output should be:
+------------------- +-------------------+---------+---------------------------+-----------------------+
|device_id |order_creation_time|order_id |previous_status_check_time |next_status_check_time |
+--------------------+-------------------+---------+---------------------------+-----------------------+
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:36.676 |2022-11-26 23:54:57.682|
+--------------------+-------------------+---------+---------------------------+-----------------------+
I was trying to use lag and lead functions, but I'm not getting the desired output:
ss = (
SparkSession.
builder.
appName("test").
master("local[2]").
getOrCreate()
)
data = [
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:55:33.858", "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:55:13.1" , "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:54:57.682", "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:54:36.676", "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:54:21.293", "status_code": 200}
]
df = ss.createDataFrame(data)
windowSpec = Window.partitionBy("device_id").orderBy("status_check_time")
(
df.withColumn(
"previous_status_check_time", lag("status_check_time").over(windowSpec)
).withColumn(
"next_status_check_time", lead("status_check_time").over(windowSpec)
).show(truncate=False)
)
Any ideas of how to fix this??

We can calculate the difference between the two timestamps in seconds and retain the ones that are the closest negative and closest positive.
data_sdf. \
withColumn('ts_diff', func.col('status_check_time').cast('long') - func.col('order_creation_time').cast('long')). \
groupBy([k for k in data_sdf.columns if k != 'status_check_time']). \
agg(func.max(func.when(func.col('ts_diff') < 0, func.struct('ts_diff', 'status_check_time'))).status_check_time.alias('previous_status_check_time'),
func.min(func.when(func.col('ts_diff') >= 0, func.struct('ts_diff', 'status_check_time'))).status_check_time.alias('next_status_check_time')
). \
show(truncate=False)
# +--------------------+-------------------+---------+-----------+--------------------------+-----------------------+
# |device_id |order_creation_time|order_id |status_code|previous_status_check_time|next_status_check_time |
# +--------------------+-------------------+---------+-----------+--------------------------+-----------------------+
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|200 |2022-11-26 23:54:36.676 |2022-11-26 23:54:57.682|
# +--------------------+-------------------+---------+-----------+--------------------------+-----------------------+
The timestamp difference results in the following
# +--------------------+-------------------+---------+-----------------------+-----------+-------+
# |device_id |order_creation_time|order_id |status_check_time |status_code|ts_diff|
# +--------------------+-------------------+---------+-----------------------+-----------+-------+
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:33.858|200 |-52 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:13.1 |200 |-32 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:57.682|200 |-16 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:36.676|200 |5 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:21.293|200 |20 |
# +--------------------+-------------------+---------+-----------------------+-----------+-------+

Related

How to perform conditional join with time column in spark scala

I am looking for help in joining 2 DF's with conditional join in time columns, using Spark Scala.
DF1
time_1
revision
review_1
2022-04-05 08:32:00
1
abc
2022-04-05 10:15:00
2
abc
2022-04-05 12:15:00
3
abc
2022-04-05 09:00:00
1
xyz
2022-04-05 20:20:00
2
xyz
DF2:
time_2
review_1
value
2022-04-05 08:30:00
abc
value_1
2022-04-05 09:48:00
abc
value_2
2022-04-05 15:40:00
abc
value_3
2022-04-05 08:00:00
xyz
value_4
2022-04-05 09:00:00
xyz
value_5
2022-04-05 10:00:00
xyz
value_6
2022-04-05 11:00:00
xyz
value_7
2022-04-05 12:00:00
xyz
value_8
Desired Output DF:
time_1
revision
review_1
value
2022-04-05 08:32:00
1
abc
value_1
2022-04-05 10:15:00
2
abc
value_2
2022-04-05 12:15:00
3
abc
null
2022-04-05 09:00:00
1
xyz
value_6
2022-04-05 20:20:00
2
xyz
null
As in the case of row 4 of the final output (where time_1 = 2022-04-05 09:00:00, if multiple values match during the join then only the latest - in time - should be taken).
Furthermore if there is no match for a row of df in the join then there it should have null for the value column.
Here we need to join between 2 columns in the two DF's:
review_1 === review_2 &&
time_1 === time_2 (condition : time_1 should be in range +1/-1 Hr from time_2, If multiple records then show latest value, as in value_6 above)
Here is the code necessary to join the DataFrames:
I have commented the code so as to explain the logic.
TL;DR
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
Full breakdown
Let's start off with your DataFrames: df1 and df2 in code:
val df1 = List(
("2022-04-05 08:32:00", 1, "abc"),
("2022-04-05 10:15:00", 2, "abc"),
("2022-04-05 12:15:00", 3, "abc"),
("2022-04-05 09:00:00", 1, "xyz"),
("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")
df1.show(false)
gives:
+-------------------+--------+--------+
|time_1 |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1 |abc |
|2022-04-05 10:15:00|2 |abc |
|2022-04-05 12:15:00|3 |abc |
|2022-04-05 09:00:00|1 |xyz |
|2022-04-05 20:20:00|2 |xyz |
+-------------------+--------+--------+
val df2 = List(
("2022-04-05 08:30:00", "abc", "value_1"),
("2022-04-05 09:48:00", "abc", "value_2"),
("2022-04-05 15:40:00", "abc", "value_3"),
("2022-04-05 08:00:00", "xyz", "value_4"),
("2022-04-05 09:00:00", "xyz", "value_5"),
("2022-04-05 10:00:00", "xyz", "value_6"),
("2022-04-05 11:00:00", "xyz", "value_7"),
("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")
df2.show(false)
gives:
+-------------------+--------+-------+
|time_2 |review_2|value |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc |value_1|
|2022-04-05 09:48:00|abc |value_2|
|2022-04-05 15:40:00|abc |value_3|
|2022-04-05 08:00:00|xyz |value_4|
|2022-04-05 09:00:00|xyz |value_5|
|2022-04-05 10:00:00|xyz |value_6|
|2022-04-05 11:00:00|xyz |value_7|
|2022-04-05 12:00:00|xyz |value_8|
+-------------------+--------+-------+
Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:
// add a new column, temporarily, which contains the time in
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))
df1WithEpoch.show()
df2WithEpoch.show()
gives:
+-------------------+--------+--------+------------+
| time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00| 1| abc| 1649147520|
|2022-04-05 10:15:00| 2| abc| 1649153700|
|2022-04-05 12:15:00| 3| abc| 1649160900|
|2022-04-05 09:00:00| 1| xyz| 1649149200|
|2022-04-05 20:20:00| 2| xyz| 1649190000|
+-------------------+--------+--------+------------+
+-------------------+--------+-------+------------+
| time_2|review_2| value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00| abc|value_1| 1649147400|
|2022-04-05 09:48:00| abc|value_2| 1649152080|
|2022-04-05 15:40:00| abc|value_3| 1649173200|
|2022-04-05 08:00:00| xyz|value_4| 1649145600|
|2022-04-05 09:00:00| xyz|value_5| 1649149200|
|2022-04-05 10:00:00| xyz|value_6| 1649152800|
|2022-04-05 11:00:00| xyz|value_7| 1649156400|
|2022-04-05 12:00:00| xyz|value_8| 1649160000|
+-------------------+--------+-------+------------+
and finally to join:
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
gives:
+-------------------+--------+--------+-------+
|time_1 |revision|review_1|value |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1 |abc |value_1|
|2022-04-05 10:15:00|2 |abc |value_2|
|2022-04-05 12:15:00|3 |abc |null |
|2022-04-05 09:00:00|1 |xyz |value_6|
|2022-04-05 20:20:00|2 |xyz |null |
+-------------------+--------+--------+-------+

Dividing a set of columns by its average in Pyspark

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
Col1 Col2 Col3 Name
1 40 56 john jones
2 45 55 tracey smith
3 33 23 amy sanders
Expected Output
Col1 Col2 Col3 Name
0.5 1.02 1.25 john jones
1 1.14 1.23 tracey smith
1.5 0.84 0.51 amy sanders
Function as of now. Not working:
#function to divide few columns by the column average and overwrite the column
def avg_scaling(df):
#List of columns which have to be scaled by their average
col_list = ['col1', 'col2', 'col3']
for i in col_list:
df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))
return df
new_df = avg_scaling(df)
You can make use of a Window here partitioned on a pusedo column and run average on that window.
The code goes like this,
columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
("2", "45", "55", "tracey smith"),
("3", "33", "23", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 1| 40| 56| john jones|
| 2| 45| 55|tracey smith|
| 3| 33| 23| amy sanders|
+----+----+----+------------+
from pyspark.sql import Window
def avg_scaling(df, cols_to_scale):
w = Window.partitionBy(F.lit(1))
for col in cols_to_scale:
df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
return df
new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()
+----+----+----+------------+
|Col1|Col2|Col3| Name|
+----+----+----+------------+
| 0.5|1.02|1.25| john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+

Conditional Concatenation in Spark

I have a dataframe with the below structure :
+----------+------+------+----------------+--------+------+
| date|market|metric|aggregator_Value|type |rank |
+----------+------+------+----------------+--------+------+
|2018-08-05| m1| 16 | m1|median | 1 |
|2018-08-03| m1| 5 | m1|median | 2 |
|2018-08-01| m1| 10 | m1|mean | 3 |
|2018-08-05| m2| 35 | m2|mean | 1 |
|2018-08-03| m2| 25 | m2|mean | 2 |
|2018-08-01| m2| 5 | m2|mean | 3 |
+----------+------+------+----------------+--------+------+
In this dataframe the rank column is calculated on the order of date and groupings of the market column.
Like this
val w_rank = Window.partitionBy("market").orderBy(desc("date"))
val outputDF2=outputDF1.withColumn("rank",rank().over(w_rank))
I want to extract the concatenated value of the metric column in the output dataframe when the rank=1 , with the condition that if the type="median" in rank=1 row is then concatenate all the metric values with that market .Otherwise if the type="mean" in rank=1 row , then concatenate only the previous 2 metric values .Like this
+----------+------+------+----------------+--------+---------+
| date|market|metric|aggregator_Value|type |result |
+----------+------+------+----------------+--------+---------+
|2018-08-05| m1| 16 | m1|median |10|5|16 |
|2018-08-05| m2| 35 | m1|mean |25|35 |
+----------+------+------+----------------+--------+---------+
How can I achieve this ?
You could nullify column metric according to the specific condition and apply collect_list followed by concat_ws to get the wanted result, as show below:
val df = Seq(
("2018-08-05", "m1", 16, "m1", "median", 1),
("2018-08-03", "m1", 5, "m1", "median", 2),
("2018-08-01", "m1", 10, "m1", "mean", 3),
("2018-08-05", "m2", 35, "m2", "mean", 1),
("2018-08-03", "m2", 25, "m2", "mean", 2),
("2018-08-01", "m2", 5, "m2", "mean", 3)
).toDF("date", "market", "metric", "aggregator_value", "type", "rank")
val win_desc = Window.partitionBy("market").orderBy(desc("date"))
val win_asc = Window.partitionBy("market").orderBy(asc("date"))
df.
withColumn("rank1_type", first($"type").over(win_desc.rowsBetween(Window.unboundedPreceding, 0))).
withColumn("cond_metric", when($"rank1_type" === "mean" && $"rank" > 2, null).otherwise($"metric")).
withColumn("result", concat_ws("|", collect_list("cond_metric").over(win_asc))).
where($"rank" === 1).
show
// +----------+------+------+----------------+------+----+----------+-----------+-------+
// | date|market|metric|aggregator_value| type|rank|rank1_type|cond_metric| result|
// +----------+------+------+----------------+------+----+----------+-----------+-------+
// |2018-08-05| m1| 16| m1|median| 1| median| 16|10|5|16|
// |2018-08-05| m2| 35| m2| mean| 1| mean| 35| 25|35|
// +----------+------+------+----------------+------+----+----------+-----------+-------+

Combine two datasets based on value

I have following two datasets:
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle"),
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
I need to join them with specific conditions:
a) There is always a row in dfB with the X flag and empty LineItem which should be the default Category for the Package from dfA
b) When there is a LineItem specified in dfB the default Category should be overwritten with the Category associated to this LineItem
Expected output:
+---------+----------+----------+----------+
| Package | LineItem | Animal | Category |
+---------+----------+----------+----------+
| 001 | 10 | Cat | A |
+---------+----------+----------+----------+
| 001 | 20 | Dog | A |
+---------+----------+----------+----------+
| 001 | 30 | Bear | A |
+---------+----------+----------+----------+
| 002 | 10 | Mouse | C |
+---------+----------+----------+----------+
| 002 | 20 | Squirrel | E |
+---------+----------+----------+----------+
| 002 | 30 | Turtle | C |
+---------+----------+----------+----------+
I spend some time on it today, but I don't have an idea how it could be accomplished. I appreciate your assistance.
Thanks!
You can use two join + when clause:
val dfC = dfA
.join(dfB, dfB.col("Flag") === "X" && dfA.col("LineItem") === dfB.col("LineItem") && dfA.col("Package") === dfB.col("Package"))
.select(dfA.col("Package").as("priorPackage"), dfA.col("LineItem").as("priorLineItem"), dfB.col("Category").as("priorCategory"))
.as("dfC")
val dfD = dfA
.join(dfB, dfB.col("LineItem") === "" && dfB.col("Flag") === "X" && dfA.col("Package") === dfB.col("Package"), "left_outer")
.join(dfC, dfA.col("LineItem") === dfC.col("priorLineItem") && dfA.col("Package") === dfC.col("priorPackage"), "left_outer")
.select(
dfA.col("package"),
dfA.col("LineItem"),
dfA.col("Animal"),
when(dfC.col("priorCategory").isNotNull, dfC.col("priorCategory")).otherwise(dfB.col("Category")).as("Category")
)
dfD.show()
This should work for you:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle")
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
val result = {
dfA.as("a")
.join(dfB.where('Flag === "X").as("b"), $"a.Package" === $"b.Package" and ($"a.LineItem" === $"b.LineItem" or $"b.LineItem" === ""), "left")
.withColumn("anyRowsInGroupWithBLineItemDefined", first(when($"b.LineItem" =!= "", lit(true)), ignoreNulls = true).over(Window.partitionBy($"a.Package", $"a.LineItem")).isNotNull)
.where(!$"anyRowsInGroupWithBLineItemDefined" or ($"anyRowsInGroupWithBLineItemDefined" and $"b.LineItem" =!= ""))
.select($"a.Package", $"a.LineItem", $"a.Animal", $"b.Category")
}
result.orderBy($"a.Package", $"a.LineItem").show(false)
// +-------+--------+--------+--------+
// |Package|LineItem|Animal |Category|
// +-------+--------+--------+--------+
// |001 |10 |Cat |A |
// |001 |20 |Dog |A |
// |001 |30 |Bear |A |
// |002 |10 |Mouse |C |
// |002 |20 |Squirrel|E |
// |002 |30 |Turtle |C |
// +-------+--------+--------+--------+
The "tricky" part is calculating whether or not there are any rows with LineItem defined in dfB for a given Package, LineItem in dfA. You can see how I perform this calculation in anyRowsInGroupWithBLineItemDefined which involves the use of a window function. Other than that, it's just a normal SQL programming exercise.
Also want to note that this code should be more efficient than the other solution as here we only shuffle the data twice (during join and during window function) and only read in each dataset once.

How to split columns with inconsistent data in spark

I was trying to join two dataframes and create a new column with the values of the attribute dynamically (or at least trying to do that).
I have to split the columns from formulaTable and create additional columns and then join it with the attribute table.
However, I am not able to split the columns dynamically properly.
I have two questions which i have kept in bold in following steps.
So Currently in my formulaTable the data is like this.
val attributeFormulaDF = Seq("A0004*A0003","A0003*A0005").toDF("AttributeFormula")
So data is like
+----------------+
|AttributeFormula|
+----------------+
|A0004*A0003 |
|A0003*A0005 |
+----------------+
Attribute data is like this.
val attrValTransposedDF = Seq(
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY1_VALUE", "801"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY2_VALUE", "802"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY3_VALUE", "803"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY4_VALUE", "804"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY5_VALUE", "805"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY6_VALUE", "736"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY7_VALUE", "1007"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY1_VALUE", "901"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY2_VALUE", "902"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY3_VALUE", "903"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY4_VALUE", "904"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY5_VALUE", "905"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY6_VALUE", "936"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY7_VALUE", "9007"))
.toDF("Store_Number", "Attribute_Week_Number", "Department_Code", "Attribute_Code", "Attribute_General_Value", "Day", "Value")
.select("Attribute_Code", "Day", "Value")
So data is like
+--------------+-----------------+-----+
|Attribute_Code|Day |Value|
+--------------+-----------------+-----+
|A0003 |ATTRIB_DAY1_VALUE|801 |
|A0003 |ATTRIB_DAY2_VALUE|802 |
|A0003 |ATTRIB_DAY3_VALUE|803 |
|A0003 |ATTRIB_DAY4_VALUE|804 |
|A0003 |ATTRIB_DAY5_VALUE|805 |
|A0003 |ATTRIB_DAY6_VALUE|736 |
|A0003 |ATTRIB_DAY7_VALUE|1007 |
|A0004 |ATTRIB_DAY1_VALUE|901 |
|A0004 |ATTRIB_DAY2_VALUE|902 |
|A0004 |ATTRIB_DAY3_VALUE|903 |
|A0004 |ATTRIB_DAY4_VALUE|904 |
|A0004 |ATTRIB_DAY5_VALUE|905 |
|A0004 |ATTRIB_DAY6_VALUE|936 |
|A0004 |ATTRIB_DAY7_VALUE|9007 |
+--------------+-----------------+-----+
Now i am splitting it based on *
val firstDF = attributeFormulaDF.select("AttributeFormula")
val rowVal = firstDF.first.mkString.split("\\*").length
val columnSeq = (0 until rowVal).map(i => col("temp").getItem(i).as(s"col$i"))
val newDFWithSplitColumn = firstDF.withColumn("temp", split(col("AttributeFormula"), "\\*"))
.select(col("*") +: columnSeq :_*).drop("temp")
I have referred to this stackOverFlow post (Split 1 column into 3 columns in spark scala)
So the split columns is like
+----------------+-----+-----+
|AttributeFormula|col0 |col1 |
+----------------+-----+-----+
|A0004*A0003 |A0004|A0003|
|A0003*A0005 |A0003|A0005|
+----------------+-----+-----+
Question 1: if my AttributeFormula can have any number of attributes list(which is just a string) how will i split it dynamically.
eg:
+-----------------+
|AttributeFormula |
+-----------------+
|A0004 |
|A0004*A0003 |
|A0003*A0005 |
|A0003*A0004 |
|A0003*A0004*A0005|
+-----------------+
So I kind of need like this
+---------------- +-----+-----+------+
|AttributeFormula |col0 |col1 | col2 |
+---------------- +-----+-----+------+
|A0004 |A0004|null | null |
|A0003*A0005 |A0003|A0005| null |
|A0003*A0004 |A0003|A0004| null |
|A0003*A0004*A0005|A0003|A0004| A0005|
+----------------+-----+-----+
Again I am joining the attributeFormula with attribute values to get the formula values column .
val joinColumnCondition = newDFWithSplitColumn.columns
.withFilter(_.startsWith("col"))
.map(col(_) === attrValTransposedDF("Attribute_Code"))
//using zipWithIndex to make the value columns separate and to avoid ambiguous error while joining
val dataFrameList = joinColumnCondition.zipWithIndex.map {
i =>
newDFWithSplitColumn.join(attrValTransposedDF, i._1)
.withColumnRenamed("Value", s"Value${i._2}")
.drop("Attribute_Code")
}
val combinedDataFrame = dataFrameList.reduce(_.join(_, Seq("Day","AttributeFormula"),"LEFT"))
val toBeConcatColumn = combinedDataFrame.columns.filter(_.startsWith("Value"))
combinedDataFrame
.withColumn("AttributeFormulaValues", concat_ws("*", toBeConcatColumn.map(c => col(c)): _*))
.select("Day","AttributeFormula","AttributeFormulaValues")
So my final output looks like this.
+-----------------+----------------+----------------------+
|Day |AttributeFormula|AttributeFormulaValues|
+-----------------+----------------+----------------------+
|ATTRIB_DAY7_VALUE|A0004*A0003 |9007*1007 |
|ATTRIB_DAY6_VALUE|A0004*A0003 |936*736 |
|ATTRIB_DAY5_VALUE|A0004*A0003 |905*805 |
|ATTRIB_DAY4_VALUE|A0004*A0003 |904*804 |
|ATTRIB_DAY3_VALUE|A0004*A0003 |903*803 |
|ATTRIB_DAY2_VALUE|A0004*A0003 |902*802 |
|ATTRIB_DAY1_VALUE|A0004*A0003 |901*801 |
|ATTRIB_DAY7_VALUE|A0003 |1007 |
|ATTRIB_DAY6_VALUE|A0003 |736 |
|ATTRIB_DAY5_VALUE|A0003 |805 |
|ATTRIB_DAY4_VALUE|A0003 |804 |
|ATTRIB_DAY3_VALUE|A0003 |803 |
|ATTRIB_DAY2_VALUE|A0003 |802 |
|ATTRIB_DAY1_VALUE|A0003 |801 |
+-----------------+----------------+----------------------+
This code is working fine if i have only fixed attributeFormula(ie. relates to question 1)
Question 2: how will I avoid using Dataframe list and use the reduce function?
For Question 1, here it is a possible solution:
Given that you have a dataframe with formulas:
val attributeFormulaDF = Seq("A0004*A0003","A0003*A0005", "A0003*A0004*A0005").toDF("formula")
you can split it and form an array
val splitFormula = attributeFormulaDF.select(col("formula"), split(col("formula"), "\\*").as("split"))
After that you select the maximum array size
val maxSize = splitFormula.select(max(size(col("split")))).first().getInt(0)
Now the interesting part is that based on the max size you can start generating columns and associated it to the previous array
val enhancedFormula = (0 until(maxSize)).foldLeft(splitFormula)( (df, i) => {
df.withColumn(s"col_${i}", expr(s"split[${i}]"))
})
Here is the output
+-----------------+--------------------+-----+-----+-----+
| formula| split|col_0|col_1|col_2|
+-----------------+--------------------+-----+-----+-----+
| A0004*A0003| [A0004, A0003]|A0004|A0003| null|
| A0003*A0005| [A0003, A0005]|A0003|A0005| null|
|A0003*A0004*A0005|[A0003, A0004, A0...|A0003|A0004|A0005|
+-----------------+--------------------+-----+-----+-----+
I think this can easily be used for the question 2