how to split row into multiple rows on the basis of date using spark scala? - scala

I have a dataframe that contains rows like below and i need to split this data to get month wise series on the basis of pa_start_date and pa_end_date and create a new column period start and end date.
i/p dataframe df is
p_id pa_id p_st_date p_end_date pa_start_date pa_end_date
p1 pa1 2-Jan-18 5-Dec-18 2-Mar-18 8-Aug-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18
p1 pa3 1-Jan-17 1-Dec-17 9-Feb-17 20-Apr-17
o/p is
p_id pa_id p_st_date p_end_date pa_start_date pa_end_date period_start_date period_end_date
p1 pa1 2-Jan-18 5-Dec-18 2-Mar-18 8-Aug-18 2-Mar-18 31-Mar-18
p1 pa1 2-Jan-18 5-Dec-18 2-Mar-18 8-Aug-18 1-Apr-18 30-Apr-18
p1 pa1 2-Jan-18 5-Dec-18 2-Mar-18 8-Aug-18 1-May-18 31-May-18
p1 pa1 2-Jan-18 5-Dec-18 2-Mar-18 8-Aug-18 1-Jun-18 30-Jun-18
p1 pa1 2-Jan-18 5-Dec-18 2-Mar-18 8-Aug-18 1-Jul-18 31-Jul-18
p1 pa1 2-Jan-18 5-Dec-18 2-Mar-18 8-Aug-18 1-Aug-18 31-Aug-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 6-Mar-18 31-Mar-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-Apr-18 30-Apr-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-May-18 31-May-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-Jun-18 30-Jun-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-Jul-18 31-Jul-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-Aug-18 31-Aug-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-Sep-18 30-Sep-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-Oct-18 30-Oct-18
p1 pa2 3-Jan-18 8-Dec-18 6-Mar-18 10-Nov-18 1-Nov-18 30-Nov-18
p1 pa3 1-Jan-17 1-Dec-17 9-Feb-17 20-Apr-17 9-Feb-17 28-Feb-17
p1 pa3 1-Jan-17 1-Dec-17 9-Feb-17 20-Apr-17 1-Mar-17 31-Mar-17
p1 pa3 1-Jan-17 1-Dec-17 9-Feb-17 20-Apr-17 1-Apr-17 30-Apr-17

I have done with creating an UDF like below.
This UDF will create an array of dates(dates from all the months inclusive start and end dates) if pa_start_date and the number of months between the pa_start_date and pa_end_date passed as parameters.
def udfFunc: ((Date, Long) => Array[String]) = {
(d, l) =>
{
var t = LocalDate.fromDateFields(d)
val dates: Array[String] = new Array[String](l.toInt)
for (i <- 0 until l.toInt) {
println(t)
dates(i) = t.toString("YYYY-MM-dd")
t = LocalDate.fromDateFields(t.toDate()).plusMonths(1)
}
dates
}
}
val my_udf = udf(udfFunc)
And the final dataframe is created as below.
val df = ss.read.format("csv").option("header", true).load(path)
.select($"p_id", $"pa_id", $"p_st_date", $"p_end_date", $"pa_start_date", $"pa_end_date",
my_udf(to_date(col("pa_start_date"), "dd-MMM-yy"), ceil(months_between(to_date(col("pa_end_date"), "dd-MMM-yy"), to_date(col("pa_start_date"), "dd-MMM-yy")))).alias("udf")) // gives array of dates from UDF
.withColumn("after_divide", explode($"udf")) // divide array of dates to individual rows
.withColumn("period_end_date", date_format(last_day($"after_divide"), "dd-MMM-yy")) // fetching the end_date for the particular date
.drop("udf")
.withColumn("row_number", row_number() over (Window.partitionBy("p_id", "pa_id", "p_st_date", "p_end_date", "pa_start_date", "pa_end_date").orderBy(col("after_divide").asc))) // just helper column for calculating `period_start_date` below
.withColumn("period_start_date", date_format(when(col("row_number").isin(1), $"after_divide").otherwise(trunc($"after_divide", "month")), "dd-MMM-yy"))
.drop("after_divide")
.drop("row_number") // dropping all the helper columns which is not needed in output.
And here is the output.
+----+-----+---------+----------+-------------+-----------+---------------+-----------------+
|p_id|pa_id|p_st_date|p_end_date|pa_start_date|pa_end_date|period_end_date|period_start_date|
+----+-----+---------+----------+-------------+-----------+---------------+-----------------+
| p1| pa3| 1-Jan-17| 1-Dec-17| 9-Feb-17| 20-Apr-17| 28-Feb-17| 09-Feb-17|
| p1| pa3| 1-Jan-17| 1-Dec-17| 9-Feb-17| 20-Apr-17| 31-Mar-17| 01-Mar-17|
| p1| pa3| 1-Jan-17| 1-Dec-17| 9-Feb-17| 20-Apr-17| 30-Apr-17| 01-Apr-17|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 31-Mar-18| 06-Mar-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 30-Apr-18| 01-Apr-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 31-May-18| 01-May-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 30-Jun-18| 01-Jun-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 31-Jul-18| 01-Jul-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 31-Aug-18| 01-Aug-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 30-Sep-18| 01-Sep-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 31-Oct-18| 01-Oct-18|
| p1| pa2| 3-Jan-18| 8-Dec-18| 6-Mar-18| 10-Nov-18| 30-Nov-18| 01-Nov-18|
| p1| pa1| 2-Jan-18| 5-Dec-18| 2-Mar-18| 8-Aug-18| 31-Mar-18| 02-Mar-18|
| p1| pa1| 2-Jan-18| 5-Dec-18| 2-Mar-18| 8-Aug-18| 30-Apr-18| 01-Apr-18|
| p1| pa1| 2-Jan-18| 5-Dec-18| 2-Mar-18| 8-Aug-18| 31-May-18| 01-May-18|
| p1| pa1| 2-Jan-18| 5-Dec-18| 2-Mar-18| 8-Aug-18| 30-Jun-18| 01-Jun-18|
| p1| pa1| 2-Jan-18| 5-Dec-18| 2-Mar-18| 8-Aug-18| 31-Jul-18| 01-Jul-18|
| p1| pa1| 2-Jan-18| 5-Dec-18| 2-Mar-18| 8-Aug-18| 31-Aug-18| 01-Aug-18|
+----+-----+---------+----------+-------------+-----------+---------------+-----------------+

Here is how I did it using RDD and UDF
kept data in a file
/tmp/pdata.csv
p_id,pa_id,p_st_date,p_end_date,pa_start_date,pa_end_date
p1,pa1,2-Jan-18,5-Dec-18,2-Mar-18,8-Aug-18
p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18
p1,pa3,1-Jan-17,1-Dec-17,9-Feb-17,20-Apr-17
spark scala code
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import scala.collection.mutable.ListBuffer
import java.util.{GregorianCalendar, Date}
import java.util.Calendar
val ipt = spark.read.format("com.databricks.spark.csv").option("header","true").option("inferchema","true").load("/tmp/pdata.csv")
val format = new java.text.SimpleDateFormat("dd-MMM-yy")
format.format(new java.util.Date()) --test date
def generateDates(startdate: Date, enddate: Date): ListBuffer[String] ={
var dateList = new ListBuffer[String]()
var calendar = new GregorianCalendar()
calendar.setTime(startdate)
val monthName = Array("Jan", "Feb","Mar", "Apr", "May", "Jun", "Jul","Aug", "Sept", "Oct", "Nov","Dec")
dateList +=(calendar.get(Calendar.DAY_OF_MONTH)) + "-" + monthName(calendar.get(Calendar.MONTH)) + "-" + (calendar.get(Calendar.YEAR)) +","+
(calendar.getActualMaximum(Calendar.DAY_OF_MONTH)) + "-" + monthName(calendar.get(Calendar.MONTH)) + "-" + (calendar.get(Calendar.YEAR))
calendar.add(Calendar.MONTH, 1)
while (calendar.getTime().before(enddate)) {
dateList +="01-" + monthName(calendar.get(Calendar.MONTH)) + "-" + (calendar.get(Calendar.YEAR)) +","+
(calendar.getActualMaximum(Calendar.DAY_OF_MONTH)) + "-" + monthName(calendar.get(Calendar.MONTH)) + "-" + (calendar.get(Calendar.YEAR))
calendar.add(Calendar.MONTH, 1)
}
dateList
}
val oo = ipt.rdd.map(x=>(x(0).toString(),x(1).toString(),x(2).toString(),x(3).toString(),x(4).toString(),x(5).toString()))
oo.flatMap(pp=> {
var allDates = new ListBuffer[(String,String,String,String,String,String,String)]()
for (x <- generateDates(format.parse(pp._5),format.parse(pp._6))) {
allDates += ((pp._1,pp._2,pp._3,pp._4,pp._5,pp._6,x))}
allDates
}).collect().foreach(println)
I did Flatmap and while doing that function is used to pull concatenated dates and list buffer to append the concatenated values
I used monthName to get the month as per your output format.
output came as below
(p1,pa1,2-Jan-18,5-Dec-18,2-Mar-18,8-Aug-18,2-Mar-2018,31-Mar-2018)
(p1,pa1,2-Jan-18,5-Dec-18,2-Mar-18,8-Aug-18,01-Apr-2018,30-Apr-2018)
(p1,pa1,2-Jan-18,5-Dec-18,2-Mar-18,8-Aug-18,01-May-2018,31-May-2018)
(p1,pa1,2-Jan-18,5-Dec-18,2-Mar-18,8-Aug-18,01-Jun-2018,30-Jun-2018)
(p1,pa1,2-Jan-18,5-Dec-18,2-Mar-18,8-Aug-18,01-Jul-2018,31-Jul-2018)
(p1,pa1,2-Jan-18,5-Dec-18,2-Mar-18,8-Aug-18,01-Aug-2018,31-Aug-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,6-Mar-2018,31-Mar-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-Apr-2018,30-Apr-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-May-2018,31-May-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-Jun-2018,30-Jun-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-Jul-2018,31-Jul-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-Aug-2018,31-Aug-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-Sept-2018,30-Sept-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-Oct-2018,31-Oct-2018)
(p1,pa2,3-Jan-18,8-Dec-18,6-Mar-18,10-Nov-18,01-Nov-2018,30-Nov-2018)
(p1,pa3,1-Jan-17,1-Dec-17,9-Feb-17,20-Apr-17,9-Feb-2017,28-Feb-2017)
(p1,pa3,1-Jan-17,1-Dec-17,9-Feb-17,20-Apr-17,01-Mar-2017,31-Mar-2017)
(p1,pa3,1-Jan-17,1-Dec-17,9-Feb-17,20-Apr-17,01-Apr-2017,30-Apr-2017)
I am happy t explain more if any one has doubt and also I might have read file in a silly way we can improve that as well.

Related

How to perform conditional join with time column in spark scala

I am looking for help in joining 2 DF's with conditional join in time columns, using Spark Scala.
DF1
time_1
revision
review_1
2022-04-05 08:32:00
1
abc
2022-04-05 10:15:00
2
abc
2022-04-05 12:15:00
3
abc
2022-04-05 09:00:00
1
xyz
2022-04-05 20:20:00
2
xyz
DF2:
time_2
review_1
value
2022-04-05 08:30:00
abc
value_1
2022-04-05 09:48:00
abc
value_2
2022-04-05 15:40:00
abc
value_3
2022-04-05 08:00:00
xyz
value_4
2022-04-05 09:00:00
xyz
value_5
2022-04-05 10:00:00
xyz
value_6
2022-04-05 11:00:00
xyz
value_7
2022-04-05 12:00:00
xyz
value_8
Desired Output DF:
time_1
revision
review_1
value
2022-04-05 08:32:00
1
abc
value_1
2022-04-05 10:15:00
2
abc
value_2
2022-04-05 12:15:00
3
abc
null
2022-04-05 09:00:00
1
xyz
value_6
2022-04-05 20:20:00
2
xyz
null
As in the case of row 4 of the final output (where time_1 = 2022-04-05 09:00:00, if multiple values match during the join then only the latest - in time - should be taken).
Furthermore if there is no match for a row of df in the join then there it should have null for the value column.
Here we need to join between 2 columns in the two DF's:
review_1 === review_2 &&
time_1 === time_2 (condition : time_1 should be in range +1/-1 Hr from time_2, If multiple records then show latest value, as in value_6 above)
Here is the code necessary to join the DataFrames:
I have commented the code so as to explain the logic.
TL;DR
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
Full breakdown
Let's start off with your DataFrames: df1 and df2 in code:
val df1 = List(
("2022-04-05 08:32:00", 1, "abc"),
("2022-04-05 10:15:00", 2, "abc"),
("2022-04-05 12:15:00", 3, "abc"),
("2022-04-05 09:00:00", 1, "xyz"),
("2022-04-05 20:20:00", 2, "xyz")
).toDF("time_1", "revision", "review_1")
df1.show(false)
gives:
+-------------------+--------+--------+
|time_1 |revision|review_1|
+-------------------+--------+--------+
|2022-04-05 08:32:00|1 |abc |
|2022-04-05 10:15:00|2 |abc |
|2022-04-05 12:15:00|3 |abc |
|2022-04-05 09:00:00|1 |xyz |
|2022-04-05 20:20:00|2 |xyz |
+-------------------+--------+--------+
val df2 = List(
("2022-04-05 08:30:00", "abc", "value_1"),
("2022-04-05 09:48:00", "abc", "value_2"),
("2022-04-05 15:40:00", "abc", "value_3"),
("2022-04-05 08:00:00", "xyz", "value_4"),
("2022-04-05 09:00:00", "xyz", "value_5"),
("2022-04-05 10:00:00", "xyz", "value_6"),
("2022-04-05 11:00:00", "xyz", "value_7"),
("2022-04-05 12:00:00", "xyz", "value_8")
).toDF("time_2", "review_2", "value")
df2.show(false)
gives:
+-------------------+--------+-------+
|time_2 |review_2|value |
+-------------------+--------+-------+
|2022-04-05 08:30:00|abc |value_1|
|2022-04-05 09:48:00|abc |value_2|
|2022-04-05 15:40:00|abc |value_3|
|2022-04-05 08:00:00|xyz |value_4|
|2022-04-05 09:00:00|xyz |value_5|
|2022-04-05 10:00:00|xyz |value_6|
|2022-04-05 11:00:00|xyz |value_7|
|2022-04-05 12:00:00|xyz |value_8|
+-------------------+--------+-------+
Next we need new columns which we can do the date range check on (where time is represented as a single number, making math operations easy:
// add a new column, temporarily, which contains the time in
// epoch format: with this adding/subtracting an hour can easily be done.
val df1WithEpoch = df1.withColumn("epoch_time_1", unix_timestamp(col("time_1")))
val df2WithEpoch = df2.withColumn("epoch_time_2", unix_timestamp(col("time_2")))
df1WithEpoch.show()
df2WithEpoch.show()
gives:
+-------------------+--------+--------+------------+
| time_1|revision|review_1|epoch_time_1|
+-------------------+--------+--------+------------+
|2022-04-05 08:32:00| 1| abc| 1649147520|
|2022-04-05 10:15:00| 2| abc| 1649153700|
|2022-04-05 12:15:00| 3| abc| 1649160900|
|2022-04-05 09:00:00| 1| xyz| 1649149200|
|2022-04-05 20:20:00| 2| xyz| 1649190000|
+-------------------+--------+--------+------------+
+-------------------+--------+-------+------------+
| time_2|review_2| value|epoch_time_2|
+-------------------+--------+-------+------------+
|2022-04-05 08:30:00| abc|value_1| 1649147400|
|2022-04-05 09:48:00| abc|value_2| 1649152080|
|2022-04-05 15:40:00| abc|value_3| 1649173200|
|2022-04-05 08:00:00| xyz|value_4| 1649145600|
|2022-04-05 09:00:00| xyz|value_5| 1649149200|
|2022-04-05 10:00:00| xyz|value_6| 1649152800|
|2022-04-05 11:00:00| xyz|value_7| 1649156400|
|2022-04-05 12:00:00| xyz|value_8| 1649160000|
+-------------------+--------+-------+------------+
and finally to join:
import org.apache.spark.sql.expressions.Window
val SECONDS_IN_ONE_HOUR = 60 * 60
val window = Window.partitionBy("time_1").orderBy(col("time_2").desc)
df1WithEpoch
.join(df2WithEpoch,
df1WithEpoch("review_1") === df2WithEpoch("review_2")
&& (
// if `time_1` is in between `time_2` and `time_2` - 1 hour
(df1WithEpoch("epoch_time_1") >= df2WithEpoch("epoch_time_2") - SECONDS_IN_ONE_HOUR)
// if `time_1` is in between `time_2` and `time_2` + 1 hour
&& (df1WithEpoch("epoch_time_1") <= df2WithEpoch("epoch_time_2") + SECONDS_IN_ONE_HOUR)
),
// LEFT OUTER is necessary to get back `2022-04-05 12:15:00` and `2022-04-05 20:20:00` which have no join to `df2` in the time window
"left_outer"
)
.withColumn("row_num", row_number().over(window))
.filter(col("row_num") === 1)
// select only the columns we care about
.select("time_1", "revision", "review_1", "value")
// order by to give the results in the same order as in the Question
.orderBy(col("review_1"), col("revision"))
.show(false)
gives:
+-------------------+--------+--------+-------+
|time_1 |revision|review_1|value |
+-------------------+--------+--------+-------+
|2022-04-05 08:32:00|1 |abc |value_1|
|2022-04-05 10:15:00|2 |abc |value_2|
|2022-04-05 12:15:00|3 |abc |null |
|2022-04-05 09:00:00|1 |xyz |value_6|
|2022-04-05 20:20:00|2 |xyz |null |
+-------------------+--------+--------+-------+

select sliced columns with fixed ones

I have the following df:
KSCHL01 VTEXT01 KWERT01 KSCHL02 VTEXT02 KWERT02 KSCHL03 VTEXT03 KWERT03 id
ZBTB Tarif de base 4455.00 ZBFA Brut facturé 4455.00 ZBN Brut Négocié 3645.00 1
ZBT Brut Tarif. 222.75 ZFIF Remises fin d'ordre 0.00 ZMAJ Majorations 0.00 2
I may have more than 13 columns.
I want to transform every 3 columns slice to a line, to have this EXPECTED OUTPUT:
id KSCHL VTEXT KWERT
1 ZBTB Tarif de base 4455.00
1 ZBFA Brut facturé 4455.00
1 ZBN Brut Négocié 3645.00
2 ZBT Brut Tarif. 222.75
2 ZFIF Remises fin d'ordre 0.00
2 ZMAJ Majorations 0.00
I did this:
for( i <- 0 to df.columns.length-4 by 3){
var temp=df.select(df.columns.slice(i, i+3).map(col(_)): _*)
val columns = temp.columns
val regex = """[0-9]"""
val replacingColumns = columns.map(regex.r.replaceAllIn(_, "")) # delete all digits in column names
val resultDF = replacingColumns.zip(columns).foldLeft(temp){(tempdf, name) => tempdf.withColumnRenamed(name._2, name._1)}
res=res.union(resultDF) # Append df to final DF
}
which gives me this:
KSCHL VTEXT KWERT
ZBTB Tarif de base 4455.00
ZBFA Brut facturé 4455.00
ZBN Brut Négocié 3645.00
ZBT Brut Tarif. 222.75
ZFIF Remises fin d'ordre 0.00
ZMAJ Majorations 0.00
How can I add the id column to every slice in order to have it as a column like in the desired output? I tried:
temp = temp.withColumn("id", df.id)
but I had this error:
error: value id in class Dataset cannot be accessed in org.apache.spark.sql.DataFrame
Thank you.
Here is how you can rewrite the code, Adjust the range dynamically as the numbers of columns
val range = (1 to 3).map(r => if (r < 10) s"0$r" else s"$r")
val structQuery = $"id" +: range.map(n =>
struct($"KSCHL$n".as("KSCHL"), $"VTEXT$n".as("VTEXT"), $"KWERT$n".as("KWERT")).as(s"struct$n")
)
df.select(structQuery: _*)
.withColumn("new", explode(array(range.map(r => col(s"struct$r")): _*)))
.select("id", "new.*")
.show(false)
Output:
+---+-----+-------------------+------+
|id |KSCHL|VTEXT |KWERT |
+---+-----+-------------------+------+
|1 |ZBTB |Tarif de base |4455.0|
|1 |ZBFA |Brut facturé |4455.0|
|1 |ZBN |Brut Négocié |3645.0|
|2 |ZBT |Brut Tarif. |222.75|
|2 |ZFIF |Remises fin d'ordre|0.0 |
|2 |ZMAJ |Majorations |0.0 |
+---+-----+-------------------+------+
Check below code.
scala> df.show(false)
+-------+-------------+-------+-------+-------------------+-------+-------+------------+-------+---+--------------+
|KSCHL01|VTEXT01 |KWERT01|KSCHL02|VTEXT02 |KWERT02|KSCHL03|VTEXT03 |KWERT03|id |KSCHL04 |
+-------+-------------+-------+-------+-------------------+-------+-------+------------+-------+---+--------------+
|ZBTB |Tarif de base|4455.00|ZBFA |Brut facturé |4455.00|ZBN |Brut Négocié|3645.00|1 |sample KSCHL03|
|ZBT |Brut Tarif. |222.75 |ZFIF |Remises fin d'ordre|0.00 |ZMAJ |Majorations |0.00 |2 |sample KSCHL03|
+-------+-------------+-------+-------+-------------------+-------+-------+------------+-------+---+--------------+
scala> val singleColumns = df.columns.filter(c => c.filter(_.isDigit).length == 0).map(col)
singleColumns: Array[org.apache.spark.sql.Column] = Array(id)
scala> val multipleColumns = df.columns.filter(c => c.filter(_.isDigit).length != 0).map(c => (c.filterNot(_.isDigit),c,c.filter(_.isDigit)))
multipleColumns: Array[(String, String, String)] = Array((KSCHL,KSCHL01,01), (VTEXT,VTEXT01,01), (KWERT,KWERT01,01), (KSCHL,KSCHL02,02), (VTEXT,VTEXT02,02), (KWERT,KWERT02,02), (KSCHL,KSCHL03,03), (VTEXT,VTEXT03,03), (KWERT,KWERT03,03), (KSCHL,KSCHL04,04))
scala> val distinctColumns = multipleColumns.map(_._1).distinct
distinctColumns: Array[String] = Array(KSCHL, VTEXT, KWERT)
scala> :paste
// Entering paste mode (ctrl-D to finish)
val colExpr = array(
multipleColumns
.groupBy(_._3)
.map(k => struct(
k._2.map(c => col(c._2).as(c._1)) ++
distinctColumns.filter(c => k._2.filter(_._1 == c).length == 0).map(c => lit("").as(c)) ++
singleColumns:_*
).as("data"))
.toSeq:_*
).as("array_data")
// Exiting paste mode, now interpreting.
colExpr: org.apache.spark.sql.Column = array(named_struct(NamePlaceholder(), KSCHL03 AS `KSCHL`, NamePlaceholder(), VTEXT03 AS `VTEXT`, NamePlaceholder(), KWERT03 AS `KWERT`, NamePlaceholder(), id) AS `data`, named_struct(NamePlaceholder(), KSCHL02 AS `KSCHL`, NamePlaceholder(), VTEXT02 AS `VTEXT`, NamePlaceholder(), KWERT02 AS `KWERT`, NamePlaceholder(), id) AS `data`, named_struct(NamePlaceholder(), KSCHL01 AS `KSCHL`, NamePlaceholder(), VTEXT01 AS `VTEXT`, NamePlaceholder(), KWERT01 AS `KWERT`, NamePlaceholder(), id) AS `data`, named_struct(NamePlaceholder(), KSCHL04 AS `KSCHL`, VTEXT, AS `VTEXT`, KWERT, AS `KWERT`, NamePlaceholder(), id) AS `data`) AS `array_data`
scala> :paste
// Entering paste mode (ctrl-D to finish)
val finalDF = df
.select(colExpr)
.withColumn("array_data",explode_outer($"array_data"))
.select("array_data.*")
// Exiting paste mode, now interpreting.
finalDF.show(false)
+--------------+-------------------+-------+---+
|KSCHL |VTEXT |KWERT |id |
+--------------+-------------------+-------+---+
|ZBN |Brut Négocié |3645.00|1 |
|ZBFA |Brut facturé |4455.00|1 |
|ZBTB |Tarif de base |4455.00|1 |
|sample KSCHL03| | |1 |
|ZMAJ |Majorations |0.00 |2 |
|ZFIF |Remises fin d'ordre|0.00 |2 |
|ZBT |Brut Tarif. |222.75 |2 |
|sample KSCHL03| | |2 |
+--------------+-------------------+-------+---+

Spark dataFrame for-if loop taking a Long time

I have a Spark DF (df):
I have to convert below into something like this:
Basically it should detect a new sentence whenever it finds a full stop (".") and another row.
I have written a code for above:
val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").config("spark.scheduler.mode", "FAIR").getOrCreate()
val count = df.count.toInt
var emptyDF = Seq.empty[(Int, Int, String)].toDF("start_time", "end_time", "Sentences")
var b = 0
for (a <- 1 to count){
if(d9.select("words").head(a)(a-1).toSeq.head == "." || a == (count-1))
{
val myList1 = d9.select("words").head(a).toArray.map(_.getString(0))
val myList = d9.select("words").head(a).toArray.map(_.getString(0)).splitAt(b)._2
val text = myList.mkString(" ")
val end_time = d9.select("end_time").head(a)(a-1).toSeq.head.toString.toInt
val start_time = d9.select("start_time").head(a)(b).toSeq.head.toString.toInt
val df1 = spark.sparkContext.parallelize(Seq(start_time)).toDF("start_time")
val df2 = spark.sparkContext.parallelize(Seq(end_time)).toDF("end_time")
val df3 = spark.sparkContext.parallelize(Seq(text)).toDF("Sentences")
val df4 = df1.crossJoin(df2).crossJoin(df3)
emptyDF = emptyDF.union(df4).toDF
b = a
}
}
Though its giving the correct output but its taking ages to complete iteration and I have 117 other df's which I need to run.
Any other way to Tune this code or any other way to achieve above operation? Any help will be deeply appreciated.
scala> import org.apache.spark.sql.expressions.Window
scala> df.show(false)
+----------+--------+--------+
|start_time|end_time|words |
+----------+--------+--------+
|132 |135 |Hi |
|135 |135 |, |
|143 |152 |I |
|151 |152 |am |
|159 |169 |working |
|194 |197 |on |
|204 |211 |hadoop |
|211 |211 |. |
|218 |222 |This |
|226 |229 |is |
|234 |239 |Spark |
|245 |249 |DF |
|253 |258 |coding |
|258 |258 |. |
|276 |276 |I |
+----------+--------+--------+
scala> val w = Window.orderBy("start_time", "end_time")
scala> df.withColumn("temp", sum(when(lag(col("words"), 1).over(w) === ".", lit(1)).otherwise(lit(0))).over(w))
.groupBy("temp").agg(min("start_time").alias("start_time"), max("end_time").alias("end_time"),concat_ws(" ",collect_list(trim(col("words")))).alias("sentenses"))
.drop("temp")
.show(false)
+----------+--------+-----------------------------+
|start_time|end_time|sentenses |
+----------+--------+-----------------------------+
|132 |211 |Hi , I am working on hadoop .|
|218 |258 |This is Spark DF coding . |
|276 |276 |I |
+----------+--------+-----------------------------+
Here is my try. You can use a window to separate the sentence by counting the number of . for the following rows.
import org.apache.spark.sql.expressions.Window
val w = Window.orderBy("start_time").rowsBetween(Window.currentRow, Window.unboundedFollowing)
val df = Seq((132, 135, "Hi"),
(135, 135, ","),
(143, 152, "I"),
(151, 152, "am"),
(159, 169, "working"),
(194, 197, "on"),
(204, 211, "hadoop"),
(211, 211, "."),
(218, 212, "This"),
(226, 229, "is"),
(234, 239, "Spark"),
(245, 249, "DF"),
(253, 258, "coding"),
(258, 258, "."),
(276, 276, "I")).toDF("start_time", "end_time", "words")
df.withColumn("count", count(when(col("words") === ".", true)).over(w))
.groupBy("count")
.agg(min("start_time").as("start_time"), max("end_time").as("end_time"), concat_ws(" ", collect_list("words")).as("Sentences"))
.drop("count").show(false)
Then, this will give you the result as follows but it has some spaces between words and , or . as follows:
+----------+--------+-----------------------------+
|start_time|end_time|Sentences |
+----------+--------+-----------------------------+
|132 |211 |Hi , I am working on hadoop .|
|218 |258 |This is Spark DF coding . |
|276 |276 |I |
+----------+--------+-----------------------------+
Here is my approach using udf without window function.
val df=Seq((123,245,"Hi"),(123,245,"."),(123,245,"Hi"),(123,246,"I"),(123,245,".")).toDF("start","end","words")
var count=0
var flag=false
val counterUdf=udf((dot:String) => {
if(flag) {
count+=1
flag=false
}
if (dot == ".")
flag=true
count
})
val df1=df.withColumn("counter",counterUdf(col("words")))
val df2=df1.groupBy("counter").agg(min("start").alias("start"), max("end").alias("end"), concat_ws(" ", collect_list("words")).alias("sentence")).drop("counter")
df2.show()
+-----+---+--------+
|start|end|sentence|
+-----+---+--------+
| 123|246| Hi I .|
| 123|245| Hi .|
+-----+---+--------+

What is the efficient way to create Spark DataFrame in Scala with array type columns from another DataFrame that does not have an array column?

Suppose, I have the following the dataframe:
id | col1 | col2
-----------------
x | p1 | a1
-----------------
x | p2 | b1
-----------------
y | p2 | b2
-----------------
y | p2 | b3
-----------------
y | p3 | c1
The distinct values from col1 which are (p1, p2, p3) alone with id will be used as columns for the final dataframe. Here, the id y has two col2 values (b2 and b3) for the same col1 value p2, so, p2 will be treated as an array type column.
Therefore, the final dataframe will be
id | p1 | p2 | p3
--------------------------------
x | a1 | [b1] | null
--------------------------------
y | null |[b2, b3]| c1
How can I achieve the second dataframe efficiently from the first dataframe?
You are basically looking for table pivoting; for your case, groupBy id, pivot col1 as headers, and aggregate col2 as list using collect_list function:
df.groupBy("id").pivot("col1").agg(collect_list("col2")).show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x|[a1]| [b1]| []|
| y| []|[b2, b3]|[c1]|
+---+----+--------+----+
If it's guaranteed that there's at most one value in p1 and p3 for each id, you can convert those columns to String type by getting the first item of the array:
df.groupBy("id").pivot("col1").agg(collect_list("col2"))
.withColumn("p1", $"p1"(0)).withColumn("p3", $"p3"(0))
.show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1| [b1]|null|
| y|null|[b2, b3]| c1|
+---+----+--------+----+
If you need to convert the column types dynamically, i.e. only use array type column types when you have to:
// get array Type columns
val arrayColumns = df.groupBy("id", "col1").agg(count("*").as("N"))
.where($"N" > 1).select("col1").distinct.collect.map(row => row.getString(0))
// arrayColumns: Array[String] = Array(p2)
// aggregate / pivot data frame
val aggDf = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, p1: array<string> ... 2 more fields]
// get string columns
val stringColumns = aggDf.columns.filter(x => x != "id" && !arrayColumns.contains(x))
// use foldLeft on string columns to convert the columns to string type
stringColumns.foldLeft(aggDf)((df, x) => df.withColumn(x, col(x)(0))).show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1| [b1]|null|
| y|null|[b2, b3]| c1|
+---+----+--------+----+

How to encode string values into numeric values in Spark DataFrame

I have a DataFrame with two columns:
df =
Col1 Col2
aaa bbb
ccc aaa
I want to encode String values into numeric values. I managed to do it in this way:
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val indexer1 = new StringIndexer()
.setInputCol("Col1")
.setOutputCol("Col1Index")
.fit(df)
val indexer2 = new StringIndexer()
.setInputCol("Col2")
.setOutputCol("Col2Index")
.fit(df)
val indexed1 = indexer1.transform(df)
val indexed2 = indexer2.transform(df)
val encoder1 = new OneHotEncoder()
.setInputCol("Col1Index")
.setOutputCol("Col1Vec")
val encoder2 = new OneHotEncoder()
.setInputCol("Col2Index")
.setOutputCol("Col2Vec")
val encoded1 = encoder1.transform(indexed1)
encoded1.show()
val encoded2 = encoder2.transform(indexed2)
encoded2.show()
The problem is that aaa is encoded in different ways in two columns.
How can I encode my DataFrame in order to get the new one correctly encoded, e.g.:
df_encoded =
Col1 Col2
1 2
3 1
Train single Indexer on both columns:
val df = Seq(("aaa", "bbb"), ("ccc", "aaa")).toDF("col1", "col2")
val indexer = new StringIndexer().setInputCol("col").fit(
df.select("col1").toDF("col").union(df.select("col2").toDF("col"))
)
and apply copy on each column
import org.apache.spark.ml.param.ParamMap
val result = Seq("col1", "col2").foldLeft(df){
(df, col) => indexer
.copy(new ParamMap()
.put(indexer.inputCol, col)
.put(indexer.outputCol, s"${col}_idx"))
.transform(df)
}
result.show
// +----+----+--------+--------+
// |col1|col2|col1_idx|col2_idx|
// +----+----+--------+--------+
// | aaa| bbb| 0.0| 1.0|
// | ccc| aaa| 2.0| 0.0|
// +----+----+--------+--------+
you can make yourself transform,the example is my pyspark code.
training a transform model as clf
sindex_pro = StringIndexer(inputCol='StringCol',outputCol='StringCol_c',stringOrderType="frequencyDesc",handleInvalid="keep").fit(province_df)`
define the self transformer load the clf
from pyspark.sql.functions import col
from pyspark.ml import Transformer
from pyspark.sql import DataFrame
class SelfSI(Transformer):
def __init__(self, clf,col_name):
super(SelfSI, self).__init__()
self.clf = clf
self.col_name=col_name
def rename_col(self,df,invers=False):
or_name = 'StringCol'
col_name = self.col_name
if invers:
df = df.withColumnRenamed(or_name,col_name)
or_name = col_name + '_c'
col_name = 'StringCol_c'
df = df.withColumnRenamed(col_name,or_name)
return df
def _transform(self, df: DataFrame) -> DataFrame:
df = self.rename_col(df)
df = self.clf.transform(df)
df = self.rename_col(df,invers=True)
return df
define the model by your need transforming column name
pro_si = SelfSI(sindex_pro,'pro_name')
pro_si.transform(df_or)
#or pipline
model = Pipeline(stages=[pro_si,pro_si2]).fit(df_or)
model.transform(df_or)
#result like
province_name|city_name|province_name_c|city_name_c|
+-------------+---------+---------------+-----------+
| 河北| 保定| 23.0| 18.0|
| 河北| 张家| 23.0| 213.0|
| 河北| 承德| 23.0| 126.0|
| 河北| 沧州| 23.0| 6.0|
| 河北| 廊坊| 23.0| 26.0|
| 北京| 北京| 13.0| 107.0|
| 天津| 天津| 10.0| 85.0|
| 河北| 石家| 23.0| 185.0|