How to iterate over pairs in a column in Scala - scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook

One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+

You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful

Related

Scala spark to filter out reoccurring zero values

I created a dataframe in spark with the following schema:
root
|-- user_id: string (nullable = true)
|-- rate: decimal(32,16) (nullable =true)
|-- date: timestamp (nullable =true)
|-- type: string (nullable = true)
Data is like this in my schema
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |0 |2020-04-24 | A |
| XO_121 |0 |2020-04-25 | A |
| XO_121 |0 |2020-04-26 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |0 |2020-04-29 | A |
| XO_121 |1 |2020-04-30 | A |
I want to save space so I want to skip rate which has zero value but just want it's initial occurrence only other rate duplicates are allowed like you see case of 10 and they need to preserve Date order . So after applying filter my data should look like this
+----------+----------+-------------+---------+
| user_id| rate |date | type |
+----------+----------+-------------+---------+
| XO_121 | 10 |2020-04-20 | A |
| XO_121 | 10 |2020-04-21 | A |
| XO_121 | 30 |2020-04-22 | A |
| XO_121 |0 |2020-04-23 | A |
| XO_121 |5 |2020-04-27 | A |
| XO_121 |0 |2020-04-28 | A |
| XO_121 |1 |2020-04-30 | A |
I'm new to spark so just want to find out way to filter . I used Rank concept but that don't work .If any body can provide solution to this problem
Data Preparation :
val df = Seq( ("XO_121","10","2020-04-20"),("XO_121","10","2020-04-21"),("XO_121","30","2020-04-22"),("XO_121","0","2020-04-23"),("XO_121","0","2020-04-24"),("XO_121","0","2020-04-25"),("XO_121","0","2020-04-26"),("XO_121","5","2020-04-27"),("XO_121","0","2020-04-28"),("XO_121","0","2020-04-29"),("XO_121","1","2020-04-30"))
.toDF("user_id","rate","date")
Get the previous value of rate, and check for each record "rate" === "0" && "previous_rate" === "0"
import org.apache.spark.sql.expressions.Window
val winSpec = Window.partitionBy("user_id").orderBy("date")
val finalDf = df.withColumn("previous_rate", lag("rate", 1).over(winSpec))
.filter( !($"rate" === "0" && $"previous_rate" === "0"))
.drop("previous_rate")
Output :
scala> finalDf.show
+-------+----+----------+
|user_id|rate| date|
+-------+----+----------+
| XO_121| 10|2020-04-20|
| XO_121| 10|2020-04-21|
| XO_121| 30|2020-04-22|
| XO_121| 0|2020-04-23|
| XO_121| 5|2020-04-27|
| XO_121| 0|2020-04-28|
| XO_121| 1|2020-04-30|
+-------+----+----------+
Now you can apply orderBy($"date") or orderBy($"userd_id",$"date") which ever is applicable for you.
You can use row_number() instead of Rank as below
_w = W.partitionBy("col2").orderBy("col1")
df = df.withColumn("rnk", F.row_number().over(_w))
df = df.filter(F.col("rnk") == F.lit("1"))
df.show()
+------+----+---+
| col1|col2|rnk|
+------+----+---+
|XO_121| 0| 1|
|XO_121| 10| 1|
|XO_121| 30| 1|
|XO_121| 20| 1|
|XO_121| 40| 1|
+------+----+---+
Also , you can use first() in case you know there is only repetition on value 0
df = df.groupBy("col1","col2").agg(F.first("col2").alias("col2")).orderBy("col2")
df.show()
+------+----+----+
| col1|col2|col2|
+------+----+----+
|XO_121| 0| 0|
|XO_121| 10| 10|
|XO_121| 20| 20|
|XO_121| 30| 30|
|XO_121| 50| 50|
+------+----+----+

how to explode a spark dataframe

I exploded a nested schema but I am not getting what I want,
before exploded it looks like this:
df.show()
+----------+----------------------------------------------------------+
|CaseNumber| SourceId |
+----------+----------------------------------------------------------+
| 0 |[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}] |
+----------|----------------------------------------------------------|
| 1 |[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}] |
+---------------------------------------------------------------------+
I want it to be like this
+----------+-------------------+
| CaseNumber| Sku | ContractId |
+----------+-------------------+
| 0 | 1 | 22 |
+----------|------|------------|
| 1 | 3 | 24 |
+------------------------------|
Here is one way using the build-in get_json_object function:
import org.apache.spark.sql.functions.get_json_object
val df = Seq(
(0, """[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}]"""),
(1, """[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}]"""))
.toDF("CaseNumber", "SourceId")
df.withColumn("sku", get_json_object($"SourceId", "$[0].id").cast("int"))
.withColumn("ContractId", get_json_object($"SourceId", "$[1].id").cast("int"))
.drop("SourceId")
.show
// +----------+---+----------+
// |CaseNumber|sku|ContractId|
// +----------+---+----------+
// | 0| 1| 22|
// | 1| 3| 24|
// +----------+---+----------+
UPDATE
After our discussion we realised that the mentioned data is of array<struct<id:string,type:string>> type and not a simple string. Next is the solution for the new schema:
df.withColumn("sku", $"SourceIds".getItem(0).getField("id"))
.withColumn("ContractId", $"SourceIds".getItem(1).getField("id"))

subtract two columns with null in spark dataframe

I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+

UDAF merge rows where are first orderdby in a Spark DataSet/Dataframe

Let's say we have a dataset/dataframe in Spark where has 3 columns
ID, Word, Timestamp
I want to write a UDAF function where I can do something like this
df.show()
ID | Word | Timestamp
1 | I | "2017-1-1 00:01"
1 | am | "2017-1-1 00:02"
1 | Chris | "2017-1-1 00:03"
2 | I | "2017-1-1 00:01"
2 | am | "2017-1-1 00:02"
2 | Jessica | "2017-1-1 00:03"
val df_merged = df.groupBy("ID")
.sort("ID", "Timestamp")
.agg(custom_agg("ID", "Word", "Timestamp")
df_merged.show
ID | Words | StartTime | EndTime |
1 | "I am Chris" | "2017-1-1 00:01" | "2017-1-1 00:03" |
1 | "I am Jessica" | "2017-1-1 00:01" | "2017-1-1 00:03" |
The question is how can ensure that the column Words will be merged in the right order inside my UDAF?
Here is a sollution with Spark 2's groupByKey (used with an untyped Dataset).The advantage of groupByKey is that you have access to the group (you get an Iterator[Row] in mapGroups):
df.groupByKey(r => r.getAs[Int]("ID"))
.mapGroups{case(id,rows) => {
val sorted = rows
.toVector
.map(r => (r.getAs[String]("Word"),r.getAs[java.sql.Timestamp]("Timestamp")))
.sortBy(_._2.getTime)
(id,
sorted.map(_._1).mkString(" "),
sorted.map(_._2).head,
sorted.map(_._2).last
)
}
}.toDF("ID","Words","StartTime","EndTime")
Sorry I dont use Scala and hope you could read it.
Window function can do what you want:
df = df.withColumn('Words', f.collect_list(df['Word']).over(
Window().partitionBy(df['ID']).orderBy('Timestamp').rowsBetween(start=Window.unboundedPreceding,
end=Window.unboundedFollowing)))
Output:
+---+-------+-----------------+----------------+
| ID| Word| Timestamp| Words|
+---+-------+-----------------+----------------+
| 1| I|2017-1-1 00:01:00| [I, am, Chris]|
| 1| am|2017-1-1 00:02:00| [I, am, Chris]|
| 1| Chris|2017-1-1 00:03:00| [I, am, Chris]|
| 2| I|2017-1-1 00:01:00|[I, am, Jessica]|
| 2| am|2017-1-1 00:02:00|[I, am, Jessica]|
| 2|Jessica|2017-1-1 00:03:00|[I, am, Jessica]|
+---+-------+-----------------+----------------+
Then groupBy above data:
df = df.groupBy(df['ID'], df['Words']).agg(
f.min(df['Timestamp']).alias('StartTime'), f.max(df['Timestamp']).alias('EndTime'))
df = df.withColumn('Words', f.concat_ws(' ', df['Words']))
Output:
+---+------------+-----------------+-----------------+
| ID| Words| StartTime| EndTime|
+---+------------+-----------------+-----------------+
| 1| I am Chris|2017-1-1 00:01:00|2017-1-1 00:03:00|
| 2|I am Jessica|2017-1-1 00:01:00|2017-1-1 00:03:00|
+---+------------+-----------------+-----------------+

how to output multiple (key,value) in spark map function

The format of input data likes below:
+--------------------+-------------+--------------------+
| StudentID| Right | Wrong |
+--------------------+-------------+--------------------+
| studentNo01 | a,b,c | x,y,z |
+--------------------+-------------+--------------------+
| studentNo02 | c,d | v,w |
+--------------------+-------------+--------------------+
And the format of output likes below():
+--------------------+---------+
| key | value|
+--------------------+---------+
| studentNo01,a | 1 |
+--------------------+---------+
| studentNo01,b | 1 |
+--------------------+---------+
| studentNo01,c | 1 |
+--------------------+---------+
| studentNo01,x | 0 |
+--------------------+---------+
| studentNo01,y | 0 |
+--------------------+---------+
| studentNo01,z | 0 |
+--------------------+---------+
| studentNo02,c | 1 |
+--------------------+---------+
| studentNo02,d | 1 |
+--------------------+---------+
| studentNo02,v | 0 |
+--------------------+---------+
| studentNo02,w | 0 |
+--------------------+---------+
The Right means 1 , The Wrong means 0.
I want to process these data using Spark map function or udf, But I don't know how to deal with it . Can you help me, please? Thank you.
Use split and explode twice and do the union
val df = List(
("studentNo01","a,b,c","x,y,z"),
("studentNo02","c,d","v,w")
).toDF("StudenID","Right","Wrong")
+-----------+-----+-----+
| StudenID|Right|Wrong|
+-----------+-----+-----+
|studentNo01|a,b,c|x,y,z|
|studentNo02| c,d| v,w|
+-----------+-----+-----+
val pair = (
df.select('StudenID,explode(split('Right,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(1))
).unionAll(
df.select('StudenID,explode(split('Wrong,",")))
.select(concat_ws(",",'StudenID,'col).as("key"))
.withColumn("value",lit(0))
)
+-------------+-----+
| key|value|
+-------------+-----+
|studentNo01,a| 1|
|studentNo01,b| 1|
|studentNo01,c| 1|
|studentNo02,c| 1|
|studentNo02,d| 1|
|studentNo01,x| 0|
|studentNo01,y| 0|
|studentNo01,z| 0|
|studentNo02,v| 0|
|studentNo02,w| 0|
+-------------+-----+
You can convert to RDD as follows
val rdd = pair.map(r => (r.getString(0),r.getInt(1)))