How to make a transform of return type DataFrame => DataFrame which will give product of 2 column as value and column1_column2 as name - scala

Input
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 1 | 0 | 1
+-------+-------+----+-------
output
+-------+-------+----+-------+-------+-------+----+-------
| id | a | b | c | a_b | a_c | b_c
+-------+-------+----+-------+-------+-------+----+-------
| 1 | 1 | 0 | 1 | 0 | 1 | 0
+-------+-------+----+-------+-------+-------+----+-------
basically I have a sequence of pair which contains Seq((a,b),(a,c),(b,c))
and thier values will be col(a)*col(b) , col(a)*col(c) col(b)*col(c) for new column
Like I know how to add them in dataFrame but not able to make a transform of return type DataFrame => DataFrame

Is this what you what?
Take a look at the API page. You will save yourself sometime :)
val df = Seq((1, 1, 0, 1))
.toDF("id", "a", "b", "c")
.withColumn("a_b", $"a" * $"b")
.withColumn("a_c", $"a" * $"c")
.withColumn("b_c", $"b" * $"c")
output ============
+---+---+---+---+---+---+---+
| id| a| b| c|a_b|a_c|b_c|
+---+---+---+---+---+---+---+
| 1| 1| 0| 1| 0| 1| 0|
+---+---+---+---+---+---+---+

Related

Add column elements to a Dataframe Scala Spark

I have two dataframes, and I want to add one to all row of the other one.
My dataframes are like:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
2 | a | 2
2 | d | 4
name
a
b
c
d
e
And I want a result like this:
id | name | rate
1 | a | 3
1 | b | 4
1 | c | 1
1 | d | null
1 | e | null
2 | a | 2
2 | b | null
2 | c | null
2 | d | 4
2 | e | null
How can I do this?
It seems it's more than a simple join.
val df = df1.select("id").distinct().crossJoin(df2).join(
df1,
Seq("name", "id"),
"left"
).orderBy("id", "name")
df.show
+----+---+----+
|name| id|rate|
+----+---+----+
| a| 1| 3|
| b| 1| 4|
| c| 1| 1|
| d| 1|null|
| e| 1|null|
| a| 2| 2|
| b| 2|null|
| c| 2|null|
| d| 2| 4|
| e| 2|null|
+----+---+----+

how to explode a spark dataframe

I exploded a nested schema but I am not getting what I want,
before exploded it looks like this:
df.show()
+----------+----------------------------------------------------------+
|CaseNumber| SourceId |
+----------+----------------------------------------------------------+
| 0 |[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}] |
+----------|----------------------------------------------------------|
| 1 |[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}] |
+---------------------------------------------------------------------+
I want it to be like this
+----------+-------------------+
| CaseNumber| Sku | ContractId |
+----------+-------------------+
| 0 | 1 | 22 |
+----------|------|------------|
| 1 | 3 | 24 |
+------------------------------|
Here is one way using the build-in get_json_object function:
import org.apache.spark.sql.functions.get_json_object
val df = Seq(
(0, """[{"id":"1","type":"Sku"},{"id":"22","type":"ContractID"}]"""),
(1, """[{"id":"3","type":"Sku"},{"id":"24","type":"ContractID"}]"""))
.toDF("CaseNumber", "SourceId")
df.withColumn("sku", get_json_object($"SourceId", "$[0].id").cast("int"))
.withColumn("ContractId", get_json_object($"SourceId", "$[1].id").cast("int"))
.drop("SourceId")
.show
// +----------+---+----------+
// |CaseNumber|sku|ContractId|
// +----------+---+----------+
// | 0| 1| 22|
// | 1| 3| 24|
// +----------+---+----------+
UPDATE
After our discussion we realised that the mentioned data is of array<struct<id:string,type:string>> type and not a simple string. Next is the solution for the new schema:
df.withColumn("sku", $"SourceIds".getItem(0).getField("id"))
.withColumn("ContractId", $"SourceIds".getItem(1).getField("id"))

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful

subtract two columns with null in spark dataframe

I new to spark, I have dataframe df:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | null |
+----------+------------+-----------+
| 5 | null | null |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
when subtracting two columns, one column has null so resulting column also resulting as null.
df.withColumn("Sub", col(A)-col(B))
Expected output should be:
+----------+------------+-----------+
| Column1 | Column2 | Sub |
+----------+------------+-----------+
| 1 | 2 | 1 |
+----------+------------+-----------+
| 4 | null | 4 |
+----------+------------+-----------+
| 5 | null | 5 |
+----------+------------+-----------+
| 6 | 8 | 2 |
+----------+------------+-----------+
I don't want to replace the column2 to replace with 0, it should be null only.
Can someone help me on this?
You can use when function as
import org.apache.spark.sql.functions._
df.withColumn("Sub", when(col("Column1").isNull(), lit(0)).otherwise(col("Column1")) - when(col("Column2").isNull(), lit(0)).otherwise(col("Column2")))
you should have final result as
+-------+-------+----+
|Column1|Column2| Sub|
+-------+-------+----+
| 1| 2|-1.0|
| 4| null| 4.0|
| 5| null| 5.0|
| 6| 8|-2.0|
+-------+-------+----+
You can coalesce nulls to zero on both columns and then do the subtraction:
val df = Seq((Some(1), Some(2)),
(Some(4), null),
(Some(5), null),
(Some(6), Some(8))
).toDF("A", "B")
df.withColumn("Sub", abs(coalesce($"A", lit(0)) - coalesce($"B", lit(0)))).show
+---+----+---+
| A| B|Sub|
+---+----+---+
| 1| 2| 1|
| 4|null| 4|
| 5|null| 5|
| 6| 8| 2|
+---+----+---+

Design a saturated sum with window functions

I have, in a pyspark dataframe, a column with values 1, -1 and 0, representing the events "startup", "shutdown" and "other" of an engine. I want to build a column with the state of the engine, with 1 when the engine is on and 0 when it's off, something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | -1 | 0 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 1 | 1 |
| 7 | -1 | 0 |
+---+-----+-----+
If 1s and -1s are always alternated, this can easily be done with a window function such as
sum('event').over(Window.orderBy('seq'))
It can happen, though, that I have some spurious 1s or -1s, which I want to ignore if I am already in state 1 or 0, respectively. I want thus to be able to do something like:
+---+-----+-----+
|seq|event|state|
+---+-----+-----+
| 1 | 1 | 1 |
| 2 | 0 | 1 |
| 3 | 1 | 1 |
| 4 | -1 | 0 |
| 5 | 0 | 0 |
| 6 | -1 | 0 |
| 7 | 1 | 1 |
+---+-----+-----+
That would require a "saturated" sum function that never goes above 1 or below 0, or some other approach which I am not able to imagine at the moment.
Does somebody have any ideas?
You can achieve the desired result using the last function to fill-forwards the most recent state-change.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = (spark.createDataFrame([
(1, 1),
(2, 0),
(3, -1),
(4, 0)
], ["seq", "event"]))
w = Window.orderBy('seq')
#replace zeros with nulls so they can be ignored easily.
df = df.withColumn('helperCol',F.when(df.event != 0,df.event))
#fill statechanges forward in a new column.
df = df.withColumn('state',F.last(df.helperCol,ignorenulls=True).over(w))
#replace -1 values with 0
df = df.replace(-1,0,['state'])
df.show()
This produces:
+---+-----+---------+-----+
|seq|event|helperCol|state|
+---+-----+---------+-----+
| 1| 1| 1| 1|
| 2| 0| null| 1|
| 3| -1| -1| 0|
| 4| 0| null| 0|
+---+-----+---------+-----+
helperCol need not be added to the dataframe, I have only included it to make the process more readable.