I am trying to read a column, and if in that column there is a "Y" i will fill the new column with "Y" otherwise i will fill it with "N".
+--------------+---------------------+-------------------+
|Date | Value | HasChanged |
+--------------+---------------------+-------------------+
|2020-12-14 | N | Y |
|2020-12-14 | Y | Y |
|2020-12-14 | N | Y |
|2020-12-14 | N | Y |
+--------------+---------------------+-------------------+
|Date | Value | HasChanged |
+--------------+---------------------+-------------------+
|2020-12-14 | N | N |
|2020-12-14 | N | N |
|2020-12-14 | N | N |
|2020-12-14 | N | N |
I am trying with this:
val df1 = df.withColumn("HasChanged", when(Value === "Y"), lit("Y")).otherwise("N")))
But only changes the row where there is a Y and what i want is change the entire Column. How can I do it?
You don't need when actually, you can just use max function as Y > N:
import org.apache.spark.sql.expressions.Window
val df1 = df.withColumn("HasChanged", max(col("Value")).over(Window.orderBy()))
df1.show
//+----------+-----+----------+
//| Date|Value|HasChanged|
//+----------+-----+----------+
//|2020-12-14| N| Y|
//|2020-12-14| Y| Y|
//|2020-12-14| N| Y|
//|2020-12-14| N| Y|
//+----------+-----+----------+
You need to check whether value = Y in all rows. You can do that using the maximum of the comparison boolean over a window, which will be True if 1 or more rows are True, and False if every row is False.
import org.apache.spark.sql.expressions.Window
val df1 = df.withColumn(
"HasChanged",
when(max($"Value" === "Y").over(Window.orderBy()), "Y").otherwise("N")
)
df1.show
+----------+-----+----------+
| Date|Value|HasChanged|
+----------+-----+----------+
|2020-12-14| N| Y|
|2020-12-14| Y| Y|
|2020-12-14| N| Y|
|2020-12-14| N| Y|
+----------+-----+----------+
Related
I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+
In scala with spark-2.4, I would like to filter the value inside the arrays in a column.
From
+---+------------+
| id| letter|
+---+------------+
| 1|[x, xxx, xx]|
| 2|[yy, y, yyy]|
+---+------------+
To
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
I thought of using explode + filter
val res = Seq(("1", Array("x", "xxx", "xx")), ("2", Array("yy", "y", "yyy"))).toDF("id", "letter")
res.withColumn("tmp", explode(col("letter"))).filter(length(col("tmp")) < 3).drop(col("letter")).show()
And I'm getting
+---+---+
| id|tmp|
+---+---+
| 1| x|
| 1| xx|
| 2| yy|
| 2| y|
+---+---+
How do I zip/groupBy back by id ?
Or maybe there is a better, more optimised solution ?
You can filter the array without explode() in Spark 2.4:
res.withColumn("letter", expr("filter(letter, x -> length(x) < 3)")).show()
Output:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
In Spark 2.4+, higher order functions are the way to go (filter), alternatively use collect_list :
res.withColumn("tmp",explode(col("letter")))
.filter(length(col("tmp")) < 3)
.drop(col("letter"))
// aggregate back
.groupBy($"id")
.agg(collect_list($"tmp").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
As this introduces a shuffle, it's better to use UDF for that:
def filter_arr(maxLength:Int)= udf((arr:Seq[String]) => arr.filter(str => str.size<=maxLength))
res
.select($"id",filter_arr(maxLength = 2)($"letter").as("letter"))
.show()
gives:
+---+-------+
| id| letter|
+---+-------+
| 1|[x, xx]|
| 2|[yy, y]|
+---+-------+
query I'm using:
I want to replace existing columns with new values on condition, if value of another col = ABC then column remain same otherwise should give null or blank.
It's giving result as per logic but only for last column it encounters in loop.
import pyspark.sql.functions as F
for i in df.columns:
if i[4:]!='ff':
new_df=df.withColumn(i,F.when(df.col_ff=="abc",df[i])\
.otherwise(None))
df:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| a | b | c | def |
| b | c | b | abc |
| c | d | a | def |
+------+----+-----+-------+
required output:
+------+----+-----+-------+
| col1 |col2|col3 | col_ff|
+------+----+-----+-------+
| a | a | d | abc |
| null |null|null | def |
| b | c | b | abc |
| null |null|null | def |
+------+----+-----+-------+
The problem in your code is that you're overwriting new_df with the original DataFrame df in each iteration of the loop. You can fix it by first setting new_df = df outside of the loop, and then performing the withColumn operations on new_df inside the loop.
For example, if df were the following:
df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#| a| b| c| def|
#| b| c| b| abc|
#| c| d| a| def|
#+----+----+----+------+
Change your code to:
import pyspark.sql.functions as F
new_df = df
for i in df.columns:
if i[4:]!='ff':
new_df = new_df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i)))
Notice here that I removed the .otherwise(None) part because when will return null by default if the condition is not met.
You could also do the same using functools.reduce:
from functools import reduce # for python3
new_df = reduce(
lambda df, i: df.withColumn(i, F.when(F.col("col_ff")=="abc", F.col(i))),
[i for i in df.columns if i[4:] != "ff"],
df
)
In both cases the result is the same:
new_df.show()
#+----+----+----+------+
#|col1|col2|col3|col_ff|
#+----+----+----+------+
#| a| a| d| abc|
#|null|null|null| def|
#| b| c| b| abc|
#|null|null|null| def|
#+----+----+----+------+
I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
}
)
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
otherwise(0L))
df2.show
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
resultDF.show
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
+--------+-------------------+
|Store_id| Date_d_id|
+--------+-------------------+
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
+--------+-------------------+
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
+--------+-------------------+--------+
|Store_id|Date_d_id |Day_diff|
+--------+-------------------+--------+
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
+--------+-------------------+--------+
now the final step is to use groupBy and aggregation to find the average
laggeddf.groupBy("Store_id")
.agg(avg("Day_diff").as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 11.0|
| 1| 4.5|
+--------+--------+
Now if you want to neglect the null Day_diff then you can do
laggeddf.groupBy("Store_id")
.agg((sum("Day_diff")/count($"Day_diff".isNotNull)).as("avg_diff"))
which should give you
+--------+--------+
|Store_id|avg_diff|
+--------+--------+
| 0| 8.25|
| 1| 3.0|
+--------+--------+
I hope the answer is helpful
I have dataframe: df1
+------+--------+--------+--------+
| Name | value1 | value2 | value3 |
+------+--------+--------+--------+
| A | 100 | null | 200 |
| B | 10000 | 300 | 10 |
| c | null | 10 | 100 |
+------+--------+--------+--------+
second dataframe: df2:
+------+------+
| Col1 | col2 |
+------+------+
| X | 1000 |
| Y | 2002 |
| Z | 3000 |
+------+------+
I want to read the values from table1 like value1,value2 and value3
Apply condition to table2 with new columns:
cond1: when name= A and col2>value1, flag it to Y or N
cond2: when name= B and col2>value2 then Y or N
cond3: when name =c and col2>value1 and col2> value3, then Y or N
source code:
df2.withColumn("cond1",when($"col2")>value1,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond2",when($"col2")>value2,lit("Y)).otherwise(lit("N"))
df2.withColumn("cond3",when($"col2")>value1 && when($"col2")>value3,lit("Y")).otherwise(lit("N"))
output:
+------+------+-------+-------+-------+
| Col1 | col2 | cond1 | cond2 | cond3 |
+------+------+-------+-------+-------+
| X | 1000 | Y | Y | y |
| Y | 2002 | N | Y | Y |
| Z | 3000 | Y | Y | Y |
+------+------+-------+-------+-------+
If I understand your question correctly, you can join the two dataframes and create the condition columns as shown below. A couple of notes:
1) With the described conditions,null in df1 is replaced with Int.MinValue for simplified integer comparison
2) Since df1 is small, broadcast join is used to minimize sorting/shuffling for better performance
val df1 = Seq(
("A", 100, Int.MinValue, 200),
("B", 10000, 300, 10),
("C", Int.MinValue, 10, 100)
).toDF("Name", "value1", "value2", "value3")
val df2 = Seq(
("A", 1000),
("B", 2002),
("C", 3000),
("A", 5000),
("A", 150),
("B", 250),
("B", 12000),
("C", 50)
).toDF("Col1", "col2")
val df3 = df2.join(broadcast(df1), df2("Col1") === df1("Name")).select(
df2("Col1"),
df2("col2"),
when(df2("col2") > df1("value1"), "Y").otherwise("N").as("cond1"),
when(df2("col2") > df1("value2"), "Y").otherwise("N").as("cond2"),
when(df2("col2") > df1("value1") && df2("col2") > df1("value3"), "Y").otherwise("N").as("cond3")
)
df3.show
+----+-----+-----+-----+-----+
|Col1| col2|cond1|cond2|cond3|
+----+-----+-----+-----+-----+
| A| 1000| Y| Y| Y|
| B| 2002| N| Y| N|
| C| 3000| Y| Y| Y|
| A| 5000| Y| Y| Y|
| A| 150| Y| Y| N|
| B| 250| N| N| N|
| B|12000| Y| Y| Y|
| C| 50| Y| Y| N|
+----+-----+-----+-----+-----+
You can create rowNo column in both dataframes as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf1 = df1.withColumn("rowNo", row_number().over(Window.orderBy("Name")))
val tempdf2 = df2.withColumn("rowNo", row_number().over(Window.orderBy("Col1")))
Then you can join them with the created column as below
val joinedDF = tempdf2.join(tempdf1, Seq("rowNo"), "left")
Finally you can use select and when function to get the final dataframe
joinedDF.select($"Col1",
$"col2",
when($"col2">$"value1" || $"value1".isNull, "Y").otherwise("N").as("cond1"),
when($"col2">$"value2" || $"value2".isNull, "Y").otherwise("N").as("cond2"),
when(($"col2">$"value1" && $"col2">$"value3") || $"value3".isNull, "Y").otherwise("N").as("cond3"))
you should have your desired dataframe as
+----+----+-----+-----+-----+
|Col1|col2|cond1|cond2|cond3|
+----+----+-----+-----+-----+
|X |1000|Y |Y |Y |
|Y |2002|N |Y |Y |
|Z |3000|Y |Y |Y |
+----+----+-----+-----+-----+
I hope the answer is helpful