Use Window to count lines with if condition in scala - scala

I hope that you might help me :-)
I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.
I created the dataframe below to explain things better:
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
(2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
(3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
(4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
(5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
(6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
(7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
(8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
(9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
"email", "date_minus1","date_minus2m")
df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
date_minus1 = the day before the advert was posted
date_minus2m = 2 month before the advert was posted
I want to count the number of advert, with the same email, between those 2 dates...
What I want as a result is:
+---------+----------------+
|id_advert|nb_prev_advert |
+---------+----------------+
|6 |2 |
|3 |3 |
|5 |3 |
|9 |2 |
|4 |1 |
|8 |3 |
|7 |3 |
|10 |3 |
+--------+-----------------+
I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...
I am sur the we can do something like:
val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)
And use it to go across the dataframe and count only row with
email of the row = email of the current_row
date_minus2m of the row< date creation of the current row < date_minus1 of the row

Adding this as different answer because it is different
Input:
df.select("*").orderBy("email","creation_date").show()
+---------+--------------------+------------+----+
|id_advert| creation_date| email|sold|
+---------+--------------------+------------+----+
| 1|2015-06-29 10:53:...|boulanger.fr| 1|
| 5|2015-07-28 11:22:...|boulanger.fr| 0|
| 3|2017-06-28 10:31:...|boulanger.fr| 1|
| 7|2017-08-24 16:08:...|boulanger.fr| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0|
| 8|2014-08-31 17:20:...|patissier.fr| 1|
| 2|2016-07-05 10:48:...|patissier.fr| 1|
| 4|2017-08-21 17:31:...|patissier.fr| 0|
| 6|2017-08-23 17:03:...|patissier.fr| 0|
+---------+--------------------+------------+----+
Now you define your window spec as something like this
val w = Window.
partitionBy("email").
orderBy(col("creation_date").
cast("timestamp").
cast("long")).rangeBetween(-60*24*60*60,-1)
And the main query will be:
df.
select(
col("*"),count("email").over(w).alias("all_prev_mail_advert"),
sum("sold").over(w).alias("all_prev_sold_mail_advert")
).orderBy("email","creation_date").show()
Output:
+---------+--------------------+------------+----+--------------------+-------------------------+
|id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
+---------+--------------------+------------+----+--------------------+-------------------------+
| 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
| 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
| 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
| 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
| 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
| 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
| 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
| 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
+---------+--------------------+------------+----+--------------------+-------------------------+
Explanation:
We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.
And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.

Here is the answer with using Window with a range
Create a window spec with range between current and past sixty days
val w = Window
.partitionBy(col("email"))
.orderBy(col("creation_date").cast("timestamp").cast("long"))
.rangeBetween(-60*86400,-1)
Then select it over your data frame
df
.select(col("*"),count("email").over(w).alias("trailing_count"))
.orderBy("email","creation_date") //using this for display purpose
.show()
Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.
Input Data :
df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()
+---------+--------------------+------------+
|id_advert| creation_date| email|
+---------+--------------------+------------+
| 3|2017-06-28 10:31:...|boulanger.fr|
| 1|2017-06-29 10:53:...|boulanger.fr|
| 5|2017-07-28 11:22:...|boulanger.fr|
| 7|2017-08-24 16:08:...|boulanger.fr|
| 9|2017-09-04 14:35:...|boulanger.fr|
| 2|2017-07-05 10:48:...|patissier.fr|
| 4|2017-08-21 17:31:...|patissier.fr|
| 6|2017-08-23 17:03:...|patissier.fr|
| 8|2017-08-31 17:20:...|patissier.fr|
| 10|2017-09-07 15:10:...|patissier.fr|
+---------+--------------------+------------+
Output:
+---------+--------------------+------------+-------------+--------------+
|id_advert| creation_date| email|date_creation|trailing_count|
+---------+--------------------+------------+-------------+--------------+
| 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
| 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
| 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
| 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
| 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
| 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
| 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
| 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
| 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
| 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
+---------+--------------------+------------+-------------+--------------+

As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.
I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...
How does it work? To me:
if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???
if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...
What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.
Is it an easy way to use what you did and apply it to my real issue?
My dataframe look like this:
var df = sc.parallelize(Array(
(1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
(2, "2016-07-05 10:48:57.0","patissier.fr", 1),
(3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
(4, "2017-08-21 17:31:12.0","patissier.fr", 0),
(5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
(6, "2017-08-23 17:03:43.0","patissier.fr", 0),
(7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
(8, "2014-08-31 17:20:43.0","patissier.fr", 1),
(9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")
For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...
Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).

Related

Calculate number of columns with missing values per each row in PySpark

Let see we have the following data set
columns = ['id', 'dogs', 'cats']
values = [(1, 2, 0),(2, None, None),(3, None,9)]
df = spark.createDataFrame(values,columns)
df.show()
+----+----+----+
| id|dogs|cats|
+----+----+----+
| 1| 2| 0|
| 2|null|null|
| 3|null| 9|
+----+----+----+
I would like to calculate number ("miss_nb") and percents ("miss_pt") of columns with missing values per rows and get the following table
+----+-------+-------+
| id|miss_nb|miss_pt|
+----+-------+-------+
| 1| 0| 0.00|
| 2| 2| 0.67|
| 3| 1| 0.33|
+----+-------+-------+
The number of columns should be any (non-fixed list).
How to do it?
Thanks!

Get last n items in pyspark

For a dataset like -
+---+------+----------+
| id| item| timestamp|
+---+------+----------+
| 1| apple|2022-08-15|
| 1| peach|2022-08-15|
| 1| apple|2022-08-15|
| 1|banana|2022-08-14|
| 2| apple|2022-08-15|
| 2|banana|2022-08-14|
| 2|banana|2022-08-14|
| 2| water|2022-08-14|
| 3| water|2022-08-15|
| 3| water|2022-08-14|
+---+------+----------+
Can I use pyspark functions directly to get last three items the user purchased in the past 5 days? I know udf can do that, but I am wondering if any existing funtion can achieve this.
My expected output is like below or anything simliar is okay too.
id last_three_item
1 [apple, peach, apple]
2 [water, banana, apple]
3 [water, water]
Thanks!
You can use pandas_udf for this.
#f.pandas_udf(returnType=ArrayType(StringType()), functionType=f.PandasUDFType.GROUPED_AGG)
def pudf_get_top_3(x):
return x.head(3).to_list()
sdf\
.orderby("timestamp")\
.groupby("id")\
.agg(pudf_get_top_3("item")\
.alias("last_three_item))\
.show()

Pyspark: How to build sum of a column(which contains negative and positive values) with stop at 0

I think a example says more then the describtion.
The right column "sum" is the one i am looking for.
enter image description here
to_count|sum
-------------
-1 |0
+1 |1
-1 |0
-1 |0
+1 |1
+1 |2
-1 |1
+1 |2
. |.
. |.
I tried to rebuild that with several groupings with comparing lead and lag but that only works for the first time the sum usually ends in a negativ value.
Summing only positive and negative values seperatly also ends in another final result.
Would be great if anyone has a good idea how to solve this in pyspark!
I would use pandas_udf here:
from pyspark.sql.functions import pandas_udf, PandasUDFType
pdf = pd.DataFrame({'g':[1]*8, 'id':range(8), 'value': [-1,1,-1,-1,1,1,-1,1]})
df = spark.createDataFrame(pdf)
df = df.withColumn('cumsum', F.lit(math.inf))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def _calc_cumsum(pdf):
pdf.sort_values(by=['id'], inplace=True, ascending=True)
cumsums = []
prev = 0
for v in pdf['value'].values:
prev = max(prev + v, 0)
cumsums.append(prev)
pdf['cumsum'] = cumsums
return pdf
df = df.groupby('g').apply(_calc_cumsum)
df.show()
The results:
+---+---+-----+------+
| g| id|value|cumsum|
+---+---+-----+------+
| 1| 0| -1| 0.0|
| 1| 1| 1| 1.0|
| 1| 2| -1| 0.0|
| 1| 3| -1| 0.0|
| 1| 4| 1| 1.0|
| 1| 5| 1| 2.0|
| 1| 6| -1| 1.0|
| 1| 7| 1| 2.0|
+---+---+-----+------+
Please look at the pic first there is a testdataset(first 3 columns) and the calc steps.
The column "flag" is now in another format. We also checked our datasource and realized that we only have to handle 1 and -1 entries. We mapped 1 to 0 and -1 to 1. Now it's working like exspected as you see in the column result.
The code is this:
w1 = Window.partitionBy('group').orderBy('order')
df_0 = tst.withColumn('edge_det',F.when(((F.col('flag')==0)&((F.lag('flag',default=1).over(w1))==1)),1).otherwise(0))
df_0 = df_0.withColumn('edge_cyl',F.sum('edge_det').over(w1))
df1 = df_0.withColumn('condition', F.when(F.col('edge_cyl')==0,0).otherwise(F.when(F.col('flag')==1,-1).otherwise(1)))
df1 =df1.withColumn('cond_sum',F.sum('condition').over(w1))
cond = (F.col('cond_sum')>=0)|(F.col('condition')==1)
df2 = df1.withColumn('new_cond',F.when(cond,F.col('condition')).otherwise(0))
df3 = df2.withColumn("result",F.sum('new_cond').over(w1))

Apply function on all rows of dataframe [duplicate]

This question already has answers here:
Process all columns / the entire row in a Spark UDF
(2 answers)
Closed 3 years ago.
I want to apply a function on all rows of DataFrame.
Example:
|A |B |C |
|1 |3 |5 |
|6 |2 |0 |
|8 |2 |7 |
|0 |9 |4 |
Myfunction(df)
Myfunction(df: DataFrame):{
//Apply sum of columns on each row
}
Wanted output:
1+3+5 = 9
6+2+0 = 8
...
How can that be done is Scala please? i followed this but got no luck.
It's simple. You don't need to write any function for this, all you can do is to create a new column by summing up all the columns you want.
scala> df.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 4|
| 1| 2| 5|
+---+---+---+
scala> df.withColumn("sum",col("A")+col("B")+col("C")).show
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 1| 2| 3| 6|
| 1| 2| 4| 7|
| 1| 2| 5| 8|
+---+---+---+---+
Edited:
Well you can run map function on each row and get the sum using row index/field name.
scala> df.map(x=>x.getInt(0) + x.getInt(1) + x.getInt(2)).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+
scala> df.map(x=>x.getInt(x.fieldIndex("A")) + x.getInt(x.fieldIndex("B")) + x.getInt(x.fieldIndex("C"))).toDF("sum").show
+---+
|sum|
+---+
| 6|
| 7|
| 8|
+---+
Map is the solution if you want to apply a function to every row of a dataframe. For every Row, you can return a tuple and a new RDD is made.
This is perfect when working with Dataset or RDD but not really for Dataframe. For your use case and for Dataframe, I would recommend just adding a column and use columns objects to do what you want.
// Using expr
df.withColumn("TOTAL", expr("A+B+C"))
// Using columns
df.withColumn("TOTAL", col("A")+col("B")+col("C"))
// Using dynamic selection of all columns
df.withColumn("TOTAL", df.colums.map(col).reduce((c1, c2) => c1 + c2))
In that case, you'll be very interested in this question.
UDF is also a good solution and is better explained here.
If you don't want to keep source columns, you can replace .withColumn(name, value) with .select(value.alias(name))

How to append column values in Spark SQL?

I have the below table:
+-------+---------+---------+
|movieId|movieName| genre|
+-------+---------+---------+
| 1| example1| action|
| 1| example1| thriller|
| 1| example1| romance|
| 2| example2|fantastic|
| 2| example2| action|
+-------+---------+---------+
What I am trying to achieve is to append the genre values together where the id and name are the same. Like this:
+-------+---------+---------------------------+
|movieId|movieName| genre |
+-------+---------+---------------------------+
| 1| example1| action|thriller|romance |
| 2| example2| action|fantastic |
+-------+---------+---------------------------+
Use groupBy and collect_list to get a list of all items with the same movie name. Then combine these to a string using concat_ws (if the order is important, first use sort_array). Small example with given sample dataframe:
val df2 = df.groupBy("movieId", "movieName")
.agg(collect_list($"genre").as("genre"))
.withColumn("genre", concat_ws("|", sort_array($"genre")))
Gives the result:
+-------+---------+-----------------------+
|movieId|movieName|genre |
+-------+---------+-----------------------+
|1 |example1 |action|thriller|romance|
|2 |example2 |action|fantastic |
+-------+---------+-----------------------+