extract day from the first week of year - date

I try to extract in pyspark the date of Sunday from every given week in a year. Week and year are in the format yyyyww. This is possible for every week except the first week, in this case, a got a null value. This is the sample code and result.
columns = ['id', 'week_year']
vals = [
(1, 201952),
(2, 202001),
(3, 202002),
(4, 201901),
(5, 201902)
]
df = spark.createDataFrame(vals, columns)
+---+---------+
| id|week_year|
+---+---------+
| 1| 201952|
| 2| 202001|
| 3| 202002|
| 4| 201901|
| 5| 201902|
+---+---------+
df = df.withColumn("day", to_timestamp(concat(df.week_year, lit("-Sunday")), 'yyyyww-E'))
As a result I got
+---+---------+-------------------+
| id|week_year| day|
+---+---------+-------------------+
| 1| 201952|2019-12-22 00:00:00|
| 2| 202001| null|
| 3| 202002|2020-01-05 00:00:00|
| 4| 201901| null|
| 5| 201902|2019-01-06 00:00:00|
+---+---------+-------------------+
Do you have an idea, why it does not work for the first week? It is also strange for me that 5.01 and 6.01 are in second week, not in the first.

If you look at the calendar for 2020, the year starts on wednesday, which is in the middle of 1st week and that first week doesn't have a sunday. Same goes for the 2019. That is why 2020-01-05 is coming in the second week.
Hope this helps!

Related

pySpark: add current month to column name

I have a self written function and it gets a dataframe and returns the whole dataframe plus a new column. That new column must not have a fixed name but instead the current month as part of the new column name. E.g. "forecast_august2022".
I tried it like
.withColumnRenamed(
old_columnname,
new_columnname
)
But I do not know, how to create the new column name with a fixed value (forecast_) concatenating it with the current month. Ideas?
You can define a variable at start with current month and year and use it in f string while adding it in with column
from pyspark.sql import functions as F
import datetime
mydate = datetime.datetime.now()
month_nm=mydate.strftime("%B%Y") #gives you July2022 for today
dql1=spark.range(3).toDF("ID")
dql1.withColumn(f"forecast_{month_nm}",F.lit(0)).show()
#output
+---+-----------------+
| ID|forecast_July2022|
+---+-----------------+
| 0| 0|
| 1| 0|
| 2| 0|
+---+-----------------+
you can do like this :
>>> import datetime
>>> #provide month number
>>> month_num = "3"
>>> datetime_object = datetime.datetime.strptime(month_num, "%m")
>>> full_month_name = datetime_object.strftime("%B")
>>> df.withColumn(("newcol"+"_"+full_month_name+"2022"),col('period')).show()
+------+-------+------+----------------+
|period|product|amount|newcol_March2022|
+------+-------+------+----------------+
| 20191| prod1| 30| 20191|
| 20192| prod1| 30| 20192|
| 20191| prod2| 20| 20191|
| 20191| prod3| 60| 20191|
| 20193| prod1| 30| 20193|
| 20193| prod2| 30| 20193|
+------+-------+------+----------------+

Scala Spark use Window function to find max value

I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/

Why is the date_format() returning wrong week in Pyspark?

Was trying to get week of the month, from a date column in a pyspark dataframe ? I am using the following schematic to get week:date_format(to_date("my_date_col","yyyy-MM-dd") "W") from https://www.datasciencemadesimple.com/get-week-number-from-date-in-pyspark/#:~:text=In%20order%20to%20get%20Week,we%20use%20weekofmonth()%20function.
Oddly this seems to work for every week, except for 1st week of August 20!
base.filter(col("acct_cycle_cut_dt").between("2020-08-01","2020-08-07")\
).select("acct_cycle_cut_dt",month("acct_cycle_cut_dt"),\
date_format(to_date("acct_cycle_cut_dt","yyyy-MM-dd"), "W")\
).limit(4).show()
+-----------------+------------------------+----------------------------------------------------------+
|acct_cycle_cut_dt|month(acct_cycle_cut_dt)|date_format(to_date(`acct_cycle_cut_dt`, 'yyyy-MM-dd'), W)|
+-----------------+------------------------+----------------------------------------------------------+
| 2020-08-02| 8| 2|
| 2020-08-07| 8| 2|
| 2020-08-07| 8| 2|
| 2020-08-07| 8| 2|
+-----------------+------------------------+----------------------------------------------------------+
base.filter(col("acct_cycle_cut_dt").between("2020-07-01","2020-07-07")\
).select("acct_cycle_cut_dt",month("acct_cycle_cut_dt"),\
date_format(to_date("acct_cycle_cut_dt","yyyy-MM-dd"), "W")\
).limit(4).show()
+-----------------+------------------------+----------------------------------------------------------+
|acct_cycle_cut_dt|month(acct_cycle_cut_dt)|date_format(to_date(`acct_cycle_cut_dt`, 'yyyy-MM-dd'), W)|
+-----------------+------------------------+----------------------------------------------------------+
| 2020-07-03| 7| 1|
| 2020-07-03| 7| 1|
| 2020-07-02| 7| 1|
| 2020-07-02| 7| 1|
+-----------------+------------------------+----------------------------------------------------------+
That is the correct result, it is not wrong.
from pyspark.sql.functions import *
df.withColumn('date', to_timestamp('date', 'yyyy-MM-dd')) \
.withColumn('month', month('date')) \
.withColumn('week', date_format('date', 'W')) \
.show(10, False)
+-------------------+-----+----+
|date |month|week|
+-------------------+-----+----+
|2020-08-01 00:00:00|8 |1 |
|2020-08-02 00:00:00|8 |2 |
|2020-08-03 00:00:00|8 |2 |
|2020-08-04 00:00:00|8 |2 |
|2020-08-05 00:00:00|8 |2 |
|2020-08-06 00:00:00|8 |2 |
|2020-08-07 00:00:00|8 |2 |
|2020-08-08 00:00:00|8 |2 |
|2020-08-09 00:00:00|8 |3 |
|2020-08-10 00:00:00|8 |3 |
+-------------------+-----+----+
You can even check this from the calendar,
where the 1st of August is really in the first week of August and the 2nd of Auguet is in the second week.
In August , month is starting from Saturday. When you have date as "2020-08-02" or "2020-08-07" both the dates have only one Sunday before in the month. Your script will always return incorrect result whenever there is a month start before or on weekend.

Use Window to count lines with if condition in scala

I hope that you might help me :-)
I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.
I created the dataframe below to explain things better:
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
(2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
(3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
(4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
(5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
(6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
(7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
(8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
(9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
"email", "date_minus1","date_minus2m")
df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
date_minus1 = the day before the advert was posted
date_minus2m = 2 month before the advert was posted
I want to count the number of advert, with the same email, between those 2 dates...
What I want as a result is:
+---------+----------------+
|id_advert|nb_prev_advert |
+---------+----------------+
|6 |2 |
|3 |3 |
|5 |3 |
|9 |2 |
|4 |1 |
|8 |3 |
|7 |3 |
|10 |3 |
+--------+-----------------+
I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...
I am sur the we can do something like:
val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)
And use it to go across the dataframe and count only row with
email of the row = email of the current_row
date_minus2m of the row< date creation of the current row < date_minus1 of the row
Adding this as different answer because it is different
Input:
df.select("*").orderBy("email","creation_date").show()
+---------+--------------------+------------+----+
|id_advert| creation_date| email|sold|
+---------+--------------------+------------+----+
| 1|2015-06-29 10:53:...|boulanger.fr| 1|
| 5|2015-07-28 11:22:...|boulanger.fr| 0|
| 3|2017-06-28 10:31:...|boulanger.fr| 1|
| 7|2017-08-24 16:08:...|boulanger.fr| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0|
| 8|2014-08-31 17:20:...|patissier.fr| 1|
| 2|2016-07-05 10:48:...|patissier.fr| 1|
| 4|2017-08-21 17:31:...|patissier.fr| 0|
| 6|2017-08-23 17:03:...|patissier.fr| 0|
+---------+--------------------+------------+----+
Now you define your window spec as something like this
val w = Window.
partitionBy("email").
orderBy(col("creation_date").
cast("timestamp").
cast("long")).rangeBetween(-60*24*60*60,-1)
And the main query will be:
df.
select(
col("*"),count("email").over(w).alias("all_prev_mail_advert"),
sum("sold").over(w).alias("all_prev_sold_mail_advert")
).orderBy("email","creation_date").show()
Output:
+---------+--------------------+------------+----+--------------------+-------------------------+
|id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
+---------+--------------------+------------+----+--------------------+-------------------------+
| 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
| 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
| 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
| 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
| 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
| 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
| 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
| 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
+---------+--------------------+------------+----+--------------------+-------------------------+
Explanation:
We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.
And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.
Here is the answer with using Window with a range
Create a window spec with range between current and past sixty days
val w = Window
.partitionBy(col("email"))
.orderBy(col("creation_date").cast("timestamp").cast("long"))
.rangeBetween(-60*86400,-1)
Then select it over your data frame
df
.select(col("*"),count("email").over(w).alias("trailing_count"))
.orderBy("email","creation_date") //using this for display purpose
.show()
Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.
Input Data :
df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()
+---------+--------------------+------------+
|id_advert| creation_date| email|
+---------+--------------------+------------+
| 3|2017-06-28 10:31:...|boulanger.fr|
| 1|2017-06-29 10:53:...|boulanger.fr|
| 5|2017-07-28 11:22:...|boulanger.fr|
| 7|2017-08-24 16:08:...|boulanger.fr|
| 9|2017-09-04 14:35:...|boulanger.fr|
| 2|2017-07-05 10:48:...|patissier.fr|
| 4|2017-08-21 17:31:...|patissier.fr|
| 6|2017-08-23 17:03:...|patissier.fr|
| 8|2017-08-31 17:20:...|patissier.fr|
| 10|2017-09-07 15:10:...|patissier.fr|
+---------+--------------------+------------+
Output:
+---------+--------------------+------------+-------------+--------------+
|id_advert| creation_date| email|date_creation|trailing_count|
+---------+--------------------+------------+-------------+--------------+
| 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
| 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
| 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
| 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
| 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
| 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
| 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
| 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
| 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
| 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
+---------+--------------------+------------+-------------+--------------+
As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.
I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...
How does it work? To me:
if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???
if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...
What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.
Is it an easy way to use what you did and apply it to my real issue?
My dataframe look like this:
var df = sc.parallelize(Array(
(1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
(2, "2016-07-05 10:48:57.0","patissier.fr", 1),
(3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
(4, "2017-08-21 17:31:12.0","patissier.fr", 0),
(5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
(6, "2017-08-23 17:03:43.0","patissier.fr", 0),
(7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
(8, "2014-08-31 17:20:43.0","patissier.fr", 1),
(9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")
For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...
Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).

Merging and aggregating dataframes using Spark Scala

I have a dataset, after transformation using Spark Scala (1.6.2). I got the following two dataframes.
DF1:
|date | country | count|
| 1872| Scotland| 1|
| 1873| England | 1|
| 1873| Scotland| 1|
| 1875| England | 1|
| 1875| Scotland| 2|
DF2:
| date| country | count|
| 1872| England | 1|
| 1873| Scotland| 1|
| 1874| England | 1|
| 1875| Scotland| 1|
| 1875| Wales | 1|
Now from above two dataframes, I want to get aggregate by date per country. Like following output. I tried using union and by joining but not able to get desired results.
Expected output from the two dataframes above:
| date| country | count|
| 1872| England | 1|
| 1872| Scotland| 1|
| 1873| Scotland| 2|
| 1873| England | 1|
| 1874| England | 1|
| 1875| Scotland| 3|
| 1875| Wales | 1|
| 1875| England | 1|
Kindly help me get solution.
The best way is to perform an union and then an groupBy by the two columns, then with the sum, you can specify which column to add up:
df1.unionAll(df2)
.groupBy("date", "country")
.sum("count")
Output:
+----+--------+----------+
|date| country|sum(count)|
+----+--------+----------+
|1872|Scotland| 1|
|1875| England| 1|
|1873| England| 1|
|1875| Wales| 1|
|1872| England| 1|
|1874| England| 1|
|1873|Scotland| 2|
|1875|Scotland| 3|
+----+--------+----------+
Using the DataFrame API, you can use a unionAll followed by a groupBy to achive this.
DF1.unionAll(DF2)
.groupBy("date", "country")
.agg(sum($"count").as("count"))
This will first put all rows from the two dataframes into a single dataframe. Then, then by grouping on the date and country columns it's possible to get the aggregate sum of the count column by date per country as asked. The as("count") part renames the aggregated column to count.
Note: In newer Spark versions (read version 2.0+), unionAll is deprecated and is replaced by union.