What is efficient way to make a new dataframe (PySpark)? - pyspark

I have a dataframe like:
| date | ID | count |
|20170101| 258 | 1003 |
|20170102| 258 | 13 |
|20170103| 258 | 1 |
|20170104| 258 | 108 |
|20170109| 258 | 25 |
| ... | ... | ... |
|20170101| 2813 | 503 |
|20170102| 2813 | 139 |
| ... | ... | ... |
|20170101| 4963 | 821 |
|20170102| 4963 | 450 |
| ... | ... | ... |
in my dataframe, there's not some data.
For example, here, date 20170105 ~ 20170108 for ID 258 are missing
and missing data means not appear(= count == 0).
But I'd like to add data which count is 0 too, like this:
| date | ID | count |
|20170101| 258 | 1003 |
|20170102| 258 | 13 |
|20170103| 258 | 1 |
|20170104| 258 | 108 |
|20170105| 258 | 0 |
|20170106| 258 | 0 |
|20170107| 258 | 0 |
|20170108| 258 | 0 |
|20170109| 258 | 25 |
| ... | ... | ... |
|20170101| 2813 | 503 |
|20170102| 2813 | 139 |
| ... | ... | ... |
|20170101| 4963 | 821 |
|20170102| 4963 | 450 |
| ... | ... | ... |
dataframe is immutable so, if I want to add zero counted data to this dataframe,
have to make a new dataframe.
But even if I have a duration(20170101 ~ 20171231) and ID list, I cannot use for loop to dataframe.
How can I make a new dataframe?
ps. what I already tried was to make a correct dataframe and then compare 2 dataframes, make another dataframe which has only 0 counted data. finally union "original dataframe" and "0 counted dataframe". I think this is not good and long process. please recommend me some other efficient solutions.

from pyspark.sql.functions import unix_timestamp, from_unixtime, struct, datediff, lead, col, explode, lit, udf
from pyspark.sql.window import Window
from pyspark.sql.types import ArrayType, DateType
from datetime import timedelta
#sample data
df = sc.parallelize([
['20170101', 258, 1003],
['20170102', 258, 13],
['20170103', 258, 1],
['20170104', 258, 108],
['20170109', 258, 25],
['20170101', 2813, 503],
['20170102', 2813, 139],
['20170101', 4963, 821],
['20170102', 4963, 450]]).\
toDF(('date', 'ID', 'count')).\
withColumn("date", from_unixtime(unix_timestamp('date', 'yyyyMMdd')).cast('date'))
def date_list_fn(d):
return [d[0] + timedelta(days=x) for x in range(1, d[1])]
date_list_udf = udf(date_list_fn, ArrayType(DateType()))
w = Window.partitionBy('ID').orderBy('date')
#dataframe having missing date
df_missing = df.withColumn("diff", datediff(lead('date').over(w), 'date')).\
filter(col("diff") > 1).\
withColumn("date_list", date_list_udf(struct("date", "diff"))).\
withColumn("date_list", explode(col("date_list"))).\
select(col("date_list").alias("date"), "ID", lit(0).alias("count"))
#final dataframe by combining sample data with missing date dataframe
final_df = df.union(df_missing).sort(col("ID"), col("date"))
Sample data:
| date| ID|count|
|2017-01-01| 258| 1003|
|2017-01-02| 258| 13|
|2017-01-03| 258| 1|
|2017-01-04| 258| 108|
|2017-01-09| 258| 25|
|2017-01-01|2813| 503|
|2017-01-02|2813| 139|
|2017-01-01|4963| 821|
|2017-01-02|4963| 450|
Output is:
| date| ID|count|
|2017-01-01| 258| 1003|
|2017-01-02| 258| 13|
|2017-01-03| 258| 1|
|2017-01-04| 258| 108|
|2017-01-05| 258| 0|
|2017-01-06| 258| 0|
|2017-01-07| 258| 0|
|2017-01-08| 258| 0|
|2017-01-09| 258| 25|
|2017-01-01|2813| 503|
|2017-01-02|2813| 139|
|2017-01-01|4963| 821|
|2017-01-02|4963| 450|


Retrieve column value given a column of column names (spark / scala)

I have a dataframe like the following:
|best_col |A |B | C |<many more columns> |
| A | 14 | 26 | 32 | ... |
| C | 13 | 17 | 96 | ... |
| B | 23 | 19 | 42 | ... |
I want to end up with a DataFrame like this:
|best_col |A |B | C |<many more columns> | result |
| A | 14 | 26 | 32 | ... | 14 |
| C | 13 | 17 | 96 | ... | 96 |
| B | 23 | 19 | 42 | ... | 19 |
Essentially, I want to add a column result that will choose the value from the column specified in the best_col column. best_col only contains column names that are present in the DataFrame. Since I have dozens of columns, I want to avoid using a bunch of when statements to check when col(best_col) === A etc. I tried doing col(col("best_col").toString()), but this didn't work. Is there an easy way to do this?
Using map_filter introduced in Spark 3.0:
val df = Seq(
("A", 14, 26, 32),
("C", 13, 17, 96),
("B", 23, 19, 42),
).toDF("best_col", "A", "B", "C")
df.withColumn("result", map(df.columns.tail.flatMap(c => Seq(col(c), lit(col("best_col") === lit(c)))): _*))
.withColumn("result", map_filter(col("result"), (a, b) => b))
.withColumn("result", map_keys(col("result"))(0))
|best_col| A| B| C|result|
| A| 14| 26| 32| 14|
| C| 13| 17| 96| 96|
| B| 23| 19| 42| 19|

PySpark Relate multiple rows

I have a problem where I need to relate rows to each other. I have tried many things but I am now completly stuck. I have tried partitioning, lag, groupbys but nothing works.
The rows below the ID 26 wil relate to the MPAN of 26
ID | MPAN | Value
26 | 12345678 | Hello
27 | 99900234 | Bye
30 | 77563820 | Help
33 | 89898937 | Stuck
26 | 54877273 | Need a genius
29 | 54645643 | So close
30 | 22222222 | Thanks
ID | MPAN | Value | Relation
26 | 12345678 | Hello | NULL
27 | 99900234 | Bye | 12345678
30 | 77563820 | Help | 12345678
33 | 89898937 | Stuck | 12345678
26 | 54877273 | Genius | NULL
29 | 54645643 | So close | 54877273
30 | 22222222 | Thanks | 54877273
This code below only works for previous row and not the LAG for the 26 record
df = spark.read.load('abfss://Files/', format='parquet')
df = df.withColumn("identity", F.monotonically_increasing_id())
win = Window.orderBy("identity")
condition = F.col("Prop_0") != '026'
df = df.withColumn("FlagY", F.when(condition, mpanlookup))
As I said in my comment, you need a column to maintain the order. In your example, you used monotonically_increasing_id to create that "ordering" column, but that is absurd because
The function is non-deterministic because its result depends on partition IDs.
Assuming you have a proper "ordering" column :
|idx| ID| MPAN| Value|
| 1| 26|12345678|Hello |
| 2| 27|99900234|Bye |
| 3| 30|77563820|Help |
| 4| 33|89898937|Stuck |
| 5| 26|54877273|Need a genius|
| 6| 29|54645643|So close |
| 7| 30|22222222|Thanks |
you can simply do that with last function :
from pyspark.sql import functions as F, Window
F.last(F.when(F.col("ID") == 26, F.col("MPAN")), ignorenulls=True).over(
|idx| ID| MPAN| Value|Relation|
| 1| 26|12345678|Hello |12345678|
| 2| 27|99900234|Bye |12345678|
| 3| 30|77563820|Help |12345678|
| 4| 33|89898937|Stuck |12345678|
| 5| 26|54877273|Need a genius|54877273|
| 6| 29|54645643|So close |54877273|
| 7| 30|22222222|Thanks |54877273|

How to enrich dataframe by adding columns in specific condition in pyspark?

I have a two different dataframes:
|user_id| movie_id|timestep|
| 100 | 1000 |20200728|
| 101 | 1001 |20200727|
| 101 | 1002 |20200726|
|movie_id| title | genre |
| 1000 |Toy Story|Adventure|Animation|Chil..|
| 1001 | Jumanji |Adventure|Children|Fantasy|
| 1002 | Iron Man|Action|Adventure|Sci-Fi |
How to get a dataframe in the following format? So I can get user's taste profile for comparing different users by their similarity score?
|user_id| Action |Adventure|Animation|Children|Drama|
| 100 | 0 | 1 | 1 | 1 | 0 |
| 101 | 1 | 2 | 0 | 1 | 0 |
First, you need to split your "genre" column.
from pyspark.sql import functions as F
movies = movies.withColumn("genre", F.explode(F.split("genre", '\|')))
# use \ in front of | because split use regex
then you join
user_movie = users.join(movies, on='movie_id')
and you pivot
| 100| 0| 1| 1| 1| 0| 0|
| 101| 1| 2| 0| 1| 1| 1|
FYI : Drama column does not appear because there is no drama "genre" in the movies dataframe. But with your full data, you will have one column for each genre.

Time series with scala and spark. Rolling window

I'm trying to work on the following exercise using Scala and spark.
Given a file containing two columns: a time in seconds and a value
| seconds | value |
| 225 | 1,5 |
| 245 | 0,5 |
| 300 | 2,4 |
| 319 | 1,2 |
| 320 | 4,6 |
and given a value V to be used for the rolling window this output should be created:
Example with V=20
| seconds | value | num_row_in_window |sum_values_in_windows |
| 225 | 1,5 | 1 | 1,5 |
| 245 | 0,5 | 2 | 2 |
| 300 | 2,4 | 1 | 2,4 |
| 319 | 1,2 | 2 | 3,6 |
| 320 | 4,6 | 3 | 8,2 |
num_row_in_window is the number of rows contained in the current window and
sum_values_in_windows is the sum of the values contained in the current window.
I've been trying with the sliding function or using the sql api but it's a bit unclear to me which is the best solution to tackle this problem considering that I'm a spark/scala novice.
This is a perfect application for window-functions. By using rangeBetween you can set your sliding window to 20s. Note that in the example below no partitioning is specified (no partitionBy). Without a partitioning, this code will not scale:
import ss.implicits._
val df = Seq(
(225, 1.5),
(245, 0.5),
(300, 2.4),
(319, 1.2),
(320, 4.6)
).toDF("seconds", "value")
val window = Window.orderBy($"seconds").rangeBetween(-20L, 0L) // add partitioning here
.withColumn("num_row_in_window", sum(lit(1)).over(window))
.withColumn("sum_values_in_window", sum($"value").over(window))
| 225| 1.5| 1| 1.5|
| 245| 0.5| 2| 2.0|
| 300| 2.4| 1| 2.4|
| 319| 1.2| 2| 3.6|
| 320| 4.6| 3| 8.2|

How to iterate over pairs in a column in Scala

I have a data frame like this, imported from a parquet file:
| Store_id | Date_d_id |
| 0 | 23-07-2017 |
| 0 | 26-07-2017 |
| 0 | 01-08-2017 |
| 0 | 25-08-2017 |
| 1 | 01-01-2016 |
| 1 | 04-01-2016 |
| 1 | 10-01-2016 |
What I am trying to achieve next is to loop through each customer's date in pair and get the day difference. Here is what it should look like:
| Store_id | Date_d_id | Day_diff |
| 0 | 23-07-2017 | null |
| 0 | 26-07-2017 | 3 |
| 0 | 01-08-2017 | 6 |
| 0 | 25-08-2017 | 24 |
| 1 | 01-01-2016 | null |
| 1 | 04-01-2016 | 3 |
| 1 | 10-01-2016 | 6 |
And finally, I will like to reduce the data frame to the average day difference by customer:
| Store_id | avg_diff |
| 0 | 7.75 |
| 1 | 3 |
I am very new to Scala and I don't even know where to start. Any help is highly appreciated! Thanks in advance.
Also, I am using Zeppelin notebook
One approach would be to use lag(Date) over Window partition and a UDF to calculate the difference in days between consecutive rows, then follow by grouping the DataFrame for the average difference in days. Note that Date_d_id is converted to yyyy-mm-dd format for proper String ordering within the Window partitions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(0, "23-07-2017"),
(0, "26-07-2017"),
(0, "01-08-2017"),
(0, "25-08-2017"),
(1, "01-01-2016"),
(1, "04-01-2016"),
(1, "10-01-2016")
).toDF("Store_id", "Date_d_id")
def daysDiff = udf(
(d1: String, d2: String) => {
import java.time.LocalDate
import java.time.temporal.ChronoUnit.DAYS
DAYS.between(LocalDate.parse(d1), LocalDate.parse(d2))
val df2 = df.
withColumn( "Date_ymd",
regexp_replace($"Date_d_id", """(\d+)-(\d+)-(\d+)""", "$3-$2-$1")).
withColumn( "Prior_date_ymd",
lag("Date_ymd", 1).over(Window.partitionBy("Store_id").orderBy("Date_ymd"))).
withColumn( "Days_diff",
when($"Prior_date_ymd".isNotNull, daysDiff($"Prior_date_ymd", $"Date_ymd")).
// +--------+----------+----------+--------------+---------+
// |Store_id| Date_d_id| Date_ymd|Prior_date_ymd|Days_diff|
// +--------+----------+----------+--------------+---------+
// | 1|01-01-2016|2016-01-01| null| 0|
// | 1|04-01-2016|2016-01-04| 2016-01-01| 3|
// | 1|10-01-2016|2016-01-10| 2016-01-04| 6|
// | 0|23-07-2017|2017-07-23| null| 0|
// | 0|26-07-2017|2017-07-26| 2017-07-23| 3|
// | 0|01-08-2017|2017-08-01| 2017-07-26| 6|
// | 0|25-08-2017|2017-08-25| 2017-08-01| 24|
// +--------+----------+----------+--------------+---------+
val resultDF = df2.groupBy("Store_id").agg(avg("Days_diff").as("Avg_diff"))
// +--------+--------+
// |Store_id|Avg_diff|
// +--------+--------+
// | 1| 3.0|
// | 0| 8.25|
// +--------+--------+
You can use lag function to get the previous date over Window function, then do some manipulation to get the final dataframe that you require
first of all the Date_d_id column need to be converted to include timestamp for sorting to work correctly
import org.apache.spark.sql.functions._
val timestapeddf = df.withColumn("Date_d_id", from_unixtime(unix_timestamp($"Date_d_id", "dd-MM-yyyy")))
which should give your dataframe as
|Store_id| Date_d_id|
| 0|2017-07-23 00:00:00|
| 0|2017-07-26 00:00:00|
| 0|2017-08-01 00:00:00|
| 0|2017-08-25 00:00:00|
| 1|2016-01-01 00:00:00|
| 1|2016-01-04 00:00:00|
| 1|2016-01-10 00:00:00|
then you can apply the lag function over window function and finally get the date difference as
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Store_id").orderBy("Date_d_id")
val laggeddf = timestapeddf.withColumn("Day_diff", when(lag("Date_d_id", 1).over(windowSpec).isNull, null).otherwise(datediff($"Date_d_id", lag("Date_d_id", 1).over(windowSpec))))
laggeddf should be
|Store_id|Date_d_id |Day_diff|
|0 |2017-07-23 00:00:00|null |
|0 |2017-07-26 00:00:00|3 |
|0 |2017-08-01 00:00:00|6 |
|0 |2017-08-25 00:00:00|24 |
|1 |2016-01-01 00:00:00|null |
|1 |2016-01-04 00:00:00|3 |
|1 |2016-01-10 00:00:00|6 |
now the final step is to use groupBy and aggregation to find the average
which should give you
| 0| 11.0|
| 1| 4.5|
Now if you want to neglect the null Day_diff then you can do
which should give you
| 0| 8.25|
| 1| 3.0|
I hope the answer is helpful