Forward-fill missing data in PySpark not working - pyspark

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?

we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!

Related

PySpark - Getting the latest date less than another given date

I need some help. I have two dataframes, one has a few dates and the other has my significant data, catalogued by date.
It goes something like this:
First df, with the relevant data
+------+----------+---------------+
| id| test_date| score|
+------+----------+---------------+
| 1|2021-03-31| 94|
| 1|2021-01-31| 93|
| 1|2020-12-31| 100|
| 1|2020-06-30| 95|
| 1|2019-10-31| 58|
| 1|2017-10-31| 78|
| 2|2020-01-31| 79|
| 2|2018-03-31| 66|
| 2|2016-05-31| 77|
| 3|2021-05-31| 97|
| 3|2020-07-31| 100|
| 3|2019-07-31| 99|
| 3|2019-06-30| 98|
| 3|2018-07-31| 91|
| 3|2018-02-28| 86|
| 3|2017-11-30| 82|
+------+----------+---------------+
Second df, with the dates
+--------------+--------------+--------------+
| eval_date_1| eval_date_2| eval_date_3|
+--------------+--------------+--------------+
| 2021-01-31| 2020-10-31| 2019-06-30|
+--------------+--------------+--------------+
Needed DF
+------+--------------+---------+--------------+---------+--------------+---------+
| id| eval_date_1| score_1 | eval_date_2| score_2 | eval_date_3| score_3 |
+------+--------------+---------+--------------+---------+--------------+---------+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
+------+--------------+---------+--------------+---------+--------------+---------+
So, for instance, for the first id, the needed df takes the scores from the second, fourth and sixth rows from the first df. Those are the most updated dates that stay equal to or below the eval_date on the second df.
Assuming df is your main dataframe and df_date is the one which contains only dates.
from functools import reduce
from pyspark.sql import functions as F, Window as W
df_final = reduce(
lambda a, b: a.join(b, on="id"),
(
df.join(
F.broadcast(df_date.select(f"eval_date_{i}")),
on=F.col(f"eval_date_{i}") >= F.col("test_date"),
)
.withColumn(
"rnk",
F.row_number().over(W.partitionBy("id").orderBy(F.col("test_date").desc())),
)
.where("rnk=1")
.select("id", f"eval_date_{i}", "score")
for i in range(1, 4)
),
)
df_final.show()
+---+-----------+-----+-----------+-----+-----------+-----+
| id|eval_date_1|score|eval_date_2|score|eval_date_3|score|
+---+-----------+-----+-----------+-----+-----------+-----+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
+---+-----------+-----+-----------+-----+-----------+-----+

How to perform one to many mapping on spark scala dataframe column using flatmaps

I am looking for specifically a flatmap solution to a problem of mocking the data column in a spark-scala dataframe by using data duplicacy technique like 1 to many mapping inside flatmap
My given data is something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
+---+----+-----+
and my expectation after doing 1 to 3 mapping of the id column will be something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
|2 |null|null |
|3 |null|null |
|1 |null|null |
|2 |null|null |
|1 |null|null |
|3 |null|null |
+---+----+-----+
Please feel free to let me know if there is any clarification required on the requirement part
Thanks in advance!!!
I see that you are attempting to generate data with a requirement of re-using values in the ID column.
You can just select the ID column and generate random values and do a union back to your original dataset.
For example:
val data = Seq((1,"asd",15), (2,"asd",20), (3,"test",99)).toDF("id","testName","marks")
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
+---+--------+-----+
import org.apache.spark.sql.types._
val newRecords = data.select("id").withColumn("testName", concat(lit("name_"), lit(rand()*10).cast(IntegerType).cast(StringType))).withColumn("marks", lit(rand()*100).cast(IntegerType))
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
val result = data.unionAll(newRecords)
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
you can run the randomisation portion of the code using a loop and do a union of all the generated dataframes.

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

How to find the max length unique rows from a dataframe with spark?

I am trying to find the unique rows (based on id) that have the maximum length value in a Spark dataframe. Each Column has a value of string type.
The dataframe is like:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi| |
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | | d |
|3 |b | c | a | d |
+-----+---+----+---+---+
The expectation is:
+-----+---+----+---+---+
|id | A | B | C | D |
+-----+---+----+---+---+
|1 |toto|tata|titi|tutu|
|2 |bla |blo | | |
|3 |b | c | a | d |
+-----+---+----+---+---+
I can't figure how to do this using Spark easily...
Thanks in advance
Note: This approach takes care of any addition/deletion of columns to the DataFrame, without the need of code change.
It can be done by first finding length of all columns after concatenating (except the first column), then filter all other rows except the row with the maximum length.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val output = input.withColumn("rowLength", length(concat(input.columns.toList.drop(1).map(col): _*)))
.withColumn("maxLength", max($"rowLength").over(Window.partitionBy($"id")))
.filter($"rowLength" === $"maxLength")
.drop("rowLength", "maxLength")
scala> df.show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi| |
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| | d|
| 3| b| c| a| d|
+---+----+----+----+----+
scala> df.groupBy("id").agg(concat_ws("",collect_set(col("A"))).alias("A"),concat_ws("",collect_set(col("B"))).alias("B"),concat_ws("",collect_set(col("C"))).alias("C"),concat_ws("",collect_set(col("D"))).alias("D")).show
+---+----+----+----+----+
| id| A| B| C| D|
+---+----+----+----+----+
| 1|toto|tata|titi|tutu|
| 2| bla| blo| | |
| 3| b| c| a| d|
+---+----+----+----+----+

scala filtering out rows in a joined df based on 2 columns with same values - best way

Im comparing 2 dataframes.
I choose to compare them column by column
I created 2 smaller dataframes from the parent dataframes.
based on join columns and the comparison columns:
Created 1st dataframe:
val df1_subset = df1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 66|
| blake | lively| 66|
| eva| green| 44|
| brad| pitt| 99|
| jason| momoa| 34|
| george | clooney| 67|
| ed| sheeran| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| null| null| |
+----------+---------+-------------+
Created 2nd Dataframe:
val df1_1_subset = df1_1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 34|
| brad| pitt| 78|
| eva| green| 56|
| tom | cruise| 99|
| jason| momoa| 34|
| george | clooney| 67|
| george | clooney| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| kyle| jenner| 56|
| celena| gomez| 2|
+----------+---------+-------------+
Then I joined the 2 subsets
I joined these as a full outer join to get the following:
val df_subset_joined = df1_subset.join(df1_1_subset, joinColsArray, "full_outer")
Joined Subset
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| will | smith| 67| 67|
| george | clooney| 67| 67|
| george | clooney| 67| 88|
| blake | lively| 66| null|
| celena| gomez| null| 2|
| eva| green| 44| 56|
| null| null| | null|
| jason| momoa| 34| 34|
| ed| sheeran| 88| null|
| lionel| messi| 88| 88|
| kyle| jenner| null| 56|
| tom | cruise| 66| 34|
| tom | cruise| 66| 99|
| brad| pitt| 99| 78|
| ryan| reynolds| 45| 45|
+----------+---------+-------------+-------------+
Then I tried to filter out the elements that are same in both comparison columns (loyalty_scores in this example) by using column positions
df_subset_joined.filter(_c2 != _c3).show
But that didnt work. Im getting the following error:
Error:(174, 33) not found: value _c2
df_subset_joined.filter(_c2 != _c3).show
What is the most efficient way for me to get a joined dataframe, where I only see the rows that do not match in the comparison columns.
I would like to keep this dynamic so hard coding column names is not an option.
Thank you for helping me understand this.
you need wo work with aliases and make us of the null-safe comparison operator (https://spark.apache.org/docs/latest/api/sql/index.html#_9), see also https://stackoverflow.com/a/54067477/1138523
val df_subset_joined = df1_subset.as("a").join(df1_1_subset.as("b"), joinColsArray, "full_outer")
df_subset_joined.filter(!($"a.loyality_score" <=> $"b.loyality_score")).show
EDIT: for dynamic column names, you can use string interpolation
import org.apache.spark.sql.functions.col
val xxx : String = ???
df_subset_joined.filter(!(col(s"a.$xxx") <=> col(s"b.$xxx"))).show