Pyspark remove duplicates base 2 columns - pyspark

I have the next df in pyspark:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| James| | V|36636|2021-09-03| 3000| remove
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-03| 3000| remove
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+
I need remove rows where ncf and date were equal. The df result will be:
+---------+----------+--------+-----+----------+------+
|firstname|middlename|lastname| ncf| date|salary|
+---------+----------+--------+-----+----------+------+
| Michael| Rose| |40288|2021-09-10| 4000|
| Robert| |Williams|42114|2021-08-03| 4000|
| Maria| Anne| Jones|39192|2021-05-13| 4000|
| Jen| Mary| Brown| |2020-09-03| -1|
| James| | Smith|36636|2021-09-04| 3000|
+---------+----------+--------+-----+----------+------+

dropDuplicates method helps with removing duplicates with in a subset of columns.
df.dropDuplicates(['ncf', 'date'])

You can use window functions to count if there are two or more rows with your conditions
from pyspark.sql import functions as F
from pyspark.sql import Window as W
df.withColumn('duplicated', F.count('*').over(W.partitionBy('ncf', 'date').orderBy(F.lit(1))) > 1)
# +---------+----------+--------+-----+----------+------+----------+
# |firstname|middlename|lastname| ncf| date|salary|duplicated|
# +---------+----------+--------+-----+----------+------+----------+
# | Jen| Mary| Brown| |2020-09-03| -1| false|
# | James| | V|36636|2021-09-03| 3000| true|
# | James| | Smith|36636|2021-09-03| 3000| true|
# | Michael| Rose| |40288|2021-09-10| 4000| false|
# | Robert| |Williams|42114|2021-08-03| 4000| false|
# | James| | Smith|36636|2021-09-04| 3000| false|
# | Maria| Anne| Jones|39192|2021-05-13| 4000| false|
# +---------+----------+--------+-----+----------+------+----------+
You now can use duplicated to filter rows as desired.

Related

PySpark - Getting the latest date less than another given date

I need some help. I have two dataframes, one has a few dates and the other has my significant data, catalogued by date.
It goes something like this:
First df, with the relevant data
+------+----------+---------------+
| id| test_date| score|
+------+----------+---------------+
| 1|2021-03-31| 94|
| 1|2021-01-31| 93|
| 1|2020-12-31| 100|
| 1|2020-06-30| 95|
| 1|2019-10-31| 58|
| 1|2017-10-31| 78|
| 2|2020-01-31| 79|
| 2|2018-03-31| 66|
| 2|2016-05-31| 77|
| 3|2021-05-31| 97|
| 3|2020-07-31| 100|
| 3|2019-07-31| 99|
| 3|2019-06-30| 98|
| 3|2018-07-31| 91|
| 3|2018-02-28| 86|
| 3|2017-11-30| 82|
+------+----------+---------------+
Second df, with the dates
+--------------+--------------+--------------+
| eval_date_1| eval_date_2| eval_date_3|
+--------------+--------------+--------------+
| 2021-01-31| 2020-10-31| 2019-06-30|
+--------------+--------------+--------------+
Needed DF
+------+--------------+---------+--------------+---------+--------------+---------+
| id| eval_date_1| score_1 | eval_date_2| score_2 | eval_date_3| score_3 |
+------+--------------+---------+--------------+---------+--------------+---------+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
+------+--------------+---------+--------------+---------+--------------+---------+
So, for instance, for the first id, the needed df takes the scores from the second, fourth and sixth rows from the first df. Those are the most updated dates that stay equal to or below the eval_date on the second df.
Assuming df is your main dataframe and df_date is the one which contains only dates.
from functools import reduce
from pyspark.sql import functions as F, Window as W
df_final = reduce(
lambda a, b: a.join(b, on="id"),
(
df.join(
F.broadcast(df_date.select(f"eval_date_{i}")),
on=F.col(f"eval_date_{i}") >= F.col("test_date"),
)
.withColumn(
"rnk",
F.row_number().over(W.partitionBy("id").orderBy(F.col("test_date").desc())),
)
.where("rnk=1")
.select("id", f"eval_date_{i}", "score")
for i in range(1, 4)
),
)
df_final.show()
+---+-----------+-----+-----------+-----+-----------+-----+
| id|eval_date_1|score|eval_date_2|score|eval_date_3|score|
+---+-----------+-----+-----------+-----+-----------+-----+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
+---+-----------+-----+-----------+-----+-----------+-----+

Create new column from existing Dataframe

I am having a Dataframe and trying to create a new column from existing columns based on following condition.
Group data by column named event_type
Only filter those rows where column source has value train and call it X.
Values for new column are X.sum / X.length
Here is input Dataframe
+-----+-------------+----------+--------------+------+
| id| event_type| location|fault_severity|source|
+-----+-------------+----------+--------------+------+
| 6597|event_type 11|location 1| -1| test|
| 8011|event_type 15|location 1| 0| train|
| 2597|event_type 15|location 1| -1| test|
| 5022|event_type 15|location 1| -1| test|
| 5022|event_type 11|location 1| -1| test|
| 6852|event_type 11|location 1| -1| test|
| 6852|event_type 15|location 1| -1| test|
| 5611|event_type 15|location 1| -1| test|
|14838|event_type 15|location 1| -1| test|
|14838|event_type 11|location 1| -1| test|
| 2588|event_type 15|location 1| 0| train|
| 2588|event_type 11|location 1| 0| train|
+-----+-------------+----------+--------------+------+
and i want following output.
+--------------+------------+-----------+
| | event_type | PercTrain |
+--------------+------------+-----------+
|event_type 11 | 7888 | 0.388945 |
|event_type 35 | 6615 | 0.407105 |
|event_type 34 | 5927 | 0.406783 |
|event_type 15 | 4395 | 0.392264 |
|event_type 20 | 1458 | 0.382030 |
+--------------+------------+-----------+
I have tried this code but this throws error
EventSet.withColumn("z" , when($"source" === "train" , sum($"source") / length($"source"))).groupBy("fault_severity").count().show()
Here EventSet is input dataframe
Python code that gives desired output is
event_type_unq['PercTrain'] = event_type.pivot_table(values='source',index='event_type',aggfunc=lambda x: sum(x=='train')/float(len(x)))
I guess you want to obtain the percentage of train values. So, here is my code,
val df2 = df.select($"event_type", $"source").groupBy($"event_type").pivot($"source").agg(count($"source")).withColumn("PercTrain", $"train" / ($"train" + $"test")).show
and gives the result as follows:
+-------------+----+-----+------------------+
| event_type|test|train| PercTrain|
+-------------+----+-----+------------------+
|event_type 11| 4| 1| 0.2|
|event_type 15| 5| 2|0.2857142857142857|
+-------------+----+-----+------------------+
Hope to be helpful.

scala filtering out rows in a joined df based on 2 columns with same values - best way

Im comparing 2 dataframes.
I choose to compare them column by column
I created 2 smaller dataframes from the parent dataframes.
based on join columns and the comparison columns:
Created 1st dataframe:
val df1_subset = df1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 66|
| blake | lively| 66|
| eva| green| 44|
| brad| pitt| 99|
| jason| momoa| 34|
| george | clooney| 67|
| ed| sheeran| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| null| null| |
+----------+---------+-------------+
Created 2nd Dataframe:
val df1_1_subset = df1_1.select(subset_cols.head, subset_cols.tail: _*)
+----------+---------+-------------+
|first_name|last_name|loyalty_score|
+----------+---------+-------------+
| tom | cruise| 34|
| brad| pitt| 78|
| eva| green| 56|
| tom | cruise| 99|
| jason| momoa| 34|
| george | clooney| 67|
| george | clooney| 88|
| lionel| messi| 88|
| ryan| reynolds| 45|
| will | smith| 67|
| kyle| jenner| 56|
| celena| gomez| 2|
+----------+---------+-------------+
Then I joined the 2 subsets
I joined these as a full outer join to get the following:
val df_subset_joined = df1_subset.join(df1_1_subset, joinColsArray, "full_outer")
Joined Subset
+----------+---------+-------------+-------------+
|first_name|last_name|loyalty_score|loyalty_score|
+----------+---------+-------------+-------------+
| will | smith| 67| 67|
| george | clooney| 67| 67|
| george | clooney| 67| 88|
| blake | lively| 66| null|
| celena| gomez| null| 2|
| eva| green| 44| 56|
| null| null| | null|
| jason| momoa| 34| 34|
| ed| sheeran| 88| null|
| lionel| messi| 88| 88|
| kyle| jenner| null| 56|
| tom | cruise| 66| 34|
| tom | cruise| 66| 99|
| brad| pitt| 99| 78|
| ryan| reynolds| 45| 45|
+----------+---------+-------------+-------------+
Then I tried to filter out the elements that are same in both comparison columns (loyalty_scores in this example) by using column positions
df_subset_joined.filter(_c2 != _c3).show
But that didnt work. Im getting the following error:
Error:(174, 33) not found: value _c2
df_subset_joined.filter(_c2 != _c3).show
What is the most efficient way for me to get a joined dataframe, where I only see the rows that do not match in the comparison columns.
I would like to keep this dynamic so hard coding column names is not an option.
Thank you for helping me understand this.
you need wo work with aliases and make us of the null-safe comparison operator (https://spark.apache.org/docs/latest/api/sql/index.html#_9), see also https://stackoverflow.com/a/54067477/1138523
val df_subset_joined = df1_subset.as("a").join(df1_1_subset.as("b"), joinColsArray, "full_outer")
df_subset_joined.filter(!($"a.loyality_score" <=> $"b.loyality_score")).show
EDIT: for dynamic column names, you can use string interpolation
import org.apache.spark.sql.functions.col
val xxx : String = ???
df_subset_joined.filter(!(col(s"a.$xxx") <=> col(s"b.$xxx"))).show

Apache Spark group by combining types and sub types

I have this dataset in spark,
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
I can now group this by city and media like this,
val groupByCityAndYear = sales
.groupBy("city", "media")
.count()
groupByCityAndYear.show()
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
+-------+--------+-----+
But, how can I do combine media and action together in one column, so the expected output should be,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Combine media and action columns as array column, explode it, then do groupBy count:
sales.select(
$"city", explode(array($"media", $"action")).as("mediaAction")
).groupBy("city", "mediaAction").count().show()
+-------+-----------+-----+
| city|mediaAction|count|
+-------+-----------+-----+
| Boston| share| 2|
| Boston| facebook| 1|
| Warsaw| share| 1|
| Boston| twitter| 1|
| Warsaw| like| 1|
|Toronto| twitter| 1|
|Toronto| like| 1|
| Warsaw| facebook| 2|
+-------+-----------+-----+
Or assuming media and action doesn't intersect (the two columns don't have common elements):
sales.groupBy("city", "media").count().union(
sales.groupBy("city", "action").count()
).show
+-------+--------+-----+
| city| media|count|
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| twitter| 1|
|Toronto| twitter| 1|
| Warsaw|facebook| 2|
| Boston| share| 2|
| Warsaw| share| 1|
| Warsaw| like| 1|
|Toronto| like| 1|
+-------+--------+-----+

Forward-fill missing data in PySpark not working

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?
we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!