Create a new column based on date checking - scala

I have two dataframes in Scala:
df1 =
ID Field1
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.

You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
| ID|Field1|check|
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
| ID|Field1|check|
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11


Calculate sequences of constantly increasing dates Spark

I have a dataframe in Spark with name column and dates. And I would like to find all continuous sequences of constantly increasing dates (day after day) for each name and calculate their durations. The output should contain a name, start date (of the dates sequence) and duration of such time period (amount of days)
How can I do this with Spark functions?
A consecutive sequence of dates example:
I have defined such solution but it calculates the overall amount of days by each name and does not divide it into sequences:
val result = allDataDf
Also, I tried with ranks, but counts column has only 1s, for some reason:
val names = Window
val result = allDataDf
.select($"name", $"date", rank over names as "rank")
.groupBy($"name", $"date", $"rank")
.agg(count($"*") as "count")
The output looks like this:
|stationName| date|rank|count|
| NAME|2019-03-24| 1| 1|
| NAME|2019-03-25| 2| 1|
| NAME|2019-03-27| 3| 1|
| NAME|2019-03-28| 4| 1|
| NAME|2019-01-29| 5| 1|
| NAME|2019-03-30| 6| 1|
| NAME|2019-03-31| 7| 1|
| NAME|2019-04-02| 8| 1|
| NAME|2019-04-05| 9| 1|
| NAME|2019-04-07| 10| 1|
Finding consecutive dates is fairly easy in SQL. You could do it with a query like:
date_add(date, -(row_number() over (partition by stationName order by date))) as discriminator
FROM stations
MIN(date) as start,
COUNT(1) AS duration
FROM s GROUP BY stationName, discriminator
Fortunately, we can use SQL in spark. Let's check if it works (I used different dates):
val df = Seq(
("NAME1", "2019-03-22"),
("NAME1", "2019-03-23"),
("NAME1", "2019-03-24"),
("NAME1", "2019-03-25"),
("NAME1", "2019-03-27"),
("NAME1", "2019-03-28"),
("NAME2", "2019-03-27"),
("NAME2", "2019-03-28"),
("NAME2", "2019-03-30"),
("NAME2", "2019-03-31"),
("NAME2", "2019-04-04"),
("NAME2", "2019-04-05"),
("NAME2", "2019-04-06")
).toDF("stationName", "date")
.withColumn("date", date_format(col("date"), "yyyy-MM-dd"))
val result = spark.sql(
|WITH s AS (
| stationName,
| date,
| date_add(date, -(row_number() over (partition by stationName order by date)) + 1) as discriminator
| FROM stations
| stationName,
| MIN(date) as start,
| COUNT(1) AS duration
|FROM s GROUP BY stationName, discriminator
It seems to output correct dataset:
|stationName| start|duration|
| NAME1|2019-03-22| 4|
| NAME1|2019-03-27| 2|
| NAME2|2019-03-27| 2|
| NAME2|2019-03-30| 2|
| NAME2|2019-04-04| 3|

Pyspark: Delete rows on column condition after groupBy

This is my input dataframe:
id val
1 Y
1 N
2 a
2 b
3 N
Result should be:
id val
1 Y
2 a
2 b
3 N
I want to group by on col id which has both Y and N in the val and then remove the row where the column val contains "N".
Please help me resolve this issue as i am beginner to pyspark
you can first identify the problematic rows with a filter for val=="Y" and then join this dataframe back to the original one. Finally you can filter for Null values and for the rows you want to keep, e.g. val==Y. Pyspark should be able to handle the self-join even if there are a lot of rows.
The example is shown below:
df_new = spark.createDataFrame([
(1, "Y"), (1, "N"), (1,"X"), (1,"Z"),
(2,"a"), (2,"b"), (3,"N")
], ("id", "val"))
df_Y = df_new.filter(col("val")=="Y").withColumnRenamed("val","val_Y").withColumnRenamed("id","id_Y")
df_new = df_new.join(df_Y, df_new["id"]==df_Y["id_Y"],how="left")
df_new.filter((col("val_Y").isNull()) | ((col("val_Y")=="Y") & ~(col("val")=="N"))).select("id","val").show()
The result would be your preferred:
| id|val|
| 1| X|
| 1| Y|
| 1| Z|
| 3| N|
| 2| a|
| 2| b|

Add a New column in pyspark Dataframe (alternative of .apply in pandas DF)

I have a pyspark.sql.DataFrame.dataframe df
id col1
1 abc
2 bcd
3 lal
4 bac
i want to add one more column flag in the df such that if id is odd no, flag should be 'odd' , if even 'even'
final output should be
id col1 flag
1 abc odd
2 bcd even
3 lal odd
4 bac even
I tried:
def myfunc(num):
if num % 2 == 0:
flag = 'EVEN'
flag = 'ODD'
return flag
df['new_col'] = df['id'].map(lambda x: myfunc(x))
df['new_col'] = df['id'].apply(lambda x: myfunc(x))
It Gave me error : TypeError: 'Column' object is not callable
How do is use .apply ( as i use in pandas dataframe) in pyspark
pyspark doesn't provide apply, the alternative is to use withColumn function. Use withColumn to perform this operation.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
| id|col1|
| 1| abc|
| 2| bcd|
| 3| lal|
| 4| bac|
F.when(F.col("id")%2 == 0, F.lit("Even")).otherwise(
| id|col1|flag|
| 1| abc| odd|
| 2| bcd|Even|
| 3| lal| odd|
| 4| bac|Even|

I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
| customers| product| val_id| rule_name| rule_id| priority|
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
Corey beat me to it, but here's the Scala version:
val df = Seq(
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))"customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
| name| date|duration|
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
| name| date|duration|
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
| name| date|duration|
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))