Remove duplicates in Pair RDD based on Values - scala

I have an RDD with multiple rows which looks like below.
val row = [(String, String), (String, String, String)]
The value is a sequence of Tuples. In the tuple, the last String is a timestamp and the second one is category. I want to filter this sequence based on maximum timestamp for each category.
(A,B) Id Category Timestamp
-------------------------------------------------------
(123,abc) 1 A 2016-07-22 21:22:59+0000
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000
(234,bcd) 2 B 2015-07-20 21:21:20+0000
I want one row for each of the categories.I was looking for some help on getting the row with the max timestamp for each category. The result I am looking to get is
(A,B) Id Category Timestamp
-------------------------------------------------------
(234,bcd) 2 B 2016-07-20 21:21:20+0000
(123,abc) 1 A 2017-07-09 21:22:59+0000
(345,cde) 4 C 2016-07-05 09:22:30+0000
(456,def) 5 D 2016-07-21 07:32:06+0000

Given input dataframe as
+---------+---+--------+------------------------+
|(A,B) |Id |Category|Timestamp |
+---------+---+--------+------------------------+
|[123,abc]|1 |A |2016-07-22 21:22:59+0000|
|[234,bcd]|2 |B |2016-07-20 21:21:20+0000|
|[123,abc]|1 |A |2017-07-09 21:22:59+0000|
|[345,cde]|4 |C |2016-07-05 09:22:30+0000|
|[456,def]|5 |D |2016-07-21 07:32:06+0000|
|[234,bcd]|2 |B |2015-07-20 21:21:20+0000|
+---------+---+--------+------------------------+
You can do the following to get the result dataframe you require
import org.apache.spark.sql.functions._
val requiredDataframe = df.orderBy($"Timestamp".desc).groupBy("Category").agg(first("(A,B)").as("(A,B)"), first("Id").as("Id"), first("Timestamp").as("Timestamp"))
You should have the requiredDataframe as
+--------+---------+---+------------------------+
|Category|(A,B) |Id |Timestamp |
+--------+---------+---+------------------------+
|B |[234,bcd]|2 |2016-07-20 21:21:20+0000|
|D |[456,def]|5 |2016-07-21 07:32:06+0000|
|C |[345,cde]|4 |2016-07-05 09:22:30+0000|
|A |[123,abc]|1 |2017-07-09 21:22:59+0000|
+--------+---------+---+------------------------+
You can do the same by using Window function as below
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("Category").orderBy($"Timestamp".desc)
df.withColumn("rank", rank().over(windowSpec)).filter($"rank" === lit(1)).drop("rank")

Related

Spark window function with condition on current row

I am trying to count for a given order_id how many orders there were in the past 365 days which had a payment. And this is not the problem: I use the window function.
Where it gets tricky for me is: I don't want to count orders in this time window where the payment_date is after order_date of the current order_id.
Currently, I have something like this:
val window: WindowSpec = Window
.partitionBy("customer_id")
.orderBy("order_date")
.rangeBetween(-365*days, -1)
and
df.withColumn("paid_order_count", count("*") over window)
which would count all orders for the customer within the last 365 days before his current order.
How can I now incorporate a condition for the counting that takes the order_date of the current order into account?
Example:
+---------+-----------+-------------+------------+
|order_id |order_date |payment_date |customer_id |
+---------+-----------+-------------+------------+
|1 |2017-01-01 |2017-01-10 |A |
|2 |2017-02-01 |2017-02-10 |A |
|3 |2017-02-02 |2017-02-20 |A |
The resulting table should look like this:
+---------+-----------+-------------+------------+-----------------+
|order_id |order_date |payment_date |customer_id |paid_order_count |
+---------+-----------+-------------+------------+-----------------+
|1 |2017-01-01 |2017-01-10 |A |0 |
|2 |2017-02-01 |2017-02-10 |A |1 |
|3 |2017-02-02 |2017-02-20 |A |1 |
For order_id = 3 the paid_order_count should not be 2 but 1 as order_id = 2 is paid after order_id = 3 is placed.
I hope that I explained my problem well and look forward to your ideas!
Very good question!!!
A couple of remarks, using rangeBetween creates a fixed frame that is based on number of rows in it and not on values, so it will be problematic in 2 cases:
customer does not have orders every single day, so 365 rows window might contain rows with order_date well before one year ago
if customer has more than one order per day, it will mess with the one year coverage
combination of the 1 and 2
Also rangeBetween does not work with Date and Timestamp datatypes.
To solve it, it is possible to use window function with lists and an UDF:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = spark.sparkContext.parallelize(Seq(
(1, "2017-01-01", "2017-01-10", "A")
, (2, "2017-02-01", "2017-02-10", "A")
, (3, "2017-02-02", "2017-02-20", "A")
)
).toDF("order_id", "order_date", "payment_date", "customer_id")
.withColumn("order_date_ts", to_timestamp($"order_date", "yyyy-MM-dd").cast("long"))
.withColumn("payment_date_ts", to_timestamp($"payment_date", "yyyy-MM-dd").cast("long"))
// df.printSchema()
// df.show(false)
val window = Window.partitionBy("customer_id").orderBy("order_date_ts").rangeBetween(Window.unboundedPreceding, -1)
val count_filtered_dates = udf( (days: Int, top: Long, array: Seq[Long]) => {
val bottom = top - (days * 60 * 60 * 24).toLong // in spark timestamps are in secconds, calculating the date days ago
array.count(v => v >= bottom && v < top)
}
)
val res = df.withColumn("paid_orders", collect_list("payment_date_ts") over window)
.withColumn("paid_order_count", count_filtered_dates(lit(365), $"order_date_ts", $"paid_orders"))
res.show(false)
Output:
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|order_id|order_date|payment_date|customer_id|order_date_ts|payment_date_ts|paid_orders |paid_order_count|
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
|1 |2017-01-01|2017-01-10 |A |1483228800 |1484006400 |[] |0 |
|2 |2017-02-01|2017-02-10 |A |1485907200 |1486684800 |[1484006400] |1 |
|3 |2017-02-02|2017-02-20 |A |1485993600 |1487548800 |[1484006400, 1486684800]|1 |
+--------+----------+------------+-----------+-------------+---------------+------------------------+----------------+
Converting dates to Spark timestamps in seconds makes the lists more memory efficient.
It is the easiest code to implement, but not the most optimal as the lists will take up some memory, custom UDAF would be best, but requires more coding, might do later. This will still work if you have thousands of orders per customer.

Update data from two Data Frames Scala-Spark

I have two Data Frames:
DF1:
ID | Col1 | Col2
1 a aa
2 b bb
3 c cc
DF2:
ID | Col1 | Col2
1 ab aa
2 b bba
4 d dd
How can I join these two DFs and the result should be:
Result:
1 ab aa
2 b bba
3 c cc
4 d dd
My code is:
val df = DF1.join(DF2, Seq("ID"), "outer")
.select($"ID",
when(DF1("Col1").isNull, lit(0)).otherwise(DF1("Col1")).as("Col1"),
when(DF1("Col2").isNull, lit(0)).otherwise(DF2("Col2")).as("Col2"))
.orderBy("ID")
And it works, but I don't want to specify each column, because I have large files.
So, is there any way to update the dataframe (and to add some recors if in the second DF are new one) without specifying each column?
A simple leftanti join of df1 with df2 and merging of the result into df2 should get your desired output as
df2.union(df1.join(df2, Seq("ID"), "leftanti")).orderBy("ID").show(false)
which should give you
+---+----+----+
|ID |Col1|Col2|
+---+----+----+
|1 |ab |aa |
|2 |b |bba |
|3 |c |cc |
|4 |d |dd |
+---+----+----+
The solution doesn't match the logic you have in your code but generates the expected result

Spark DataFrame Summary

Say I have a Spark SQL DataFrame like so:
name gender grade
-----------------
Joe M 3
Sue F 2
Pam F 3
Gil M 2
Lon F 3
Kim F 3
Zoe F 2
I want to create a report of single values like so:
numMales numFemales numGrade2 numGrade3
---------------------------------------
2 5 3 4
What is the best way to do this? I know how to get one of these individually like so:
val numMales = dataDF.where($"gender" == "M").count
But I don't really know how to put this into a DataFrame, or how to combine all the results.
Use of when, sum and struct inbuilt functions should give you your desired result
import org.apache.spark.sql.functions._
dataDF.select(struct(sum(when(col("gender")==="M", 1)).as("numMales"), sum(when(col("gender")==="F", 1)).as("numFemales")).as("genderCounts"),
struct(sum(when(col("grade")===2, 1)).as("numGrade2"), sum(when(col("grade")===3, 1)).as("numGrade3")).as("gradeCounts"))
.select(col("genderCounts.*"), col("gradeCounts.*"))
.show(false)
which should give you
+--------+----------+---------+---------+
|numMales|numFemales|numGrade2|numGrade3|
+--------+----------+---------+---------+
|2 |5 |3 |4 |
+--------+----------+---------+---------+
You can explode and pivot:
import org.apache.spark.sql.functions._
val cols = Seq("gender", "grade")
df
.select(explode(array(cols map (c => concat(lit(c), col(c))): _*)))
.groupBy().pivot("col").count.show
// +-------+-------+------+------+
// |genderF|genderM|grade2|grade3|
// +-------+-------+------+------+
// | 5| 2| 3| 4|
// +-------+-------+------+------+
I'd say that you need to .groupBy().count() your dataframe separately by each column, them combine the answers into a new dataframe.

How to create pairs of nodes in Spark?

I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.
You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.
If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+

Scala how to match two dfs if mathes then update the key in first df

I have data in two dataframes:
selectedPersonDF:
ID key
1
2
3
4
5
selectedDetailsDF:
first second third key
--------------------------
1 9 9 777
9 8 8 878
8 10 10 765
10 12 19 909
11 2 20 708
Code :
val personDF = spark.read.option("header", "true").option("inferSchema", "false").csv("person.csv")
val detailsDF = spark.read.option("header", "true").option("inferSchema", "false").csv("details.csv")
val selectedPersonDF=personDF.select((col("ID"),col("key"))).show()
val selectedDetailsDF=detailsDF.select(col("first"),col("second"),col("third"),col("key")).show()
I have to match the selectedPersonDF id column with selectedDetailsDF all the columns(First, Second, Third) if any of the column data matches with persons id then we have to take the key value from selectedDetailsDF and have to update in selectedPersonDF key column.
Expected output (in selectedPersonDF):
ID key
1 777
2 708
3
4
5
and after removing the first row from persons'df since its matched with detailsdf remaining data should be stored in another df.
You can use join and use || condition checking and left join as
val finalDF = selectedPersonDF.join(selectedDetailsDF.withColumnRenamed("key", "key2"), $"ID" === $"first" || $"ID" === $"second" || $"ID" === $"third", "left")
.select($"ID", $"key2".as("key"))
.show(false)
so finalDF should give you
+---+----+
|ID |key |
+---+----+
|1 |777 |
|2 |708 |
|3 |null|
|4 |null|
|5 |null|
+---+----+
We can call .na.fill("") on above dataframe (key column has to be StringType()) to get
+---+---+
|ID |key|
+---+---+
|1 |777|
|2 |708|
|3 | |
|4 | |
|5 | |
+---+---+
After that you can use filter to separate the dataframe into matching and non matching using key column with value and null repectively
val notMatchingDF = finalDF.filter($"key" === "")
val matchingDF = finalDF.except(notMatchingDF)
Updated for if the column names of selectedDetailsDF is unknown except the key column
If the column names of the second dataframe is unknown then you will have to form an array column of the unknown columns as
val columnsToCheck = selectedDetailsDF.columns.toSet - "key" toList
import org.apache.spark.sql.functions._
val tempSelectedDetailsDF = selectedDetailsDF.select(array(columnsToCheck.map(col): _*).as("array"), col("key").as("key2"))
Now tempSelectedDetailsDF dataframe has two columns: combined column of all the unknown columns as an array column and the key column renamed as key2.
After that you would need a udf function for checking the condition while joining
val arrayContains = udf((array: collection.mutable.WrappedArray[String], value: String) => array.contains(value))
And then you join the dataframes using the call to the defined udf function as
val finalDF = selectedPersonDF.join(tempSelectedDetailsDF, arrayContains($"array", $"ID"), "left")
.select($"ID", $"key2".as("key"))
.na.fill("")
Rest of the process is already defined above.
I hope the answer is helpful and understandable.