How to drop duplicates using conditions [duplicate] - scala

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have the following DataFrame df:
How can I delete duplicates, while keeping the minimum value of level per each duplicated pair of item_id and country_id.
+-----------+----------+---------------+
|item_id |country_id|level |
+-----------+----------+---------------+
| 312330| 13535670| 82|
| 312330| 13535670| 369|
| 312330| 13535670| 376|
| 319840| 69731210| 127|
| 319840| 69730600| 526|
| 311480| 69628930| 150|
| 311480| 69628930| 138|
| 311480| 69628930| 405|
+-----------+----------+---------------+
The expected output:
+-----------+----------+---------------+
|item_id |country_id|level |
+-----------+----------+---------------+
| 312330| 13535670| 82|
| 319840| 69731210| 127|
| 319840| 69730600| 526|
| 311480| 69628930| 138|
+-----------+----------+---------------+
I know how to delete duplicates without conditions using dropDuplicates, but I don't know how to do it for my particular case.

One of the method is to use orderBy (default is ascending order), groupBy and aggregation first
import org.apache.spark.sql.functions.first
df.orderBy("level").groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
You can define the order as well by using .asc for ascending and .desc for descending as below
df.orderBy($"level".asc).groupBy("item_id", "country_id").agg(first("level").as("level")).show(false)
And you can do the operation using window and row_number function too as below
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy("item_id", "country_id").orderBy($"level".asc)
import org.apache.spark.sql.functions.row_number
df.withColumn("rank", row_number().over(windowSpec)).filter($"rank" === 1).drop("rank").show()

Related

Merge rows from one pair of columns into another

Here's a link to an example of what I want to achieve: https://community.powerbi.com/t5/Desktop/Append-Rows-using-Another-columns/m-p/401836. Basically, I need to merge all the rows of a pair of columns into another pair of columns. How can I do this in Spark Scala?
Input
Output
Correct me if I'm wrong, but I understand that you have a dataframe with 4 columns and you want two of them to be in the previous couple of columns right?
For instance with this input (only two rows for simplicity)
df.show
+----+----------+-----------+----------+---------+
|name| date1| cost1| date2| cost2|
+----+----------+-----------+----------+---------+
| A|2013-03-25|19923245.06| | |
| B|2015-06-04| 4104660.00|2017-10-16|392073.48|
+----+----------+-----------+----------+---------+
With just a couple of selects and a unionn you can achieve what you want
df.select("name", "date1", "cost1")
.union(df.select("name", "date2", "cost2"))
.withColumnRenamed("date1", "date")
.withColumnRenamed("cost1", "cost")
+----+----------+-----------+
|name| date| cost|
+----+----------+-----------+
| A|2013-03-25|19923245.06|
| B|2015-06-04| 4104660.00|
| A| | |
| B|2017-10-16| 392073.48|
+----+----------+-----------+

Apache spark aggregation: aggregate column based on another column value

I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+

Is there a better way to go about this process of trimming my spark DataFrame appropriately?

In the following example, I want to be able to only take the x Ids with the highest counts. x is number of these I want which is determined by a variable called howMany.
For the following example, given this Dataframe:
+------+--+-----+
|query |Id|count|
+------+--+-----+
|query1|11|2 |
|query1|12|1 |
|query2|13|2 |
|query2|14|1 |
|query3|13|2 |
|query4|12|1 |
|query4|11|1 |
|query5|12|1 |
|query5|11|2 |
|query5|14|1 |
|query5|13|3 |
|query6|15|2 |
|query6|16|1 |
|query7|17|1 |
|query8|18|2 |
|query8|13|3 |
|query8|12|1 |
+------+--+-----+
I would like to get the following dataframe if the variable number is 2.
+------+-------+-----+
|query |Ids |count|
+------+-------+-----+
|query1|[11,12]|2 |
|query2|[13,14]|2 |
|query3|[13] |2 |
|query4|[12,11]|1 |
|query5|[11,13]|2 |
|query6|[15,16]|2 |
|query7|[17] |1 |
|query8|[18,13]|2 |
+------+-------+-----+
I then want to remove the count column, but that is trivial.
I have a way to do this, but I think it defeats the purpose of scala all together and completely wastes a lot of runtime. Being new, I am unsure about the best ways to go about this
My current method is to first get a distinct list of the query column and create an iterator. Second I loop through the list using the iterator and trim the dataframe to only the current query in the list using df.select($"eachColumnName"...).where("query".equalTo(iter.next())). I then .limit(howMany) and then groupBy($"query").agg(collect_list($"Id").as("Ids")). Lastly, I have an empty dataframe and add each of these one by one to the empty dataframe and return this newly created dataframe.
df.select($"query").distinct().rdd.map(r => r(0).asInstanceOf[String]).collect().toList
val iter = queries.toIterator
while (iter.hasNext) {
middleDF = df.select($"query", $"Id", $"count").where($"query".equalTo(iter.next()))
queryDF = middleDF.sort(col("count").desc).limit(howMany).select(col("query"), col("Ids")).groupBy(col("query")).agg(collect_list("Id").as("Ids"))
emptyDF.union(queryDF) // Assuming emptyDF is made
}
emptyDF
I would do this using Window-Functions to get the rank, then groupBy to aggrgate:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val howMany = 2
val newDF = df
.withColumn("rank",row_number().over(Window.partitionBy($"query").orderBy($"count".desc)))
.where($"rank"<=howMany)
.groupBy($"query")
.agg(
collect_list($"Id").as("Ids"),
max($"count").as("count")
)

dataframe spark scala take the (MAX-MIN) for each group

i have a dataframe from a processing part, looks like :
+---------+------+-----------+
|Time |group |value |
+---------+------+-----------+
| 28371| 94| 906|
| 28372| 94| 864|
| 28373| 94| 682|
| 28374| 94| 574|
| 28383| 95| 630|
| 28384| 95| 716|
| 28385| 95| 913|
i would like to take the (value for max time - value for min time) for each group, to have this result :
+------+-----------+
|group | value |
+------+-----------+
| 94| -332|
| 95| 283|
Thank you in advance for the help
df.groupBy("groupCol").agg(max("value")-min("value"))
Based on the question edit by the OP, here is a way to do this in PySpark. The idea is to compute the row numbers in ascending and descending order of time per group and use those values for subtraction.
from pyspark.sql import Window
from pyspark.sql import functions as func
w_asc = Window.partitionBy(df.groupCol).orderBy(df.time)
w_desc = Window.partitionBy(df.groupCol).orderBy(func.desc(df.time))
df = df.withColumn(func.row_number().over(w_asc).alias('rnum_asc')) \
.withColumn(func.row_number().over(w_desc).alias('rnum_desc'))
df.groupBy(df.groupCol) \
.agg((func.max(func.when(df.rnum_desc==1,df.value))-func.max(func.when(df.rnum_asc==1,df.value))).alias('diff')).show()
It would have been easier if window function first_value were available in Spark SQL. A generic way to solve this using SQL is
select distinct groupCol,diff
from (
select t.*
,first_value(val) over(partition by groupCol order by time) -
first_value(val) over(partition by groupCol order by time desc) as diff
from tbl t
) t

trim dataframe in spark using last appearance of value in the column

I have dataframe where I want to trim it by last appearance of value Good in column PDP. This is to consider rows 5 and below. Anything above row 5 does not matter.
+------+----+
|custId| PDP|
| 1001| New|
| 1002|Good|
| 1003| New|
| 1004| New|
| 1005|Good|
| 1006| New|
| 1007| New|
| 1008| New|
| 1009| New|
+------+----+
What i need is this dataframe. Since last Good action happened on row 5th
+------+----+
|custId| PDP|
| 1001| New|
| 1002|Good|
| 1003| New|
| 1004| New|
| 1005|Good|
+------+----+
You can try:
df
.filter($"PDP" === "Good") // Filter good
.select(max("custId").alias("maxId")) // Find max id
.crossJoin(df)
.where($"custId" <= $"maxId") // Filter records with id <= lastGoodId
.drop("maxId") // Remove obsolete column
You have to find the last row index with Good in PDP column, and then filter in only rows less than that index.
custId
If your custId column contains increasing ids in sorted order then you can do the following
import org.apache.spark.sql.functions._
val maxIdToFilter = df.filter(lower(col("PDP")) === "good").select(max(col("custId").cast("long"))).first().getLong(0)
df.filter(col("custId") <= maxIdToFilter).show(false)
monotically_increasing_id
If your custId is not sorted and increasing order then you can use following logic
import org.apache.spark.sql.functions._
val dfWithRow = df.withColumn("rowNo", monotonically_increasing_id())
val maxIdToFilter = dfWithRow.filter(lower(col("PDP")) === "good").select(max("rowNo")).first().getLong(0)
dfWithRow.filter(col("rowNo") <= maxIdToFilter).drop("rowNo").show(false)
I hope the answer is helpful