Taking sum ini spark-scala based on a condition - scala

I have a data frame like this. How can i take the sum of the column sales where the rank is greater than 3 , per 'M'
+---+-----+----+
| M|Sales|Rank|
+---+-----+----+
| M1| 200| 1|
| M1| 175| 2|
| M1| 150| 3|
| M1| 125| 4|
| M1| 90| 5|
| M1| 85| 6|
| M2| 1001| 1|
| M2| 500| 2|
| M2| 456| 3|
| M2| 345| 4|
| M2| 231| 5|
| M2| 123| 6|
+---+-----+----+
Expected Output --
+---+-----+----+---------------+
| M|Sales|Rank|SumGreaterThan3|
+---+-----+----+---------------+
| M1| 200| 1| 300|
| M1| 175| 2| 300|
| M1| 150| 3| 300|
| M1| 125| 4| 300|
| M1| 90| 5| 300|
| M1| 85| 6| 300|
| M2| 1001| 1| 699|
| M2| 500| 2| 699|
| M2| 456| 3| 699|
| M2| 345| 4| 699|
| M2| 231| 5| 699|
| M2| 123| 6| 699|
+---+-----+----+---------------+
I have done sum over ROwnumber like this
df.withColumn("SumGreaterThan3",sum("Sales").over(Window.partitionBy(col("M"))))` //But this will provide total sum of sales.
To replicate the same DF-
val df = Seq(
("M1",200,1),
("M1",175,2),
("M1",150,3),
("M1",125,4),
("M1",90,5),
("M1",85,6),
("M2",1001,1),
("M2",500,2),
("M2",456,3),
("M2",345,4),
("M2",231,5),
("M2",123,6)
).toDF("M","Sales","Rank")

Well, the partition is enough to set the window function. Of course you also have to use the conditional summation by mixing sum and when.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("M")
df.withColumn("SumGreaterThan3", sum(when('Rank > 3, 'Sales).otherwise(0)).over(w).alias("sum")).show
This will givs you the expected results.

Related

Add another column after groupBy and agg

I have a df looks like this:
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 3| 3| 600|
| 2| 3| 702|
| 1| 2| 120|
| 2| 5| 200|
| 2| 2| 500|
| 3| 1| 100|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+
I want to output the max count of vocabId and the docId it belongs to. I did this:
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count"))
and got this:
+-------+----------+
|vocabId| count |
+-------+----------+
| 1| 1000|
| 3| 1200|
| 5| 2000|
| 4| 122|
| 2| 500|
+-------+----------+
How do I add the docId at the front???
It should looks something like this(the order is not important):
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 2| 2| 500|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+
You can do self join with docwords over count and vocabId something like below
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")).join(docwords,Seq("vocabId","count"))

Unable to get the result from the window function

+---------------+--------+
|YearsExperience| Salary|
+---------------+--------+
| 1.1| 39343.0|
| 1.3| 46205.0|
| 1.5| 37731.0|
| 2.0| 43525.0|
| 2.2| 39891.0|
| 2.9| 56642.0|
| 3.0| 60150.0|
| 3.2| 54445.0|
| 3.2| 64445.0|
| 3.7| 57189.0|
| 3.9| 63218.0|
| 4.0| 55794.0|
| 4.0| 56957.0|
| 4.1| 57081.0|
| 4.5| 61111.0|
| 4.9| 67938.0|
| 5.1| 66029.0|
| 5.3| 83088.0|
| 5.9| 81363.0|
| 6.0| 93940.0|
| 6.8| 91738.0|
| 7.1| 98273.0|
| 7.9|101302.0|
| 8.2|113812.0|
| 8.7|109431.0|
| 9.0|105582.0|
| 9.5|116969.0|
| 9.6|112635.0|
| 10.3|122391.0|
| 10.5|121872.0|
+---------------+--------+
I want to find the top highest salary from the above data which is 122391.0
My Code
val top= Window.partitionBy("id").orderBy(col("Salary").desc)
val res= df1.withColumn("top", rank().over(top))
Result
+---------------+--------+---+---+
|YearsExperience| Salary| id|top|
+---------------+--------+---+---+
| 1.1| 39343.0| 0| 1|
| 1.3| 46205.0| 1| 1|
| 1.5| 37731.0| 2| 1|
| 2.0| 43525.0| 3| 1|
| 2.2| 39891.0| 4| 1|
| 2.9| 56642.0| 5| 1|
| 3.0| 60150.0| 6| 1|
| 3.2| 54445.0| 7| 1|
| 3.2| 64445.0| 8| 1|
| 3.7| 57189.0| 9| 1|
| 3.9| 63218.0| 10| 1|
| 4.0| 55794.0| 11| 1|
| 4.0| 56957.0| 12| 1|
| 4.1| 57081.0| 13| 1|
| 4.5| 61111.0| 14| 1|
| 4.9| 67938.0| 15| 1|
| 5.1| 66029.0| 16| 1|
| 5.3| 83088.0| 17| 1|
| 5.9| 81363.0| 18| 1|
| 6.0| 93940.0| 19| 1|
| 6.8| 91738.0| 20| 1|
| 7.1| 98273.0| 21| 1|
| 7.9|101302.0| 22| 1|
| 8.2|113812.0| 23| 1|
| 8.7|109431.0| 24| 1|
| 9.0|105582.0| 25| 1|
| 9.5|116969.0| 26| 1|
| 9.6|112635.0| 27| 1|
| 10.3|122391.0| 28| 1|
| 10.5|121872.0| 29| 1|
+---------------+--------+---+---+
Also I have choosed partioned by salary and orderby id.
<br>
But the result was same.
As you can see 122391 is coming just below the above but it should come in first position as i have done ascending.
Please help anybody find any things
Are you sure you need a window function here? The window you defined partitions the data by id, which I assume is unique, so each group produced by the window will only have one row. It looks like you want a window over the entire dataframe, which means you don't actually need one. If you just want to add a column with the max, you can get the max using an aggregation on your original dataframe and cross join with it:
val maxDF = df1.agg(max("salary").as("top"))
val res = df1.crossJoin(maxDF)

Perform merge/insert on two spark dataframes with different schemas?

I have spark dataframe df and df1 both with different schemas.
DF:-
val DF = Seq(("1","acv","34","a","1"),("2","fbg","56","b","3"),("3","rty","78","c","5")).toDF("id","name","age","DBName","test")
+---+----+---+------+----+
| id|name|age|DBName|test|
+---+----+---+------+----+
| 1| acv| 34| a| 1|
| 2| fbg| 56| b| 3|
| 3| rty| 78| c| 5|
+---+----+---+------+----+
DF1:-
val DF1= Seq(("1","gbj","67","a","5"),("2","gbj","67","a","7"),("2","jku","88","b","8"),("4","jku","88","b",7"),("5","uuu","12","c","9")).toDF("id","name","age","DBName","col1")
+---+----+---+------+----+
| id|name|age|DBName|col1|
+---+----+---+------+----+
| 1| gbj| 67| a| 5|
| 2| gbj| 67| a| 7|
| 2| jku| 88| b| 8|
| 4| jku| 88| b| 7|
| 5| uuu| 12| c| 9|
+---+----+---+------+----+
I want to merge DF1 with DF based on value of id and DBName. So if my id and DBName already exists in DF then the record should be updated and if id and DBName doesn't exist then the new record should be added. So the resulting data frame should be like this:
+---+----+---+------+----+----+
| id|name|age|DBName|Test|col |
+---+----+---+------+----+----+
| 5| uuu| 12| c|NULL|9 |
| 2| jku| 88| b|NULL|8 |
| 4| jku| 88| b|NULL|7 |
| 1| gbj| 67| a|NULL|5 |
| 3| rty| 78| c|5 |NULL|
| 2| gbj| 67| a|NULL|7 |
+---+----+---+------+----+----+
I have tried so far
val updatedDF = DF.as("a").join(DF1.as("b"), $"a.id" === $"b.id" && $"a.DBName" === $"b.DBName", "outer").select(DF.columns.map(c => coalesce($"b.$c", $"b.$c") as c): _*)
Error:-
org.apache.spark.sql.AnalysisException: cannot resolve '`b.test`' given input columns: [b.DBName, a.DBName, a.name, b.age, a.id, a.age, b.id, a.test, b.name];;
You're selecting non-existent columns, and also there is a typo in the coalesce. You can follow the example below to fix your issue:
val updatedDF = DF.as("a").join(
DF1.as("b"),
$"a.id" === $"b.id" && $"a.DBName" === $"b.DBName",
"outer"
).select(
DF.columns.dropRight(1).map(c => coalesce($"b.$c", $"a.$c") as c)
:+ col(DF.columns.last)
:+ col(DF1.columns.last)
:_*
)
updatedDF.show
+---+----+---+------+----+----+
| id|name|age|DBName|test|col1|
+---+----+---+------+----+----+
| 5| uuu| 12| c|null| 9|
| 2| jku| 88| b| 3| 8|
| 4| jku| 88| b|null| 7|
| 1| gbj| 67| a| 1| 5|
| 3| rty| 78| c| 5|null|
| 2| gbj| 67| a|null| 7|
+---+----+---+------+----+----+

How to take row_number() based on a condition in spark with scala

I have the below data frame -
+----+-----+---+
| val|count| id|
+----+-----+---+
| a| 10| m1|
| b| 20| m1|
|null| 30| m1|
| b| 30| m2|
| c| 40| m2|
|null| 50| m2|
+----+-----+---+
created by -
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
)toDF("val","count","id")
I am trying to make a rank with the help of row_number() and window fuction as below.
df1.withColumn("rannk_num", row_number() over Window.partitionBy("id").orderBy("count")).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| 3|
+----+-----+---+---------+
But I have to filter those records with null values for column - val.
Expected output --
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| NULL|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| NULL|
+----+-----+---+---------+
wondering if this is possible with minimal change. Also there can be 'n' number of values for the columns val and count.
Filter those rows with null val, assign them a null row number, and union back to the original dataframe.
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
).toDF("val","count","id")
df1.filter("val is not null").withColumn(
"rannk_num", row_number() over Window.partitionBy("id").orderBy("count")
).union(
df1.filter("val is null").withColumn("rannk_num", lit(null))
).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 30| m1| null|
|null| 50| m2| null|
+----+-----+---+---------+

PySpark: counting rows based on current row value

I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')