Consider items of the same value when deciding rank - scala

In spark, I would like to count how values are less or equal to other values. I tried to accomplish this via ranking but rank produces
[1,2,2,2,3,4] -> [1,2,2,2,5,6]
while what I would like is
[1,2,2,2,3,4] -> [1,4,4,4,5,6]
I can accomplish this by ranking, grouping by the rank and then modifying the rank value based on how many items are in the group. But this is kind of clunky and it's inefficient. Is there a better way to do this?
Edit: Added minimal example of what I'm trying to accomplish
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
object Question extends App {
val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()
import spark.implicits._
val win = Window.orderBy($"nums".asc)
Seq(1, 2, 2, 2, 3, 4)
.toDF("nums")
.select($"nums", rank.over(win).alias("rank"))
.as[(Int, Int)]
.groupByKey(_._2)
.mapGroups((rank, nums) => (rank, nums.toList.map(_._1)))
.map(x => (x._1 + x._2.length - 1, x._2))
.flatMap(x => x._2.map(num => (num, x._1)))
.toDF("nums", "rank")
.show(false)
}
Output:
+----+----+
|nums|rank|
+----+----+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+----+

Use window functions
scala> val df = Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]
scala> df.createOrReplaceTempView("tbl")
scala> spark.sql(" with tab1(select nums, rank() over(order by nums) rk, count(*) over(partition by nums) cn from tbl) select nums, rk+cn-1 as rk2 from tab1 ").show(false)
18/11/28 02:20:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+---+
scala>
Note that the df doesn't partition on any column, so spark complains of moving all data to single partition.
EDIT1:
scala> spark.sql(" select nums, rank() over(order by nums) + count(*) over(partition by nums) -1 as rk2 from tbl ").show
18/11/28 23:20:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
| 1| 1|
| 2| 4|
| 2| 4|
| 2| 4|
| 3| 5|
| 4| 6|
+----+---+
scala>
EDIT2:
The equivalent df version
scala> val df = Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> df.withColumn("rk2", rank().over(Window orderBy 'nums)+ count(lit(1)).over(Window.partitionBy('nums)) - 1 ).show(false)
2018-12-01 11:10:26 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+---+
scala>

So, a friend pointed out that if I just calculate the rank in descending order and then for each rank do (max_rank + 1) - current_rank. This is a much more efficient implementation.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
object Question extends App {
val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()
import spark.implicits._
val win = Window.orderBy($"nums".desc)
val rankings = Seq(1, 2, 2, 2, 3, 4)
.toDF("nums")
.select($"nums", rank.over(win).alias("rank"))
.as[(Int, Int)]
val maxElement = rankings.select("rank").as[Int].reduce((a, b) => if (a > b) a else b)
rankings
.map(x => x.copy(_2 = maxElement - x._2 + 1))
.toDF("nums", "rank")
.orderBy("rank")
.show(false)
}
Output
+----+----+
|nums|rank|
+----+----+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+----+

Related

How to efficiently map over DF and use combination of outputs?

Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+

Add values to a dataframe against some particular ID in Spark Scala

I have the following dataframe:
ID Name City
1 Ali swl
2 Sana lhr
3 Ahad khi
4 ABC fsd
And a list of values like (1,2,1):
val nums: List[Int] = List(1, 2, 1)
I want to add these values against ID == 3. So that DataFrame looks like:
ID Name City newCol newCol2 newCol3
1 Ali swl null null null
2 Sana lhr null null null
3 Ahad khi 1 2 1
4 ABC fsd null null null
I wonder if it is possible? Any help will be appreciated. Thanks
Yes, Its possible.
Use when for populating matched values & otherwise for not matched values.
I have used zipWithIndex for making column names unique.
Please check below code.
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)
scala> val filterData = List(3,4)
scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1 |Ali |swl |null |null |null |
|2 |Sana|lhr |null |null |null |
|3 |Ahad|khi |1 |2 |1 |
|4 |ABC |fsd |1 |2 |1 |
+---+----+----+-------+-------+-------+
Time taken: 43 ms
scala>
Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:
import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._
val numsDf =
Seq(nums)
.toDF("nums")
.select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)
After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:
val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")
resultDf.show() will print:
+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
| 1| Ali| swl| null| null| null|
| 2|Sana| lhr| null| null| null|
| 3|Ahad| khi| 1| 2| 3|
| 4| ABC| fsd| null| null| null|
+---+----+----+-------+-------+-------+
Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:
val spark = SparkSession.builder()
...
.config("spark.sql.crossJoin.enabled", value = true)
.getOrCreate()

isin throws stackoverflow error in withcolumn function in spark

I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?
Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+
foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+
I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String

How can I do map reduce on spark dataframe group by conditional columns?

My spark dataframe looks like this:
+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23 |null |dsad |3 |
|11 |44 |null |4 |
|231 |null |temp |5 |
|231 |null |temp |2 |
+------+------+-------+------+
I want to do the calculation for each pair of userid and useid1/userid2 (whichever is not null).
And if it's useid1, I multiply the score by 5, if it's userid2, I multiply the score by 3.
Finally, I want to add all score for each pair.
The result should be:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|11 |44 |20 |
|231 |temp |21 |
+------+------+-------------+
How can I do this?
For the groupBy part, I know dataframe has the groupBy function, but I don't know if I can use it conditionally, like if userid1 is null, groupby(userid, userid2), if userid2 is null, groupby(userid, useid1).
For the calculation part, how to multiply 3 or 5 based on the condition?
The below solution will help to solve your problem.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
.withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
.withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
.select("userid", "useid1/2", "finalscore").distinct()
using when method in spark SQL, select userid1 or 2 and multiply with values based on the condition
Output:
+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
| 11 | 44| 20.0|
| 23 | dsad| 9.0|
| 231| temp| 21.0|
+------+--------+----------+
Group by will work:
val original = Seq(
(23, null, "dsad", 3),
(11, "44", null, 4),
(231, null, "temp", 5),
(231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")
// action
val result = original
.withColumn("useid1/2", coalesce($"useid1", $"userid2"))
.withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
.groupBy("userid", "useid1/2")
.agg(sum("score").alias("final score"))
result.show(false)
Output:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|231 |temp |21 |
|11 |44 |20 |
+------+--------+-----------+
coalesce will do the needful.
df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))
basically this function return first non-null value of the order
documentation :
COALESCE(T v1, T v2, ...)
Returns the first v that is not NULL, or NULL if all v's are NULL.
needs an import import org.apache.spark.sql.functions.coalesce

spark 2.3.1 insertinto remove value of fields and change it to null

I just upgrade my spark cluster from 2.2.1 to 2.3.1 in order to enjoy the feature of overwrite specific partitions. see link.
But ....
From some reason when I am testing it I get a very strange behavior see code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
case class MyRow(partitionField: Int, someId: Int, someText: String)
object ExampleForStack2 extends App{
val sparkConf = new SparkConf()
sparkConf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val list1 = List(
MyRow(1, 1, "someText")
,MyRow(2, 2, "someText2")
)
val list2 = List(
MyRow(1, 1, "someText modified")
,MyRow(3, 3, "someText3")
)
val df = spark.createDataFrame(list1)
val df2 = spark.createDataFrame(list2)
df2.show(false)
df.write.partitionBy("partitionField").option("path","/tmp/tables/").saveAsTable("my_table")
df2.write.mode(SaveMode.Overwrite).insertInto("my_table")
spark.sql("select * from my_table").show(false)
}
And output:
+--------------+------+-----------------+
|partitionField|someId|someText |
+--------------+------+-----------------+
|1 |1 |someText modified|
|3 |3 |someText3 |
+--------------+------+-----------------+
+------+---------+--------------+
|someId|someText |partitionField|
+------+---------+--------------+
|2 |someText2|2 |
|1 |someText |1 |
|3 |3 |null |
|1 |1 |null |
+------+---------+--------------+
Why I get those nulls ?
It seems that fields were moved ? but why?
Thanks
Ok found it, insert into is based on fields position. see documentation
Unlike saveAsTable, insertInto ignores the column names and just uses position-based resolution. For example:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
scala> sql("select * from t1").show
+---+---+
| i| j|
+---+---+
| 5| 6|
| 3| 4|
| 1| 2|
+---+---+
Because it inserts data to an existing table, format or options will
be ignored.
Moreover I am using dynamic partition which should appear as the last field. So the solution is to move the dynamic partitions to the end of the dataframe, which means in my case:
df2.select("someId", "someText","partitionField").write.mode(SaveMode.Overwrite).insertInto("my_table")