Improving performance of distinct + groupByKey on Spark - scala

I am trying to learn spark and came up with this problem but my solution doesn't seem to be performing well. I was hoping someone can educate me on how I can improve the performance. The problem I have is as follows.
I have a few million tuples (e.g. (A, B), (A, C), (B, C), etc.) with possibly duplicate tuples (keys and value). What I would like to do is group up the tuples by key AND, to make it more interesting, limit the length of the grouped values to some arbitrary number (let's say 3).
So, for example, if I have:
[(A, B), (A, C), (A, D), (A, E), (B, C)]
I would expect the output to be:
[(A, [B, C, D]), (A, [E]), (B, [C]))
If any of the values for the list got longer than 3, then it would split it and you the same key listed multiple times as shown above with (A, [E]). Hopefully this makes sense.
The solution I came up with was:
val myTuples: Array[(String, String)] = ...
sparkContext.parallelize(myTuples)
.distinct() // to delete duplicates
.groupByKey() // to group up the tuples by key
.flatMapValues(values => values.grouped(3)) // split up values in groups of 3
.repartition(sparkContext.defaultParallelism)
.collect()
My solution works okay but is there a more efficient way to do this? I hear that groupByKey is very inefficient. Any help would be greatly appreciated.
Also, is there a good number I should choose for partitions? I noticed that distinct takes in a partition parameter but not sure what I should have put.
Thanks!

You need to re-formulate your problem slightly, as you aren't actually grouping by a single key here; in your example above, you output multiple rows for "A". In the below, I add a column which we can use additionally to group by (it will increment every 3 records), and collect_list which is a Spark SQL function to produce the arrays you are looking for. Note that by sticking to entirely Spark SQL, you get many optimisations from Spark (through "catalyst" which is a query optimiser).
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val data = List(("A", "B"), ("A", "C"), ("A", "D"), ("A", "E"), ("B", "C")).toDF("key","value")
val data2 = data.withColumn("index", floor(
(row_number().over(Window.partitionBy("key").orderBy("value"))-1)/3)
)
data2.show
+---+-----+-----+
|key|value|index|
+---+-----+-----+
| B| C| 0|
| A| B| 0|
| A| C| 0|
| A| D| 0|
| A| E| 1|
+---+-----+-----+
data2.groupBy("key","index").agg(collect_list("value")).show
+---+-----+-------------------+
|key|index|collect_list(value)|
+---+-----+-------------------+
| B| 0| [C]|
| A| 0| [B, C, D]|
| A| 1| [E]|
+---+-----+-------------------+

Related

How to copy the "first" row of a spark data frame to another data frame? Why does my minimal example fails?

Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
I do not understand what goes wrong in the following code.
Hence I am looking forward for a solution and an explanation what fails in my minimal example.
A minimal example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1)
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
As row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
the exspected result for sdfEmpty is:
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
But I get :
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
| 2| b|
+---+---+
Question:
What confused me is the following: Using val row = sdf.limit(1) I thought I created a permanent/ unchangeable/ well defined object. Such that when I print it once and add it to something, I get the same results.
Remark: (thanks a lot to Daniel's remarks)
I know that in the distributed world of scala there is no well defined notion of "first row". I put it there for simplicity and I hope that people struggling with something similar will "accidentially" use the term "first".
What I try to achieve is the following: (in a simplified example)
I have a data frame with 2 columns A and B. Column A is partially ordered and column B is totally ordered.
I want to filter the data w.r.t. the columns. So the idea is some kind of divide and conquer: split the data frame, such that into pieces both columns are totally ordered and than filter as usual. (and do the obvious iterations)
To achieve this I need to pick a well defined row and split the date w.r.t. row.A. But as the minimal example shows my comands do not produce a well defined object.
Thanks a lot
Spark is distributed, so the notion of 'first' is not something we can rely on. Dependently on partitioning we can get a different result when calling limit or first.
To have consistent results your data has to have an underlying order which we can use - what makes a lot of sense, since unless there is logical ordering to your data, we can't really say what does it mean to take the first row.
Assuming you want to take the first row with respect to column A, you can just run orderBy("A").first()(*) . Although if column A has more than one row with same smallest value there is no guarantee which row you will get.
(* I assume scala API has the same naming as Python so please correct me if they are differently named)
#Christian you can achieve this result by using take function.
take(num) Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
here the code snippet.
scala> import org.apache.spark.sql.types._
scala> val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
scala> import org.apache.spark.sql._
scala> var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
scala> var first1 =sdf.rdd.take(1)
scala> val first_row = spark.createDataFrame(sc.parallelize(first1), sdf.schema)
scala> sdfEmpty.union(first_row).show
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
for more about take() and first() function just read spark Documentation.let me know if you have any query related to this.
I am posting this answer as it contains the solution suggested by Daniel. Once I am through literature provided mahesh gupta or some more testing I'll update this answer and give remarks on the runtimes of the different approaches in "real life".
Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
As in the distributed world of spark there is not a well defined notion of first, but something similar might be achieved due to orderBy.
A minimal working example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1).collect()
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
The row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
** and the result for sdfEmpty is:**
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
Remark: Explanation given by Daniel (see comments above) .limit(n) is a transformation - it does not get evaluated until an action runs like show or collect. Hence depending on the context it can return different value. To use the result of .limit consistently one can .collect it to driver and use it as a local variable.

Spark - group and aggregate only several smallest items

In short
I have cartesian-product (cross-join) of two dataframes and function which gives some score for given element of this product. I want now to get few "best matched" elements of the second DF for every member of the first DF.
In details
What follows is a simplified example as my real code is somewhat bloated with additional fields and filters.
Given two sets of data, each having some id and value:
// simple rdds of tuples
val rdd1 = sc.parallelize(Seq(("a", 31),("b", 41),("c", 59),("d", 26),("e",53),("f",58)))
val rdd2 = sc.parallelize(Seq(("z", 16),("y", 18),("x",3),("w",39),("v",98), ("u", 88)))
// convert them to dataframes:
val df1 = spark.createDataFrame(rdd1).toDF("id1", "val1")
val df2 = spark.createDataFrame(rdd2).toDF("id2", "val2")
and some function which for pair of the elements from the first and second dataset gives their "matching score":
def f(a:Int, b:Int):Int = (a * a + b * b * b) % 17
// convert it to udf
val fu = udf((a:Int, b:Int) => f(a, b))
we can create the product of two sets and calculate score for every pair:
val dfc = df1.crossJoin(df2)
val r = dfc.withColumn("rez", fu(col("val1"), col("val2")))
r.show
+---+----+---+----+---+
|id1|val1|id2|val2|rez|
+---+----+---+----+---+
| a| 31| z| 16| 8|
| a| 31| y| 18| 10|
| a| 31| x| 3| 2|
| a| 31| w| 39| 15|
| a| 31| v| 98| 13|
| a| 31| u| 88| 2|
| b| 41| z| 16| 14|
| c| 59| z| 16| 12|
...
And now we want to have this result grouped by id1:
r.groupBy("id1").agg(collect_set(struct("id2", "rez")).as("matches")).show
+---+--------------------+
|id1| matches|
+---+--------------------+
| f|[[v,2], [u,8], [y...|
| e|[[y,5], [z,3], [x...|
| d|[[w,2], [x,6], [v...|
| c|[[w,2], [x,6], [v...|
| b|[[v,2], [u,8], [y...|
| a|[[x,2], [y,10], [...|
+---+--------------------+
But really we want only to retain only few (say 3) of "matches", those having the best score (say, least score).
The question is
How to get the "matches" sorted and reduced to top-N elements? Probably it is something about collect_list and sort_array, though I don't know how to sort by inner field.
Is there a way to ensure optimization in case of large input DFs - e.g. choosing minimums directly while aggregating. I know it could be done easily if I wrote the code without spark - keeping small array or priority queue for every id1 and adding element where it should be, possibly dropping out some previously added.
E.g. it's ok that cross-join is costly operation, but I want to avoid wasting memory on the results most of which I'm going to drop in the next step. My real use case deals with DFs with less than 1 mln entries so cross-join is yet viable but as we want to select only 10-20 top matches for each id1 it seems to be quite desirable not to keep unnecessary data between steps.
For start we need to take only the first n rows. To do this we are partitioning the DF by 'id1' and sorting the groups by the res. We use it to add row number column to the DF, like that we can use where function to take the first n rows. Than you can continue doing the same code your wrote. Grouping by 'id1' and collecting the list. Only now you already have the highest rows.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val n = 3
val w = Window.partitionBy($"id1").orderBy($"res".desc)
val res = r.withColumn("rn", row_number.over(w)).where($"rn" <= n).groupBy("id1").agg(collect_set(struct("id2", "res")).as("matches"))
A second option that might be better because you won't need to group the DF twice:
val sortTakeUDF = udf{(xs: Seq[Row], n: Int)} => xs.sortBy(_.getAs[Int]("res")).reverse.take(n).map{case Row(x: String, y:Int)}}
r.groupBy("id1").agg(sortTakeUDF(collect_set(struct("id2", "res")), lit(n)).as("matches"))
In here we create a udf that take the array column and an integer value n. The udf sorts the array by your 'res' and returns only the first n elements.

Scala: Any better way to join two DataFrames by the relationship from the third one

I have to two DataFrames, and want to outer join them. But the joining mapping is in another dataframe.
Now I am using below way, it works, but I hope there is more efficient way for I have >1,000,000 rows
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
scala> ta.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
scala> tb.show
+---+---+
| C| D|
+---+---+
| 2| 1|
+---+---+
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
scala> tc.show
+---+---+---+
| D| E| F|
+---+---+---+
| 1| 1| 1|
| 2| 2| 2|
+---+---+---+
scala> val tmp = ta.join(tb, Seq("C"), "left_outer")
tmp: org.apache.spark.sql.DataFrame = [C: int, A: int, B: int, D: int]
scala> tmp.show
+---+---+---+----+
| C| A| B| D|
+---+---+---+----+
| 1| 1| 1|null|
| 2| 2| 2| 1|
+---+---+---+----+
scala> tmp.join(tc, Seq("D"), "outer").show
+----+----+----+----+----+----+
| D| C| A| B| E| F|
+----+----+----+----+----+----+
|null| 1| 1| 1|null|null|
| 1| 2| 2| 2| 1| 1|
| 2|null|null|null| 2| 2|
+----+----+----+----+----+----+
As Umberto noted, a good reference on how to improve performance of your joins is Holden Karau and Rachel Warren's High Performance Spark > Chapter 4. Joins (SQL & Core).
From the standpoint of your code, running it as you noted or the SQL equivalent (as noted below) should result in about the same performance.
// Create initial tables
val ta = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("A", "B", "C")
val tb = sc.parallelize(Array(
(2,1)
)).toDF("C", "D")
val tc = sc.parallelize(Array(
(1,1,1),
(2,2,2)
)).toDF("D", "E", "F")
// _.createOrReplaceTempView
ta.createOrReplaceTempView("ta")
tb.createOrReplaceTempView("tb")
tc.createOrReplaceTempView("tc")
// SQL Query
spark.sql("
select tc.D, ta.A, ta.B, ta.C, tc.E, tc.F
from ta
left outer join tb
on tb.C = ta.C
full outer join tc
on tc.D = tb.D
")
The reason why is because the Spark SQL Catalyst Optimizer (as noted in the diagram below) takes the DataFrame query and builds up an optimized logical plan. A number of physical plans are developed and Spark SQL Engine's Cost Optimizer chooses the best physical plan and generates the code to produce the RDDs.
Saying this, the key concern is that when you're working with a lot of rows that use up a lot of memory, you have to take into account of the partitioning. For example, if you can ensure that the mapping DataFrame (tc) have the same / similar partitioning scheme as the other DataFrames (ta, tb) so that way you can have a co-located join (this is Figure 4-3 within High Performance Spark > Chapter 4. Join).
If the partitions for your three DataFrames (ta, tb, tc) all have different partitioning, this means the keys for your DataFrames will not have a 1-to-1 matching between the partitions. That is, this will result in a shuffle join (this is Figure 4-2 within High Performance Spark > Chapter 4. Join) which potentially could be more costly.
Basically, from the standpoint of your query, the concern is less about the query itself and more about the partitioning schemes for your DataFrames. But before experimenting too much with the partitioning schemes of your DataFrames, experiment with your queries to see if the default Spark SQL / DataFrame queries are able to take care of the partitioning by itself.

Spark find key/value pairs with key equals to other values and join

If we have the following key value pairs:
[T,V] [V,W] [A,B] [B,C]
I need to result to be
[T,V] [V,W] [T,W] [A,B] [B,C] [A,C]
So basically to generate [T,W] from [T,V] and [V,W] and append to the existing set
I'm not sure how to do this in spark with scala, please help.
val df = sc.parallelize(
Array(("T","V"),("V","W"),("A","B"),("B","C"))
).toDF("key","value")
df.show
+---+-----+
|key|value|
+---+-----+
| T| V|
| V| W|
| A| B|
| B| C|
+---+-----+
df.join(
df.toDF("keyR", "valueR"),
$"value" === $"keyR"
).explode($"key",$"value",$"keyR",$"valueR"){row => Seq(
(row.getString(0), row.getString(1)),
(row.getString(2), row.getString(3)),
(row.getString(0), row.getString(3))
)}.select($"_1" as "key", $"_2" as "value").show
+---+-----+
|key|value|
+---+-----+
| A| B|
| B| C|
| A| C|
| T| V|
| V| W|
| T| W|
+---+-----+
Using purely Scala collection functions (in Set) - I don't use Spark:
val ex = Set("T" -> "V", "V" -> "W", "A" -> "B", "B" -> "C")
val keysEquallingValues = ex.flatMap { tuple =>
ex.find(t => tuple._2 == t._1).map(t => tuple -> t)
}
val r = ex ++ keysEquallingValues.map(pair => pair._1._1 -> pair._2._2)
Explanation:
ex is your example input Set
We flatMap over it, using an expression that returns an Option[((String,String), (String, String))] - i.e. if the condition is there a tuple with a key the same as the current value? is true, we'll have a Some containing a tuple of the two tuples (!) that satisfy the condition.
Using flatMap and Option like this allows us to drop out non-matching cases (like a filter) but also simultaneously transform the content of the collection in the one pass.
Finally we cherrypick the key of the first tuple and the value of the second, to get the desired combination, and add it to your original Set.

What is going wrong with `unionAll` of Spark `DataFrame`?

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:
object Entities {
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
val as = Seq(
A(1,3),
A(2,4)
)
val bs = Seq(
B(5,3),
B(6,4)
)
}
class UnsortedTestSuite extends SparkFunSuite {
configuredUnitTest("The truth test.") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val aDF = sc.parallelize(Entities.as, 4).toDF
val bDF = sc.parallelize(Entities.bs, 4).toDF
aDF.show()
bDF.show()
aDF.unionAll(bDF).show
}
}
Output:
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
+---+---+
+---+---+
| b| a|
+---+---+
| 5| 3|
| 6| 4|
+---+---+
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
| 5| 3|
| 6| 4|
+---+---+
Why does the result contain intermixed "b" and "a" columns, instead of aligning columns bases on column names? Sounds like a serious bug!?
It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.
To quote PostgreSQL manual:
In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types
Column names, excluding the first table in the set operation, are simply ignored.
This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.
If you want to match using names you can do something like this
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).unionAll(b.select(columns: _*))
}
To check both names and types it is should be enough to replace columns with:
a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq
This issue is getting fixed in spark2.3. They are adding support of unionByName in the dataset.
https://issues.apache.org/jira/browse/SPARK-21043
no issues/bugs - if you observe your case class B very closely then you will be clear.
Case Class A --> you have mentioned the order (a,b), and
Case Class B --> you have mentioned the order (b,a) ---> this is expected as per order
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
thanks,
Subbu
Use unionByName:
Excerpt from the documentation:
def unionByName(other: Dataset[T]): Dataset[T]
The difference between this function and union is that this function resolves columns by name (not by position):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 4| 5| 6|
// +----+----+----+
As discussed in SPARK-9813, it seems like as long as the data types and number of columns are the same across frames, the unionAll operation should work. Please see the comments for additional discussion.