Spark migrate sql window function to RDD for better performance - scala

A function should be executed for multiple columns in a data frame
def handleBias(df: DataFrame, colName: String, target: String = target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
This can be written nicely as shown above in spark-SQL and a for loop. However this is causing a lot of shuffles (spark apply function to columns in parallel).
A minimal example:
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
val targetCounts = df.filter(df(target) === 1).groupBy(target)
.agg(count(target).as("cnt_foo_eq_1"))
val newDF = df.join(broadcast(targetCounts), Seq(target), "left")
val result = (columnsToDrop ++ columnsToCode).toSet.foldLeft(newDF) {
(currentDF, colName) => handleBias(currentDF, colName)
}
result.drop(columnsToDrop: _*).show
How can I formulate this more efficient using RDD API? aggregateByKeyshould be a good idea but is still not very clear to me how to apply it here to substitute the window functions.
(provides a bit more context / bigger example https://github.com/geoHeil/sparkContrastCoding)
edit
Initially, I started with Spark dynamic DAG is a lot slower and different from hard coded DAG which is shown below. The good thing is, each column seems to run independent /parallel. The downside is that the joins (even for a small dataset of 300 MB) get "too big" and lead to an unresponsive spark.
handleBiasOriginal("col1", df)
.join(handleBiasOriginal("col2", df), df.columns)
.join(handleBiasOriginal("col3TooMany", df), df.columns)
.drop(columnsToDrop: _*).show
def handleBiasOriginal(col: String, df: DataFrame, target: String = target): DataFrame = {
val pre1_1 = df
.filter(df(target) === 1)
.groupBy(col, target)
.agg((count("*") / df.filter(df(target) === 1).count).alias("pre_" + col))
.drop(target)
val pre2_1 = df
.groupBy(col)
.agg(mean(target).alias("pre2_" + col))
df
.join(pre1_1, Seq(col), "left")
.join(pre2_1, Seq(col), "left")
.na.fill(0)
}
This image is with spark 2.1.0, the images from Spark dynamic DAG is a lot slower and different from hard coded DAG are with 2.0.2
The DAG will be a bit simpler when caching is applied
df.cache
handleBiasOriginal("col1", df). ...
What other possibilities than window functions do you see to optimize the SQL?
At best it would be great if the SQL was generated dynamically.

The main point here is to avoid unnecessary shuffles. Right now your code shuffles twice for each column you want to include and the resulting data layout cannot be reused between columns.
For simplicity I assume that target is always binary ({0, 1}) and all remaining columns you use are of StringType. Furthermore I assume that the cardinality of the columns is low enough for the results to be grouped and handled locally. You can adjust these methods to handle other cases but it requires more work.
RDD API
Reshape data from wide to long:
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
)).alias("level")
val long = df.select(exploded, $"TARGET")
aggregateByKey, reshape and collect:
import org.apache.spark.util.StatCounter
val lookup = long.as[((String, String), Int)].rdd
// You can use prefix partitioner (one that depends only on _._1)
// to avoid reshuffling for groupByKey
.aggregateByKey(StatCounter())(_ merge _, _ merge _)
.map { case ((c, v), s) => (c, (v, s)) }
.groupByKey
.mapValues(_.toMap)
.collectAsMap
You can use lookup to get statistics for individual columns and levels. For example:
lookup("col1")("A")
org.apache.spark.util.StatCounter =
(count: 3, mean: 0.666667, stdev: 0.471405, max: 1.000000, min: 0.000000)
Gives you data for col1, level A. Based on the binary TARGET assumption this information is complete (you get count / fractions for both classes).
You can use lookup like this to generate SQL expressions or pass it to udf and apply it on individual columns.
DataFrame API
Convert data to long as for RDD API.
Compute aggregates based on levels:
val stats = long
.groupBy($"level.k", $"level.v")
.agg(mean($"TARGET"), sum($"TARGET"))
Depending on your preferences you can reshape this to enable efficient joins or convert to a local collection and similarly to the RDD solution.

Using aggregateByKey
A simple explanation on aggregateByKey can be found here. Basically you use two functions: One which works inside a partition and one which works between partitions.
You would need to do something like aggregate by the first column and build a data structure internally with a map for every element of the second column to aggregate and collect data there (of course you could do two aggregateByKey if you want).
This will not solve the case of doing multiple runs on the code for each column you want to work with (you can do use aggregate as opposed to aggregateByKey to work on all data and put it in a map but that will probably give you even worse performance). The result would then be one line per key, if you want to move back to the original records (as window function does) you would actually need to either join this value with the original RDD or save all values internally and flatmap
I do not believe this would provide you with any real performance improvement. You would be doing a lot of work to reimplement things that are done for you in SQL and while doing so you would be losing most of the advantages of SQL (catalyst optimization, tungsten memory management, whole stage code generation etc.)
Improving the SQL
What I would do instead is attempt to improve the SQL itself.
For example, the result of the column in the window function appears to be the same for all values. Do you really need a window function? You can instead do a groupBy instead of a window function (and if you really need this per record you can try to join the results. This might provide better performance as it would not necessarily mean shuffling everything twice on every step).

Related

spark scala: Performance degrade with simple UDF over large number of columns

I have a dataframe with 100 million rows and ~ 10,000 columns. The columns are of two types, standard (C_i) followed by dynamic (X_i). This dataframe was obtained after some processing, and the performance was fast. Now only 2 steps remain:
Goal:
A particular operation needs to be done on every X_i using identical subset of C_i columns.
Convert each of X-i column into FloatType.
Difficulty:
Performance degrades terribly with increasing number of columns.
After a while, only 1 executor seems to work (%CPU use < 200%), even on a sample data with 100 rows and 1,000 columns. If I push it to 1,500 columns, it crashes.
Minimal code:
import spark.implicits._
import org.apache.spark.sql.types.FloatType
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val columns = Seq("C1", "C2", "X1", "X2", "X3", "X4")
val data = Seq(("abc", "212", "1", "2", "3", "4"),("def", "436", "2", "2", "1", "8"),("abc", "510", "1", "2", "5", "8"))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd).toDF(columns:_*)
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols, foos_udf(col("C2"),col(cols)))
}
df.show()
for (cols <- df.columns.drop(2)) {
df = df.withColumn(cols,col(cols).cast(FloatType))
}
df.show()
Error on 1,500 column data:
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.isStreaming(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$isStreaming$1.apply(LogicalPlan.scala:37)
at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:93)
at scala.collection.immutable.List.exists(List.scala:84)
...
Thoughts:
Perhaps var could be replaced, but the size of the data is close to 40% of the RAM.
Perhaps for loop for dtype casting could be causing degradation of performance, though I can't see how, and what are the alternatives. From searching on internet, I have seen people suggesting foldLeft based approach, but that apparently still gets translated to for loop internally.
Any inputs on this would be greatly appreciated.
A faster solution was to call UDF on row itself rather than calling on each column. As Spark stores data as rows, the earlier approach was exhibiting terrible performance.
def my_udf(names: Array[String]) = udf[String,Row]((r: Row) => {
val row = Array.ofDim[String](names.length)
for (i <- 0 until row.length) {
row(i) = r.getAs(i)
}
...
}
...
val df2 = df1.withColumn(results_col,my_udf(df1.columns)(struct("*"))).select(col(results_col))
Type casting can be done as suggested by Riccardo
not sure if this will fix the performance on your side with 10000~ columns, but I was able to run it locally with 1500 using the following code.
I addressed points #1 and #2, which may have had some impact on performance. One note, to my understanding foldLeft should be a pure recursive function without an internal for loop, so it might have an impact on performance in this case.
Also, the two for loops can be simplified into a single for loop that I refactored as foldLeft.
We might also get a performance increase if we replace the udf with a spark function.
import spark.implicits._
import org.apache.spark.sql.types.FloatType
import org.apache.spark.sql.functions._
// sample_udf
val foo = (s_val: String, t_val: String) => {
t_val + s_val.takeRight(1)
}
val foos_udf = udf(foo)
spark.udf.register("foos_udf", foo)
val numberOfColumns = 1500
val numberOfRows = 100
val colNames = (1 to numberOfColumns).map(s => s"X$s")
val colValues = (1 to numberOfColumns).map(_.toString)
val columns = Seq("C1", "C2") ++ colNames
val schema = StructType(columns.map(field => StructField(field, StringType)))
val rowFields = Seq("abc", "212") ++ colValues
val listOfRows = (1 to numberOfRows).map(_ => Row(rowFields: _*))
val listOfRdds = spark.sparkContext.parallelize(listOfRows)
val df = spark.createDataFrame(listOfRdds, schema)
df.show()
val newDf = df.columns.drop(2).foldLeft(df)((df, colName) => {
df.withColumn(colName, foos_udf(col("C2"), col(colName)) cast FloatType)
})
newDf.show()
Hope this helps!
*** EDIT
Found a way better solution that circumvents loops. Simply make a single expression with SelectExpr, this way sparks casts all columns in one go without any kind of recursion. From my previous example:
instead of doing fold left, just replace it with these lines. I just tested it with 10k columns 100 rows in my local computer, lasted a few seconds
val selectExpression = Seq("C1", "C2") ++ colNames.map(s => s"cast($s as float)")
val newDf = df.selectExpr(selectExpression:_*)

Spark, applying filters on DataFrame(or RDD) multiple times without redundant evaluations

I have a Spark DataFrame that needs heavy evaluations for the chaining of the parent RDDs.
val df: DataFrame[(String, Any)] = someMethodCalculatingDF()
val out1 = df.filter(_._1 == "Key1").map(_._2).collect()
val out2 = df.filter(_._1 == "Key2").map(_._2)
out1 is a very small data ( one or two Rows in each partition) and collected for further use.
out2 is a Dataframe and will be used to generate another RDD that will be materialized later.
So, df will be evaluated twice, which is heavy.
Caching could be a solution, but in my application, it wont be, because the data could be really really BIG. The memory would be overflowed.
Is there any genius :) who could suggest another way bypassing the redundant evaluations?
It's actually a scenario which occurs in our cluster on a daily basis. From our experience this methodology work for us the best.
When we need to use same calculated dataframe twice(on different branches) we do as follows:
Calculation phase is heavy and resulted in a rather small dataframe -> cache it.
Calculation phase is light resulted in a big dataframe -> let it calculate twice.
Calculation is heavy resulted in a big data frame -> write it to disk(HDFS or S3) split the job on splitting point to two different batch processing job. In this you don't repeat the heavy calculation and you don't shred your cache(which will either way probably use the disk).
Calculation phase is light resulting in a small Dataframe. Your life is good and you can go home :).
I'm not familiar with dataset API so will write a solution using RDD api.
val rdd: RDD[(String, Int)] = ???
//First way
val both: Map[String, Iterable[Int]] = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2")
.groupByKey().collectAsMap()
//Second way
val smallCached = rdd.filter(e => e._1 == "Key1" || e._1 == "Key2").cache()
val out1 = smallCached.filter(_._1 == "Key1").map(_._2).collect()
val out2 = smallCached.filter(_._1 == "Key2").map(_._2).collect()

Spark RowMatrix columnSimilarities preserve original index

I have the following Scala Spark DataFrame df of (String, Array[Double]): Note id is of type String (A base64 hash)
id, values
"a", [0.5, 0.6]
"b", [0.1, 0.2]
...
The dataset is quite large (45k) and I would like to perform a pairwise cosine similarity using org.apache.spark.mllib.linalg.distributed.RowMatrix for performance. This works, but I am not able to identify the pairwise similarities as the indexes have turned into integers (output columns i and j). How do I use IndexedRowMatrix to preserve the original indexes?
val rows = df.select("values")
.rdd
.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(rows)
val simsEstimate = mat.columnSimilarities()
Ideally, the end result should look something like this:
id_x, id_y, similarity
"a", "b", 0.9
"b", "c", 0.8
...
columnSimilarities() compute similarities between columns of the RowMatrix, not among rows, so "ids" you have are meaningless in this context and indices are indices in each feature vector.
Additionally these methods are designed for long, narrow and data, so an obvious approach - encode id with StringIndexer, create IndedxedRowMatrix, transpose, compute similarities, and go back (with IndexToString) simply won't do.
Your best bet here is to take crossJoin
df.as("a").crossJoin(df.as("b")).where($"a.id" <= $"b.id").select(
$"a.id" as "id_x", $"b.id" as "id_y", cosine_similarity($"a.values", $b.values")
)
where
val cosine_similarity = udf((xs: Array[Double], ys: Array[Double]) => ???)
is something you have to implement yourself.
Alternatively you can explode the data:
import org.apache.spark.sql.functions.posexplode
val long = ds.select($"id", posexplode($"values")).toDF("item", "feature", "value")
and then use method shown in Spark Scala - How to group dataframe rows and apply complex function to the groups? to compute similarity.

Spark DataFrame: Is order of withColumn guaranteed?

Given the following code:
dataFrame
.withColumn("A", myUdf1($"x")) // withColumn1 from x
.withColumn("B", myUdf2($"y")) // withColumn2 from y
Is it guaranteed that withColumn1 will execute before withColumn2?
A better example:
dataFrame
.withColumn("A", myUdf1($"x")) // withColumn1 from x
.withColumn("B", myUdf2($"A")) // withColumn2 from A!!
Note that withColumn2 operates on A that is calculated from withColumn1.
I'm asking because I'm having inconsistent results over multiple runs of the same code and I started to think that this could be the source of the issue.
EDIT: Added more detailed code sample
val result = dataFrame
.groupBy("key")
.agg(
collect_list($"itemList").as("A"), // all items
collect_list(when($"click".isNotNull, $"itemList")).as("B") // subset of A
)
// create sparse item vector from all list of items A
.withColumn("vectorA", aggToSparseUdf($"A"))
// create sparse item vector from all list of items B (subset of A)
.withColumn("vectorB", aggToSparseUdf($"B"))
// calculate ratio vector B / A
.withColumn("ratio", divideVectors($"vectorB", $"vectorA"))
val keys: Seq[String] = result.head.getAs[Seq[String]]("key")
val values: Seq[SparseVector] = result.head.getAs[Seq[SparseVector]]("ratio")
It IS guaranteed that for each specific record in dataFrame, myUdf1 will be applied before myUdf2; However:
It is NOT guaranteed that myUdf1 will be applied to all records of dataFrame before myUdf2 is applied to any record - in other words, myUdf2 might be applied to some records before myUdf1 has been applied to other records
This is true because Spark would likely combine both operations together into a single stage, and execute this stage (applying myUdf1 and myUdf2) on each record of each partition.
This shouldn't pose any problem if your UDFs are "purely functional", or "idempotent", or cause no side effects - and they should be, because Spark assumes all transformations are such. If they weren't, Spark wouldn't be able to optimize execution by "combining" transformations together, running them in parallel on different partitions, retrying transformations etc.
EDIT: if you want to force UDF1 to be completely applied before applying UDF2 to any record, you'd have to force them into separate stages - this can be done, for example, by repartitioning the DataFrame:
// sample data:
val dataFrame = Seq("A", "V", "D").toDF("x")
// two UDFs with "side effects" (printing to console):
val myUdf1 = udf[String, String](x => {
println("In UDF1")
x.toLowerCase
})
val myUdf2 = udf[String, String](x => {
println("In UDF2")
x.toUpperCase
})
// repartitioning between UDFs
dataFrame
.withColumn("A", myUdf1($"x"))
.repartition(dataFrame.rdd.partitions.length + 1)
.withColumn("B", myUdf2($"A"))
.show()
// prints:
// In UDF1
// In UDF1
// In UDF1
// In UDF2
// In UDF2
// In UDF2
NOTE that this isn't bullet-proof either - if, for example, there are failures and retries, order can be once again non deterministic.

Selecting every 3rd element from a huge RDD [duplicate]

I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)