cache and persist datasets - scala

I'd like to use an org.apache.flink.api.scala.DataSet object several times:
print the number of rows using count(),
writing to a neo4j database,
converting to a Gelly graph object,
etc.
With each of these actions, Flink completely recalculates the value of the DataSet instead of caching it. I can't find any cache() or persist() function like in Spark.
This does have a huge impact on my application with ~1.000.000 data with many joins / coGroup usages etc.: The runtime seems to increase by a factor of 3, which is several hours! So how can I cache or persist datasets and reduce the runtime significantly?
I'm using the newest Flink release 1.3.2, and Scala 2.11.
Example:
package dummy
import org.apache.flink.api.scala._
import org.apache.flink.graph.scala.Graph
import org.apache.flink.graph.{Edge, Vertex}
import org.apache.logging.log4j.scala.Logging
object Trials extends Logging {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
// some dataset which could be huge in reality
val dataSet = env.fromElements((1, 436), (2, 235), (3, 67), (4, 51), (5, 15), (6, 62), (7, 155))
// some complex joins, coGroup functions etc.
val joined = dataSet.cross(dataSet).filter(tuple => (tuple._1._2 + tuple._2._2) % 7 == 0)
// log the number of rows --> performs the join above
logger.info(f"results contains ${joined.count()} rows")
// convert to Gelly graph format
val graph = Graph.fromDataSet(
dataSet.map(nodeTuple => new Vertex[Long, Long](nodeTuple._1, nodeTuple._2)),
joined.map(edgeTuple => new Edge[Long, String](edgeTuple._1._1, edgeTuple._2._1, "someValue")),
env
)
// do something with the graph
logger.info("get number of vertices")
val numberOfVertices = graph.numberOfVertices()
logger.info("get number of edges")
val numberOfEdges = graph.numberOfEdges() // --> performs the join again!
logger.info(f"the graph has ${numberOfVertices} vertices and ${numberOfEdges} edges")
}
}
Required libs: log4j-core, log4j-api-scala_2.11, flink-core, flink-scala_2.11, flink-gelly-scala_2.10

I think, in case you need to perform multiple operations on the same stream, it's worth using the side outputs - https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.
Once you have performed some complex joins, coGroup functions etc. and got a joined dataset, you can collect the values to different side outputs - the one which will later on calculate count, the other which will do the other job.

Related

What is the alternative and faster way to look up an element in an RDD

I am new in Scala and Spark. This is a simple example of my whole code:
package trouble.something
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Stack {
def ExFunc2(looku: RDD[(Int, List[(Double, Int)])], ke: Int): Seq[List[(Double, Int)]] = {
val y: Seq[List[(Double, Int)]] = looku.lookup(ke)
val g = y.map{x =>
x
/* some functions here
.
.
*/
}
g
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("toy")
val sc = new SparkContext(conf)
val pi: RDD[(Int, List[(Double, Int)])] = sc.parallelize(Seq((1, List((9.0, 3), (7.0, 2))), (2, List((7.0, 1), (1.0, 3))), (3, List((1.0, 2), (9.0, 1)))))
val res = ExFunc2(pi, 1)
println(res)
}
}
I am running a large enough data, and I need faster performance. By looking at Spark's web UI and a software profiler. The most consuming time is lookup() function:
val y: Seq[List[(Double, Int)]] = looku.lookup(ke)
What is an alternative and way to lookup an element in an RDD rather than lookup() function?
There is a discussion related to this problem Spark: Fastest way to look up an element in an RDD. However, it does not give me any idea.
You should not have performance issues with the lookup function if you use and scale it carefully.
def lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
By default functions which generate a PairRdd use the HashPartitioner. So check what your spark.default.parallelism value is set to, since this is the number of partitions that the HashPartitioner will default to. You can tune that parameter to match the # of executors * # of cores per executor you are using.
You should confirm that your PairRdd does in fact have a known partitioner, and if it does not, use partitionBy to create one, or modify your existing code to use a HashPartitioner when the PairRdd is created.
let parallelismFactor = # of executors * # of cores per executor
Then if the lookup function is still too slow, you will need to increase the parallelismFactor you are using. Now spark will know which partition to lookup in, and as you increase the parallelismFactor, you will reduce the size of each partition, which will increase the speed of the lookup.
Keep in mind that you may wish to have many times more partitions then executors * cores, you will have to benchmark your use case yourself, trying values from 1-10 times more partitions then executors * cores.

rdd.sortByKey gives wrong result

I copied sortByKey's body and renamed to sortByKey2, but they give different results. Why the first result is wrong here? This was run in eclipse. I restarted eclipse and still got the wrong result.
package test.spark
import org.apache.spark.sql.SparkSession
object RddTests {
var spark = SparkSession.builder().appName("rdd-test").master("local[*]")
.enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]) {
//mapValues
//combineWithKey
//foldByKey
sortByKey
sortByKey2
}
def sortByKey() {
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
val rdd = sc.parallelize(people)
val sortByKeyRDD = rdd.sortByKey()
println;println("sortByKeyRDD")
sortByKeyRDD.foreach(println)
}
def sortByKey2() {
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
val rdd = sc.parallelize(people)
val sortByKeyRDD = rdd.sortByKey()
println;println("sortByKeyRDD2")
sortByKeyRDD.foreach(println)
}
}
The output is:
[Stage 0:> (0 + 0) / 4]
sortByKeyRDD
(Mobin,2)
(Mobin,1)
(Amy,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)
sortByKeyRDD2
(Amy,1)
(Mobin,2)
(Mobin,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)
foreach does not guarantee that the elements will be processed in any particular order. If you do sortByKeyRDD.collect.foreach(println) you will see the results in order, although this assumes that your data will fit in driver memory.
As noted in the sortByKey documentation:
Calling collect or save on the resulting RDD will return or output an ordered list of records
[EDIT] Using toLocalIterator instead of collect limits the driver memory requirement to the largest single partition. Thanks to user8371915 for pointing that out in a comment.
It is important to understand here how methods like foreach() or sortByKey() works in spark.
When you try to sort your data and like to print the output using foreach(System.out::println), the driver distribute this method to each partition (i.e. node in case of cluster OR multiple threads in case of single machine). So each partition execute the foreach locally. This means you will not see the output that you want to see.
Possible solution that people suggest, though not right solution in Bigdata,
sortByKeyRDD.coalesce(1).foreach(System.out::println);
or
sortByKeyRDD.collect().forEach(System.out::println);
Above solution is just for understanding purpose, I do not recommend to use it. If your data is large, it might give you out of memory exception as it try to collect all the data at driver for printing the output.

Spark migrate sql window function to RDD for better performance

A function should be executed for multiple columns in a data frame
def handleBias(df: DataFrame, colName: String, target: String = target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
This can be written nicely as shown above in spark-SQL and a for loop. However this is causing a lot of shuffles (spark apply function to columns in parallel).
A minimal example:
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
val targetCounts = df.filter(df(target) === 1).groupBy(target)
.agg(count(target).as("cnt_foo_eq_1"))
val newDF = df.join(broadcast(targetCounts), Seq(target), "left")
val result = (columnsToDrop ++ columnsToCode).toSet.foldLeft(newDF) {
(currentDF, colName) => handleBias(currentDF, colName)
}
result.drop(columnsToDrop: _*).show
How can I formulate this more efficient using RDD API? aggregateByKeyshould be a good idea but is still not very clear to me how to apply it here to substitute the window functions.
(provides a bit more context / bigger example https://github.com/geoHeil/sparkContrastCoding)
edit
Initially, I started with Spark dynamic DAG is a lot slower and different from hard coded DAG which is shown below. The good thing is, each column seems to run independent /parallel. The downside is that the joins (even for a small dataset of 300 MB) get "too big" and lead to an unresponsive spark.
handleBiasOriginal("col1", df)
.join(handleBiasOriginal("col2", df), df.columns)
.join(handleBiasOriginal("col3TooMany", df), df.columns)
.drop(columnsToDrop: _*).show
def handleBiasOriginal(col: String, df: DataFrame, target: String = target): DataFrame = {
val pre1_1 = df
.filter(df(target) === 1)
.groupBy(col, target)
.agg((count("*") / df.filter(df(target) === 1).count).alias("pre_" + col))
.drop(target)
val pre2_1 = df
.groupBy(col)
.agg(mean(target).alias("pre2_" + col))
df
.join(pre1_1, Seq(col), "left")
.join(pre2_1, Seq(col), "left")
.na.fill(0)
}
This image is with spark 2.1.0, the images from Spark dynamic DAG is a lot slower and different from hard coded DAG are with 2.0.2
The DAG will be a bit simpler when caching is applied
df.cache
handleBiasOriginal("col1", df). ...
What other possibilities than window functions do you see to optimize the SQL?
At best it would be great if the SQL was generated dynamically.
The main point here is to avoid unnecessary shuffles. Right now your code shuffles twice for each column you want to include and the resulting data layout cannot be reused between columns.
For simplicity I assume that target is always binary ({0, 1}) and all remaining columns you use are of StringType. Furthermore I assume that the cardinality of the columns is low enough for the results to be grouped and handled locally. You can adjust these methods to handle other cases but it requires more work.
RDD API
Reshape data from wide to long:
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
)).alias("level")
val long = df.select(exploded, $"TARGET")
aggregateByKey, reshape and collect:
import org.apache.spark.util.StatCounter
val lookup = long.as[((String, String), Int)].rdd
// You can use prefix partitioner (one that depends only on _._1)
// to avoid reshuffling for groupByKey
.aggregateByKey(StatCounter())(_ merge _, _ merge _)
.map { case ((c, v), s) => (c, (v, s)) }
.groupByKey
.mapValues(_.toMap)
.collectAsMap
You can use lookup to get statistics for individual columns and levels. For example:
lookup("col1")("A")
org.apache.spark.util.StatCounter =
(count: 3, mean: 0.666667, stdev: 0.471405, max: 1.000000, min: 0.000000)
Gives you data for col1, level A. Based on the binary TARGET assumption this information is complete (you get count / fractions for both classes).
You can use lookup like this to generate SQL expressions or pass it to udf and apply it on individual columns.
DataFrame API
Convert data to long as for RDD API.
Compute aggregates based on levels:
val stats = long
.groupBy($"level.k", $"level.v")
.agg(mean($"TARGET"), sum($"TARGET"))
Depending on your preferences you can reshape this to enable efficient joins or convert to a local collection and similarly to the RDD solution.
Using aggregateByKey
A simple explanation on aggregateByKey can be found here. Basically you use two functions: One which works inside a partition and one which works between partitions.
You would need to do something like aggregate by the first column and build a data structure internally with a map for every element of the second column to aggregate and collect data there (of course you could do two aggregateByKey if you want).
This will not solve the case of doing multiple runs on the code for each column you want to work with (you can do use aggregate as opposed to aggregateByKey to work on all data and put it in a map but that will probably give you even worse performance). The result would then be one line per key, if you want to move back to the original records (as window function does) you would actually need to either join this value with the original RDD or save all values internally and flatmap
I do not believe this would provide you with any real performance improvement. You would be doing a lot of work to reimplement things that are done for you in SQL and while doing so you would be losing most of the advantages of SQL (catalyst optimization, tungsten memory management, whole stage code generation etc.)
Improving the SQL
What I would do instead is attempt to improve the SQL itself.
For example, the result of the column in the window function appears to be the same for all values. Do you really need a window function? You can instead do a groupBy instead of a window function (and if you really need this per record you can try to join the results. This might provide better performance as it would not necessarily mean shuffling everything twice on every step).

Spark UDF called more than once per record when DF has too many columns

I'm using Spark 1.6.1 and encountering a strange behaviour: I'm running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input data, and building up a result-Dataframe containing many columns (~40).
Strangely, my UDF is called more than once per Record of my input Dataframe in this case (1.6 times more often), which I find unacceptable because its very expensive. If I reduce the number of columns (e.g. to 20), then this behavior disappears.
I managed to write down a small script which demonstrates this:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo {
case class Result(a: Double)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val numRuns = sc.accumulator(0) // to count the number of udf calls
val myUdf = udf((i:Int) => {numRuns.add(1);Result(i.toDouble)})
val data = sc.parallelize((1 to 100), numSlices = 5).toDF("id")
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a")
// add many columns to dataframe (must depend on the UDF's result)
for (i <- 1 to 42) {
results=results.withColumn(s"col_$i",$"result")
}
// trigger action
val res = results.collect()
println(res.size) // prints 100
println(numRuns.value) // prints 160
}
}
Now, is there a way to solve this without reducing the number of columns?
I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a").cache()
Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.
We had this same problem about a year ago and spent a lot of time till we finally figured out what was the problem.
We also had a very expensive UDF to calculate and we found out that it gets calculated again and again for every time we refer to its column. Its just happened to us again a few days ago, so I decided to open a bug on this:
SPARK-18748
We also opened a question here then, but now I see the title wasn't so good:
Trying to turn a blob into multiple columns in Spark
I agree with Tzach about somehow "forcing" the plan to calculate the UDF. We did it uglier, but we had to, because we couldn't cache() the data - it was too big:
val df = data.withColumn("tmp", myUdf($"id"))
val results = sqlContext.createDataFrame(df.rdd, df.schema)
.withColumn("result", $"tmp.a")
update:
Now I see that my jira ticket was linked to another one: SPARK-17728, which still didn't really handle this issue the right way, but it gives one more optional work around:
val results = data.withColumn("tmp", explode(array(myUdf($"id"))))
.withColumn("result", $"tmp.a")
In newer spark verion (2.3+) we can mark UDFs as non-deterministic: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/UserDefinedFunction.html#asNondeterministic():org.apache.spark.sql.expressions.UserDefinedFunction
i.e. use
val myUdf = udf(...).asNondeterministic()
This makes sure the UDF is only called once

Spark program performance - GC & Task Deserialization & Concurrent execution

I have a cluster of 4 machines, 1 master and three workers, each with 128G memory and 64 cores. I'm using Spark 1.5.0 in stand alone mode. My program reads data from Oracle tables using JDBC, then does ETL, manipulating data, and does machine learning tasks like k-means.
I have a DataFrame (myDF.cache()) which is join results with two other DataFrames, and cached. The DataFrame contains 27 million rows and the size of data is around 1.5G. I need to filter the data and calculate 24 histogram as follows:
val h1 = myDF.filter("pmod(idx, 24) = 0").select("col1").histogram(arrBucket)
val h2 = myDF.filter("pmod(idx, 24) = 1").select("col1").histogram(arrBucket)
// ......
val h24 = myDF.filter("pmod(idx, 24) = 23").select("col1").histogram(arrBucket)
Problems:
Since my DataFrame is cached, I expect the filter, select, and histogram is very fast. However, the actual time is about 7 seconds for each calculation, which is not acceptable. From UI, it show the GC time takes 5 seconds and Task Deserialization Time 4 seconds. I've tried different JVM parameters but cannot improve further. Right now I'm using
-Xms25G -Xmx25G -XX:MaxPermSize=512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 \
-XX:ParallelGCThreads=32 \
-XX:ConcGCThreads=8 -XX:InitiatingHeapOccupancyPercent=70
What puzzles me is that the size of data is nothing compared with available memory. Why does GC kick in every time filter/select/histogram running? Is there any way to reduce the GC time and Task Deserialization Time?
I have to do parallel computing for h[1-24], instead of sequential. I tried Future, something like:
import scala.concurrent.{Await, Future, blocking}
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future{myDF.filter("pmod(idx, 24) = 1").count}
val f2 = Future{myDF.filter("pmod(idx, 24) = 2").count}
val f3 = Future{myDF.filter("pmod(idx, 24) = 3").count}
val future = for {c1 <- f1; c2 <- f2; c3 <- f3} yield {
c1 + c2 + c3
}
val summ = Await.result(future, 180 second)
The problem is that here Future only means jobs are submitted to the scheduler near-simultaneously, not that they end up being scheduled and run simultaneously. Future used here doesn't improve performance at all.
How to make the 24 computation jobs run simultaneously?
A couple of things you can try:
Don't compute pmod(idx, 24) all over again. Instead you can simply compute it once:
import org.apache.spark.sql.functions.{pmod, lit}
val myDfWithBuckets = myDF.withColumn("bucket", pmod($"idx", lit(24)))
Use SQLContext.cacheTable instead of cache. It stores table using compressed columnar storage which can be used to access only required columns and as stated in the Spark SQL and DataFrame Guide "will automatically tune compression to minimize memory usage and GC pressure".
myDfWithBuckets.registerTempTable("myDfWithBuckets")
sqlContext.cacheTable("myDfWithBuckets")
If you can, cache only the columns you actually need instead of projecting each time.
It is not clear for me what is the source of a histogram method (do you convert to RDD[Double] and use DoubleRDDFunctions.histogram?) and what is the argument but if you want to compute all histograms at the same time you can try to groupBy bucket and apply histogram once for example using histogram_numeric UDF:
import org.apache.spark.sql.functions.callUDF
val n: Int = ???
myDfWithBuckets
.groupBy($"bucket")
.agg(callUDF("histogram_numeric", $"col1", lit(n)))
If you use predefined ranges you can obtain a similar effect using custom UDF.
Notes
how to extract values computed by histogram_numeric? First lets create a small helper
import org.apache.spark.sql.Row
def extractBuckets(xs: Seq[Row]): Seq[(Double, Double)] =
xs.map(x => (x.getDouble(0), x.getDouble(1)))
now we can map using pattern matching as follows:
import org.apache.spark.rdd.RDD
val histogramsRDD: RDD[(Int, Seq[(Double, Double)])] = histograms.map{
case Row(k: Int, hs: Seq[Row #unchecked]) => (k, extractBuckets(hs)) }