rdd.sortByKey gives wrong result - scala

I copied sortByKey's body and renamed to sortByKey2, but they give different results. Why the first result is wrong here? This was run in eclipse. I restarted eclipse and still got the wrong result.
package test.spark
import org.apache.spark.sql.SparkSession
object RddTests {
var spark = SparkSession.builder().appName("rdd-test").master("local[*]")
.enableHiveSupport()
.getOrCreate()
val sc = spark.sparkContext
def main(args: Array[String]) {
//mapValues
//combineWithKey
//foldByKey
sortByKey
sortByKey2
}
def sortByKey() {
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
val rdd = sc.parallelize(people)
val sortByKeyRDD = rdd.sortByKey()
println;println("sortByKeyRDD")
sortByKeyRDD.foreach(println)
}
def sortByKey2() {
val people = List(("Mobin", 2), ("Mobin", 1), ("Lucy", 2), ("Amy", 1), ("Lucy", 3), ("Lucy", 1))
val rdd = sc.parallelize(people)
val sortByKeyRDD = rdd.sortByKey()
println;println("sortByKeyRDD2")
sortByKeyRDD.foreach(println)
}
}
The output is:
[Stage 0:> (0 + 0) / 4]
sortByKeyRDD
(Mobin,2)
(Mobin,1)
(Amy,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)
sortByKeyRDD2
(Amy,1)
(Mobin,2)
(Mobin,1)
(Lucy,2)
(Lucy,3)
(Lucy,1)

foreach does not guarantee that the elements will be processed in any particular order. If you do sortByKeyRDD.collect.foreach(println) you will see the results in order, although this assumes that your data will fit in driver memory.
As noted in the sortByKey documentation:
Calling collect or save on the resulting RDD will return or output an ordered list of records
[EDIT] Using toLocalIterator instead of collect limits the driver memory requirement to the largest single partition. Thanks to user8371915 for pointing that out in a comment.

It is important to understand here how methods like foreach() or sortByKey() works in spark.
When you try to sort your data and like to print the output using foreach(System.out::println), the driver distribute this method to each partition (i.e. node in case of cluster OR multiple threads in case of single machine). So each partition execute the foreach locally. This means you will not see the output that you want to see.
Possible solution that people suggest, though not right solution in Bigdata,
sortByKeyRDD.coalesce(1).foreach(System.out::println);
or
sortByKeyRDD.collect().forEach(System.out::println);
Above solution is just for understanding purpose, I do not recommend to use it. If your data is large, it might give you out of memory exception as it try to collect all the data at driver for printing the output.

Related

Serialization on rdd vs dataframe Spark

EX1. This with an RDD gives Serialization as we expect with or without Object and val num being the culprit, fine:
object Example {
val r = 1 to 1000000 toList
val rdd = sc.parallelize(r,3)
val num = 1
val rdd2 = rdd.map(_ + num)
rdd2.collect
}
Example
EX2. Using a Dataframe in similar fashion, however, does not. Why is that as it looks sort of the same? What am I missing here?
object Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val n = 1
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
df.repartition(3).withColumn("plus1", $"b" + n).show(false)
}
Example
Reasons not entirely clear to me on DF, would expect similar behaviour. Looks like DSs circumvent some issues, but I may well be missing something.
Running on Databricks gives plenty of Serializatiion issues, so do not think that is affecting things, handy to test.
The reason is simple and more fundamental than distinction between RDD and Dataset:
The first piece of code evaluates a function
_ + num
therefore it has to be computed and evaluated.
The second piece of code doesn't. Following
$"b" + n
is just a value, therefore no closure computation and subsequent serialization is required.
If this is still not clear you can think about it this way:
The former piece of code tells Spark how to do something.
The latter piece of code tells Spark what to do. Actual code that is executed is generated in different scope.
If your Dataset code was closer to it's RDD counterpart, for example:
object Example {
import spark.implicits._
val num = 1
spark.range(1000).map(_ + num).collect
}
or
Example {
import spark.implicits._
import org.apache.spark.sql.functions._
val num = 1
val f = udf((x: Int) => x + num)
spark.range(1000).select(f($"id")).collect
}
it would fail with serialization exception, same as RDD version does.

What is the alternative and faster way to look up an element in an RDD

I am new in Scala and Spark. This is a simple example of my whole code:
package trouble.something
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Stack {
def ExFunc2(looku: RDD[(Int, List[(Double, Int)])], ke: Int): Seq[List[(Double, Int)]] = {
val y: Seq[List[(Double, Int)]] = looku.lookup(ke)
val g = y.map{x =>
x
/* some functions here
.
.
*/
}
g
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]").setAppName("toy")
val sc = new SparkContext(conf)
val pi: RDD[(Int, List[(Double, Int)])] = sc.parallelize(Seq((1, List((9.0, 3), (7.0, 2))), (2, List((7.0, 1), (1.0, 3))), (3, List((1.0, 2), (9.0, 1)))))
val res = ExFunc2(pi, 1)
println(res)
}
}
I am running a large enough data, and I need faster performance. By looking at Spark's web UI and a software profiler. The most consuming time is lookup() function:
val y: Seq[List[(Double, Int)]] = looku.lookup(ke)
What is an alternative and way to lookup an element in an RDD rather than lookup() function?
There is a discussion related to this problem Spark: Fastest way to look up an element in an RDD. However, it does not give me any idea.
You should not have performance issues with the lookup function if you use and scale it carefully.
def lookup(key: K): Seq[V]
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
By default functions which generate a PairRdd use the HashPartitioner. So check what your spark.default.parallelism value is set to, since this is the number of partitions that the HashPartitioner will default to. You can tune that parameter to match the # of executors * # of cores per executor you are using.
You should confirm that your PairRdd does in fact have a known partitioner, and if it does not, use partitionBy to create one, or modify your existing code to use a HashPartitioner when the PairRdd is created.
let parallelismFactor = # of executors * # of cores per executor
Then if the lookup function is still too slow, you will need to increase the parallelismFactor you are using. Now spark will know which partition to lookup in, and as you increase the parallelismFactor, you will reduce the size of each partition, which will increase the speed of the lookup.
Keep in mind that you may wish to have many times more partitions then executors * cores, you will have to benchmark your use case yourself, trying values from 1-10 times more partitions then executors * cores.

cache and persist datasets

I'd like to use an org.apache.flink.api.scala.DataSet object several times:
print the number of rows using count(),
writing to a neo4j database,
converting to a Gelly graph object,
etc.
With each of these actions, Flink completely recalculates the value of the DataSet instead of caching it. I can't find any cache() or persist() function like in Spark.
This does have a huge impact on my application with ~1.000.000 data with many joins / coGroup usages etc.: The runtime seems to increase by a factor of 3, which is several hours! So how can I cache or persist datasets and reduce the runtime significantly?
I'm using the newest Flink release 1.3.2, and Scala 2.11.
Example:
package dummy
import org.apache.flink.api.scala._
import org.apache.flink.graph.scala.Graph
import org.apache.flink.graph.{Edge, Vertex}
import org.apache.logging.log4j.scala.Logging
object Trials extends Logging {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
// some dataset which could be huge in reality
val dataSet = env.fromElements((1, 436), (2, 235), (3, 67), (4, 51), (5, 15), (6, 62), (7, 155))
// some complex joins, coGroup functions etc.
val joined = dataSet.cross(dataSet).filter(tuple => (tuple._1._2 + tuple._2._2) % 7 == 0)
// log the number of rows --> performs the join above
logger.info(f"results contains ${joined.count()} rows")
// convert to Gelly graph format
val graph = Graph.fromDataSet(
dataSet.map(nodeTuple => new Vertex[Long, Long](nodeTuple._1, nodeTuple._2)),
joined.map(edgeTuple => new Edge[Long, String](edgeTuple._1._1, edgeTuple._2._1, "someValue")),
env
)
// do something with the graph
logger.info("get number of vertices")
val numberOfVertices = graph.numberOfVertices()
logger.info("get number of edges")
val numberOfEdges = graph.numberOfEdges() // --> performs the join again!
logger.info(f"the graph has ${numberOfVertices} vertices and ${numberOfEdges} edges")
}
}
Required libs: log4j-core, log4j-api-scala_2.11, flink-core, flink-scala_2.11, flink-gelly-scala_2.10
I think, in case you need to perform multiple operations on the same stream, it's worth using the side outputs - https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html.
Once you have performed some complex joins, coGroup functions etc. and got a joined dataset, you can collect the values to different side outputs - the one which will later on calculate count, the other which will do the other job.

Spark UDF called more than once per record when DF has too many columns

I'm using Spark 1.6.1 and encountering a strange behaviour: I'm running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input data, and building up a result-Dataframe containing many columns (~40).
Strangely, my UDF is called more than once per Record of my input Dataframe in this case (1.6 times more often), which I find unacceptable because its very expensive. If I reduce the number of columns (e.g. to 20), then this behavior disappears.
I managed to write down a small script which demonstrates this:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo {
case class Result(a: Double)
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val numRuns = sc.accumulator(0) // to count the number of udf calls
val myUdf = udf((i:Int) => {numRuns.add(1);Result(i.toDouble)})
val data = sc.parallelize((1 to 100), numSlices = 5).toDF("id")
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a")
// add many columns to dataframe (must depend on the UDF's result)
for (i <- 1 to 42) {
results=results.withColumn(s"col_$i",$"result")
}
// trigger action
val res = results.collect()
println(res.size) // prints 100
println(numRuns.value) // prints 160
}
}
Now, is there a way to solve this without reducing the number of columns?
I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:
// get results of UDF
var results = data
.withColumn("tmp", myUdf($"id"))
.withColumn("result", $"tmp.a").cache()
Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.
We had this same problem about a year ago and spent a lot of time till we finally figured out what was the problem.
We also had a very expensive UDF to calculate and we found out that it gets calculated again and again for every time we refer to its column. Its just happened to us again a few days ago, so I decided to open a bug on this:
SPARK-18748
We also opened a question here then, but now I see the title wasn't so good:
Trying to turn a blob into multiple columns in Spark
I agree with Tzach about somehow "forcing" the plan to calculate the UDF. We did it uglier, but we had to, because we couldn't cache() the data - it was too big:
val df = data.withColumn("tmp", myUdf($"id"))
val results = sqlContext.createDataFrame(df.rdd, df.schema)
.withColumn("result", $"tmp.a")
update:
Now I see that my jira ticket was linked to another one: SPARK-17728, which still didn't really handle this issue the right way, but it gives one more optional work around:
val results = data.withColumn("tmp", explode(array(myUdf($"id"))))
.withColumn("result", $"tmp.a")
In newer spark verion (2.3+) we can mark UDFs as non-deterministic: https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/expressions/UserDefinedFunction.html#asNondeterministic():org.apache.spark.sql.expressions.UserDefinedFunction
i.e. use
val myUdf = udf(...).asNondeterministic()
This makes sure the UDF is only called once

Scala distributed execution of function objects

Given the following function objects,
val f : Int => Double = (i:Int) => i + 0.1
val g1 : Double => Double = (x:Double) => x*10
val g2 : Double => Double = (x:Double) => x/10
val h : (Double,Double) => Double = (x:Double,y:Double) => x+y
and for instance 3 remote servers or nodes (IP xxx.xxx.xxx.1, IP 2 and IP 3), how to distribute the execution of this program,
val fx = f(1)
val g1x = g1( fx )
val g2x = g2( fx )
val res = h ( g1x, g2x )
so that
fx is computed in IP 1,
g1x is computed in IP 2,
g2x is computed in IP 3,
res is computed in IP 1
May Scala Akka or Apache Spark provide a simple approach to this ?
Update
RPC (Remote Procedure Call) Finagle as suggested by #pkinsky may be a feasible choice.
Consider load-balancing policies as a mechanism for selecting a node for execution, at least any free available node policy.
I can speak for Apache Spark. It can do what you are looking for with the code below. But it's not designed for this kind of parallel computation. It is designed for parallel computation where you also have a large amount of parallel data distributed on many machines. So the solution looks a bit silly, as we distribute a single integer across a single machine for example (for f(1)).
Also, Spark is designed to run the same computation on all the data. So running g1() and g2() in parallel goes a bit against the design. (It's possible, but not elegant, as you see.)
// Distribute the input (1) across 1 machine.
val rdd1 = sc.parallelize(Seq(1), numSlices = 1)
// Run f() on the input, collect the results and take the first (and only) result.
val fx = rdd1.map(f(_)).collect.head
// The next stage's input will be (1, fx), (2, fx) distributed across 2 machines.
val rdd2 = sc.parallelize(Seq((1, fx), (2, fx)), numSlices = 2)
// Run g1() on one machine, g2() on the other.
val gxs = rdd2.map {
case (1, x) => g1(x)
case (2, x) => g2(x)
}.collect
val g1x = gxs(0)
val g2x = gxs(1)
// Same deal for h() as for f(). The input is (g1x, g2x), distributed to 1 machine.
val rdd3 = sc.parallelize(Seq((g1x, g2x)), numSlices = 1)
val res = rdd3.map { case (g1x, g2x) => h(g1x, g2x) }.collect.head
You can see that Spark code is based around the concept of RDDs. An RDD is like an array, except it's partitioned across multiple machines. sc.parallelize() creates such a parallel collection from a local collection. For example rdd2 in the above code will be created from the local collection Seq((1, fx), (2, fx)) and split across two machines. One machine will have Seq((1, fx)), the other will have Seq((2, fx)).
Next we do a transformation on the RDD. map is a common transformation that creates a new RDD of the same length by applying a function to each element. (Same as Scala's map.) The map we run on rdd2 will replace (1, x) with g1(x) and (2, x) with g2(x). So on one machine it will cause g1() to run, while on the other g2() will run.
Transformations run lazily, only when you want to access the results. The methods that access the results are called actions. The most straightforward example is collect, which downloads the contents of the entire RDD from the cluster to the local machine. (It is exactly the opposite of sc.parallelize().)
You can try and see all this if you download Spark, start bin/spark-shell, and copy your function definitions and the above code into the shell.