Spark RDD equivalent to Scala collections partition - scala

This is a minor issue with one of my spark jobs which doesn't seem to cause any issues -- yet annoys me every time I see it and fail to come up with a better solution.
Say I have a Scala collection like this:
val myStuff = List(Try(2/2), Try(2/0))
I can partition this list into successes and failures with partition:
val (successes, failures) = myStuff.partition(_.isSuccess)
Which is nice. The implementation of partition only traverses the source collection once to build the two new collections. However, using Spark, the best equivalent I have been able to devise is this:
val myStuff: RDD[Try[???]] = sourceRDD.map(someOperationThatMayFail)
val successes: RDD[???] = myStuff.collect { case Success(v) => v }
val failures: RDD[Throwable] = myStuff.collect { case Failure(ex) => ex }
Which aside from the difference of unpacking the Try (which is fine) also requires traversing the data twice. Which is annoying.
Is there any better Spark alternative that can split an RDD without multiple traversals? i.e. having a signature something like this where partition has the behaviour of Scala collections partition rather than RDD partition:
val (successes: RDD[Try[???]], failures: RDD[Try[???]]) = myStuff.partition(_.isSuccess)
For reference, I previously used something like the below to solve this. The potentially failing operation is de-serializing some data from a binary format, and the failures have become interesting enough that they need to be processed and saved as an RDD rather than something logged.
def someOperationThatMayFail(data: Array[Byte]): Option[MyDataType] = {
try {
Some(deserialize(data))
} catch {
case e: MyDesrializationError => {
logger.error(e)
None
}
}
}

There might be other solutions, but here you go:
Setup:
import scala.util._
val myStuff = List(Try(2/2), Try(2/0))
val myStuffInSpark = sc.parallelize(myStuff)
Execution:
val myStuffInSparkPartitioned = myStuffInSpark.aggregate((List[Try[Int]](),List[Try[Int]]()))(
(accum, curr)=>if(curr.isSuccess) (curr :: accum._1,accum._2) else (accum._1, curr :: accum._2),
(first, second)=> (first._1 ++ second._1,first._2 ++ second._2))
Let me know if you need an explanation

Related

Is there a Scala pattern for collecting error messages from a sequence of independent computations?

Say I have a series of independent computations that need to be combined at the end:
def makeSauce(): Try[Sauce] = Success(Sauce("marinara"))
def makeCrust(): Try[Crust] = Failure(new RuntimeException("stale bread"))
def makeCheese(): Try[Cheese] = Failure(new IllegalArgumentException("moldy cheese"))
for {
sauce <- makeSauce()
crust <- makeCrust()
cheese <- makeCheese()
} yield {
Pizza(sauce, crust, cheese)
}
But if any of my computations fail, I want a comprehensive error message describing all of the failures, not just the first. The above code doesn't do what I want; it only shows me the first failure ("stale bread"). Instead, it should throw a super-error message combining both of the others, which could be a sequence of exception objects, sequence of string error messages, or even just one giant concatenated string.
Alternatively, instead of throwing the exception, it could be composed into a top-level Failure object:
(for {
sauce <- makeSauce()
crust <- makeCrust()
cheese <- makeCheese()
} yield (sauce, crust, cheese)) match {
case Success((sauce, crust, cheese)) => Pizza(sauce, crust, cheese)
case Failure(e) => throw e
}
The for-comprehension and the Try here are just for illustration; a working solution doesn't have to use for or Try. However, obviously the above syntax is expressive and clean, and almost does what I want, so ideally the solution will stick to it as closely as possible.
How can I achieve something like this in Scala?
Validated is what you are looking for. It is in Cats for example.
import cats.data.Validated
import cats.implicits.catsSyntaxTuple3Semigroupal
def valid(input: String) = if (input.toIntOption.isEmpty) Validated.Invalid(s"$input is not a number")
else Validated.Valid(input.toInt)
def sumThreeNumbers(one:String, two:String, three: String) = (valid(one), valid(two), valid(three)).mapN((oneInt, twoInt, threeInt)=> oneInt + twoInt + threeInt)
sumThreeNumbers("1","2","3")
sumThreeNumbers("1","two","xxx")
//results
val res0: cats.data.Validated[String,Int] = Valid(6)
val res1: cats.data.Validated[String,Int] = Invalid(two is not a number.xxx is not a number.)
scala.util.Try behaves in a fail-fast manner. In a for-comprehension context it executed sequentially and halts at the first failure.
What you need is a structure that aggregates all failures like cats.data.Validated or scalaz.Validation.

Serialising temp collections created in Spark executors during task execution

I'm trying to find an effective way of writing collections created inside tasks to the output files of the job. For example, if we iterate over a RDD using foreach, we can create data structures that are local to the executor ex.,ListBuffer arr in the following code snippet. My problem is that how do I serialise arr and write it to file?
(1) Should I use FileWriter api or Spark saveAsTextFile will work?
(2) What will be the advantages of using one over the other
(3) Is there a better way of achieving the same.
PS: The reason I am using foreach instead of map is because I might not be able to transform all my RDD rows and I want to avoid getting Null values in the output.
val dataSorted: RDD[(Int, Int)] = <Some Operation>
val arr: ListBuffer = ListBuffer[(String, String)]()
dataSorted.foreach {
case (e, r) => {
if(e.id > 1000) {
arr += (("a", "b"))
}
}
}
Thanks,
Devj
You should not use driver's variables, but Accumulators - therw are articles about them with code examples here and here, also this question maybe helpful - there is simplified code example of custom AccumulatorParam
Write your own accumulator, that is able to add (String, String) or use built-in CollectionAccumulator. This is implementation of AccumulatorV2, new version of accumulator from Spark 2
Other way is to use Spark built-in filter and map functions - thanks #ImDarrenG for suggesting flatMap, but I think filter and map will be easier:
val result : Array[(String, String)] = someRDD
.filter(x => x._1 > 1000) // filter only good rows
.map (x => ("a", "b"))
.collect() // convert to arrat
The Spark API saves you some file handling code but essentially achieves the same thing.
The exception is if you are not using, say, HDFS and do not want your output file to be partitioned (spread across the executors file systems). In this case you will need to collect the data to the driver and use FileWriter to write to a single file, or files, and how you achieve that will depend on how much data you have. If you have more data than driver has memory you will need to handle it differently.
As mentioned in another answer, you're creating an array in your driver, while adding items from your executors, which will not work in a cluster environment. Something like this might be a better way to map your data and handle nulls:
val outputRDD = dataSorted.flatMap {
case (e, r) => {
if(e.id > 1000) {
Some(("a", "b"))
} else {
None
}
}
}
// save outputRDD to file/s here using the approapriate method...

In Apache Spark, how to make an RDD/DataFrame operation lazy?

Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.

Distributing a loop to different machines of a cluster in Spark

Here is a for loop that I'm running in my code:
for(x<-0 to vertexArray.length-1)
{
for(y<-0 to vertexArray.length-1)
{
breakable {
if (x.equals(y)) {
break
}
else {
var d1 = vertexArray(x)._2._2
var d2 = vertexArray(y)._2._2
val ps = new Period(d1, d2)
if (ps.getMonths() == 0 && ps.getYears() == 0 && Math.abs(ps.toStandardHours().getHours()) <= 5) {
edgeArray += Edge(vertexArray(x)._1, vertexArray(y)._1, Math.abs(ps.toStandardHours().getHours()))
}
}
}
}
}
I want to speed up the running time of this code by distributing it across multiple machines in a cluster. I'm using Scala on intelliJ-idea with Spark. How would I implement this type of code to work on multiple machines?
As already stated by Mariano Kamp Spark is probably not a good choice here and there are much better options out there. To add on top of that any approach which has to work on a relatively large data and requires O(N^2) time is simply unacceptable. So the first thing you should do is to focus on choosing suitable algorithm not a platform.
Still it is possible to translate it to Spark. A naive approach which directly reflects your code would be to use Cartesian product:
def check(v1: T, v2: T): Option[U] = {
if (v1 == v2) {
None
} else {
// rest of your logic, Some[U] if all tests passed
// None otherwise
???
}
}
val vertexRDD = sc.parallelize(vertexArray)
.map{case (v1, v2) => check(v1, 2)}
.filter(_.isDefined)
.map(_.get)
If vertexArray is small you could use flatMap with broadcast variable
val vertexBd = sc.broadcast(vertexArray)
vertexRDD.flatMap(v1 =>
vertexBd.map(v2 => check(v1, v2)).filter(_.isDefined).map(_.get))
)
Another improvement is to perform proper join. The obvious condition is year and month:
def toPair(v: T): ((Int, Int), T) = ??? // Return ((year, month), vertex)
val vertexPairs = vertexRDD.map(toPair)
vertexPairs.join(vertexPairs)
.map{case ((_, _), (v1, v2)) => check(v1, v2) // Check should be simplified
.filter(_.isDefined)
.map(_.get)
Of course this can be achieved with a broadcast variable as well. You simply have to group vertexArray by (year, month) pair and broadcast Map[(Int, Int), T].
From here you can improve further by avoiding naive checks by partition and traversing data sorted by timestamp:
def sortPartitionByDatetime(iter: Iterator[U]): Iterator[U] = ???
def yieldMatching(iter: Iterator[U]): Iterator[V] = {
// flatmap keeping track of values in open window
???
}
vertexPairs
.partitionBy(new HashPartitioner(n))
.mapPartitions(sortPartitionByDatetime)
.mapPartitions(yieldMatching)
or using a DataFrame with window function and range clause.
Note:
All types are simply placeholders. In the future please try to provide type information. Right now all I can tell is there are some tuples and dates involved
Welcome to Stack Overflow. Unfortunately this is not the right approach ;(
Spark is not a tool to parallelize tasks, but to parallelize data.
So you need to think how you can distribute/parallelize/partition your data, then compute the individual partitions, then consolidate the results as a last step.
Also you need to read up on Spark in general. A simple answer here cannot get you started. This is just the wrong format.
Start here: http://spark.apache.org/docs/latest/programming-guide.html

Splitting a scalaz-stream process into two child streams

Using scalaz-stream is it possible to split/fork and then rejoin a stream?
As an example, let's say I have the following function
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val sumOfEvenNumbers = streamOfNumbers.filter(isEven).fold(0)(add)
val sumOfOddNumbers = streamOfNumbers.filter(isOdd).fold(0)(add)
zip( sumOfEven, sumOfOdd ).to( someEffectfulFunction )
With scalaz-stream, in this example the results would be as you expect - a tuple of numbers from 1 to 10 passed to a sink.
However if we replace streamOfNumbers with something that requires IO, it will actually execute the IO action twice.
Using a Topic I'm able create a pub/sub process that duplicates elements in the stream correctly, however it does not buffer - it simply consumers the entire source as fast as possible regardless of the pace sinks consume it.
I can wrap this in a bounded Queue, however the end result feels a lot more complex than it needs to be.
Is there a simpler way of splitting a stream in scalaz-stream without duplicate IO actions from the source?
Also to clarify the previous answer delas with the "splitting" requirement. The solution to your specific issue may be without the need of splitting streams:
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val oddOrEven: Process[Task,Int\/Int] = streamOfNumbers.map {
case even if even % 2 == 0 => right(even)
case odd => left(odd)
}
val summed = oddOrEven.pipeW(sump1).pipeO(sump1)
val evenSink: Sink[Task,Int] = ???
val oddSink: Sink[Task,Int] = ???
summed
.drainW(evenSink)
.to(oddSink)
You can perhaps still use topic and just assure that the children processes will subscribe before you will push to topic.
However please note this solution does not have any bounds on it, i.e. if you will be pushing too fast, you may encounter OOM error.
def split[A](source:Process[Task,A]): Process[Task,(Process[Task,A], Proces[Task,A])]] = {
val topic = async.topic[A]
val sub1 = topic.subscribe
val sub2 = topic.subscribe
merge.mergeN(Process(emit(sub1->sub2),(source to topic.publish).drain))
}
I likewise needed this functionality. My situation was quite a bit trickier disallowing me to work around it in this manner.
Thanks to Daniel Spiewak's response in this thread, I was able to get the following to work. I improved on his solution by adding onHalt so my application would exit once the Process completed.
def split[A](p: Process[Task, A], limit: Int = 10): Process[Task, (Process[Task, A], Process[Task, A])] = {
val left = async.boundedQueue[A](limit)
val right = async.boundedQueue[A](limit)
val enqueue = p.observe(left.enqueue).observe(right.enqueue).drain.onHalt { cause =>
Process.await(Task.gatherUnordered(Seq(left.close, right.close))){ _ => Halt(cause) }
}
val dequeue = Process((left.dequeue, right.dequeue))
enqueue merge dequeue
}