How to parallelize several apache spark rdds? - scala

I have the next code:
sc.parquetFile("some large parquet file with bc").registerTempTable("bcs")
sc.parquetFile("some large parquet file with imps").registerTempTable("imps")
val bcs = sc.sql("select * from bcs")
val imps = sc.sql("select * from imps")
I want to do:
bcs.map(x => wrapBC(x)).collect
imps.map(x => wrapIMP(x)).collect
but when I do this, it's running not async. I can to do it with Future, like that:
val bcsFuture = Future { bcs.map(x => wrapBC(x)).collect }
val impsFuture = Future { imps.map(x => wrapIMP(x)).collect }
val result = for {
bcs <- bcsFuture
imps <- impsFuture
} yield (bcs, imps)
Await.result(result, Duration.Inf) //this return (Array[Bc], Array[Imp])
I want to do this without Future, how can I do it?

Update This was originally composed before the question was updated. Given those updates, I agree with #stholzm's answer to use cartesian in this case.
There do exist a limited number of actions which will produce a FutureAction[A] for an RDD[A] and be executed in the background. These are available on the AsyncRDDActions class, and so long as you import SparkContext._ any RDD will can be implicitly converted to an AysnchRDDAction as needed. For your specific code example that would be:
bcs.map(x => wrapBC(x)).collectAsync
imps.map(x => wrapIMP(x)).collectAsync
In additionally to evaluating the DAG up to action in the background, the FutureAction produced has the cancel method to attempt to end processing early.
Caveat
This may not do what you think it does. If the intent is to get data from both sources and then combine them you're more likely to want to join or group the RDDs instead. For that you can look at the functions available in PairRDDFunctions, again available on RDDs through implicit conversion.
If the intention isn't to have the data graphs interact then so far in my experience then running batches concurrently might only serve to slow down both, though that may be a consequence of how the cluster is configured. If the resource manager is set up to give each execution stage a monopoly on the cluster in FIFO order (the default in standalone and YARN modes, I believe; I'm not sure about Mesos) then each of the asynchronous collects will contend with each other for that monopoly, run their tasks, then contend again for the next execution stage.
Compare this to using a Future to wrap blocking calls to downstream services or database, for example, where either the resources in question are completely separate or generally have enough resource capacity to handle multiple requests in parallel without contention.

Update: I misunderstood the question. The desired result is not the cartesian product Array[(Bc, Imp)].
But I'd argue that it does not matter how long the single map calls take because as soon as you add other transformations, Spark tries to combine them in an efficient way. As long as you only chain transformations on RDDs, nothing happens on the data. When you finally apply an action then the execution engine will figure out a way to produce the requested data.
So my advice would be to not think so much about the intermediate steps and avoid collect as much as possible because it will fetch all the data to the driver program.
It seems you are building a cartesian product yourself. Try cartesian instead:
val bc = bcs.map(x => wrapBC(x))
val imp = imps.map(x => wrapIMP(x))
val result = bc.cartesian(imp).collect
Note that collect is called on the final RDD and no longer on intermediate results.

You can use union for solve this problem. For example:
bcs.map(x => wrapBC(x).asInstanceOf[Any])
imps.map(x => wrapIMP(x).asInstanceOf[Any])
val result = (bcs union imps).collect()
val bcsResult = result collect { case bc: Bc => bc }
val impsResult = result collect { case imp: Imp => imp }
If you want to use sortBy or another operations, you can use inheritance of trait or main class.

Related

What's the right way to "log and skip" validated transformations in spark-streaming

I have a spark-streaming application where I want to do some data transformations before my main operation, but the transformation involves some data validation.
When the validation fails, I want to log the failure cases, and then proceed on with the rest.
Currently, I have code like this:
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validationResults = values.map(validate)
validationResults.foreachRDD { rdd =>
rdd.foreach {
case Left(error) => logger.error(error)
case _ =>
}
}
val validatedValues: DStream[MyCaseClass] =
validationResults.mapPartitions { partition =>
partition.collect { case Right(record) => record }
}
This currently works, but it feels like I'm doing something wrong.
Questions
As far as I understand, this will perform the validation twice, which is potentially wasteful.
Is it correct to use values.map(validation).persist() to solve that problem?
Even if I persist the values, it still iterates and pattern matches on all the elements twice. It feels like there should be some method I can use to solve this. On a regular scala collection, I might use some of the cats TraverseFilter api, or with fs2.Stream an evalMapFilter. What DStream api can I use for that? Maybe something with mapPartitions?
I would say that the best way to tackle this is to take advantage that the stdlib flatMap accepts Option
def values: DStream[String] = ???
def validate(element: String): Either[String, MyCaseClass] = ???
val validatedValues: DStream[MyCaseClass] =
values.map(validate).flatMap {
case Left(error) =>
logger.error(error)
None
case Right(record) =>
Some(record)
}
You can also be a little bit more verbose using mapPartitions which should be a little bit more efficient.
The 'best' option here depends a bit on the rest of your spark job and your version of spark.
Ideally you'd pick a mechanism natively understood by catalyst. The spark3 dataset observe listener may be what you're looking for there. (I haven't seen many examples of using this in the wild but it seems like this was the motivation behind such a thing.)
In pure spark sql, one option is to add a new column with the results of validation, e.g. a column named invalid_reason which is NULL if the record is valid or some [enumerated] string containing the reason the column failed validation. At this point, you likely want to persist/cache the dataset before doing a groupBy/count/collect/log operation, then filter where invalid_reason is null on the persisted dataframe and continue on the rest of the processing.
tl;dr: consider adding a validation column rather than just applying a 'validate' function. You then 'fork' processing here: log the records which have the invalid column specified, process the rest of your job on the records which don't. It does add some volume to your dataframe, but doesn't require processing the same records twice.

Spark httpclient job does not parallel well [duplicate]

I build a RDD from a list of urls, and then try to fetch datas with some async http call.
I need all the results before doing other calculs.
Ideally, I need to make the http calls on differents nodes for scaling considerations.
I did something like this:
//init spark
val sparkContext = new SparkContext(conf)
val datas = Seq[String]("url1", "url2")
//create rdd
val rdd = sparkContext.parallelize[String](datas)
//httpCall return Future[String]
val requests = rdd.map((url: String) => httpCall(url))
//await all results (Future.sequence may be better)
val responses = requests.map(r => Await.result(r, 10.seconds))
//print responses
response.collect().foreach((s: String) => println(s))
//stop spark
sparkContext.stop()
This work, but Spark job never finish !
So I wonder what is are the best practices for dealing with Future using Spark (or Future[RDD]).
I think this use case looks pretty common, but didn't find any answer yet.
Best regards
this use case looks pretty common
Not really, because it simply doesn't work as you (probably) expect. Since each task operates on standard Scala Iterators these operations will be squashed together. It means that all operations will be blocking in practice. Assuming you have three URLs ["x", "y", "z"] you code will be executed in a following order:
Await.result(httpCall("x", 10.seconds))
Await.result(httpCall("y", 10.seconds))
Await.result(httpCall("z", 10.seconds))
You can easily reproduce the same behavior locally. If you want to execute your code asynchronously you should handle this explicitly using mapPartitions:
rdd.mapPartitions(iter => {
??? // Submit requests
??? // Wait until all requests completed and return Iterator of results
})
but this is relatively tricky. There is no guarantee all data for a given partition fits into memory so you'll probably need some batching mechanism as well.
All of that being said I couldn't reproduce the problem you've described to is can be some configuration issue or a problem with httpCall itself.
On a side note allowing a single timeout to kill whole task doesn't look like a good idea.
I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
If you dont want to perform any transformation on the content like parsing the response payload, etc. Then you could use foreachPartition instead of mapPartitions to perform all those http calls immediately.
I finally made it using scalaj-http instead of Dispatch.
Call are synchronous, but this match my use case.
I think the Spark Job never finish using Dispatch because the Http connection was not closed properly.
Best Regards
This wont work.
You cannot expect the request objects be distributed and responses collected over a cluster by other nodes. If you do then the spark calls for future will never end. The futures will never work in this case.
If your map() make sync(http) requests then please collect responses within the same action/transformation call and then subject the results(responses) to further map/reduce/other calls.
In your case, please rewrite logic collect the responses for each call in sync and remove the notion of futures then all should be fine.

Iterator of InputStream: How to close the InputStreams?

I have an Iterator[InputStream] which i map over to retrieve the individual results:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] = streams flatMap (c => transformToResult(c))
This works as expected, meaning I can retrieve values of type MyResultType from the results iterator. The only problem I have is that the individual InputStreams are never being closed. Is there any way to do this?
There is no magic way to close it, or at least to guarantee that it will get closed. Thus you have to close each stream yourself. Take a look at the Loan Pattern which makes it less error prone: Loaner Pattern in Scala.
In your case you don't have a single resource to release but rather a collection of resources, so adjust your custom loan pattern accordingly.
Since you are dealing with Iterator you might have unlimited supply of InputStreams, in that case your transformToResult function would have to close the stream or something else at the element level.
It could look something like this:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] =
streams flatMap (c => yourLoaner(c)(transformToResult))

Do I need to I save intermediate subsets of data while building decision tree on spark recursively?

I am building a Decision Tree on Scala/Spark (on a 50 node cluster). Since my dataset is somewhat big (~ 2TB), I want to parallelise it.
My code looks like this
def buildTree(data: RDD[Array[Double]], numInstances: Int): Node = {
// Base case
if (numInstances < minInstances) {
return new Node(isLeaf = true)
}
/*
* Find best split for all columns in data
*/
val leftRDD = data.filter(leftSplitCriteria)
val rightRDD = data.filter(rightSplitCriteria)
val subset = Seq(leftRDD, rightRDD)
val counts = Seq(numLeft, numRight)
val children = (0 until 2).map(i =>
(i,subset(i),counts(i)))
.par.map(x => {buildTree(x._2,x._3)})
return new Node(children(0), children(1), Split)
}
My questions are
Scala being a lazy language, doesn't immediately compute the output of map/filter operation. So while building a new Node, do all the filters of parents, and parents of parents, are stacked up (and recursively applied)?
What would be the best approach to build the tree in parallel? Should I cache/save the dataset in the intermediate steps?
While running this code, is it sufficient to just provide num-executers, or would it make a difference if I give executor-cores, driver-cores etc.?
I assume that the numLeft is computed using leftRDD.count() and counting is an action and will force the computation of all the dependent RDDs.
You will actually make more than once the filtering in this case, once for the count and another time for each children dependence. You should cache your RDD to avoid double computation and you only need the last one so you can unpersist the previous one at every stage.
See Apache Spark Method returning an RDD (with Tail Recursion) for more explanation
Side note: Spark uses the lazy evaluation model, I think we don't say scala is a lazy language.
I ended up parallelising split finding at each level by features.
Refer
http://zhanpengfang.github.io/418home.html
http://tullo.ch/articles/speeding-up-decision-tree-training/

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.