Say I have a spark job that looks like following:
def loadTable1() {
val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
table1.cache().registerTempTable("table1")
}
def loadTable2() {
val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
table2.cache().registerTempTable("table2")
}
def loadAllTables() {
loadTable1()
loadTable2()
}
loadAllTables()
How do I parallelize this Spark job so that both tables are created at the same time?
You don't need to parallelize it. The RDD/DF creation operations don't do anything. These data structures are lazy, so any actual calculation will only happen when you start using them. And when a Spark calculation does happen, it will be automatically parallelized (partition-by-partition). Spark will distribute the work across the executors. So you would not generally gain anything by introducing further parallelism.
Use Futures!
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def loadAllTables() {
Future { loadTable1() }
Future { loadTable2() }
}
You can do this with standard scala threading mechanism. Personally I'd like to do a list of pairs with path & table name and then parallel map over that. You could also look at futures or standard threads.
Related
I don't understand what is wrong with the code below. This works fine and hashmap typeMap gets updated if my input data frame is not partitioned. But if the code below is executed in a partitioned environment, typeMap is always empty and not updated. What is wrong with this code? Thanks for all your help.
var typeMap = new mutable.HashMap[String, (String, Array[String])]
case class Combiner(,,,,,,, mapTypes: mutable.HashMap[String, (String, Array[String])]) {
def execute() {
<...>
val combinersResult = dfInput.rdd.aggregate(combiners.toArray) (incrementCount, mergeCount)
}
def updateTypes(arr: Array[String], tempMapTypes:mutable.HashMap[String, (String, Array[String])]): Unit = {
<...>
typeMap ++= tempMapTypes
}
def incrementCount(combiners: Array[Combiner], row: Row): Array[Combiner] = {
for (i <- 0 until row.length) {
val array = getMyType(row(i), tempMapTypes)
combiners(i). updateTypes(array, tempMapTypes)
}
combiners
}
It is a really bad idea to use mutable values in distributed computing. With Spark in particular, RDD operations are shipped from the driver to the executors and are executed in parallel on all the different machines in the cluster. Updates made to your mutable.HashMap are never sent back to the driver, so you are stuck with the empty map that got constructed on the driver in the first place.
So you need to completely rethink your data structures by preferring immutability and to remember that operations firing on the executors are independent and parallel.
I'm new to using spark and scala but I have to solve the following problem:
I have one ORC file containing rows which I have to check against a certain condition comming from a hash map.
I build the hash map (filename,timestamp) with 120,000 entries this way (getTimestamp returns an Option[Long] type):
val tgzFilesRDD = sc.textFile("...")
val fileNameTimestampRDD = tgzFilesRDD.map(itr => {
(itr, getTimestamp(itr))
})
val fileNameTimestamp = fileNameTimestampRDD.collect.toMap
And retrieve the RDD with 6 million entries like this:
val sessionDataDF = sqlContext.read.orc("...")
case class SessionEvent(archiveName: String, eventTimestamp: Long)
val sessionEventsRDD = sessionDataDF.as[SessionEvent].rdd
And do the check:
val sessionEventsToReport = sessionEventsRDD.filter(se => {
val timestampFromFile = fileNameTimestamp.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
Is this the right and performant way to do it? Is there a caching recommended?
Will the Map fileNameTimestamp get shuffled to the clusters where the parititons where processed?
fileNameTimestamp will get serialized for each task, and with 120,000 entries, it may be quite expensive. You should broadcast large objects and reference the broadcast variables:
val fileNameTimestampBC = sc.broadcast(fileNameTimestampRDD.collect.toMap)
Now only one of these object will be shipped to each worker. There is also no need to drop down to the RDD API, as the Dataset API has a filter method:
val sessionEvents = sessionDataDF.as[SessionEvent]
val sessionEventsToReport = sessionEvents.filter(se => {
val timestampFromFile = fileNameTimestampBC.value.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
The fileNameTimestamp Map you collected exists on Spark Master Node. In order to be referenced efficiently like this in a query, the worker nodes need to have access to it. This is done by broadcasting.
In essence, you rediscovered Broadcast Hash Join: You are left joining sessionEventsRDD with tgzFilesRDD to gain access to the optional timestamp and then you filter accordingly.
When using RDDs, you need to explicitly code the joining strategy. Dataframes/Datasets API has a query optimiser that can make the choice for you. You can also explicitly ask the API to use the above broadcast join technique behind the scenes. You can find examples for both approaches here.
Let me know if this is clear enough :)
It's quite unfortunate that take on RDD is a strict operation instead of lazy but I won't get into why I think that's a regrettable design here and now.
My question is whether this is a suitable implementation of a lazy take for RDD. It seems to work, but I might be missing some non-obvious problem with it.
def takeRDD[T: scala.reflect.ClassTag](rdd: RDD[T], num: Long): RDD[T] =
new RDD[T](rdd.context, List(new OneToOneDependency(rdd))) {
// An unfortunate consequence of the way the RDD AST is designed
var doneSoFar = 0L
def isDone = doneSoFar >= num
override def getPartitions: Array[Partition] = rdd.partitions
// Should I do this? Doesn't look like I need to
// override val partitioner = self.partitioner
override def compute(split: Partition, ctx: TaskContext): Iterator[T] = new Iterator[T] {
val inner = rdd.compute(split, ctx)
override def hasNext: Boolean = !isDone && inner.hasNext
override def next: T = {
doneSoFar += 1
inner.next
}
}
}
Answer to your question
No, this doesn't work. There's no way to have a variable which can be seen and updated concurrently across a Spark cluster, and that's exactly what you're trying to use doneSoFar as. If you try this, then when you run compute (in parallel across many nodes), you
a) serialize the takeRDD in the task, because you reference the class variable doneSoFar. This means that you write the class to bytes and make a new instance in each JVM (executor)
b) update doneSoFar in compute, which updates the local instance on each executor JVM. You'll take a number of elements from each partition equal to num.
It's possible this will work in Spark local mode due to some of the JVM properties there, but it CERTAINLY will not work when running Spark in cluster mode.
Why take is an action, not transformation
RDDs are distributed, and so subsetting to an exact number of elements is an inefficient operation -- it can't be done totally in parallel, since each shard needs information about the other shards (like whether it should be computed at all). Take is great for bringing distributed data back into local memory.
rdd.sample is a similar operation that stays in the distributed world, and can be run in parallel easily.
I have a situation where an underlying function operates significantly more efficiently when given batches to work on. I have existing code like this:
// subjects: RDD[Subject]
val subjects = Subject.load(job, sparkContext, config)
val classifications = subjects.flatMap(subject => classify(subject)).reduceByKey(_ + _)
classifications.saveAsTextFile(config.output)
The classify method works on single elements but would be more efficient operating on groups of elements. I considered using coalesce to split the RDD into chunks and acting on each chunk as a group, however there are two problems with this:
I'm not sure how to return the mapped RDD.
classify doesn't know in advance how big the groups should be and it varies based on the contents of the input.
Sample code on how classify could be called in an ideal situation (the output is kludgey since it can't spill for very large inputs):
def classifyRdd (subjects: RDD[Subject]): RDD[(String, Long)] = {
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}
This way classifyInBatches can have code like this internally:
def classifyInBatches(subject: Subject) {
if (!internals.canAdd(subject)) {
partialResults.add(internals.processExisting)
}
internals.add(subject) // Assumption: at least one will fit.
}
What can I do in Apache Spark that will allow behavior somewhat like this?
Try using the mapPartitions method, which allows your map function to consume a partition as an iterator and produce an iterator of output.
You should be able to write something like this:
subjectsRDD.mapPartitions { subjects =>
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}
I have written a Scala (2.9.1-1) application that needs to process several million rows from a database query. I am converting the ResultSet to a Stream using the technique shown in the answer to one of my previous questions:
class Record(...)
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(resultSet.getString(1), resultSet.getInt(2), ...)
}.toStream.foreach { record => ... }
and this has worked very well.
Since the body of the foreach closure is very CPU intensive, and as a testament to the practicality of functional programming, if I add a .par before the foreach, the closures get run in parallel with no other effort, except to make sure that the body of the closure is thread safe (it is written in a functional style with no mutable data except printing to a thread-safe log).
However, I am worried about memory consumption. Is the .par causing the entire result set to load in RAM, or does the parallel operation load only as many rows as it has active threads? I've allocated 4G to the JVM (64-bit with -Xmx4g) but in the future I will be running it on even more rows and worry that I'll eventually get an out-of-memory.
Is there a better pattern for doing this kind of parallel processing in a functional manner? I've been showing this application to my co-workers as an example of the value of functional programming and multi-core machines.
If you look at the scaladoc of Stream, you will notice that the definition class of par is the Parallelizable trait... and, if you look at the source code of this trait, you will notice that it takes each element from the original collection and put them into a combiner, thus, you will load each row into a ParSeq:
def par: ParRepr = {
val cb = parCombiner
for (x <- seq) cb += x
cb.result
}
/** The default `par` implementation uses the combiner provided by this method
* to create a new parallel collection.
*
* #return a combiner for the parallel collection of type `ParRepr`
*/
protected[this] def parCombiner: Combiner[A, ParRepr]
A possible solution is to explicitly parallelize your computation, thanks to actors for example. You can take a look at this example from the akka documentation for example, that might be helpful in your context.
The new akka stream library is the fix you're looking for:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source, Sink}
def iterFromQuery() : Iterator[Record] = {
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(...)
}
}
def cpuIntensiveFunction(record : Record) = {
...
}
implicit val actorSystem = ActorSystem()
implicit val materializer = ActorMaterializer()
implicit val execContext = actorSystem.dispatcher
val poolSize = 10 //number of Records in memory at once
val stream =
Source(iterFromQuery).runWith(Sink.foreachParallel(poolSize)(cpuIntensiveFunction))
stream onComplete {_ => actorSystem.shutdown()}