Scala hashmap not getting appended - scala

I don't understand what is wrong with the code below. This works fine and hashmap typeMap gets updated if my input data frame is not partitioned. But if the code below is executed in a partitioned environment, typeMap is always empty and not updated. What is wrong with this code? Thanks for all your help.
var typeMap = new mutable.HashMap[String, (String, Array[String])]
case class Combiner(,,,,,,, mapTypes: mutable.HashMap[String, (String, Array[String])]) {
def execute() {
<...>
val combinersResult = dfInput.rdd.aggregate(combiners.toArray) (incrementCount, mergeCount)
}
def updateTypes(arr: Array[String], tempMapTypes:mutable.HashMap[String, (String, Array[String])]): Unit = {
<...>
typeMap ++= tempMapTypes
}
def incrementCount(combiners: Array[Combiner], row: Row): Array[Combiner] = {
for (i <- 0 until row.length) {
val array = getMyType(row(i), tempMapTypes)
combiners(i). updateTypes(array, tempMapTypes)
}
combiners
}

It is a really bad idea to use mutable values in distributed computing. With Spark in particular, RDD operations are shipped from the driver to the executors and are executed in parallel on all the different machines in the cluster. Updates made to your mutable.HashMap are never sent back to the driver, so you are stuck with the empty map that got constructed on the driver in the first place.
So you need to completely rethink your data structures by preferring immutability and to remember that operations firing on the executors are independent and parallel.

Related

How to update the "bytes written" count in custom Spark data source?

I created a Spark Data Source that uses the "older" DataSource V1 API to write data in a specific binary format our measuring devices and some software requires, i.e., my DefaultSource extends CreatableRelationProvider.
In the appropriate createRelation method I call my own custom method to write the data from the DataFrame passed in. I am doing this with the help of Hadoop's FileSystem API, initialized from the Hadoop Configuration one can pull out of the supplied DataFrame:
def createRelation(sqlContext: SQLContext,
mode : SaveMode,
parameters: Map[String, String],
data : DataFrame): BaseRelation = {
val path = ... // get from parameters; in real here is more preparation code, checking save mode etc.
MyCustomWriter.write(data, path)
EchoingRelation(data) // small class that just wraps the data frame into a BaseRelation with TableScan
}
In the MyCustomWriter I then do all sorts of things, and in the end, I save data as a side effect to map, mapPartitions and foreachPartition calls on the executors of the cluster, like this:
val confBytes = conf.toByteArray // implicit I wrote turning Hadoop Writables to Byte Array, as Configuration isn't serializable
data.
select(...).
where(...).
// much more
as[Foo].
mapPartitions { it =>
val conf = confBytes.toWritable[Configuration] // vice-versa like toByteArray
val writeResult = customWriteRecords(it, conf) // writes data to the disk using Hadoop FS API
writeResult.iterator
}.
// do more stuff
While this approach works fine, I notice that when running this, the Output column in the Spark job UI is not updated. Is it somehow possible to propagate this information or do I have to wrap the data in Writables and use a Hadoop FileOutputFormat approach instead?
I found a hacky approach.
Inside a RDD/DF operation you can get OutputMetrics:
val metrics = TaskContext.get().taskMetrics().outputMetrics
This has the fields bytesWritten and recordsWritten. However, the setters are package-local for org.apache.spark.executor. So, I created a "breakout object" in the package:
package org.apache.spark.executor
object OutputMetricsBreakout {
def setRecordsWritten(outputMetrics: OutputMetrics,
recordsWritten: Long): Unit =
outputMetrics.setRecordsWritten(recordsWritten)
def setBytesWritten(outputMetrics: OutputMetrics,
bytesWritten: Long): Unit =
outputMetrics.setBytesWritten(bytesWritten)
}
Then I can use:
val myBytesWritten = ... // calculate written bytes
OutputMetricsBreakout.setBytesWritten(metrics, myBytesWritten + metrics.bytesWritten)
This is a hack but the only "simple" way I could come up with.

Is this a suitable way to implement a lazy `take` on RDD?

It's quite unfortunate that take on RDD is a strict operation instead of lazy but I won't get into why I think that's a regrettable design here and now.
My question is whether this is a suitable implementation of a lazy take for RDD. It seems to work, but I might be missing some non-obvious problem with it.
def takeRDD[T: scala.reflect.ClassTag](rdd: RDD[T], num: Long): RDD[T] =
new RDD[T](rdd.context, List(new OneToOneDependency(rdd))) {
// An unfortunate consequence of the way the RDD AST is designed
var doneSoFar = 0L
def isDone = doneSoFar >= num
override def getPartitions: Array[Partition] = rdd.partitions
// Should I do this? Doesn't look like I need to
// override val partitioner = self.partitioner
override def compute(split: Partition, ctx: TaskContext): Iterator[T] = new Iterator[T] {
val inner = rdd.compute(split, ctx)
override def hasNext: Boolean = !isDone && inner.hasNext
override def next: T = {
doneSoFar += 1
inner.next
}
}
}
Answer to your question
No, this doesn't work. There's no way to have a variable which can be seen and updated concurrently across a Spark cluster, and that's exactly what you're trying to use doneSoFar as. If you try this, then when you run compute (in parallel across many nodes), you
a) serialize the takeRDD in the task, because you reference the class variable doneSoFar. This means that you write the class to bytes and make a new instance in each JVM (executor)
b) update doneSoFar in compute, which updates the local instance on each executor JVM. You'll take a number of elements from each partition equal to num.
It's possible this will work in Spark local mode due to some of the JVM properties there, but it CERTAINLY will not work when running Spark in cluster mode.
Why take is an action, not transformation
RDDs are distributed, and so subsetting to an exact number of elements is an inefficient operation -- it can't be done totally in parallel, since each shard needs information about the other shards (like whether it should be computed at all). Take is great for bringing distributed data back into local memory.
rdd.sample is a similar operation that stays in the distributed world, and can be run in parallel easily.

How to get truly atomic update for TrieMap.getOrElseUpdate

As I understand, TrieMap.getOrElseUpdate is still not truly atomic, and this fixes only returned result (it could return different instances for different callers before this fix), so the updater function still might be called several times, as documentation (for 2.11.7) says:
Note: This method will invoke op at most once. However, op may be invoked without the result being added to the map if a concurrent process is also trying to add a value corresponding to the same key k.
*I've checked that manually for 2.11.7, still "at least once"
How to guarantee one-time call (if I use TrieMap for factories)?
I think this solution should work for my requirements:
trait LazyComp { val get: Int }
val map = new TrieMap[String, LazyComp]()
val count = new AtomicInteger() //just for test, you don't need it
def getSingleton(key: String) = {
val v = new LazyComp {
lazy val get = {
//compute something
count.incrementAndGet() //just for test, you don't need it
}
}
map.putIfAbsent(key, v).getOrElse(v).get
}
I believe, lazy val actually uses synchronized inside. And also the code inside get should be safe from exceptions
However, performance could be improved in future: SIP-20
Test:
scala> (0 to 10000000).par.map(_ => getSingleton("zzz")).last
res8: Int = 1
P.S. Java has computeIfAbscent method on ConcurrentHashMap which I could use as well.

Memory consumption of a parallel Scala Stream

I have written a Scala (2.9.1-1) application that needs to process several million rows from a database query. I am converting the ResultSet to a Stream using the technique shown in the answer to one of my previous questions:
class Record(...)
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(resultSet.getString(1), resultSet.getInt(2), ...)
}.toStream.foreach { record => ... }
and this has worked very well.
Since the body of the foreach closure is very CPU intensive, and as a testament to the practicality of functional programming, if I add a .par before the foreach, the closures get run in parallel with no other effort, except to make sure that the body of the closure is thread safe (it is written in a functional style with no mutable data except printing to a thread-safe log).
However, I am worried about memory consumption. Is the .par causing the entire result set to load in RAM, or does the parallel operation load only as many rows as it has active threads? I've allocated 4G to the JVM (64-bit with -Xmx4g) but in the future I will be running it on even more rows and worry that I'll eventually get an out-of-memory.
Is there a better pattern for doing this kind of parallel processing in a functional manner? I've been showing this application to my co-workers as an example of the value of functional programming and multi-core machines.
If you look at the scaladoc of Stream, you will notice that the definition class of par is the Parallelizable trait... and, if you look at the source code of this trait, you will notice that it takes each element from the original collection and put them into a combiner, thus, you will load each row into a ParSeq:
def par: ParRepr = {
val cb = parCombiner
for (x <- seq) cb += x
cb.result
}
/** The default `par` implementation uses the combiner provided by this method
* to create a new parallel collection.
*
* #return a combiner for the parallel collection of type `ParRepr`
*/
protected[this] def parCombiner: Combiner[A, ParRepr]
A possible solution is to explicitly parallelize your computation, thanks to actors for example. You can take a look at this example from the akka documentation for example, that might be helpful in your context.
The new akka stream library is the fix you're looking for:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source, Sink}
def iterFromQuery() : Iterator[Record] = {
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(...)
}
}
def cpuIntensiveFunction(record : Record) = {
...
}
implicit val actorSystem = ActorSystem()
implicit val materializer = ActorMaterializer()
implicit val execContext = actorSystem.dispatcher
val poolSize = 10 //number of Records in memory at once
val stream =
Source(iterFromQuery).runWith(Sink.foreachParallel(poolSize)(cpuIntensiveFunction))
stream onComplete {_ => actorSystem.shutdown()}

Scala Infinite Iterator OutOfMemory

I'm playing around with Scala's lazy iterators, and I've run into an issue. What I'm trying to do is read in a large file, do a transformation, and then write out the result:
object FileProcessor {
def main(args: Array[String]) {
val inSource = Source.fromFile("in.txt")
val outSource = new PrintWriter("out.txt")
try {
// this "basic" lazy iterator works fine
// val iterator = inSource.getLines
// ...but this one, which incorporates my process method,
// throws OutOfMemoryExceptions
val iterator = process(inSource.getLines.toSeq).iterator
while(iterator.hasNext) outSource.println(iterator.next)
} finally {
inSource.close()
outSource.close()
}
}
// processing in this case just means upper-cases every line
private def process(contents: Seq[String]) = contents.map(_.toUpperCase)
}
So I'm getting an OutOfMemoryException on large files. I know you can run afoul of Scala's lazy Streams if you keep around references to the head of the Stream. So in this case I'm careful to convert the result of process() to an iterator and throw-away the Seq it initially returns.
Does anyone know why this still causes O(n) memory consumption? Thanks!
Update
In response to fge and huynhjl, it seems like the Seq might be the culprit, but I don't know why. As an example, the following code works fine (and I'm using Seq all over the place). This code does not produce an OutOfMemoryException:
object FileReader {
def main(args: Array[String]) {
val inSource = Source.fromFile("in.txt")
val outSource = new PrintWriter("out.txt")
try {
writeToFile(outSource, process(inSource.getLines.toSeq))
} finally {
inSource.close()
outSource.close()
}
}
#scala.annotation.tailrec
private def writeToFile(outSource: PrintWriter, contents: Seq[String]) {
if (! contents.isEmpty) {
outSource.println(contents.head)
writeToFile(outSource, contents.tail)
}
}
private def process(contents: Seq[String]) = contents.map(_.toUpperCase)
As hinted by fge, modify process to take an iterator and remove the .toSeq. inSource.getLines is already an iterator.
Converting to a Seq will cause the items to be remembered. I think it will convert the iterator into a Stream and cause all items to be remembered.
Edit: Ok, it's more subtle. You are doing the equivalent of Iterator.toSeq.iterator by calling iterator on the result of process. This can cause an out of memory exception.
scala> Iterator.continually(1).toSeq.iterator.take(300*1024*1024).size
java.lang.OutOfMemoryError: Java heap space
It may be the same issue as reported here: https://issues.scala-lang.org/browse/SI-4835. Note my comment at the end of the bug, this is from personal experience.