I use Spark Streaming to process imported data. The import data is stored in a DStream. Further, I have the classes Creation and Update which hold a Foo object.
One of the tasks I want to accomplish, is change-detection.
So I join 2 rdds (one holds the batch of data that is processed, the other one holds the current state) currentState is empty initially.
val stream: DStream[Foo]
val currentState: RDD[Foo]
val changes = stream
.transform { batch =>
batch.leftouterJoin(currentState) {
case(objectNew, Some(objectOld)) => Update(objectNew)
case(objectNew, None) => Creation(objectNew)
}
}
currentState = currentState.fullOuterJoin(changes).map {
case (Some(foo), None) => foo
case (_, Some(change)) => change.foo
}
}.cache()
Afterwards I filter out the updates.
changes.filter(!_.isInstanceOf[Update])
I now import the same data twice. Since the state is empty initially, the result set of the first import consists of Creation objects while the second one only results in Update objects. So the second result set changes is empty.
In this case I notice a massive performance decrease. It works fine if I leave out the filter.
I can't imagine this is intended behavior but maybe it is a problem with Spark Computation Internals. Can anyone explain why this happens?
Related
I'm trying to learn Scala and having a good bit of fun, but I'm running into this classic problem. It reminds me a lot of nested callback hell in the early days of NodeJS.
Here's my program in psuedocode:
A task to fetch a list of S3 Buckets.
After task one completes I want to batch the processing of buckets in groups of ten.
For each batch:
Get every bucket's region.
Filter out buckets that are not in the region.
List all the objects in each bucket.
println everything
At one point I wind up with the type: Task[Iterator[Task[List[Bucket]]]]
Essentially:
The outer task being the initial step to list all the S3 buckets, and then the inside Iterator/Task/List is trying to batch Tasks that return lists.
I would hope there's some way to remove/flatten the outer Task to get to Iterator[Task[List[Bucket]]].
When I try to break down my processing into steps the deep nesting causes me to do many nested maps. Is this the right thing to do or is there a better way to handle this nesting?
In this particular case, I would suggest something like FS2 with Monix as F:
import cats.implicits._
import monix.eval._, monix.execution._
import fs2._
// use your own types here
type BucketName = String
type BucketRegion = String
type S3Object = String
// use your own implementations as well
val fetchS3Buckets: Task[List[BucketName]] = Task(???)
val bucketRegion: BucketName => Task[BucketRegion] = _ => Task(???)
val listObject: BucketName => Task[List[S3Object]] = _ => Task(???)
Stream.evalSeq(fetchS3Buckets)
.parEvalMap(10) { name =>
// checking region, filtering and listing on batches of 10
bucketRegion(name).flatMap {
case "my-region" => listObject(name)
case _ => Task.pure(List.empty)
}
}
.foldMonoid // combines List[S3Object] together
.compile.lastOrError // turns into Task with result
.map(list => println(s"Result: $list"))
.onErrorHandle { case error: Throwable => println(error) }
.runToFuture // or however you handle it
FS2 underneath uses cats.effect.IO or Monix Task, or whatever you want as long is it provided Cats Effect type classes. It builds a nice, functional DSL to design streams of data, so you could use reactive streams without Akka Streams.
Here there is this little problem that we are printing all results at once, which might be a bad idea if there was more of them than the memory could handle - we could do the printing in batches (weren't sure if that is what you wanted or not) or make filtering and printing separate batches.
Stream.evalSeq(fetchS3Buckets)
.parEvalMap(10) { name =>
bucketRegion(name).map(name -> _)
}
.collect { case (name, "my-region") => name }
.parEvalMap(10) { name =>
listObject(name).map(list => println(s"Result: $list"))
}
.compile
.drain
While none of that is impossible in bare Monix, FS2 makes such operations much easier to write and maintain, so you should be able to implement your flow much easier.
A simple scenario here. I am using akka streams to read from kafka and write into an external source, in my case: cassandra.
Akka streams(reactive-kafka) library equips me with backpressure and other nifty things to make this possible.
kafka being a Source and Cassandra being a Sink, when I get bunch of events which are, for example be cassandra queries here through Kafka which are supposed to be executed sequentially (ex: it could be a INSERT, UPDATE and a DELETE and must be sequential).
I cannot use mayAsync and execute both the statement, Future is eager and there is a chance that DELETE or UPDATE might get executed first before INSERT.
I am forced to use Cassandra's execute as opposed to executeAsync which is non-blocking.
There is no way to make a complete async solution to this issue, but how ever is there a much elegant way to do this?
For ex: Make the Future lazy and sequential and offload it to a different execution context of sorts.
mapAsync gives a parallelism option as well.
Can Monix Task be of help here?
This a general design question and what are the approaches one can take.
UPDATE:
Flow[In].mapAsync(3)(input => {
input match {
case INSERT => //do insert - returns future
case UPDATE => //do update - returns future
case DELETE => //delete - returns future
}
The scenario is a little more complex. There could be thousands of insert, update and delete coming in order for specific key(s)(in kafka)
I would ideally want to execute the 3 futures of a single key in sequence. I believe Monix's Task can help?
If you process things with parallelism of 1, they will get executed in strict sequence, which will solve your problem.
But that's not interesting. If you want, you can run operations for different keys in parallel - if processing for different keys is independent, which, I assume from your description, is possible. To do this, you have to buffer the incoming values and then regroup it. Let's see some code:
import monix.reactive.Observable
import scala.concurrent.duration._
import monix.eval.Task
// Your domain logic - I'll use these stubs
trait Event
trait Acknowledgement // whatever your DB functions return, if you need it
def toKey(e: Event): String = ???
def processOne(event: Event): Task[Acknowledgement] = Task.deferFuture {
event match {
case _ => ??? // insert/update/delete
}
}
// Monix Task.traverse is strictly sequential, which is what you need
def processMany(evs: Seq[Event]): Task[Seq[Acknowledgement]] =
Task.traverse(evs)(processOne)
def processEventStreamInParallel(source: Observable[Event]): Observable[Acknowledgement] =
source
// Process a bunch of events, but don't wait too long for whole 100. Fine-tune for your data source
.bufferTimedAndCounted(2.seconds, 100)
.concatMap { batch =>
Observable
.fromIterable(batch.groupBy(toKey).values) // Standard collection methods FTW
.mapAsync(3)(processMany) // processing up to 3 different keys in parallel - tho 3 is not necessary, probably depends on your DB throughput
.flatMap(Observable.fromIterable) // flattening it back
}
The concatMap operator here will ensure that your chunks are processed sequentially as well. So even if one buffer has key1 -> insert, key1 -> update and the other has key1 -> delete, that causes no problems. In Monix, this is the same as flatMap, but in other Rx libraries flatMap might be an alias for mergeMap which has no ordering guarantee.
This can be done with Futures too, tho there's no standard "sequential traverse", so you have to roll your own, something like:
def processMany(evs: Seq[Event]): Future[Seq[Acknowledgement]] =
evs.foldLeft(Future.successful(Vector.empty[Acknowledgement])){ (acksF, ev) =>
for {
acks <- acksF
next <- processOne(ev)
} yield acks :+ next
}
You can use akka-streams subflows, to group by key, then merge substreams if you want to do something with what you get from your database operations:
def databaseOp(input: In): Future[Out] = input match {
case INSERT => ...
case UPDATE => ...
case DELETE => ...
}
val databaseFlow: Flow[In, Out, NotUsed] =
Flow[In].groupBy(Int.maxValues, _.key).mapAsync(1)(databaseOp).mergeSubstreams
Note that order from input source won't be kept in output as it is done in mapAsync, but all operations on the same key will still be in order.
You are looking for Future.flatMap:
def doSomething: Future[Unit]
def doSomethingElse: Future[Unit]
val result = doSomething.flatMap { _ => doSomethingElse }
This executes the first function, and then, when its Future is satisfied, starts the second one. The result is a new Future that completes when the result of the second execution is satisfied.
The result of the first future is passed into the function you give to .flatMap, so the second function can depend on the result of the first one. For example:
def getUserID: Future[Int]
def getUser(id: Int): Future[User]
val userName: Future[String] = getUserID.flatMap(getUser).map(_.name)
You can also write this as a for-comprehension:
for {
id <- getUserID
user <- getUser(id)
} yield user.name
I am currently trying to track requests per minute in a Spark Application to use them in another transformation. However the code below will never result in another value than the originally set value of 0 when using the variable in the transformation
var rpm: Long = 0
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5)).foreachRDD(rdd => {
rdd.foreach(x => {
rpm = x
})
})
stream.foreachRDD { rdd =>
rdd.foreach(x => {
//do something including parameter rpm
})
}
I assume it has to do something with parellization - what I also tries was to use an RDD or a Broadcast instead of the plain variable. However that resulted in the code not being executed.
What is the recommended way to achieve this in SparkStreaming?
EDIT:
The incoming objects are timestamped, if that helps with anything.
In Spark Streaming, there are two levels of execution:
The scheduling of operations, executed in the driver and,
The distributed computation on RDDs, executed in the cluster
There are two operations that provide access to both levels: transform and foreachRDD. In these operations, we have access to the driver's context and we have a reference to an RDD, that we can use to apply computations on it.
In the specific case of the question, to update a local variable, the operation must be executed in the driver's context:
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5))
requestsPerMinute.foreachRDD{ rdd =>
val computedRPM = rdd.collect()(0) // this gets the data locally
rpm = computedRPM
}
In the original case:
rdd.foreach(x => {
rpm = x
})
the closure: f(x): Long => Unit = rpm = x is serialized and executed on the cluster. The side-effects are applied in the remote context and lost after the operation finishes. At the driver level, the value of the variable never changes.
Also, note that is not a good idea to use side-effecting functions for remote execution.
I have 2 RDD's that are pulled in with the following code:
val fileA = sc.textFile("fileA.txt")
val fileB = sc.textFile("fileB.txt")
I then Map and Reduce it by key:
val countsB = fileB.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
val countsA = fileA.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
I now wan't to find and remove all keys in countB if the key exist in countA
I have tried something like:
countsB.keys.foreach(b => {
if(countsB.collect().exists(_ == b)){
countsB.collect().drop(countsB.collect().indexOf(b))
}
})
but it doesn't seem like it removes them by the key.
There are 3 issues with your suggested code:
You are collecting the RDDs, which means they are not RDDs anymore, they are copied into the driver application's memory as plain Scala collections, so you lose Spark's parallelism and risk OutOfMemory errors in case your dataset is large
When calling drop on an immutable Scala collection (or an RDD), you don't change the original collection, you get a new collection with those records dropped, so you can't expect original collection to change
You cannot access an RDD within a function passed to any of the RDDs higher-order methods (e.g. foreach in this case) - any function passed to these method is serialized and sent to workers, and RDDs are (intentionally) not serializable - it makes no sense to fetch them into driver memory, serialize them, and send back to workers - the data is already distributed on the workers!
To solve all these - when you want to use one RDD's data to transform/filter another one, you usually want to use some type of join. In this case you can do:
// left join, and keep only records for which there was NO match in countsA:
countsB.leftOuterJoin(countsA).collect { case (key, (valueB, None)) => (key, valueB) }
NOTE that this collect that I'm using here isn't the collect you used - this one takes a PartialFunction as an argument, and behaves like a combination of map and filter, and most importantly: it doesn't copy all data into driver memory.
EDIT: as The Archetypal Paul commented - you have a much shorter and nicer option - subtractByKey:
countsB.subtractByKey(countsA)
When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?
Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava