In this case
val dStream : Stream[_] =
dStream.foreachRDD(a => ... )
dStream.foreachRDD(b => ... )
Do the foreach methods :
run in parallel?
run in sequence but without specific order?
The foreachRDD( a => ) before the foreachRDD( b => )?
I want to know that because I want to commit kafka offset after a database insert. ( And the db connector give just a "foreach" insert )
val dStream : Stream[_] = ...().cache()
dStream.toDb // consume the stream
dStream.foreachRDD(b => //commit offset ) //consume the stream but after the db insert
In the spark UI it's look like there is an order but I'm not sure it's reliable.
Edit : if foreachRDD( a => ) fail , do the foreachRDD( b => ) is still executed?
DStream.foreach is deprecated since Spark 0.9.0. You want the equivalent DStream.foreachRDD to begin with.
Stages in the Spark DAG are executed sequentially, as one transformation's output is usually also the input for the next transformation in the graph, but this isn't the case in your example.
What happens is that internally the RDD is divided into partitions. Each partition is ran on a different worker which is available to the cluster manager. In your example, DStream.foreach(a => ...) will execute before DStream.foreach(b => ...), but the execution within foreach will run in parallel as regards to the internal RDD being iterated.
I want to know that because I want to commit kafka offset after a
database insert.
The DStream.foreachRDD is an output transformation, meaning it will cause the Spark to materialize the graph and begin execution. You can safely assume that the insertion to the database will end prior to executing your second foreach, but keep in mind that your first foreach will be updating your database in parallel foreach partition in the RDD.
Multiple DStream.foreachRDD are not guaranteed to execute sequentially atleast till spark-streaming-2.4.0. Look at this code in JobScheduler class:
class JobScheduler(val ssc: StreamingContext) extends Logging {
// Use of ConcurrentHashMap.keySet later causes an odd runtime problem due to Java 7/8 diff
// https://gist.github.com/AlainODea/1375759b8720a3f9f094
private val jobSets: java.util.Map[Time, JobSet] = new ConcurrentHashMap[Time, JobSet]
private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1)
private val jobExecutor =
ThreadUtils.newDaemonFixedThreadPool(numConcurrentJobs, "streaming-job-executor")
The JobExecutor is a thread pool and if "spark.streaming.concurrentJobs" is set to a number more than 1, there can be parallel execution if enough spark-executors are available. So make sure your settings are correct to elicit a behavior you need.
Related
A simple scenario here. I am using akka streams to read from kafka and write into an external source, in my case: cassandra.
Akka streams(reactive-kafka) library equips me with backpressure and other nifty things to make this possible.
kafka being a Source and Cassandra being a Sink, when I get bunch of events which are, for example be cassandra queries here through Kafka which are supposed to be executed sequentially (ex: it could be a INSERT, UPDATE and a DELETE and must be sequential).
I cannot use mayAsync and execute both the statement, Future is eager and there is a chance that DELETE or UPDATE might get executed first before INSERT.
I am forced to use Cassandra's execute as opposed to executeAsync which is non-blocking.
There is no way to make a complete async solution to this issue, but how ever is there a much elegant way to do this?
For ex: Make the Future lazy and sequential and offload it to a different execution context of sorts.
mapAsync gives a parallelism option as well.
Can Monix Task be of help here?
This a general design question and what are the approaches one can take.
UPDATE:
Flow[In].mapAsync(3)(input => {
input match {
case INSERT => //do insert - returns future
case UPDATE => //do update - returns future
case DELETE => //delete - returns future
}
The scenario is a little more complex. There could be thousands of insert, update and delete coming in order for specific key(s)(in kafka)
I would ideally want to execute the 3 futures of a single key in sequence. I believe Monix's Task can help?
If you process things with parallelism of 1, they will get executed in strict sequence, which will solve your problem.
But that's not interesting. If you want, you can run operations for different keys in parallel - if processing for different keys is independent, which, I assume from your description, is possible. To do this, you have to buffer the incoming values and then regroup it. Let's see some code:
import monix.reactive.Observable
import scala.concurrent.duration._
import monix.eval.Task
// Your domain logic - I'll use these stubs
trait Event
trait Acknowledgement // whatever your DB functions return, if you need it
def toKey(e: Event): String = ???
def processOne(event: Event): Task[Acknowledgement] = Task.deferFuture {
event match {
case _ => ??? // insert/update/delete
}
}
// Monix Task.traverse is strictly sequential, which is what you need
def processMany(evs: Seq[Event]): Task[Seq[Acknowledgement]] =
Task.traverse(evs)(processOne)
def processEventStreamInParallel(source: Observable[Event]): Observable[Acknowledgement] =
source
// Process a bunch of events, but don't wait too long for whole 100. Fine-tune for your data source
.bufferTimedAndCounted(2.seconds, 100)
.concatMap { batch =>
Observable
.fromIterable(batch.groupBy(toKey).values) // Standard collection methods FTW
.mapAsync(3)(processMany) // processing up to 3 different keys in parallel - tho 3 is not necessary, probably depends on your DB throughput
.flatMap(Observable.fromIterable) // flattening it back
}
The concatMap operator here will ensure that your chunks are processed sequentially as well. So even if one buffer has key1 -> insert, key1 -> update and the other has key1 -> delete, that causes no problems. In Monix, this is the same as flatMap, but in other Rx libraries flatMap might be an alias for mergeMap which has no ordering guarantee.
This can be done with Futures too, tho there's no standard "sequential traverse", so you have to roll your own, something like:
def processMany(evs: Seq[Event]): Future[Seq[Acknowledgement]] =
evs.foldLeft(Future.successful(Vector.empty[Acknowledgement])){ (acksF, ev) =>
for {
acks <- acksF
next <- processOne(ev)
} yield acks :+ next
}
You can use akka-streams subflows, to group by key, then merge substreams if you want to do something with what you get from your database operations:
def databaseOp(input: In): Future[Out] = input match {
case INSERT => ...
case UPDATE => ...
case DELETE => ...
}
val databaseFlow: Flow[In, Out, NotUsed] =
Flow[In].groupBy(Int.maxValues, _.key).mapAsync(1)(databaseOp).mergeSubstreams
Note that order from input source won't be kept in output as it is done in mapAsync, but all operations on the same key will still be in order.
You are looking for Future.flatMap:
def doSomething: Future[Unit]
def doSomethingElse: Future[Unit]
val result = doSomething.flatMap { _ => doSomethingElse }
This executes the first function, and then, when its Future is satisfied, starts the second one. The result is a new Future that completes when the result of the second execution is satisfied.
The result of the first future is passed into the function you give to .flatMap, so the second function can depend on the result of the first one. For example:
def getUserID: Future[Int]
def getUser(id: Int): Future[User]
val userName: Future[String] = getUserID.flatMap(getUser).map(_.name)
You can also write this as a for-comprehension:
for {
id <- getUserID
user <- getUser(id)
} yield user.name
I am currently trying to track requests per minute in a Spark Application to use them in another transformation. However the code below will never result in another value than the originally set value of 0 when using the variable in the transformation
var rpm: Long = 0
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5)).foreachRDD(rdd => {
rdd.foreach(x => {
rpm = x
})
})
stream.foreachRDD { rdd =>
rdd.foreach(x => {
//do something including parameter rpm
})
}
I assume it has to do something with parellization - what I also tries was to use an RDD or a Broadcast instead of the plain variable. However that resulted in the code not being executed.
What is the recommended way to achieve this in SparkStreaming?
EDIT:
The incoming objects are timestamped, if that helps with anything.
In Spark Streaming, there are two levels of execution:
The scheduling of operations, executed in the driver and,
The distributed computation on RDDs, executed in the cluster
There are two operations that provide access to both levels: transform and foreachRDD. In these operations, we have access to the driver's context and we have a reference to an RDD, that we can use to apply computations on it.
In the specific case of the question, to update a local variable, the operation must be executed in the driver's context:
val requestsPerMinute = stream.countByWindow(Seconds(60), Seconds(5))
requestsPerMinute.foreachRDD{ rdd =>
val computedRPM = rdd.collect()(0) // this gets the data locally
rpm = computedRPM
}
In the original case:
rdd.foreach(x => {
rpm = x
})
the closure: f(x): Long => Unit = rpm = x is serialized and executed on the cluster. The side-effects are applied in the remote context and lost after the operation finishes. At the driver level, the value of the variable never changes.
Also, note that is not a good idea to use side-effecting functions for remote execution.
I am running a spark scala program for performing text scanning in input file. I am trying to achieve parallelism by using rdd.mappartition. Inside the mappartition section i am performing few checks and calling the map function to achieve parallel execution for each partition. Inside the map function i am calling a custom method where i am performing the scanning and sending the results back.
Now, the code is working fine when i submit the code using --master local[*] but the same is not working when i submit using --master yarn-cluster. It is working without any error but the call is not getting inside the mappartition itself.I verified this by placing few println statements.
Please help me with your suggestions.
Here is the sample code:
def main(args: Array[String]) {
val inputRdd = sc.textFile(inputFile,2)
val resultRdd = inputRdd.mapPartitions{ iter =>
println("Inside scanning method..")
var scanEngine = ScanEngine.getInstance();
...
....
....
var mapresult = iter.map { y =>
line = y
val last = line.lastIndexOf("|");
message = line.substring(last + 1, line.length());
getResponse(message)
}
}
val finalRdd = sc.parallelize(resultRdd.map(x => x.trim()))
finalRdd.coalesce(1, true).saveAsTextFile(hdfsOutpath)
}
def getResponse(input: String): String = {
var result = "";
val rList = new ListBuffer[String]();
try {
//logic here
}
return result;
}
If your evidence of it working is seeing Inside scanning method.. printed out, it won't show up when run on the cluster because that code is executed by the workers, not the driver.
You're going to have to go over the code in forensic detail, with an open mind and try to find why the job has no output. Usually when a job works on local mode but not on a cluster it is because of some subtlety in where the code is executed, or where output is recorded.
There's too much clipped code to provide a more specific answer.
Spark achieves parallelism using the map function as well as mapPartitions. The number of partitions determines the amount of parallelism, but each partition will execute independently whether or not you use the mapPartitions function.
There are only a few reasons to use mapPartitions over map; e.g. there isa high initialization cost for a function, but then can call it multiple times such as doing some NLP task on text
I use Spark Streaming to process imported data. The import data is stored in a DStream. Further, I have the classes Creation and Update which hold a Foo object.
One of the tasks I want to accomplish, is change-detection.
So I join 2 rdds (one holds the batch of data that is processed, the other one holds the current state) currentState is empty initially.
val stream: DStream[Foo]
val currentState: RDD[Foo]
val changes = stream
.transform { batch =>
batch.leftouterJoin(currentState) {
case(objectNew, Some(objectOld)) => Update(objectNew)
case(objectNew, None) => Creation(objectNew)
}
}
currentState = currentState.fullOuterJoin(changes).map {
case (Some(foo), None) => foo
case (_, Some(change)) => change.foo
}
}.cache()
Afterwards I filter out the updates.
changes.filter(!_.isInstanceOf[Update])
I now import the same data twice. Since the state is empty initially, the result set of the first import consists of Creation objects while the second one only results in Update objects. So the second result set changes is empty.
In this case I notice a massive performance decrease. It works fine if I leave out the filter.
I can't imagine this is intended behavior but maybe it is a problem with Spark Computation Internals. Can anyone explain why this happens?
When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?
Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava