Below is my Code:
class Data(val x:Double=0.0,val y:Double=0.0) {
var cluster = 0;
}
var dataList = new ArrayBuffer[Data]()
val data = sc.textFile("Path").map(line => line.split(",")).map(userRecord => (userRecord(3), userRecord(4)))
data.foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
When I do
dataList.size
i get output as 0
But there are more that 4k records in data.
Now when I try using take
data.take(10).foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
Now I got the data in dataList. But I want my whole data in dataList.
Please help.
The problem is that your code inside the foreach is running on distributed worker not the same main thread where you inspect dataList.length. Use Rdd.collect() to get it.
val dataList = data
.take(10)
.map(a => new Data(a._1.toDouble, a._2.toDouble))
.collect()
The problem is related to where your code will be executed. Every operation made inside a transformation, i.e. a map, flatMap, reduce, and so on, is not performed in the main thread (or in the driver node), but in the worker nodes. These nodes run in different threads (or hosts) than the driver node.
Every object that is not store inside a RDD and that is used in a worker node, lives only in the worker memory space. Then, your dataList object is simply freshly created in each worker and the driver node cannot retrieve any information from this remote objects.
The code in the main program and in the so called actions, i.e. foreach, collect, take and so on, is executed in the main thread or driver node. Then, when you run
data.take(10).foreach(a => dataList += new Data(a._1.toDouble, a._2.toDouble))
the take method is getting back from workers the first 10 data of the RDD. All code is executed in the driver node and the magic works.
If you want to build an RDD of Data object you have to transform the information that you read directly into the oringal RDD. Try something similar to the following:
val dataList: RDD[Data] =
data.map(a => new Data(a._1.toDouble, a._2.toDouble))
Try to have a look also to this post: A new way to err, Apache Spark.
Hope it helps.
Related
Suppose I have a method to be executed once on every worker node.
The following code is what I have come up with to achieve this goal, but it seems that the method is executed twice on the same worker node(there are a master and two worker nodes altogether).
val noOfExecs = sparkSession.sparkContext.getExecutorMemoryStatus.keys.size
val results = sparkSession.sparkContext
.parallelize(0 until noOfExecs, noOfExecs)
.map { _ =>
new SomeClass().doSomething()
}
.cache()
results.count()
How can I make sure that the method is executed only once on every worker node?
Maybe you've confused yourself in drawing the conclusion. Why do you say the method is executed twice on the same worker node?
a few things needs to be clarified for spark:
the noOfExecs by using the method sparkSession.sparkContext.getExecutorMemoryStatus.keys.size will return the total number of executors plus driver. which will be 3 if you have two workers/executors.
breaking down your code into a few chunks, first a data set to be parallelized out to the spark cluster, it is basically an array/range of integers. (0,1,2). Note you can not really control which integer get sent to which worker.
and you map over the integer(s), so there are 3 values in the data set across all 2 workers, and you ask the worker to do something. ( below I've modified it to print to the WORKERs stdout - console output, so when you check the WORKER log, you know which data is executed on that worker.)
the rest of cache or results.count() are just noise.
a method will be executed once on each worker, if you use method like map. so you do not have to ensure this, spark should.
see below. and you should be able to check worker's log. in my test, 1 work log has this
on worker, this method is executed for data number: 1, happy sharing
and the other worker has this:
on worker, this method is executed for data number: 0, happy sharing
on worker, this method is executed for data number: 2, happy sharing
below is your code being modified.
class SomeClass()
{
def doSomething(x:Int) = {
println(s"on worker, this method is executed for data number: $x, happy sharing")
}
}
// below return 3 for 1 driver, 2 executors/workers cluster setup
val driverAndWorkers = spark.sparkContext.getExecutorMemoryStatus
val noOfExecs = driverAndWorkers.keys.size
//below is basicaly '0,1,2'
val data = 0 until noOfExecs
val rddOfInt = spark.sparkContext.parallelize(data,noOfExecs) //, noOfExecs can be removed. in this case/topic, it does not matter how you partition the RDD.
val results = rddOfInt
.map { x =>
new SomeClass().doSomething(x)
}
.cache()
results.count()
I define a receiver to read data from Redis.
part of receiver simplified code:
class MyReceiver extends Receiver (StorageLevel.MEMORY_ONLY){
override def onStart() = {
while(!isStopped) {
val res = readMethod()
if (res != null) store(res.toIterator)
// using res.foreach(r => store(r)) the performance is almost the same
}
}
}
My streaming workflow:
val ssc = new StreamingContext(spark.sparkContext, new Duration(50))
val myReceiver = new MyReceiver()
val s = ssc.receiverStream(myReceiver)
s.foreachRDD{ r =>
r.persist()
if (!r.isEmpty) {
some short operations about 1s in total
// note this line ######1
}
}
I have a producer which produce much faster than consumer so that there are plenty records in Redis now, I tested with number 10000. I debugged, and all records could quickly be read after they are in Redis by readMethod() above. However, in each microbatch I can only get 30 records. (If store is fast enough it should get all of 10000)
With this suspect, I added a sleep 10 seconds code Thread.sleep(10000) to ######1 above. Each microbatch still gets about 30 records, and each microbatch process time increases 10 seconds. And if I increase the Duration to 200ms, val ssc = new StreamingContext(spark.sparkContext, new Duration(200)), it could get about 120 records.
All of these shows spark streaming only generate RDD in Duration? After gets RDD and in main workflow, store method is temporarily stopped? But this is a great waste if it is true. I want it also generates RDD (store) while the main workflow is running.
Any ideas?
I cannot leave a comment simply I don't have enough reputation. Is it possible that propertyspark.streaming.receiver.maxRate is set somewhere in your code ?
I write a custom spark sink. In my addBatch method I use ForEachPartitionAsync which if I'm not wrong only makes the driver work asynchronously, returning a future.
val work: FutureAction[Unit] = rdd.foreachPartitionAsync { rows =>
val sourceInfo: StreamSourceInfo = serializeRowsAsInputStream(schema, rows)
val ackIngestion = Future {
ingestRows(sourceInfo) } andThen {
case Success(ingestion) => ackIngestionDone(partitionId, ingestion)
}
Await.result(ackIngestion, timeOut) // I would like to remove this line..
}
work onSuccess {
case _ => // move data from temporary table, report success of all workers
}
work onFailure{
//delete tmp data
case t => throw t.getCause
}
I can't find a way to run the worker nodes without blocking on the Await call, as if I remove them a success is reported to the work future object although the future didn't really finish.
Is there a way to report to the driver that all the workers finished
their asynchronous jobs?
Note: I looked at the foreachPartitionAsync function and it has only one implementation that expects a function that returns a Unit (i would've expected it to have another one returning a future or maybe a CountDownLatch..)
I have a heavy load flow of users data. I want to determine if this is a new user by it's id. In order to reduce calls to the db I rather maintain a state in memory of previous users.
val users = mutable.set[String]()
//init the state from db
user = db.getAllUsersIds()
val source: Source[User, NotUsed]
val dbSink: Sink[User, NotUsed] //goes to db
//if the user is added to the set it will return true
val usersFilter = Flow[User].filter(user => users.add(user.id))
now I can create a graph
source ~> usersFilter ~> dbSink
my problem is that the mutable state is shared and unsafe. Is there an option to maintain the state within the flow ?
There are two ways of doing this.
If you are getting a streams of records and you want to deduplicate the stream (because some ids are already processed). You can do
http://janschulte.com/2016/03/08/deduplicate-akka-stream/
The other way of doing this is via database lookups where you check if the ID already exists.
val alreadyExists : Flow[User, NotUsed] = {
// build a cache of known ids
val knownIdList = ... // query database and get list of IDs
Flow[User].filterNot(user => knownIdList.contains(user.id))
}
I have a following algorithm with scala:
Do initial call to db to initialize cursor
Get 1000 entities from db (Returns Future)
For every entity process one additional request to database and get modified entity (returns future)
Transform original entity
Put transformed entity to Future call back from #3
Wait for all Futures
In scala it some thing like:
val client = ...
val size = 1000
val init:Future = client.firstSearch(size) //request over network
val initResult = Await(init, 30.seconds)
var cursorId:String = initResult.getCursorId
while (!cursorId.isEmpty) {
val futures:Seq[Future] = client.grabWithSize(cursorId).map{response=>
response.getAllResults.map(result=>
val grabbedOne:Future[Entity] = client.grabOneEntity(result.id) //request over network
val resultMap:Map[String,Any] = buildMap(result)
val transformed:Map[String,Any] = transform(resultMap) //no future here
grabbedOne.map{grabbedOne=>
buildMap(grabbedOne) == transformed
}
}
Futures.sequence(futures).map(_=> response.getNewCursorId)
}
}
def buildMap(...):Map[String,Any] //sync call
I noticed that if I increase size say two times, every iteration in while started working slowly ~1.5. But I do not see that my PC processor loaded more. It loaded near zero, but time increases in ~1.5. Why? I have setuped:
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1024))
I think, that not all Futures executed in parallel. But why? And ho to fix?
I see that in your code, the Futures don't block each other. It's more likely the database that is the bottleneck.
Is it possible to do a SQL join for O(1) rather than O(n) in terms of database calls? (If you're using Slick, have a look under the queries section about joins.)
If the load is low, it's probably that the connection pool is maxed out, you'd need to increase it for the database and the network.