How can this be done concurrently in scala - scala

So I have this chunk of code
dbs.foreach({
var map = scala.collection.mutable.Map[String, mutable.MutableList[String]]()
db =>
val resultList = getTables(hive, db)
map+=(db -> resultList)
})
What this does is loops through a list of dbs, does a show tables in db call for each db, then adds the db -> table to a map. How can this be done concurrently since there is about a 5 seconds wait time for the hive query to return?
update code --
def getAllTablesConcurrent(hive: JdbcHive, dbs: mutable.MutableList[String]): Map[String, mutable.MutableList[String]] = {
implicit val context:ExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val futures = dbs.map {
db =>
Future(db, getTables(hive, db))
}
val map = Await.result( Future.sequence(futures), Duration(10, TimeUnit.SECONDS) ).toMap
map
}

Don't use vars and mutable state, especially if you want concurrency.
val result: Future[Map[String, Seq[String]] = Future
.traverse(dbs) { name =>
Future(name -> getTables(hive, name) )
}.map(_.toMap)

if you want more control (how much time do you want to wait, how many threads do you want to use, what happens if all your threads are busy, etc) you can use ThreadPollExecutor and Future
implicit val context:ExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val dbs = List("db1", "db2", "db3")
val futures = dbs.map {
name => Future(name, getables(hive, name))
}
val result = Await.result( Future.sequence(futures), Duration(TIMEOUT, TimeUnit.MILLISECONDS) ).toMap
just remember not to create a new ExecutionContext every time you need it

You can use .par on any Scala collection to perform the next transformation in parallel (using default parallelism which depends on number of cores).
Also - easier and cleaner to map into an (immutable) map instead of updating a mutable one.
val result = dbs.par.map(db => db -> getTables(hive, db)).toMap
To have more control on the number of concurrent threads used, see https://docs.scala-lang.org/overviews/parallel-collections/configuration.html

Related

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

inserting a list of objects not working in slick

I am using slick3.1.1
I want to insert a list of objects in DB (postgres)
I have written following code which works
for(user<-x.userList)
{
val users = userClass(user.name,user.id)
val userAction = DBAccess.userTable.insertOrUpdate(users)
val f = DBAccess.db.run(DBIO.seq(userAction))
Await.result(f, Duration.Inf)
}
However I am running multiple DB queries. So I was looking some way to call only a single db.run.
So I wrote something like below
val userQuery = TableQuery[dbuserTable]
for(user<-x.userList)
{
val users = userClass(user.name,user.id)
userQuery += users
}
val f = DBAccess.db.run(DBIO.seq(userQuery.result))
Await.result(f, Duration.Inf)
However this second piece does not write to DB. Can someone point out where I am going wrong?
I know this old but since I just fell on it I'll give an updated answer for this.
You can use ++= to insert a list.
val usersTable = TableQuery[dbuserTable]
val listToInsert = x.userList.map(userClass(_.name, _.id))
val action = usersTable ++= listToInsert
DBAccess.db.run(action)
This will only do one request that insert everything at once.
+= doesn't mutate your userQuery. Every iteration of your for loop creates an insert action and then promptly discards it.
Try accumulating the insert actions instead of discarding them (NB use of yield):
val usersTable = TableQuery[dbuserTable]
val inserts = for (user <- x.userList) yield {
val userRow = userClass(user.name, user.id)
usersTable += userRow
}
DBAccess.db.run(DBIO.seq(inserts: _*))

Spark : How to use mapPartition and create/close connection per partition

So, I want to do certain operations on my spark DataFrame, write them to DB and create another DataFrame at the end. It looks like this :
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
iterator.map(
row => {
addRowToBatch(row)
convertRowToObject(row)
})
conn.writeTheBatchToDB()
conn.close()
})
.toDF()
This gives me an error as mapPartitions expects return type of Iterator[NotInferedR], but here it is Unit. I know this is possible with forEachPartition, but I'd like to do the mapping also. Doing it separate would be an overhead (extra spark job). What to do?
Thanks!
On most cases, eager consuming the iterator will result to execution failure if not slow down of jobs. Thus what I've done was to check if iterator is already empty then do the cleanup routines.
rdd.mapPartitions(itr => {
val conn = new DbConnection
itr.map(data => {
val yourActualResult = // do something with your data and conn here
if(itr.isEmpty) conn.close // close the connection
yourActualResult
})
})
Thought this as a spark problem at first but was a scala one actually. http://www.scala-lang.org/api/2.12.0/scala/collection/Iterator.html#isEmpty:Boolean
The last expression in the anonymous function implementation must be the return value:
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
// using toList to force eager computation - make it happen now when connection is open
val result = iterator.map(/* the same... */).toList
conn.writeTheBatchToDB()
conn.close()
result.iterator
}
).toDF()

How to consume grouped sub streams with mapAsync in akka streams

I need to do something really similar to this https://github.com/typesafehub/activator-akka-stream-scala/blob/master/src/main/scala/sample/stream/GroupLogFile.scala
my problem is that I have an unknown number of groups and if the number of parallelism of the mapAsync is less of the number of groups i got and error in the last sink
Tearing down
SynchronousFileSink(/Users/sam/dev/projects/akka-streams/target/log-ERROR.txt)
due to upstream error
(akka.stream.impl.StreamSubscriptionTimeoutSupport$$anon$2)
I tried to put a buffer in the middle as suggested in the pattern guide of akka streams http://doc.akka.io/docs/akka-stream-and-http-experimental/1.0/scala/stream-cookbook.html
groupBy {
case LoglevelPattern(level) => level
case other => "OTHER"
}.buffer(1000, OverflowStrategy.backpressure).
// write lines of each group to a separate file
mapAsync(parallelism = 2) {....
but with the same result
Expanding on jrudolph's comment which is completely correct...
You do not need a mapAsync in this instance. As a basic example, suppose you have a source of tuples
import akka.stream.scaladsl.{Source, Sink}
def data() = List(("foo", 1),
("foo", 2),
("bar", 1),
("foo", 3),
("bar", 2))
val originalSource = Source(data)
You can then perform a groupBy to create a Source of Sources
def getID(tuple : (String, Int)) = tuple._1
//a Source of (String, Source[(String, Int),_])
val groupedSource = originalSource groupBy getID
Each one of the grouped Sources can be processed in parallel with just a map, no need for anything fancy. Here is an example of each grouping being summed in an independent stream:
import akka.actor.ActorSystem
import akka.stream.ACtorMaterializer
implicit val actorSystem = ActorSystem()
implicit val mat = ActorMaterializer()
import actorSystem.dispatcher
def getValues(tuple : (String, Int)) = tuple._2
//does not have to be a def, we can re-use the same sink over-and-over
val sumSink = Sink.fold[Int,Int](0)(_ + _)
//a Source of (String, Future[Int])
val sumSource =
groupedSource map { case (id, src) =>
id -> {src map getValues runWith sumSink} //calculate sum in independent stream
}
Now all of the "foo" numbers are being summed in parallel with all of the "bar" numbers.
mapAsync is used when you have a encapsulated function that returns a Future[T] and you're trying to emit a T instead; which is not the case in you question. Further, mapAsync involves waiting for results which is not reactive...

working with two RDDs apache spark

I am using calliope i.e. spark plugin to connect with cassandra. I have created 2 RDDs which looks like
class A
val persistLevel = org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
val cas1 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 1")
val sc1 = new SparkContext("local", "name it any thing ")
var rdd1 = sc.cql3Cassandra[SCALACLASS_1](cas1)
var rddResult1 = rdd1.persist(persistLevel)
class B
val cas2 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 2")
var rdd2 = sc1.cql3Cassandra[SCALACLASS_2](cas2)
var rddResult2 = rdd2.persist(persistLevel)
somehow following code base which creates a new RDD using the other 2 is not working. Is it possible that we cannot iterate with 2 RDDs together?
Here is the code snippet which is not working -
case class Report(id: Long, anotherId: Long)
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**rddResult1.collect().toList**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
while if I replace the bold thing and initialize a val for it like -
val collection = rddResult1.collect().toList
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**collection**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
it works, is there any explaination?
You are mixing a transformation with an action. The closure of the rdd2.flatMap is executed on the workers, while rdd1.collect is an 'action' in Spark lingo and delivers data back to the driver. So, informally, you could say that the data is not there when you try to flatMap over it. (I don't know enough of the internals -yet- to pin-point the exact root-cause)
If you want to operate on both RDDs distributedly, you should join them using one of the join functions (join, leftOuterJoin, rightOuterJoin, cogroup).
E.g.
val mappedRdd1 = rdd1.map(x=> (x.id,x))
val mappedRdd2 = rdd2.map(x=> (x.customerId, x))
val joined = mappedRdd1.join(mappedRdd2)
joined.flatMap(...reporting logic..).collect
You can operate on RDDs in the application. But you cannot operate on RDDs in the executors (the worker nodes). The executors cannot give commands to drive the cluster. The code inside flatMap runs on the executors.
In the first case, you try to operate on an RDD in the executor. I reckon you would get a NotSerializableException as you cannot even send the RDD object to the executors. In the second case, you pull the RDD contents to the application, and then send this simple List to the executors. (Lambda captures are automatically serialized.)