Spark : How to use mapPartition and create/close connection per partition - scala

So, I want to do certain operations on my spark DataFrame, write them to DB and create another DataFrame at the end. It looks like this :
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
iterator.map(
row => {
addRowToBatch(row)
convertRowToObject(row)
})
conn.writeTheBatchToDB()
conn.close()
})
.toDF()
This gives me an error as mapPartitions expects return type of Iterator[NotInferedR], but here it is Unit. I know this is possible with forEachPartition, but I'd like to do the mapping also. Doing it separate would be an overhead (extra spark job). What to do?
Thanks!

On most cases, eager consuming the iterator will result to execution failure if not slow down of jobs. Thus what I've done was to check if iterator is already empty then do the cleanup routines.
rdd.mapPartitions(itr => {
val conn = new DbConnection
itr.map(data => {
val yourActualResult = // do something with your data and conn here
if(itr.isEmpty) conn.close // close the connection
yourActualResult
})
})
Thought this as a spark problem at first but was a scala one actually. http://www.scala-lang.org/api/2.12.0/scala/collection/Iterator.html#isEmpty:Boolean

The last expression in the anonymous function implementation must be the return value:
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
// using toList to force eager computation - make it happen now when connection is open
val result = iterator.map(/* the same... */).toList
conn.writeTheBatchToDB()
conn.close()
result.iterator
}
).toDF()

Related

How can this be done concurrently in scala

So I have this chunk of code
dbs.foreach({
var map = scala.collection.mutable.Map[String, mutable.MutableList[String]]()
db =>
val resultList = getTables(hive, db)
map+=(db -> resultList)
})
What this does is loops through a list of dbs, does a show tables in db call for each db, then adds the db -> table to a map. How can this be done concurrently since there is about a 5 seconds wait time for the hive query to return?
update code --
def getAllTablesConcurrent(hive: JdbcHive, dbs: mutable.MutableList[String]): Map[String, mutable.MutableList[String]] = {
implicit val context:ExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val futures = dbs.map {
db =>
Future(db, getTables(hive, db))
}
val map = Await.result( Future.sequence(futures), Duration(10, TimeUnit.SECONDS) ).toMap
map
}
Don't use vars and mutable state, especially if you want concurrency.
val result: Future[Map[String, Seq[String]] = Future
.traverse(dbs) { name =>
Future(name -> getTables(hive, name) )
}.map(_.toMap)
if you want more control (how much time do you want to wait, how many threads do you want to use, what happens if all your threads are busy, etc) you can use ThreadPollExecutor and Future
implicit val context:ExecutionContext = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
val dbs = List("db1", "db2", "db3")
val futures = dbs.map {
name => Future(name, getables(hive, name))
}
val result = Await.result( Future.sequence(futures), Duration(TIMEOUT, TimeUnit.MILLISECONDS) ).toMap
just remember not to create a new ExecutionContext every time you need it
You can use .par on any Scala collection to perform the next transformation in parallel (using default parallelism which depends on number of cores).
Also - easier and cleaner to map into an (immutable) map instead of updating a mutable one.
val result = dbs.par.map(db => db -> getTables(hive, db)).toMap
To have more control on the number of concurrent threads used, see https://docs.scala-lang.org/overviews/parallel-collections/configuration.html

Unable to convert a ResultSet into a List in Cassandra datastax driver

In the following Cassandra code, I am querying a database and expect multiple values. The function takes and id and should return Option[List[M]] where M is my model. I have a function rowToModel(row: Row): MyModel which could take a row from ResultSet and convert it into instance of my model.
My issue is that the List I am returning is always empty even though ResultSet has data. I checked it by adding debug prints in rowToModel
def getRowsByPartitionKeyId(id:I):Option[List[M]] = {
val whereClause = whereConditions(tablename, id);
val resultSet = session.execute(whereClause) //resultSet is an iterator
val it = resultSet.iterator();
val resultList:List[M] = List();
if(it.hasNext){
while(it.hasNext) {
val item:M = rowToModel(it.next())
resultList.:+(item)
}
Some(resultList) //THIS IS ALWAYS List()
}
else
None
}
I suspect that as resultList is a val, its value is not getting changed in the while loop. I probably should use yield or something else but I don't know what and how.
Solved it by converting Java Iterator to Scala and then using toList
import collection.JavaConverters._
val it = resultSet.iterator();
if(it.hasNext){
val resultSetAsList:List[Row] = asScalaIterator(it).toList
val resultSetAsModelList = resultSetAsList.map((row:Row) => rowToModel(row))
Some(resultSetAsModelList)

EsHadoopException: Could not write all entries for bulk operation Spark Streaming

I want to traverse the stream of data, run a query on it and return the results which should be written into ElasticSearch. I tried to use mapPartitions method for creation of the connection to the database, however, I get such an error, which indicates that partition returns None to the rdd (I guess, some action should be added after the transformations):
org.elasticsearch.hadoop.EsHadoopException: Could not write all entries for bulk operation [10/10]. Error sample (first [5] error messages)
What can be changed in the code to get the data into rdd and send it to ElasticSearch without any troubles?
Alos, I had a variant of the solution for this problem with flatMap in foreachRDD, however, I create a connection to the database on each rdd, which is not effective in terms of performance.
This is the code for streaming data processing:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { part => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
part.map(
data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
val recommendationsMap = convertDataToMap(recommendations, calendarTime)
recommendationsMap
})
}
}
}.saveToEs("rdd-timed/output")
)
The problem was that I tried to convert the iterator directly into the Array, although it holds multiple rows of my records. That is why ElasticSEarch was not able to map this collection of records to the defined single record schema.
Here is the code that works properly:
wordsArrays.foreachRDD(rdd => {
rdd.mapPartitions { partition => {
val neo4jConfig = neo4jConfigurations.getNeo4jConfig(args(1))
val result = partition.map( data => {
val recommendations = execNeo4jSearchQuery(neo4jConfig, data)
val calendarTime = Calendar.getInstance.getTime
convertDataToMap(recommendations, calendarTime)
}).toList.flatten
result.iterator
}
}.saveToEs("rdd-timed/output")
})

Spark accumulator empty when used in UDF

I was working on optimizing my Spark process, and was trying to use a UDF with an accumulator. I have gotten the accumulator to work on its own, and was looking to see if I would get any speed up using a UDF. But instead, when I wrap the accumulator in the UDF, it remains empty. Am I going something wrong in particular? Is there something going on with Lazy Execution where even with my .count it is still not executing?
Input:
0,[0.11,0.22]
1,[0.22,0.33]
Output:
(0,0,0.11),(0,1,0.22),(1,0,0.22),(1,1,0.33)
Code
val accum = new MapAccumulator2d()
val session = SparkSession.builder().getOrCreate()
session.sparkContext.register(accum)
//Does not work - Empty Accumlator
val rowAccum = udf((itemId: Int, item: mutable.WrappedArray[Float]) => {
val map = item
.zipWithIndex
.map(ff => {
((itemId, ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
itemId
})
dataFrame.select(rowAccum(col("itemId"), col("jaccardList"))).count
//Works
dataFrame.foreach(f => {
val map = f.getAs[mutable.WrappedArray[Float]](1)
.zipWithIndex
.map(ff => {
((f.getInt(0), ff._2), ff._1.toDouble)
}).toMap
accum.add(map)
})
val list = accum.value.toList.map(f => (f._1._1, f._1._2, f._2))
Looks like the only issue here is using count to "trigger" the lazily-evaluated UDF: Spark is "smart" enough to realize that the select operation can't change the result of count and therefore doesn't really execute the UDF. Choosing a different operation (e.g. collect) shows that the UDF works and updates the accumulator.
Here's a (more concise) example:
val accum = sc.longAccumulator
val rowAccum = udf((itemId: Int) => { accum.add(itemId); itemId })
val dataFrame = Seq(1,2,3,4,5).toDF("itemId")
dataFrame.select(rowAccum(col("itemId"))).count() // won't trigger UDF
println(s"RESULT: ${accum.value}") // prints 0
dataFrame.select(rowAccum(col("itemId"))).collect() // triggers UDF
println(s"RESULT: ${accum.value}") // prints 15

working with two RDDs apache spark

I am using calliope i.e. spark plugin to connect with cassandra. I have created 2 RDDs which looks like
class A
val persistLevel = org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK
val cas1 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 1")
val sc1 = new SparkContext("local", "name it any thing ")
var rdd1 = sc.cql3Cassandra[SCALACLASS_1](cas1)
var rddResult1 = rdd1.persist(persistLevel)
class B
val cas2 = CasBuilder.cql3.withColumnFamily("cassandra_keyspace", "cassandra_coulmn_family 2")
var rdd2 = sc1.cql3Cassandra[SCALACLASS_2](cas2)
var rddResult2 = rdd2.persist(persistLevel)
somehow following code base which creates a new RDD using the other 2 is not working. Is it possible that we cannot iterate with 2 RDDs together?
Here is the code snippet which is not working -
case class Report(id: Long, anotherId: Long)
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**rddResult1.collect().toList**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
while if I replace the bold thing and initialize a val for it like -
val collection = rddResult1.collect().toList
var reportRDD = rddResult2.flatMap(f => {
val buf = List[Report]()
**collection**.foldLeft(buf)((k, v) => {
val buf1 = new ListBuffer[Report]
buf ++ v.INSTANCE_VAR_FROM_SCALACLASS_1.foldLeft(buf1)((ik, iv) => {
buf1 += Report(f.INSTANCE_VAR_FROM_SCALACLASS_1, iv.INSTANCE_VAR_FROM_SCALACLASS_2)
})
})
})
it works, is there any explaination?
You are mixing a transformation with an action. The closure of the rdd2.flatMap is executed on the workers, while rdd1.collect is an 'action' in Spark lingo and delivers data back to the driver. So, informally, you could say that the data is not there when you try to flatMap over it. (I don't know enough of the internals -yet- to pin-point the exact root-cause)
If you want to operate on both RDDs distributedly, you should join them using one of the join functions (join, leftOuterJoin, rightOuterJoin, cogroup).
E.g.
val mappedRdd1 = rdd1.map(x=> (x.id,x))
val mappedRdd2 = rdd2.map(x=> (x.customerId, x))
val joined = mappedRdd1.join(mappedRdd2)
joined.flatMap(...reporting logic..).collect
You can operate on RDDs in the application. But you cannot operate on RDDs in the executors (the worker nodes). The executors cannot give commands to drive the cluster. The code inside flatMap runs on the executors.
In the first case, you try to operate on an RDD in the executor. I reckon you would get a NotSerializableException as you cannot even send the RDD object to the executors. In the second case, you pull the RDD contents to the application, and then send this simple List to the executors. (Lambda captures are automatically serialized.)