What part of this code will execute on Spark driver? - scala

val formatter: DateTimeFormatter = DateTimeFormatter.ofPattern("yyyy/MM")
def getEventCountOnWeekdaysPerMonth(data: RDD[(LocalDateTime, Long)]): Array[(String, Long)] = {
val result = data
.filter(e => e._1.getDayOfWeek.getValue < DayOfWeek.SATURDAY.getValue)
.map(mapDateTime2Date)
.reduceByKey(_ + _)
.collect()
result
.map(e => (e._1.format(formatter), e._2))
}
private def mapDateTime2Date(v: (LocalDateTime, Long)): (LocalDate, Long) = {
(v._1.toLocalDate.withDayOfMonth(1), v._2)
}
In the above code piece, data stored in "result" will be sent to driver during execution because of collect.
Will the mapping on "result" take place on driver or executors will also store the "result" and perform the mapping and store it till next action is called ?

Actually, nothing is executed here because there are only method declarations (besides of the formatter). If you call getEventCountOnWeekdaysPerMonth, only this line will be executed on the driver :
result
.map(e => (e._1.format(formatter), e._2))
This is because result is a plain scala array.

Execution on driver, in which case next Action is not relevant. collect means result set on driver and futher processing there. Would need makeRDD or equivalent for map to push processing to executors.

Related

How to retrieve value from the output of a scala Future?

I am trying to query a table, store values of the query in a Scala Map & return the same map.
To do that, I came up with the following code:
def getBounds(incLogIdMap:scala.collection.mutable.Map[String, String]): Future[scala.collection.mutable.Map[String, String]] = Future {
var boundsMap = scala.collection.mutable.Map[String, String]()
incLogIdMap.keys.foreach(table => if(!incLogIdMap(table).contains("INVALID")) {
val minMax = s"select max(cast(to_char(update_tms,'yyyyddmmhhmmss') as bigint)) maxTms, min(cast(to_char(update_tms,'yyyyddmmhhmmss') as bigint)) minTms from queue.${table} where key_ids in (${incLogIdMap(table)})"
val boundsDF = spark.read.format("jdbc").option("url", commonParams.getGpConUrl()).option("dbtable", s"(${minMax}) as ctids")
.option("user", commonParams.getGpUserName()).option("password", commonParams.getGpPwd()).load()
val maxTms = boundsDF.select("minTms").head.getLong(0).toString + "," + boundsDF.select("maxTms").head.getLong(0).toString
boundsMap += (table -> maxTms)
}
)
boundsMap
}
In order to receive the value from the method: getBounds, I used the method onCompletion as below:
val tmsobj = new MinMaxVals(spark, commonParams)
val boundsMap = tmsobj.getBounds(incLogIds)
boundsMap.onComplete({
case Success(value) =>
case Failure(value) =>
})
I have coded in Scala before but I am new to Futures in Scala. Could anyone let me know how can I retrieve the value returned by getBounds into val boundsMap
You can use Awaits ( not the best aproach)
val boundsMap = Await.result(tmsobj.getBounds(incLogIds),Duration.Inf)
Or use the value only when you need
val boundsMap = tmsobj.getBounds(incLogIds)
booundsMap.map(value => Smth_To_Do(value))
Accessing a value from a Future is not recommended as it defeats the purpose of asynchronous computation. However, there may be cases where you are dealing with the legacy code or some situation where fetching the value from the future is the way forward. To deal with such situations, there are two approaches
Using await that will block the thread
Await.result(getBounds, 10 seconds)
So, here what await does is, it will wait for 10 seconds for the getBounds future to complete. If it completes within this time, then you have the value, else you get an exception here. The biggest drawback of this method is that it blocks the current thread of execution.
Using a callback method onComplete as you have used
getBounds onComplete {
case Success(someOption) => myMethod(someOption)
case Failure(t) => println("Error)
}
So what onComplete does is to register a callback function that will get executed whenever the future is completed. This is comparatively safer that await.
You can refer to Accessing value returned by scala futures for further details.
I hope that this answers your question.

Scala - how to get and print the contents of Either?

I'm processing data where some records may be corrupted. So I decided to explore the data and used Either to divide valid and invalid records.
I figured out how to count the number of each kind of records and now getting the output for failedCount and successCount successfully.
But I have a problem with printing out each invalid (Left) sale record. What could be wrong with my approach?
I don't get any output when printing out failedSales
def filterSales(rawSales: RDD[Sale]): RDD[(String, Sale)] = {
val filteredSales = rawSales
.map(sale => {
val saleOption = Try(sale.id -> sale)
saleOption match {
case Success(successSale) => Right(successSale)
case Failure(e) => Left(s"Corrupted sale: $sale;", e)
}
})
val failedCount: Long = filteredSales.filter(_.isLeft).count()
val successCount: Long = filteredSales.filter(_.isRight).count()
println("FAILED SALES COUNT: " + failedCount)
println("SUCCESS SALES COUNT: " + successCount)
// Problem here
val failedSales: RDD[Either.LeftProjection[(String, Throwable), (String, Sale)]] = filteredSales.map(_.left)
println("FAILED SALES: ")
// Doesn't produce any output
failedSales.foreach(println)
}
When you call foreach(fn) on an RDD then the funtion fn (println in your case) is executed on the slave nodes where the RDD is distributed. So it's happening somewhere but not on the driver program you're looking at.
If you have a small data set then you could collect() the RDD so the data is returned to your driver and you can println that.
If you have large data, you could saveAsTextFile() so it gets written to HDFS and you can download from there.

Using contains in scala - exception

I am encountering this error:
java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to [Ljava.lang.Object;
whenever I try to use "contains" to find if a string is inside an array. Is there a more appropriate way of doing this? Or, am I doing something wrong? (I am fairly new to Scala)
Here is the code:
val matches = Set[JSONObject]()
val config = new SparkConf()
val sc = new SparkContext("local", "SparkExample", config)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val ebay = sqlContext.read.json("/Users/thomassquires/Downloads/products.json")
val catalogue = sqlContext.read.json("/Users/thomassquires/Documents/catalogue2.json")
val eins = ebay.map(item => (item.getAs[String]("ID"), Option(item.getAs[Set[Row]]("itemSpecifics"))))
.filter(item => item._2.isDefined)
.map(item => (item._1 , item._2.get.find(x => x.getAs[String]("k") == "EAN")))
.filter(x => x._2.isDefined)
.map(x => (x._1, x._2.get.getAs[String]("v")))
.collect()
def catEins = catalogue.map(r => (r.getAs[String]("_id"), Option(r.getAs[Array[String]]("item_model_number")))).filter(r => r._2.isDefined).map(r => (r._1, r._2.get)).collect()
def matched = for(ein <- eins) yield (ein._1, catEins.filter(z => z._2.contains(ein._2)))
The exception occurs on the last line. I have tried a few different variants.
My data structure is one List[Tuple2[String, String]] and one List[Tuple2[String, Array[String]]] . I need to find the zero or more matches from the second list that contain the string.
Thanks
Long story short (there is still part that eludes me here*) you're using wrong types. getAs is implemented as fieldIndex (String => Int) followed by get (Int => Any) followed by asInstanceOf.
Since Spark doesn't use Arrays nor Sets but WrappedArray to store array column data, calls like getAs[Array[String]] or getAs[Set[Row]] are not valid. If you want specific types you should use either getAs[Seq[T]] or getAsSeq[T] and convert your data to desired type with toSet / toArray.
* See Why wrapping a generic method call with Option defers ClassCastException?

How to get a result from Enumerator/Iteratee?

I am using play2 and reactivemongo to fetch a result from mongodb. Each item of the result needs to be transformed to add some metadata. Afterwards I need to apply some sorting to it.
To deal with the transformation step I use enumerate():
def ideasEnumerator = collection.find(query)
.options(QueryOpts(skipN = page))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate()
Then I create an Iteratee as follows:
val processIdeas: Iteratee[Idea, Unit] =
Iteratee.foreach[Idea] { idea =>
resolveCrossLinks(idea) flatMap { idea =>
addMetaInfo(idea.copy(history = None))
}
}
Finally I feed the Iteratee:
ideasEnumerator(processIdeas)
And now I'm stuck. Every example I saw does some println inside foreach, but seems not to care about a final result.
So when all documents are returned and transformed how do I get a Sequence, a List or some other datatype I can further deal with?
Change the signature of your Iteratee from Iteratee[Idea, Unit] to Iteratee[Idea, Seq[A]] where A is the type. Basically the first param of Iteratee is Input type and second param is Output type. In your case you gave the Output type as Unit.
Take a look at the below code. It may not compile but it gives you the basic usage.
ideasEnumerator.run(
Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
) // returns Future[List[MyObject]]
As you can see, Iteratee is a simply a state machine. Just extract that Iteratee part and assign it to a val:
val iteratee = Iteratee.fold(List.empty[MyObject]) { (accumulator, next) =>
accumulator + resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
}
}
and feel free to use it where ever you need to convert from your Idea to List[MyObject]
With the help of your answers I ended up with
val processIdeas: Iteratee[Idea, Future[Vector[Idea]]] =
Iteratee.fold(Future(Vector.empty[Idea])) { (accumulator: Future[Vector[Idea]], next:Idea) =>
resolveCrossLinks(next) flatMap { next =>
addMetaInfo(next.copy(history = None))
} flatMap (ideaWithMeta => accumulator map (acc => acc :+ ideaWithMeta))
}
val ideas = collection.find(query)
.options(QueryOpts(page, perPage))
.sort(Json.obj(sortField -> -1))
.cursor[Idea]
.enumerate(perPage).run(processIdeas)
This later needs a ideas.flatMap(identity) to remove the returning Future of Futures but I'm fine with it and everything looks idiomatic and elegant I think.
The performance gained compared to creating a list and iterate over it afterwards is negligible though.

How to convert Future[BSONDocument] to list?

The code sends a request to MongoDB using ReactiveMongo and returns Future[BSONDocument] but my code handles lists of data, so I need to get the value of Future[BSONDocument] and then turn it into a list.
How do I do that preferably without blocking?
Upadte:
I am using ReactiveMongo RawCommand
def findLByDistance()(implicit ec: ExecutionContext) = db.command(RawCommand(
BSONDocument(
"aggregate" -> collName,
"pipeline" -> BSONArray(BSONDocument(
"$geoNear" -> BSONDocument(
"near" -> BSONArray(44.72,25.365),
"distanceField" -> "location.distance",
"maxDistance" -> 0.08,
"uniqueDocs" -> true)))
)))
And the result comes out in Future[BSONDocument]. For some simple queries I used default query builder which allowed for simple conversion
def findLimitedEvents()(implicit ec: ExecutionContext) =
collection.find(BSONDocument.empty)
.query(BSONDocument("tags" -> "lazy"))
.options(QueryOpts().batchSize(10))
.cursor.collect[List](10, true)
I basically I need the RawCommand output type to match previously used.
Not sure about your exact use-case (showing some more code would help), but it might be useful to convert from List[Future[BSONDocument]] to one Future[List[BsonDocument]], which you can then more easily onSuccess or map on, you can do that via:
val futures: List[Future[A]] = List(f1, f2, f3)
val futureOfThose: Future[List[A]] = Future.sequence(futures)
You cannot "get" a future without blocking. if you want to wait for a Future to complete then you must block.
What you can do is map a Future into another Future:
val futureDoc: Future[BSONDocument] = ...
val futureList = futureDoc map { doc => docToList(doc) }
Eventually, you'll hit a point where you've combined, mapped, recovered, etc. all your futures and want something to happen with the result. This is where you either block, or establish a handler to do something with the eventual result:
val futureThing: Future[Thing] = ...
//this next bit will be executed at some later time,
//probably on a different thread
futureThing onSuccess {
case thing => doWhateverWith(thing)
}