How to time scala map functions in an aggregative fashion? - scala

I am having a hard to come up with a solution to time individual functions in a Scala map operation. Let's say I have the following line in my code:
val foo = data.map(d => func1(d).func2())
where data is a Seq of N elements. How would I go about timing how long my program has executed func1 in total and func2 in total? Since it is a map operation, these functions will be called N times, so each time record should be added to a cumulative time record.
How can I do this without breaking the Scala syntax?
NB: I want to end up with totalTime_inFunc1 and totalTime_inFunc2.

Let's say, func2() returns YourType. Then, you need to return tuple (YourType, Long, Long) from function inside map, where second tuple element is execution time of func1 and third element is exec time of func2. After that, you can easily get execution time from seq of tuples using sum:
val fooWithTime = {
data.map(d => {
def now = System.currentTimeMillis
val beforeFunc1 = now
val func1Result = func1(d)
val func1Time = now - beforeFunc1
val beforeFun2 = now
val result = func1Result.func2()
(result, func1Time, now - beforeFun2)
}
}
val foo = fooWithTime.map(_._1)
val totalTimeFunc1 = fooWithTime.map(_._2).sum
val totalTimeFunc2 = fooWithTime.map(_._3).sum
Also, you can easily use your preferred method of calculating execution time instead of System.currentTimeMillis().

Look at the Closure. You will declare your functions and make them refer to variable in scope, then pass them to map, and make them increment variable from scope.
Edit
code with closures
object closure {
var time1 = 0L
var time2 = 0L
def time[R](block: => R)(time: Int): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
if (time==1)
time1 += t1-t0
else
time2 += t1-t0
result
}
def fun1(i: Int): Int = {
time{i+1}(1)
}
def fun2(i: Int): Int = {
time{i+2}(2)
}
}
val data = List(1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7)
val foo = data.map(d => closure.fun2(closure.fun1(d)))
closure.time1 // res4: Long = 22976
closure.time2 // res5: Long = 25438
Edit 2
object closure {
var time1 = 0L
var time2 = 0L
def time[R](block: => R)(time: Int): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
if (time==1)
time1 += t1-t0
else
time2 += t1-t0
result
}
val data = List(1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7,1,2,3,4,5,6,7)
val test = new test;
val foo = data.map(d => {
val fun1 = time{test.fun1(d)}(1)
time{fun1.fun2(d)}(2)
})
}

val s = Seq(1, 2, 3)
val (mappedSeq, totalTime) = s.map(x => {
// call your map methods here and time
// x is the mapped element
val timing = 5.5
// then return the tuple with mapped element and time taken for the map function
(x, timing)
}).foldLeft((Seq.empty[Int], 0d))((accumulator, pair) => (accumulator._1 :+ pair._1, accumulator._2 + pair._2))
println(totalTime)
println(mappedSeq.mkString(", "))

Related

Is there a performance penalty for scala functions with implicit parameters?

The code below shows consistent results on my machine: the code without implicits performs better
case class Config(appName: String, timeout: Int)
def appendWithImplicit(str: StringBuilder)(implicit config: Config): String = {
val temp = new StringBuilder()
temp.append(str).append(config.appName)
temp.toString()
}
def appendNoImplicit(str: StringBuilder, appName: String, timeout:Int): String = {
val temp = new StringBuilder()
temp.append(str).append(appName)
temp.toString()
}
implicit val config = Config("myapp", 100)
val iterations = 100000
var start = System.currentTimeMillis()
for(i <- 1 to iterations)
appendWithImplicit(new StringBuilder(100, "foo"))
var end = System.currentTimeMillis()
println("method with implicit " + (end-start))
start = System.currentTimeMillis()
for(i <- 1 to iterations)
appendNoImplicit(new StringBuilder(100, "foo"), config.appName, config.timeout)
end = System.currentTimeMillis()
println("method no implicit " + (end-start))
Is it something specific to my methods ?
is is something related to the implicits in scala ?

How to persist the list which we made dynamically from dataFrame in scala spark

def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().collect.toList
}
I am basicaly Calling this function 2 times For getting the list for different purposes . I just want to know is there a way to retain the list in memory and we dont have to call the same function again and again to generate the list and only have to generate the list only one time in scala spark.
Try something as below and you can also check the performance using time func.
Also find the code explanation inline
import org.apache.spark.rdd
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, functions}
object HandleCachedDF {
var cachedAnimalDF : rdd.RDD[String] = _
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
val df = spark.read.json("src/main/resources/hugeTest.json") // Load your Dataframe
val df1 = time[rdd.RDD[String]] {
getAnimalName(df)
}
val resultList = df1.collect().toList
val df2 = time{
getAnimalName(df)
}
val resultList1 = df2.collect().toList
println(resultList.equals(resultList1))
}
def getAnimalName(dataFrame: DataFrame): rdd.RDD[String] = {
if (cachedAnimalDF == null) { // Check if this the first initialization of your dataframe
cachedAnimalDF = dataFrame.select("animal").
filter(functions.col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().cache() // Cache your dataframe
}
cachedAnimalDF // Return your cached dataframe
}
def time[R](block: => R): R = { // COmpute the time taken by function to execute
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
}
}
You would have to persist or cache at this point
dataFrame.select("animal").
filter(col("animal").isNotNull && col("animal").notEqual("")).
rdd.map(r => r.getString(0)).distinct().persist
and then call the function as follow
def getAnimalName(dataFrame: DataFrame): List[String] = {
dataFrame.collect.toList
}
as many times as you need it without repeat the process.
I hope it helps.

Compilation error not found: value x but I declared it already, what's wrong

result.map { res =>
val totaldocs: Int = res.value
// do something with this number
}
//val totaldocs = 60
val totalpages:Int = (totaldocs/ipp)+1
Compilation error not found: value x but I declared it already, what is wrong with my implementation, sorry I am new to play framework and scala programming language.
I would say this line is the problem:
val totalpages:Int = (totaldocs/ipp)+1
because totaldocs is only defined inside the map scope
maybe you want something like:
private def getTotalPages(query:BSONDocument, ipp:Int) (implicit ec: ExecutionContext) = {
val key = collectionName + ":" + BSONDocument.pretty(query)
Logger.debug("Query key = "+key)
val command = Count(query)
val result: Future[CountResult] = collection.runCommand(command)
result.map { res =>
val totaldocs: Int = res.value
// do something with this number
val totalpages:Int = (totaldocs/ipp)+1
Logger.debug(s"Total docs $totaldocs, Total pages $totalpages, Items per page, $ipp")
totalpages
}
}
but now it will return a Future[Int] and you will have to deal with the future on the caller.
Note: this is just one solution, depending on your code it may not be the most adequate one

Iterating over cogrouped RDD

I have used a cogroup function and obtain following RDD:
org.apache.spark.rdd.RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
Before the map operation joined object would look like this:
RDD[(Int, (Iterable[(Int, Long)], Iterable[(Int, Long)]))]
(-2095842000,(CompactBuffer((1504999740,1430096464017), (613904354,1430211912709), (-1514234644,1430288363100), (-276850688,1430330412225)),CompactBuffer((-511732877,1428682217564), (1133633791,1428831320960), (1168566678,1428964645450), (-407341933,1429009306167), (-1996133514,1429016485487), (872888282,1429031501681), (-826902224,1429034491003), (818711584,1429111125268), (-1068875079,1429117498135), (301875333,1429121399450), (-1730846275,1429131773065), (1806256621,1429135583312))))
(352234000,(CompactBuffer((1350763226,1430006650167), (-330160951,1430320010314)),CompactBuffer((2113207721,1428994842593), (-483470471,1429324209560), (1803928603,1429426861915))))
Now I want to do the following:
val globalBuffer = ListBuffer[Double]()
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
for(tupleB <- listB) {
val localResults = ListBuffer[Double]()
val itemToTest = Set(tupleB._1)
val tempList = ListBuffer[(Int, Double)]()
for(tupleA <- listA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
val i = (tupleA._1, tValue)
tempList += i
}
val sortList = tempList.sortWith(_._2 > _._2).slice(0,20).map(i => i._1)
val intersect = sortList.toSet.intersect(itemToTest)
if (intersect.size > 0)
localResults += 1.0
else localResults += 0.0
val normalized = sum(localResults.toList)/localResults.size
globalBuffer += normalized
}
})
//method sum
def sum(xs: List[Double]): Double = {//do the sum}
At the end of this I was expecting joined to be a list with double values. But when I looked at it it was unit. Also I will this is not the Scala way of doing it. How do I obtain globalBuffer as the final result.
Hmm, if I understood your code correctly, it could benefit from these improvements:
val joined = data1.cogroup(data2).map(x => {
val listA = x._2._1.toList
val listB = x._2._2.toList
val localResults = listB.map {
case (intBValue, longBValue) =>
val itemToTest = intBValue // it's always one element
val tempList = listA.map {
case (intAValue, longAValue) =>
(intAValue, someFunctionReturnDouble(longBvalue, longAValue))
}
val sortList = tempList.sortWith(-_._2).slice(0,20).map(i => i._1)
if (sortList.toSet.contains(itemToTest)) { 1.0 } else {0.0}
// no real need to convert to a set for 20 elements, by the way
}
sum(localResults)/localResults.size
})
Transformations of RDDs are not going to modify globalBuffer. Copies of globalBuffer are made and sent out to each of the workers, but any modifications to these copies on the workers will never modify the globalBuffer that exists on the driver (the one you have defined outside the map on the RDD.) Here's what I do (with a few additional modifications):
val joined = data1.cogroup(data2) map { x =>
val iterA = x._2._1
val iterB = x._2._2
var count, positiveCount = 0
val tempList = ListBuffer[(Int, Double)]()
for (tupleB <- iterB) {
tempList.clear
for(tupleA <- iterA) {
val tValue = someFunctionReturnDouble(tupleB._2, tupleA._2)
tempList += ((tupleA._1, tValue))
}
val sortList = tempList.sortWith(_._2 > _._2).iterator.take(20)
if (sortList.exists(_._1 == tupleB._1)) positiveCount += 1
count += 1
}
positiveCount.toDouble/count
}
At this point you can obtain of local copy of the proportions by using joined.collect.

Task not serializable Flink

I am trying to do the pagerank Basic example in flink with little bit of modification(only in reading the input file, everything else is the same) i am getting the error as Task not serializable and below is the part of the output error
atorg.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
Below is my code
object hpdb {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val maxIterations = 10000
val DAMPENING_FACTOR: Double = 0.85
val EPSILON: Double = 0.0001
val outpath = "/home/vinoth/bigdata/assign10/pagerank.csv"
val links = env.readCsvFile[Tuple2[Long,Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1,4)).as('sourceId,'targetId).toDataSet[Link]//source and target
val pages = env.readCsvFile[Tuple1[Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1)).as('pageId).toDataSet[Id]//Pageid
val noOfPages = pages.count()
val pagesWithRanks = pages.map(p => Page(p.pageId, 1.0 / noOfPages))
val adjacencyLists = links
// initialize lists ._1 is the source id and ._2 is the traget id
.map(e => AdjacencyList(e.sourceId, Array(e.targetId)))
// concatenate lists
.groupBy("sourceId").reduce {
(l1, l2) => AdjacencyList(l1.sourceId, l1.targetIds ++ l2.targetIds)
}
// start iteration
val finalRanks = pagesWithRanks.iterateWithTermination(maxIterations) {
// **//the output shows error here**
currentRanks =>
val newRanks = currentRanks
// distribute ranks to target pages
.join(adjacencyLists).where("pageId").equalTo("sourceId") {
(page, adjacent, out: Collector[Page]) =>
for (targetId <- adjacent.targetIds) {
out.collect(Page(targetId, page.rank / adjacent.targetIds.length))
}
}
// collect ranks and sum them up
.groupBy("pageId").aggregate(SUM, "rank")
// apply dampening factor
//**//the output shows error here**
.map { p =>
Page(p.pageId, (p.rank * DAMPENING_FACTOR) + ((1 - DAMPENING_FACTOR) / pages.count()))
}
// terminate if no rank update was significant
val termination = currentRanks.join(newRanks).where("pageId").equalTo("pageId") {
(current, next, out: Collector[Int]) =>
// check for significant update
if (math.abs(current.rank - next.rank) > EPSILON) out.collect(1)
}
(newRanks, termination)
}
val result = finalRanks
// emit result
result.writeAsCsv(outpath, "\n", " ")
env.execute()
}
}
Any help in the right direction is highly appreciated? Thank you.
The problem is that you reference the DataSet pages from within a MapFunction. This is not possible, since a DataSet is only the logical representation of a data flow and cannot be accessed at runtime.
What you have to do to solve this problem is to assign the val pagesCount = pages.count value to a variable pagesCount and refer to this variable in your MapFunction.
What pages.count actually does, is to trigger the execution of the data flow graph, so that the number of elements in pages can be counted. The result is then returned to your program.