Scala TypeTags and performance - scala

There are some answers around for equivalent questions about Java, but is scala reflection (2.11, TypeTags) really slow? there's a long narrative write-up about it at http://docs.scala-lang.org/overviews/reflection/overview.html, where the answer to this question is hard to extract.
I see a lot of advice floating around about avoiding reflection, maybe some of it predating the improvements of 2.11, but if this works well it looks like it can solve the debilitating aspect of the JVM's type erasure, for scala code.
Thanks!

Let's measure it.
I've created simple class C that has one method. All what this method do is sleep for 10ms.
Let's invoke this method
within reflection
directly
And see which is faster and how fast it is.
I've created three tests.
Test 1. Invoke via reflection. Execution time include all work that necessary to be done for setup reflection.
Create runtimeMirror, reflect class, create declaration for method, and at last step - execute method.
Test 2. Do not take into account this preparation stage, as it can be re-used.
We are calculate time of method invoking via reflection only.
Test 3. Invoke method directly.
Results:
Reflection from start : job done in 2561ms got 101 (1,5seconds for setup each execution)
Invoke method reflection: job done in 1093ms got 101 ( < 1ms for setup each execution)
No reflection: job done in 1087ms got 101 ( < 1ms for setup each execution)
Conclusion:
Setup phase increase execution time dramatically. But there are no need to perform setup on each execution (this is like class initialization - can be done once). So if you use reflection in right way(with separated init stage) it shows relevant performance and can be used for production.
Source code:
class C {
def x = {
Thread.sleep(10)
1
}
}
class XYZTest extends FunSpec {
def withTime[T](procName: String, f: => T): T = {
val start = System.currentTimeMillis()
val r = f
val end = System.currentTimeMillis()
print(s"$procName job done in ${end-start}ms")
r
}
describe("SomeTest") {
it("rebuild each time") {
val s = withTime("Reflection from start : ", (0 to 100). map {x =>
val ru = scala.reflect.runtime.universe
val m = ru.runtimeMirror(getClass.getClassLoader)
val im = m.reflect(new C)
val methodX = ru.typeOf[C].declaration(ru.TermName("x")).asMethod
val mm = im.reflectMethod(methodX)
mm().asInstanceOf[Int]
}).sum
println(s" got $s")
}
it("invoke each time") {
val ru = scala.reflect.runtime.universe
val m = ru.runtimeMirror(getClass.getClassLoader)
val im = m.reflect(new C)
val s = withTime("Invoke method reflection: ", (0 to 100). map {x =>
val methodX = ru.typeOf[C].declaration(ru.TermName("x")).asMethod
val mm = im.reflectMethod(methodX)
mm().asInstanceOf[Int]
}).sum
println(s" got $s")
}
it("invoke directly") {
val c = new C()
val s = withTime("No reflection: ", (0 to 100). map {x =>
c.x
}).sum
println(s" got $s")
}
}
}

Related

When to use def and val on Gatling scenarios and chains

I come from a Java background and am taking over a Gatling project where I noticed what seems to me a bit of inconsistency when using what is a val or a def method. The picture below exemplifies that and I was wondering if there's any guidance on what is the best usage for these within the Gatling context please.
These are other examples where I'm not sure what should be used. I'm assuming a Switch makes sense being inside a method but not sure about the others?
private def teacherViewResources: ChainBuilder =
exec(viewResourcesFlow)
.randomSwitch(
70.0 -> pause(1,2).exec(teacherLaunchResource),
10.0 -> pause(1,2).exec(teacherAssignResource),
20.0 -> pause(1,2).exec(teacherResourcesNext)
)
private def teacherLaunchResource: ChainBuilder =
exec(launchResourcesFlow)
val rootTeacherScenario = scenario("Root Teacher Scenario " + currentScenario.toString)
.doIfOrElse(currentScenario == PossibleScenarios.BRANCH)(
feed(userFeederTeacher).during(EXECUTION_TIME_SEC) {
exec(teacherBranching)
}
//For use with atOnceUsers for debugging
//feed(userFeederTeacher).exec(simulationTeacherBranching)
)(
exec {
session =>
logger.debug("Invalid teacher scenario chosen")
session
}
)
val loginFlowWithExit = exec(loginFlow).exitHereIfFailed
val teacherBranching = group("teacherBranching") {
exec(loginFlow)
.exec(session => sessionSetSessionVariable(session))
.exec(execFlaggedScenario(teacherDashboard)) // First method to run for a teacher
.exec(logout())
}
Many thanks.
val is evaluated once while def is evaluated on every call.
Remember that Gatling DSL components are just builders, not what is executed when your test is running.
Everything that doesn't take a parameter could be a val, you just have to make sure you don't end up with using forward references, eg:
broken:
val foo = exec(???).exec(bar) // here, bar is still null because it's populated later in the code
val bar = exec(???)
correct:
val bar = exec(???)
val foo = exec(???).exec(bar) // fine because bar is already populated

Cats Writer Vector is empty

I wrote this simple program in my attempt to learn how Cats Writer works
import cats.data.Writer
import cats.syntax.applicative._
import cats.syntax.writer._
import cats.instances.vector._
object WriterTest extends App {
type Logged2[A] = Writer[Vector[String], A]
Vector("started the program").tell
val output1 = calculate1(10)
val foo = new Foo()
val output2 = foo.calculate2(20)
val (log, sum) = (output1 + output2).pure[Logged2].run
println(log)
println(sum)
def calculate1(x : Int) : Int = {
Vector("came inside calculate1").tell
val output = 10 + x
Vector(s"Calculated value ${output}").tell
output
}
}
class Foo {
def calculate2(x: Int) : Int = {
Vector("came inside calculate 2").tell
val output = 10 + x
Vector(s"calculated ${output}").tell
output
}
}
The program works and the output is
> run-main WriterTest
[info] Compiling 1 Scala source to /Users/Cats/target/scala-2.11/classes...
[info] Running WriterTest
Vector()
50
[success] Total time: 1 s, completed Jan 21, 2017 8:14:19 AM
But why is the vector empty? Shouldn't it contain all the strings on which I used the "tell" method?
When you call tell on your Vectors, each time you create a Writer[Vector[String], Unit]. However, you never actually do anything with your Writers, you just discard them. Further, you call pure to create your final Writer, which simply creates a Writer with an empty Vector. You have to combine the writers together in a chain that carries your value and message around.
type Logged[A] = Writer[Vector[String], A]
val (log, sum) = (for {
_ <- Vector("started the program").tell
output1 <- calculate1(10)
foo = new Foo()
output2 <- foo.calculate2(20)
} yield output1 + output2).run
def calculate1(x: Int): Logged[Int] = for {
_ <- Vector("came inside calculate1").tell
output = 10 + x
_ <- Vector(s"Calculated value ${output}").tell
} yield output
class Foo {
def calculate2(x: Int): Logged[Int] = for {
_ <- Vector("came inside calculate2").tell
output = 10 + x
_ <- Vector(s"calculated ${output}").tell
} yield output
}
Note the use of for notation. The definition of calculate1 is really
def calculate1(x: Int): Logged[Int] = Vector("came inside calculate1").tell.flatMap { _ =>
val output = 10 + x
Vector(s"calculated ${output}").tell.map { _ => output }
}
flatMap is the monadic bind operation, which means it understands how to take two monadic values (in this case Writer) and join them together to get a new one. In this case, it makes a Writer containing the concatenation of the logs and the value of the one on the right.
Note how there are no side effects. There is no global state by which Writer can remember all your calls to tell. You instead make many Writers and join them together with flatMap to get one big one at the end.
The problem with your example code is that you're not using the result of the tell method.
If you take a look at its signature, you'll see this:
final class WriterIdSyntax[A](val a: A) extends AnyVal {
def tell: Writer[A, Unit] = Writer(a, ())
}
it is clear that tell returns a Writer[A, Unit] result which is immediately discarded because you didn't assign it to a value.
The proper way to use a Writer (and any monad in Scala) is through its flatMap method. It would look similar to this:
println(
Vector("started the program").tell.flatMap { _ =>
15.pure[Logged2].flatMap { i =>
Writer(Vector("ended program"), i)
}
}
)
The code above, when executed will give you this:
WriterT((Vector(started the program, ended program),15))
As you can see, both messages and the int are stored in the result.
Now this is a bit ugly, and Scala actually provides a better way to do this: for-comprehensions. For-comprehension are a bit of syntactic sugar that allows us to write the same code in this way:
println(
for {
_ <- Vector("started the program").tell
i <- 15.pure[Logged2]
_ <- Vector("ended program").tell
} yield i
)
Now going back to your example, what I would recommend is for you to change the return type of compute1 and compute2 to be Writer[Vector[String], Int] and then try to make your application compile using what I wrote above.

In Apache Spark, how to make an RDD/DataFrame operation lazy?

Assuming that I would like to write a function foo that transforms a DataFrame:
object Foo {
def foo(source: DataFrame): DataFrame = {
...complex iterative algorithm with a stopping condition...
}
}
since the implementation of foo has many "Actions" (collect, reduce etc.), calling foo will immediately triggers the expensive execution.
This is not a big problem, however since foo only converts a DataFrame to another, by convention it should be better to allow lazy execution: the implementation of foo should be executed only if the resulted DataFrame or its derivative(s) are being used on the Driver (through another "Action").
So far, the only way to reliably achieve this is through writing all implementations into a SparkPlan, and superimpose it into the DataFrame's SparkExecution, this is very error-prone and involves lots of boilerplate codes. What is the recommended way to do this?
It is not exactly clear to me what you try to achieve but Scala itself provides at least few tools which you may find useful:
lazy vals:
val rdd = sc.range(0, 10000)
lazy val count = rdd.count // Nothing is executed here
// count: Long = <lazy>
count // count is evaluated only when it is actually used
// Long = 10000
call-by-name (denoted by => in the function definition):
def foo(first: => Long, second: => Long, takeFirst: Boolean): Long =
if (takeFirst) first else second
val rdd1 = sc.range(0, 10000)
val rdd2 = sc.range(0, 10000)
foo(
{ println("first"); rdd1.count },
{ println("second"); rdd2.count },
true // Only first will be evaluated
)
// first
// Long = 10000
Note: In practice you should create local lazy binding to make sure that arguments are not evaluated on every access.
infinite lazy collections like Stream
import org.apache.spark.mllib.random.RandomRDDs._
val initial = normalRDD(sc, 1000000L, 10)
// Infinite stream of RDDs and actions and nothing blows :)
val stream: Stream[RDD[Double]] = Stream(initial).append(
stream.map {
case rdd if !rdd.isEmpty =>
val mu = rdd.mean
rdd.filter(_ > mu)
case _ => sc.emptyRDD[Double]
}
)
Some subset of these should be more than enough to implement complex lazy computations.

dynamically parse a string and return a function in scala using reflection and interpretors

I am trying to dinamically interpret code given as a String.
Eg:
val myString = "def f(x:Int):Int=x+1".
Im looking for a method that will return the real function out of it:
Eg:
val myIncrementFunction = myDarkMagicFunctionThatWillBuildMyFunction(myString)
println(myIncrementFunction(3))
will print 4
Use case: I want to use some simple functions from that interpreted code later in my code. For example they can provide something like def fun(x: Int): Int = x + 1 as a string, then I use the interpreter to compile/execute that code and then I'd like to be able to use this fun(x) in a map for example.
The problem is that that function type is unknown for me, and this is one of the big problems because I need to cast back from IMain.
I've read about reflection, type system and such, and after some googling I reached this point. Also I checked twitter's util-eval but I cant see too much from the docs and the examples in their tests, it's pretty the same thing.
If I know the type I can do something like
val settings = new Settings
val imain = new IMain(settings)
val res = imain.interpret("def f(x:Int):Int=x+1; val ret=f _ ")
val myF = imain.valueOfTerm("ret").get.asInstanceOf[Function[Int,Int]]
println(myF(2))
which works correctly and prints 3 but I am blocked by the problem I said above, that I dont know the type of the function, and this example works just because I casted to the type I used when I defined the string function for testing how IMain works.
Do you know any method how I could achieve this functionality ?
I'm a newbie so please excuse me if I wrote any mistakes.
Thanks
Ok, I managed to achieve the functionality I wanted, I am still looking for improving this code, but this snippet does what I want.
I used scala toolbox and quasiquotes
import scala.reflect.runtime.universe.{Quasiquote, runtimeMirror}
import scala.tools.reflect.ToolBox
object App {
def main(args: Array[String]): Unit = {
val mirror = runtimeMirror(getClass.getClassLoader)
val tb = ToolBox(mirror).mkToolBox()
val data = Array(1, 2, 3)
println("Data before function applied on it")
println(data.mkString(","))
println("Please enter the map function you want:")
val function = scala.io.StdIn.readLine()
val functionWrapper = "object FunctionWrapper { " + function + "}"
val functionSymbol = tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])
// Map each element using user specified function
val dataAfterFunctionApplied = data.map(x => tb.eval(q"$functionSymbol.function($x)"))
println("Data after function applied on it")
println(dataAfterFunctionApplied.mkString(","))
}
}
And here is the result in the terminal:
Data before function applied on it
1,2,3
Please enter the map function you want:
def function(x: Int): Int = x + 2
Data after function applied on it
3,4,5
Process finished with exit code 0
I wanted to elaborate the previous answer with the comment and perform an evaluation of the solutions:
import scala.reflect.runtime.universe.{Quasiquote, runtimeMirror}
import scala.tools.reflect.ToolBox
object Runtime {
def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block // call-by-name
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + " ns")
result
}
def main(args: Array[String]): Unit = {
val mirror = runtimeMirror(getClass.getClassLoader)
val tb = ToolBox(mirror).mkToolBox()
val data = Array(1, 2, 3)
println(s"Data before function applied on it: '${data.toList}")
val function = "def apply(x: Int): Int = x + 2"
println(s"Function: '$function'")
println("#######################")
// Function with tb.eval
println(".... with tb.eval")
val functionWrapper = "object FunctionWrapper { " + function + "}"
// This takes around 1sec!
val functionSymbol = time { tb.define(tb.parse(functionWrapper).asInstanceOf[tb.u.ImplDef])}
// This takes around 0.5 sec!
val result = time {data.map(x => tb.eval(q"$functionSymbol.apply($x)"))}
println(s"Data after function applied on it: '${result.toList}'")
println(".... without tb.eval")
val func = time {tb.eval(q"$functionSymbol.apply _").asInstanceOf[Int => Int]}
// This takes around 0.5 sec!
val result2 = time {data.map(func)}
println(s"Data after function applied on it: '${result2.toList}'")
}
}
If we execute the code above we see the following output:
Data before function applied on it: 'List(1, 2, 3)
Function: 'def apply(x: Int): Int = x + 2'
#######################
.... with tb.eval
Elapsed time: 716542980 ns
Elapsed time: 661386581 ns
Data after function applied on it: 'List(3, 4, 5)'
.... without tb.eval
Elapsed time: 394119232 ns
Elapsed time: 85713 ns
Data after function applied on it: 'List(3, 4, 5)'
Just to emphasize the importance of do the evaluation to extract a Function, and then apply to the data, without the end to evaluate again, as the comment in the answer indicates.
You can use twitter-util library to do this, check the test file:
https://github.com/twitter/util/blob/b0696d0/util-eval/src/test/scala/com/twitter/util/EvalTest.scala
If you need to use IMain, maybe because you want to use the intepreter with your own custom settings, you can do something like this:
a. First create a class meant to hold your result:
class ResHolder(var value: Any)
b. Create a container object to hold the result and interpret the code into that object:
val settings = new Settings()
val writer = new java.io.StringWriter()
val interpreter = new IMain(settings, writer)
val code = "def f(x:Int):Int=x+1"
// Create a container object to hold the result and bind in the interpreter
val holder = new ResHolder(null)
interpreter.bind("$result", holder.getClass.getName, holder) match {
case Success =>
case Error => throw new ScriptException("error in: binding '$result' value\n" + writer)
case Incomplete => throw new ScriptException("incomplete in: binding '$result' value\n" + writer)
}
val ir = interpreter.interpret("$result.value = " + code)
// Return cast value or throw an exception based on result
ir match {
case Success =>
val any = holder.value
any.asInstanceOf[(Int) => Int]
case Error => throw new ScriptException("error in: '" + code + "'\n" + writer)
case Incomplete => throw new ScriptException("incomplete in :'" + code + "'\n" + writer)
}

ParSeq.fill running sequentially?

I am trying to initialize an array in Scala, using parallelization. However, when using ParSeq.fill method, the performance doesn't seem to be better any better than sequential initialization (Seq.fill). If I do the same task, but initializing the collection with map, then it is much faster.
To show my point, I set up the following example:
import scala.collection.parallel.immutable.ParSeq
import scala.util.Random
object Timer {
def apply[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object ParallelBenchmark extends App {
def randomIsPrime: Boolean = {
val n = Random.nextInt(1000000)
(2 until n).exists(i => n % i == 0)
}
val seqSize = 100000
val (_, timeSeq) = Timer { Seq.fill(seqSize)(randomIsPrime) }
println(f"Time Seq:\t\t $timeSeq")
val (_, timeParFill) = Timer { ParSeq.fill(seqSize)(randomIsPrime) }
println(f"Time Par Fill:\t $timeParFill")
val (_, timeParMap) = Timer { (0 until seqSize).par.map(_ => randomIsPrime) }
println(f"Time Par map:\t $timeParMap")
}
And the result is:
Time Seq: 32389215709
Time Par Fill: 32730035599
Time Par map: 17270448112
Clearly showing that the fill method is not running in parallel.
The parallel collections library in Scala can only parallelize existing collections, fill hasn't been implemented yet (and may never be). Your method of using a Range to generate a cheap placeholder collection is probably your best option if you want to see a speed boost.
Here's the underlying method being called by ParSeq.fill, obviously not parallel.