Scala performance with functional constructs - scala

I am currently profiling the performance of an application written in Scala and I am wondering whether functional constructs can be used. On the one hand I love functional programming for its elegance and conciseness, on the other hand I am scared about the resulting performance. I found one particularly good example.
I have a string with a million characters and I need to sum each digit. A typical functional approach would be like this:
val sum = value.map(_.asDigit).sum.toString
However, this beautiful, concise, functional approach takes 0.98s (almost a second)
var sum = 0;
for(digit <- value)
sum += digit.asDigit
This imperative approach on the other side takes only 0.022s (2.24% of the above time) - it's around 50 times faster...
I am sure the problem arises because Scala is generating a new list in the first approach and then iterating over that list again in order to create the sum.
Is it just a bad idea to rely on functional constructs? I mean, they are beautiful - I love them - but they are 50 times slower...
P.S.
I have also tried something else.
val sum = value.foldLeft(0)((sum, value) => sum + value.asDigit)
This functional approach, which is a little less concise and probably even harder to read than the imperative approach, takes 0.085s. It's harder to read AND still 4 times slower...

First of all: are you sure that you have properly benchmarked the two versions? Just measuring the execution time using something like System.nanoTime will not give exact results. See this funny and insightful blog post by JVM performance guru Aleksey Shipilёv.
Here is a benchmark using the excellent Thyme scala benchmarking library:
val value = "1234567890" * 100000
def sumf = value.map(_.asDigit).sum
def sumi = { var sum = 0; for(digit <- value) sum += digit.asDigit; sum }
val th = ichi.bench.Thyme.warmed(verbose = println)
scala> th.pbenchOffWarm("Functional vs. Imperative")(th.Warm(sumf))(th.Warm(sumi))
Benchmark comparison (in 6.654 s): Functional vs. Imperative
Significantly different (p ~= 0)
Time ratio: 0.36877 95% CI 0.36625 - 0.37129 (n=20)
First 40.25 ms 95% CI 40.15 ms - 40.34 ms
Second 14.84 ms 95% CI 14.75 ms - 14.94 ms
res3: Int = 4500000
So yes, the imperative version is faster. But not nearly by as much as you measured. In many, many situations the performance difference will be completely irrelevant. And for those few situations where the performance difference does matter, scala gives you the opportunity to write imperative code. All in all, I think scala is doing pretty well.
By the way: your second approach is almost as fast as the imperative version when properly benchmarked:
def sumf2 = value.foldLeft(0)(_ + _.asDigit)
scala> th.pbenchOffWarm("Functional2 vs. Imperative")(th.Warm(sumf2))(th.Warm(sumi))
Benchmark comparison (in 3.886 s): Functional2 vs. Imperative
Significantly different (p ~= 0)
Time ratio: 0.89560 95% CI 0.88823 - 0.90297 (n=20)
First 16.95 ms 95% CI 16.85 ms - 17.04 ms
Second 15.18 ms 95% CI 15.08 ms - 15.27 ms
res17: Int = 4500000
Update due to suggestion from #Odomontois : Note that if you really want to optimize this, you have to make sure that the chars of the string are not being boxed. Here is an imperative version that is not very nice to look at, but also almost as fast as possible. This is using the cfor macro from spire, but a while loop would work just as well.
def sumi3 = {
var sum = 0
cfor(0)(_ < value.length, _ + 1) { i =>
sum += value(i).asDigit
}
sum
}
scala> th.pbenchOffWarm("Imperative vs. optimized Imperative")(th.Warm(sumi))(th.Warm(sumi3))
Benchmark comparison (in 4.401 s): Imperative vs. optimized Imperative
Significantly different (p ~= 0)
Time ratio: 0.08925 95% CI 0.08880 - 0.08970 (n=20)
First 15.10 ms 95% CI 15.04 ms - 15.16 ms
Second 1.348 ms 95% CI 1.344 ms - 1.351 ms
res9: Int = 4500000
Premature optimization disclaimer:
Unless you are absolutely sure that a) a piece of code is a performance bottleneck and b) the imperative version is much faster, I would always prefer the most readable version over the fastest. Scala 2.12 will come with a new optimizer that will make a lot of the overhead of functional style much smaller, since it can do advanced optimizations such as closure inlining in many cases.

Related

using scastie and scalafiddle to evaluate code execution time

I'm trying to reduce execution time of some small scala routine, say, concatenation of strings, since I'm too lazy to setup local environment, I'm using online scala compilers, but found the comparison result differs between scastie and scalafiddle w/ the following code:
// routine 1
var startT1 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val sb = new StringBuilder("a")
sb.append("b").append("c").append("d").append("e").append("f")
}
println(System.nanoTime() - startT1)
// routine 2
var startT2 = System.nanoTime()
(1 until 100 * 1000).foreach{ x=>
val arr = Array[Char]('a', 'b', 'c', 'd', 'e', 'f')
}
println(System.nanoTime() - startT2)
In scalafiddle routine 1 is faster but in scastie routine 2 is faster.
I'v read this article https://medium.com/#otto.chrons/what-makes-scalafiddle-so-fast-9a3edf33ed4d, so it seems that scalafiddle actually runs JavaScript instead of scala. But the remaining question is, can I really use scastie for execution time benchmarks?
Short answer - NO I don't you can not rely on ANY online running tools like scastie and scalafiddle to verify performance.
Because, there is 1000 and more reasons why some benchmark will show X millis execution for some operation, and 99% of that reasons is running environment: hardware, operating system, CPU architecture, load on machine, used Scala compiler, used JVM etc. And we don't know environment change between runs on Scatie for instance, so you can get totally different numbers and don't know why, hence benchmark results won't be reliable.
If you would like to get some results, you would like rely one at least a bit, take a look at https://openjdk.java.net/projects/code-tools/jmh/ and it's sbt helper plugin https://github.com/ktoso/sbt-jmh and run known environment.
And along with posting benchmark results - please post environment details, where it was run.

Using intermediate variables inside a Spark map

Does creating intermediate variables inside a map or a flatMap in Spark result in worse performance?
Here are two versions of some code that are supposed to do the same thing.
v1:
val x = someRDD.flatMap { case(id, row) =>
if (row.flag.isDefined)
Some((id, (Some(row.a.get), Some(row.b.get),
if (someFunction(row.c.get) 1 else 0, 1)))
else
Some((id, (Some(row.a.get), None,
if (someFunction(row.c.get) 1 else 0, 1)))
}
v2:
val x = someRdd.flatMap { case(id, row) =>
val a = Some(row.a.get)
val b = if (row.flag.isDefined) Some(row.b.get) else None
val c = if (someFunction(row.c.get) 1 else 0
Some((id, (a, b, c, 1)))
}
The difference is that v1 avoids creating any intermediate variables like v2 does.
Does v2 have worse performance compared to v1? Does the creation of the a, b, c vals require a later garbage collection step (eg: due to the cleanup needed on the root objects) that makes it much slower?
Obviously, this is data dependent and detailed profiling is necessary to definitively answer the question but I wanted to know if, in general, using intermediate variables leads to worse performance.
I feel that from a code readability aspect, v2 is much better but if we defer to v1 would it be premature optimization?
There probably will be no difference at all for primitive values (like your c variable). The compiler is smart enough to optimize it. For reference types creating a value formally does result in more garbage to collect, so theoretically yes, this might affect performance. However, in practice most likely you won't be able to notice a performance difference (unless you do create a lot of temporary objects, e.g hundreds and thousands of large arrays) - there are JIT optimizations which might kick in here, and also garbage collection is quite efficient these days, especially when handling lots of short-lived objects.
The best answer would be to profile your job, and do not attempt improving things like this in advance. I personally would look at optimizations like this as the very last step, after everything else stops to help. In the majority of cases, you can achieve much more impressive performance improvements by optimizing the plan of your job, e.g. by removing unnecessary shuffles or ensuring that your partitions have even size.

Spark performance for Scala vs Python

I prefer Python over Scala. But, as Spark is natively written in Scala, I was expecting my code to run faster in the Scala than the Python version for obvious reasons.
With that assumption, I thought to learn & write the Scala version of some very common preprocessing code for some 1 GB of data. Data is picked from the SpringLeaf competition on Kaggle. Just to give an overview of the data (it contains 1936 dimensions and 145232 rows). Data is composed of various types e.g. int, float, string, boolean. I am using 6 cores out of 8 for Spark processing; that's why I used minPartitions=6 so that every core has something to process.
Scala Code
val input = sc.textFile("train.csv", minPartitions=6)
val input2 = input.mapPartitionsWithIndex { (idx, iter) =>
if (idx == 0) iter.drop(1) else iter }
val delim1 = "\001"
def separateCols(line: String): Array[String] = {
val line2 = line.replaceAll("true", "1")
val line3 = line2.replaceAll("false", "0")
val vals: Array[String] = line3.split(",")
for((x,i) <- vals.view.zipWithIndex) {
vals(i) = "VAR_%04d".format(i) + delim1 + x
}
vals
}
val input3 = input2.flatMap(separateCols)
def toKeyVal(line: String): (String, String) = {
val vals = line.split(delim1)
(vals(0), vals(1))
}
val input4 = input3.map(toKeyVal)
def valsConcat(val1: String, val2: String): String = {
val1 + "," + val2
}
val input5 = input4.reduceByKey(valsConcat)
input5.saveAsTextFile("output")
Python Code
input = sc.textFile('train.csv', minPartitions=6)
DELIM_1 = '\001'
def drop_first_line(index, itr):
if index == 0:
return iter(list(itr)[1:])
else:
return itr
input2 = input.mapPartitionsWithIndex(drop_first_line)
def separate_cols(line):
line = line.replace('true', '1').replace('false', '0')
vals = line.split(',')
vals2 = ['VAR_%04d%s%s' %(e, DELIM_1, val.strip('\"'))
for e, val in enumerate(vals)]
return vals2
input3 = input2.flatMap(separate_cols)
def to_key_val(kv):
key, val = kv.split(DELIM_1)
return (key, val)
input4 = input3.map(to_key_val)
def vals_concat(v1, v2):
return v1 + ',' + v2
input5 = input4.reduceByKey(vals_concat)
input5.saveAsTextFile('output')
Scala Performance
Stage 0 (38 mins), Stage 1 (18 sec)
Python Performance
Stage 0 (11 mins), Stage 1 (7 sec)
Both produces different DAG visualization graphs (due to which both pictures show different stage 0 functions for Scala (map) and Python (reduceByKey))
But, essentially both code tries to transform data into (dimension_id, string of list of values) RDD and save to disk. The output will be used to compute various statistics for each dimension.
Performance wise, Scala code for this real data like this seems to run 4 times slower than the Python version.
Good news for me is that it gave me good motivation to stay with Python. Bad news is I didn't quite understand why?
The original answer discussing the code can be found below.
First of all, you have to distinguish between different types of API, each with its own performance considerations.
RDD API
(pure Python structures with JVM based orchestration)
This is the component which will be most affected by the performance of the Python code and the details of PySpark implementation. While Python performance is rather unlikely to be a problem, there at least few factors you have to consider:
Overhead of JVM communication. Practically all data that comes to and from Python executor has to be passed through a socket and a JVM worker. While this is a relatively efficient local communication it is still not free.
Process-based executors (Python) versus thread based (single JVM multiple threads) executors (Scala). Each Python executor runs in its own process. As a side effect, it provides stronger isolation than its JVM counterpart and some control over executor lifecycle but potentially significantly higher memory usage:
interpreter memory footprint
footprint of the loaded libraries
less efficient broadcasting (each process requires its own copy of a broadcast)
Performance of Python code itself. Generally speaking Scala is faster than Python but it will vary on task to task. Moreover you have multiple options including JITs like Numba, C extensions (Cython) or specialized libraries like Theano. Finally, if you don't use ML / MLlib (or simply NumPy stack), consider using PyPy as an alternative interpreter. See SPARK-3094.
PySpark configuration provides the spark.python.worker.reuse option which can be used to choose between forking Python process for each task and reusing existing process. The latter option seems to be useful to avoid expensive garbage collection (it is more an impression than a result of systematic tests), while the former one (default) is optimal for in case of expensive broadcasts and imports.
Reference counting, used as the first line garbage collection method in CPython, works pretty well with typical Spark workloads (stream-like processing, no reference cycles) and reduces the risk of long GC pauses.
MLlib
(mixed Python and JVM execution)
Basic considerations are pretty much the same as before with a few additional issues. While basic structures used with MLlib are plain Python RDD objects, all algorithms are executed directly using Scala.
It means an additional cost of converting Python objects to Scala objects and the other way around, increased memory usage and some additional limitations we'll cover later.
As of now (Spark 2.x), the RDD-based API is in a maintenance mode and is scheduled to be removed in Spark 3.0.
DataFrame API and Spark ML
(JVM execution with Python code limited to the driver)
These are probably the best choice for standard data processing tasks. Since Python code is mostly limited to high-level logical operations on the driver, there should be no performance difference between Python and Scala.
A single exception is usage of row-wise Python UDFs which are significantly less efficient than their Scala equivalents. While there is some chance for improvements (there has been substantial development in Spark 2.0.0), the biggest limitation is full roundtrip between internal representation (JVM) and Python interpreter. If possible, you should favor a composition of built-in expressions (example. Python UDF behavior has been improved in Spark 2.0.0, but it is still suboptimal compared to native execution.
This may improved in the future has improved significantly with introduction of the vectorized UDFs (SPARK-21190 and further extensions), which uses Arrow Streaming for efficient data exchange with zero-copy deserialization. For most applications their secondary overheads can be just ignored.
Also be sure to avoid unnecessary passing data between DataFrames and RDDs. This requires expensive serialization and deserialization, not to mention data transfer to and from Python interpreter.
It is worth noting that Py4J calls have pretty high latency. This includes simple calls like:
from pyspark.sql.functions import col
col("foo")
Usually, it shouldn't matter (overhead is constant and doesn't depend on the amount of data) but in the case of soft real-time applications, you may consider caching/reusing Java wrappers.
GraphX and Spark DataSets
As for now (Spark 1.6 2.1) neither one provides PySpark API so you can say that PySpark is infinitely worse than Scala.
GraphX
In practice, GraphX development stopped almost completely and the project is currently in the maintenance mode with related JIRA tickets closed as won't fix. GraphFrames library provides an alternative graph processing library with Python bindings.
Dataset
Subjectively speaking there is not much place for statically typed Datasets in Python and even if there was the current Scala implementation is too simplistic and doesn't provide the same performance benefits as DataFrame.
Streaming
From what I've seen so far, I would strongly recommend using Scala over Python. It may change in the future if PySpark gets support for structured streams but right now Scala API seems to be much more robust, comprehensive and efficient. My experience is quite limited.
Structured streaming in Spark 2.x seem to reduce the gap between languages but for now it is still in its early days. Nevertheless, RDD based API is already referenced as "legacy streaming" in the Databricks Documentation (date of access 2017-03-03)) so it reasonable to expect further unification efforts.
Non-performance considerations
Feature parity
Not all Spark features are exposed through PySpark API. Be sure to check if the parts you need are already implemented and try to understand possible limitations.
It is particularly important when you use MLlib and similar mixed contexts (see Calling Java/Scala function from a task). To be fair some parts of the PySpark API, like mllib.linalg, provides a more comprehensive set of methods than Scala.
API design
The PySpark API closely reflects its Scala counterpart and as such is not exactly Pythonic. It means that it is pretty easy to map between languages but at the same time, Python code can be significantly harder to understand.
Complex architecture
PySpark data flow is relatively complex compared to pure JVM execution. It is much harder to reason about PySpark programs or debug. Moreover at least basic understanding of Scala and JVM in general is pretty much a must have.
Spark 2.x and beyond
Ongoing shift towards Dataset API, with frozen RDD API brings both opportunities and challenges for Python users. While high level parts of the API are much easier to expose in Python, the more advanced features are pretty much impossible to be used directly.
Moreover native Python functions continue to be second class citizen in the SQL world. Hopefully this will improve in the future with Apache Arrow serialization (current efforts target data collection but UDF serde is a long term goal).
For projects strongly depending on the Python codebase, pure Python alternatives (like Dask or Ray) could be an interesting alternative.
It doesn't have to be one vs. the other
The Spark DataFrame (SQL, Dataset) API provides an elegant way to integrate Scala/Java code in PySpark application. You can use DataFrames to expose data to a native JVM code and read back the results. I've explained some options somewhere else and you can find a working example of Python-Scala roundtrip in How to use a Scala class inside Pyspark.
It can be further augmented by introducing User Defined Types (see How to define schema for custom type in Spark SQL?).
What is wrong with code provided in the question
(Disclaimer: Pythonista point of view. Most likely I've missed some Scala tricks)
First of all, there is one part in your code which doesn't make sense at all. If you already have (key, value) pairs created using zipWithIndex or enumerate what is the point in creating string just to split it right afterwards? flatMap doesn't work recursively so you can simply yield tuples and skip following map whatsoever.
Another part I find problematic is reduceByKey. Generally speaking, reduceByKey is useful if applying aggregate function can reduce the amount of data that has to be shuffled. Since you simply concatenate strings there is nothing to gain here. Ignoring low-level stuff, like the number of references, the amount of data you have to transfer is exactly the same as for groupByKey.
Normally I wouldn't dwell on that, but as far as I can tell it is a bottleneck in your Scala code. Joining strings on JVM is a rather expensive operation (see for example: Is string concatenation in scala as costly as it is in Java?). It means that something like this _.reduceByKey((v1: String, v2: String) => v1 + ',' + v2) which is equivalent to input4.reduceByKey(valsConcat) in your code is not a good idea.
If you want to avoid groupByKey you can try to use aggregateByKey with StringBuilder. Something similar to this should do the trick:
rdd.aggregateByKey(new StringBuilder)(
(acc, e) => {
if(!acc.isEmpty) acc.append(",").append(e)
else acc.append(e)
},
(acc1, acc2) => {
if(acc1.isEmpty | acc2.isEmpty) acc1.addString(acc2)
else acc1.append(",").addString(acc2)
}
)
but I doubt it is worth all the fuss.
Keeping the above in mind, I've rewritten your code as follows:
Scala:
val input = sc.textFile("train.csv", 6).mapPartitionsWithIndex{
(idx, iter) => if (idx == 0) iter.drop(1) else iter
}
val pairs = input.flatMap(line => line.split(",").zipWithIndex.map{
case ("true", i) => (i, "1")
case ("false", i) => (i, "0")
case p => p.swap
})
val result = pairs.groupByKey.map{
case (k, vals) => {
val valsString = vals.mkString(",")
s"$k,$valsString"
}
}
result.saveAsTextFile("scalaout")
Python:
def drop_first_line(index, itr):
if index == 0:
return iter(list(itr)[1:])
else:
return itr
def separate_cols(line):
line = line.replace('true', '1').replace('false', '0')
vals = line.split(',')
for (i, x) in enumerate(vals):
yield (i, x)
input = (sc
.textFile('train.csv', minPartitions=6)
.mapPartitionsWithIndex(drop_first_line))
pairs = input.flatMap(separate_cols)
result = (pairs
.groupByKey()
.map(lambda kv: "{0},{1}".format(kv[0], ",".join(kv[1]))))
result.saveAsTextFile("pythonout")
Results
In local[6] mode (Intel(R) Xeon(R) CPU E3-1245 V2 # 3.40GHz) with 4GB memory per executor it takes (n = 3):
Scala - mean: 250.00s, stdev: 12.49
Python - mean: 246.66s, stdev: 1.15
I am pretty sure that most of that time is spent on shuffling, serializing, deserializing and other secondary tasks. Just for fun, here's naive single-threaded code in Python that performs the same task on this machine in less than a minute:
def go():
with open("train.csv") as fr:
lines = [
line.replace('true', '1').replace('false', '0').split(",")
for line in fr]
return zip(*lines[1:])

Efficiency/scalability of parallel collections in Scala (graphs)

So I've been working with parallel collections in Scala for a graph project I'm working on, I've got the basics of the graph class defined, it is currently using a scala.collection.mutable.HashMap where the key is Int and the value is ListBuffer[Int] (adjacency list). (EDIT: This has since been change to ArrayBuffer[Int]
I had done a similar thing a few months ago in C++, with a std::vector<int, std::vector<int> >.
What I'm trying to do now is run a metric between all pairs of vertices in the graph, so in C++ I did something like this:
// myVec = std::vector<int> of vertices
for (std::vector<int>::iterator iter = myVec.begin(); iter != myVec.end(); ++iter) {
for (std::vector<int>::iterator iter2 = myVec.begin();
iter2 != myVec.end(); ++iter2) {
/* Run algorithm between *iter and *iter2 */
}
}
I did the same thing in Scala, parallelized, (or tried to) by doing this:
// vertexList is a List[Int] (NOW CHANGED TO Array[Int] - see below)
vertexList.par.foreach(u =>
vertexList.foreach(v =>
/* Run algorithm between u and v */
)
)
The C++ version is clearly single-threaded, the Scala version has .par so it's using parallel collections and is multi-threaded on 8 cores (same machine). However, the C++ version processed 305,570 pairs in a span of roughly 3 days, whereas the Scala version so far has only processed 23,573 in 17 hours.
Assuming I did my math correctly, the single-threaded C++ version is roughly 3x faster than the Scala version. Is Scala really that much slower than C++, or am I completely mis-using Scala (I only recently started - I'm about 300 pages into Programming in Scala)?
Thanks!
-kstruct
EDIT To use a while loop, do I do something like..
// Where vertexList is an Array[Int]
vertexList.par.foreach(u =>
while (i <- 0 until vertexList.length) {
/* Run algorithm between u and vertexList(i) */
}
}
If you guys mean use a while loop for the entire thing, is there an equivalent of .par.foreach for whiles?
EDIT2 Wait a second, that code isn't even right - my bad. How would I parallelize this using while loops? If I have some var i that keeps track of the iteration, then wouldn't all threads be sharing that i?
From your comments, I see that your updating a shared mutable HashMap at the end of each algorithm run. And if you're randomizing your walks, a shared Random is also a contention point.
I recommend two changes:
Use .map and .flatMap to return an immutable collection instead of modifying a shared collection.
Use a ThreadLocalRandom (from either Akka or Java 7) to reduce contention on the random number generator
Check the rest of your algorithm for further possible contention points.
You may try running the inner loop in parallel, too. But without knowing your algorithm, it's hard to know if that will help or hurt. Fortunately, running all combinations of parallel and sequential collections is very simple; just switch out pVertexList and vertexList in the code below.
Something like this:
val pVertexList = vertexList.par
val allResult = for {
u <- pVertexList
v <- pVertexList
} yield {
/* Run algorithm between u and v */
((u -> v) -> result)
}
The value allResult will be a ParVector[((Int, Int), Int)]. You may call .toMap on it to convert that into a Map.
Why mutable? I don't think there's a good parallel mutable map on Scala 2.9.x -- particularly because just such a data structure was added to the upcoming Scala 2.10.
On the other hand... you have a List[Int]? Don't use that, use a Vector[Int]. Also, are you sure you aren't wasting time elsewhere, doing the conversions from your mutable maps and buffers into immutable lists? Scala data structures are different than C++'s so you might well be incurring in complexity problems elsewhere in the code.
Finally, I think dave might be onto something when he asks about contention. If you have contention, parallelism might well make things slower. How faster/slower does it run if you do not make it parallel? If making it not parallel makes it faster, then you most likely do have contention issues.
I'm not completely sure about it, but I think foreach loops in foreach loops are rather slow, because lots of objects get created. See: http://scala-programming-language.1934581.n4.nabble.com/for-loop-vs-while-loop-performance-td1935856.html
Try rewriting it using a while loop.
Also Lists are only efficient for head access, Arrays are probably faster.

Writing a time function in Haskell

I’m new to Haskell and I’d like to be able to time the runtime of a given function call or snippet of code.
In Clojure I can use ‘time’:
user=> (time (apply * (range 2 10000)))
"Elapsed time: 289.795 msecs"
2846259680917054518906413212119868890148051...
In Scala, I can define the function myself:
scala> def time[T](code : => T) = {
| val t0 = System.nanoTime : Double
| val res = code
| val t1 = System.nanoTime : Double
| println("Elapsed time " + (t1 - t0) / 1000000.0 + " msecs")
| res
| }
time: [T](=> T)T
scala> time((1 to 10000).foldLeft(1:BigInt)(_*_))
Elapsed time 274.292224 msecs
res0: BigInt = 284625968091705451...
How can I write the equivalent of my Scala function or Clojure's ‘time’ in Haskell? The System.TimeIt module I've found on Hackage is not general enough because it works only if an IO computation is being measured. So timeIt(4 + 4) wouldn't work, only timeIt(print $ 4 + 4), which gets annoying fast. Beside, I really want to see how Haskell handles the general case.
Thank you!
Please look at using the standard libraries for this:
Timing computations in Haskell
Criterion, possibly the best open source benchmarking/timing library in existence
About Criterion
Just use criterion.
A note on evaluation depth: laziness means you need to decide how much evaluation you want to have during your timing run. Typically you'll want to reduce your code to normal form. The NFData typeclass lets you do this via the rnf method. If evaluating to the outermost constructor is ok, use seq on your pure code to force its evaluation.
Haskell is lazily evaluated. If your expression doesn't have some side effect (as encoded in the IO monad or the like), then the program doesn't need to actually resolve the expression to a value, and so won't.
To get meaningful numbers out of this, you might try timing print 4 and print expr and take the difference, in order to remove the overhead of string formatting and IO.
Lazy means Lazy. Time is only relevant when inside a monad like IO.
Time has NO meaning in the expression "4 + 4" - or in any other mathematical equation. The answer simply IS. The "answer" to any other pure computation is already predetermined the instant that the computation is specified.
Unfortunately, this is the "answer" to your question. An answer that, in fact, existed before you even posed your question. It existed in 1998 when the language was finally defined. The fact that it took me a year to write this doesn't matter ;-)
OK, enough of that nonsense!!!! (But if the above is too annoying, then just forget about Haskell.)
If the Criterion package is too much pain, just write a test case and use +RTS to test it.
If you want to be really cool, create your own monad - one that times the execution of your algorithm and hands the result back tupled with the algorithm's return value. Good luck. We're all counting on you!