Difference between reduce and foldLeft/fold in functional programming (particularly Scala and Scala APIs)? - scala

Why do Scala and frameworks like Spark and Scalding have both reduce and foldLeft? So then what's the difference between reduce and fold?

reduce vs foldLeft
A big big difference, not mentioned in any other stackoverflow answer relating to this topic clearly, is that reduce should be given a commutative monoid, i.e. an operation that is both commutative and associative. This means the operation can be parallelized.
This distinction is very important for Big Data / MPP / distributed computing, and the entire reason why reduce even exists. The collection can be chopped up and the reduce can operate on each chunk, then the reduce can operate on the results of each chunk - in fact the level of chunking need not stop one level deep. We could chop up each chunk too. This is why summing integers in a list is O(log N) if given an infinite number of CPUs.
If you just look at the signatures there is no reason for reduce to exist because you can achieve everything you can with reduce with a foldLeft. The functionality of foldLeft is a greater than the functionality of reduce.
But you cannot parallelize a foldLeft, so its runtime is always O(N) (even if you feed in a commutative monoid). This is because it's assumed the operation is not a commutative monoid and so the cumulated value will be computed by a series of sequential aggregations.
foldLeft does not assume commutativity nor associativity. It's associativity that gives the ability to chop up the collection, and it's commutativity that makes cumulating easy because order is not important (so it doesn't matter which order to aggregate each of the results from each of the chunks). Strictly speaking commutativity is not necessary for parallelization, for example distributed sorting algorithms, it just makes the logic easier because you don't need to give your chunks an ordering.
If you have a look at the Spark documentation for reduce it specifically says "... commutative and associative binary operator"
http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.rdd.RDD
Here is proof that reduce is NOT just a special case of foldLeft
scala> val intParList: ParSeq[Int] = (1 to 100000).map(_ => scala.util.Random.nextInt()).par
scala> timeMany(1000, intParList.reduce(_ + _))
Took 462.395867 milli seconds
scala> timeMany(1000, intParList.foldLeft(0)(_ + _))
Took 2589.363031 milli seconds
reduce vs fold
Now this is where it gets a little closer to the FP / mathematical roots, and a little trickier to explain. Reduce is defined formally as part of the MapReduce paradigm, which deals with orderless collections (multisets), Fold is formally defined in terms of recursion (see catamorphism) and thus assumes a structure / sequence to the collections.
There is no fold method in Scalding because under the (strict) Map Reduce programming model we cannot define fold because chunks do not have an ordering and fold only requires associativity, not commutativity.
Put simply, reduce works without an order of cumulation, fold requires an order of cumulation and it is that order of cumulation that necessitates a zero value NOT the existence of the zero value that distinguishes them. Strictly speaking reduce should work on an empty collection, because its zero value can by deduced by taking an arbitrary value x and then solving x op y = x, but that doesn't work with a non-commutative operation as there can exist a left and right zero value that are distinct (i.e. x op y != y op x). Of course Scala doesn't bother to work out what this zero value is as that would require doing some mathematics (which are probably uncomputable), so just throws an exception.
It seems (as is often the case in etymology) that this original mathematical meaning has been lost, since the only obvious difference in programming is the signature. The result is that reduce has become a synonym for fold, rather than preserving it's original meaning from MapReduce. Now these terms are often used interchangeably and behave the same in most implementations (ignoring empty collections). Weirdness is exacerbated by peculiarities, like in Spark, that we shall now address.
So Spark does have a fold, but the order in which sub results (one for each partition) are combined (at the time of writing) is the same order in which tasks are completed - and thus non-deterministic. Thanks to #CafeFeed for pointing out that fold uses runJob, which after reading through the code I realised that it's non-deterministic. Further confusion is created by Spark having a treeReduce but no treeFold.
Conclusion
There is a difference between reduce and fold even when applied to non-empty sequences. The former is defined as part of the MapReduce programming paradigm on collections with arbitrary order (http://theory.stanford.edu/~sergei/papers/soda10-mrc.pdf) and one ought to assume operators are commutative in addition to being associative to give deterministic results. The latter is defined in terms of catomorphisms and requires that the collections have a notion of sequence (or are defined recursively, like linked lists), thus do not require commutative operators.
In practice due to the unmathematical nature of programming, reduce and fold tend to behave in the same way, either correctly (like in Scala) or incorrectly (like in Spark).
Extra: My Opinion On the Spark API
My opinion is that confusion would be avoided if use of the term fold was completely dropped in Spark. At least spark does have a note in their documentation:
This behaves somewhat differently from fold operations implemented for
non-distributed collections in functional languages like Scala.

If I am not mistaken, even though the Spark API does not require it, fold also requires for the f to be commutative. Because the order in which the partitions will be aggregated is not assured.
For example in the following code only the first print out is sorted:
import org.apache.spark.{SparkConf, SparkContext}
object FoldExample extends App{
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Simple Application")
implicit val sc = new SparkContext(conf)
val range = ('a' to 'z').map(_.toString)
val rdd = sc.parallelize(range)
println(range.reduce(_ + _))
println(rdd.reduce(_ + _))
println(rdd.fold("")(_ + _))
}
Print out:
abcdefghijklmnopqrstuvwxyz
abcghituvjklmwxyzqrsdefnop
defghinopjklmqrstuvabcwxyz

fold in Apache Spark is not the same as fold on not-distributed collections. In fact it requires commutative function to produce deterministic results:
This behaves somewhat differently from fold operations implemented for non-distributed
collections in functional languages like Scala. This fold operation may be applied to
partitions individually, and then fold those results into the final result, rather than
apply the fold to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold applied to a
non-distributed collection.
This has been shown by Mishael Rosenthal and suggested by Make42 in his comment.
It's been suggested that observed behavior is related to HashPartitioner when in fact parallelize doesn't shuffle and doesn't use HashPartitioner.
import org.apache.spark.sql.SparkSession
/* Note: standalone (non-local) mode */
val master = "spark://...:7077"
val spark = SparkSession.builder.master(master).getOrCreate()
/* Note: deterministic order */
val rdd = sc.parallelize(Seq("a", "b", "c", "d"), 4).sortBy(identity[String])
require(rdd.collect.sliding(2).forall { case Array(x, y) => x < y })
/* Note: all posible permutations */
require(Seq.fill(1000)(rdd.fold("")(_ + _)).toSet.size == 24)
Explained:
Structure of fold for RDD
def fold(zeroValue: T)(op: (T, T) => T): T = withScope {
var jobResult: T
val cleanOp: (T, T) => T
val foldPartition = Iterator[T] => T
val mergeResult: (Int, T) => Unit
sc.runJob(this, foldPartition, mergeResult)
jobResult
}
is the same as structure of reduce for RDD:
def reduce(f: (T, T) => T): T = withScope {
val cleanF: (T, T) => T
val reducePartition: Iterator[T] => Option[T]
var jobResult: Option[T]
val mergeResult = (Int, Option[T]) => Unit
sc.runJob(this, reducePartition, mergeResult)
jobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))
}
where runJob is performed with disregard of partition order and results in need of commutative function.
foldPartition and reducePartition are equivalent in terms of order of processing and effectively (by inheritance and delegation) implemented by reduceLeft and foldLeft on TraversableOnce.
Conclusion: fold on RDD cannot depend on order of chunks and needs commutativity and associativity.

One other difference for Scalding is the use of combiners in Hadoop.
Imagine your operation is commutative monoid, with reduce it will be applied on the map side also instead of shuffling/sorting all data to reducers. With foldLeft this is not the case.
pipe.groupBy('product) {
_.reduce('price -> 'total){ (sum: Double, price: Double) => sum + price }
// reduce is .mapReduceMap in disguise
}
pipe.groupBy('product) {
_.foldLeft('price -> 'total)(0.0){ (sum: Double, price: Double) => sum + price }
}
It is always good practice to define your operations as monoid in Scalding.

Related

Behavior of scala fold in parallel collections

Let's run the following line of code several times:
Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)
The results are quite interesting:
scala> Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)
res10: Int = 8
scala> Set(1,2,3,4,5,6,7).par.fold(0)(_ - _)
res11: Int = 20
However clearly it should be like in its sequential version:
scala> Set(1,2,3,4,5,6,7).fold(0)(_ - _)
res15: Int = -28
I understand that operation - is non-associative on integers and that's the reason behind such behavior, but my question is quite simple: doesn't it mean that fold should not be parallelized in .par implementation of collections?
When you look at the standard library documentation, you see that fold is undeterministic here:
Folds the elements of this sequence using the specified associative binary operator.
The order in which operations are performed on elements is unspecified and may be nondeterministic.
As an alternative, there's foldLeft:
Applies a binary operator to a start value and all elements of this sequence, going left to right.
Applies a binary operator to a start value and all elements of this collection or iterator, going left to right.
Note: might return different results for different runs, unless the underlying collection type is ordered or the operator is associative and commutative.
As Set is not an ordered collection, there's no canonical order in which the elements could be folded, so the standard library allows itself to be undeterministic even for foldLeft. If you would use an ordered sequence here, foldLeft would be deterministic in that case.
The scaladoc does say:
The order in which the elements are reduced is unspecified and may be nondeterministic.
So, as you stated, a binary operation applied in ParSet#fold that is not associative is not guaranteed to produce a deterministic result. The above text is warning is all you get.
Does that mean ParSet#fold (and cousins) should not be parallelized? Not exactly. If your binary operation is commutative and you don't care about non-determinism of side-effects (not that a fold should have any), then there isn't a problem. However, you are hit with the caveat of needing to tread carefully around parallel collections.
Whether or not it is correct is more of a matter of opinion. One could argue that if a method can result in accidental non-determinism, that it should not exist in a language or library. But the alternative is to clip out functionality so that ParSet is missing functionality that is present in most of the other collection implementations. You could use the same line of thinking to also suggest the removal of Stream#foreach to prevent people from accidentally triggering infinite loops on infinite streams, but should you?
It is useful to parallelize fold operation with high workloads, however, to guarantee a deterministic output from calling of collection.par.fold(z)(f), the following conditions must hold:
1- f(f(a,b),c) == f(a,f(b,c)) // Associativity
2- f(z,a) == f(a,z) == a , where z is the neutral element for f (like 0 for sum, and 1 for multiplication).
Fabian's answer suggests using foldLeft instead. Although this is deterministic, using .par with it won't really parallelize anything. because foldLeft is sequential by nature.

Scala: Is there any reason to prefer `filter+map` over `collect`?

Can there be any reason to prefer filter+map:
list.filter (i => aCondition(i)).map(i => fun(i))
over collect? :
list.collect(case i if aCondition(i) => fun(i))
The one with collect (single look) looks faster and cleaner to me. So I would always go for collect.
Most of Scala's collections eagerly apply operations and (unless you're using a macro library that does this for you) will not fuse operations. So filter followed by map will usually create two collections (and even if you use Iterator or somesuch, the intermediate form will be transiently created, albeit only an element at a time), whereas collect will not.
On the other hand, collect uses a partial function to implement the joint test, and partial functions are slower than predicates (A => Boolean) at testing whether something is in the collection.
Additionally, there can be cases where it is simply clearer to read one than the other and you don't care about performance or memory usage differences of a factor of 2 or so. In that case, use whichever is clearer. Generally if you already have the functions named, it's clearer to read
xs.filter(p).map(f)
xs.collect{ case x if p(x) => f(x) }
but if you are supplying the closures inline, collect generally looks cleaner
xs.filter(x < foo(x, x)).map(x => bar(x, x))
xs.collect{ case x if foo(x, x) => bar(x, x) }
even though it's not necessarily shorter, because you only refer to the variable once.
Now, how big is the difference in performance? That varies, but if we consider a a collection like this:
val v = Vector.tabulate(10000)(i => ((i%100).toString, (i%7).toString))
and you want to pick out the second entry based on filtering the first (so the filter and map operations are both really easy), then we get the following table.
Note: one can get lazy views into collections and gather operations there. You don't always get your original type back, but you can always use to get the right collection type. So xs.view.filter(p).map(f).toVector would, because of the view, not create an intermediate. That is tested below also. It has also been suggested that one can xs.flatMap(x => if (p(x)) Some(f(x)) else None) and that this is efficient. That is not so. It's also tested below. And one can avoid the partial function by explicitly creating a builder: val vb = Vector.newBuilder[String]; xs.foreach(x => if (p(x)) vb += f(x)); vb.result, and the results for that are also listed below.
In the table below, three conditions have been tested: filter out nothing, filter out half, filter out everything. The times have been normalized to filter/map (100% = same time as filter/map, lower is better). Error bounds are around +- 3%.
Performance of different filter/map alternatives
====================== Vector ========================
filter/map collect view filt/map flatMap builder
100% 44% 64% 440% 30% filter out none
100% 60% 76% 605% 42% filter out half
100% 112% 103% 1300% 74% filter out all
Thus, filter/map and collect are generally pretty close (with collect winning when you keep a lot), flatMap is far slower under all situations, and creating a builder always wins. (This is true specifically for Vector. Other collections may have somewhat different characteristics, but the trends for most will be similar because the differences in operations are similar.) Views in this test tend to be a win, but they don't always work seamlessly (and they aren't really better than collect except for the empty case).
So, bottom line: prefer filter then map if it aids clarity when speed doesn't matter, or prefer it for speed when you're filtering out almost everything but still want to keep things functional (so don't want to use a builder); and otherwise use collect.
I guess this is rather opinion based, but given the following definitions:
scala> val l = List(1,2,3,4)
l: List[Int] = List(1, 2, 3, 4)
scala> def f(x: Int) = 2*x
f: (x: Int)Int
scala> def p(x: Int) = x%2 == 0
p: (x: Int)Boolean
Which of the two do you find nicer to read:
l.filter(p).map(f)
or
l.collect{ case i if p(i) => f(i) }
(Note that I had to fix your syntax above, as you need the bracket and case to add the if condition).
I personally find the filter+map much nicer to read and understand. It's all a matter of the exact syntax that you use, but given p and f, you don't have to write anonymous functions when using filter or map, while you do need them when using collect.
You also have the possibility to use flatMap:
l.flatMap(i => if(p(i)) Some(f(i)) else None)
Which is likely to be the most efficient among the 3 solutions, but I find it less nice than map and filter.
Overall, it's very difficult to say which one would be faster, as it depends a lot of which optimizations end up being performed by scalac and then the JVM. All 3 should be pretty close, and definitely not a factor in deciding which one to use.
One case where filter/map looks cleaner is when you want to flatten the result of filter
def getList(x: Int) = {
List.range(x, 0, -1)
}
val xs = List(1,2,3,4)
//Using filter and flatMap
xs.filter(_ % 2 == 0).flatMap(getList)
//Using collect and flatten
xs.collect{ case x if x % 2 == 0 => getList(x)}.flatten

Explanation of fold method of spark RDD

I am running Spark-1.4.0 pre-built for Hadoop-2.4 (in local mode) to calculate the sum of squares of a DoubleRDD. My Scala code looks like
sc.parallelize(Array(2., 3.)).fold(0.0)((p, v) => p+v*v)
And it gave a surprising result 97.0.
This is quite counter-intuitive compared to the Scala version of fold
Array(2., 3.).fold(0.0)((p, v) => p+v*v)
which gives the expected answer 13.0.
It seems quite likely that I have made some tricky mistakes in the code due to a lack of understanding. I have read about how the function used in RDD.fold() should be communicative otherwise the result may depend on partitions and etc. So example, if I change the number of partitions to 1,
sc.parallelize(Array(2., 3.), 1).fold(0.0)((p, v) => p+v*v)
the code will give me 169.0 on my machine!
Can someone explain what exactly is happening here?
Well, it is actually pretty well explained by the official documentation:
Aggregate the elements of each partition, and then the results for all the partitions, using a given associative and commutative function and a neutral "zero value". The function op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.
This behaves somewhat differently from fold operations implemented for non-distributed collections in functional languages like Scala. This fold operation may be applied to partitions individually, and then fold those results into the final result, rather than apply the fold to each element sequentially in some defined ordering. For functions that are not commutative, the result may differ from that of a fold applied to a non-distributed collection.
To illustrate what is going on lets try to simulate what is going on step by step:
val rdd = sc.parallelize(Array(2., 3.))
val byPartition = rdd.mapPartitions(
iter => Array(iter.fold(0.0)((p, v) => (p + v * v))).toIterator).collect()
It gives us something similar to this Array[Double] = Array(0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 9.0) and
byPartition.reduce((p, v) => (p + v * v))
returns 97
Important thing to note is that results can differ from run to run depending on an order in which partitions are combined.

why use foldLeft instead of procedural version?

So in reading this question it was pointed out that instead of the procedural code:
def expand(exp: String, replacements: Traversable[(String, String)]): String = {
var result = exp
for ((oldS, newS) <- replacements)
result = result.replace(oldS, newS)
result
}
You could write the following functional code:
def expand(exp: String, replacements: Traversable[(String, String)]): String = {
replacements.foldLeft(exp){
case (result, (oldS, newS)) => result.replace(oldS, newS)
}
}
I would almost certainly write the first version because coders familiar with either procedural or functional styles can easily read and understand it, while only coders familiar with functional style can easily read and understand the second version.
But setting readability aside for the moment, is there something that makes foldLeft a better choice than the procedural version? I might have thought it would be more efficient, but it turns out that the implementation of foldLeft is actually just the procedural code above. So is it just a style choice, or is there a good reason to use one version or the other?
Edit: Just to be clear, I'm not asking about other functions, just foldLeft. I'm perfectly happy with the use of foreach, map, filter, etc. which all map nicely onto for-comprehensions.
Answer: There are really two good answers here (provided by delnan and Dave Griffith) even though I could only accept one:
Use foldLeft because there are additional optimizations, e.g. using a while loop which will be faster than a for loop.
Use fold if it ever gets added to regular collections, because that will make the transition to parallel collections trivial.
It's shorter and clearer - yes, you need to know what a fold is to understand it, but when you're programming in a language that's 50% functional, you should know these basic building blocks anyway. A fold is exactly what the procedural code does (repeatedly applying an operation), but it's given a name and generalized. And while it's only a small wheel you're reinventing, but it's still a wheel reinvention.
And in case the implementation of foldLeft should ever get some special perk - say, extra optimizations - you get that for free, without updating countless methods.
Other than a distaste for mutable variable (even mutable locals), the basic reason to use fold in this case is clarity, with occasional brevity. Most of the wordiness of the fold version is because you have to use an explicit function definition with a destructuring bind. If each element in the list is used precisely once in the fold operation (a common case), this can be simplified to use the short form. Thus the classic definition of the sum of a collection of numbers
collection.foldLeft(0)(_+_)
is much simpler and shorter than any equivalent imperative construct.
One additional meta-reason to use functional collection operations, although not directly applicable in this case, is to enable a move to using parallel collection operations if needed for performance. Fold can't be parallelized, but often fold operations can be turned into commutative-associative reduce operations, and those can be parallelized. With Scala 2.9, changing something from non-parallel functional to parallel functional utilizing multiple processing cores can sometimes be as easy as dropping a .par onto the collection you want to execute parallel operations on.
One word I haven't seen mentioned here yet is declarative:
Declarative programming is often defined as any style of programming that is not imperative. A number of other common definitions exist that attempt to give the term a definition other than simply contrasting it with imperative programming. For example:
A program that describes what computation should be performed and not how to compute it
Any programming language that lacks side effects (or more specifically, is referentially transparent)
A language with a clear correspondence to mathematical logic.
These definitions overlap substantially.
Higher-order functions (HOFs) are a key enabler of declarativity, since we only specify the what (e.g. "using this collection of values, multiply each value by 2, sum the result") and not the how (e.g. initialize an accumulator, iterate with a for loop, extract values from the collection, add to the accumulator...).
Compare the following:
// Sugar-free Scala (Still better than Java<5)
def sumDoubled1(xs: List[Int]) = {
var sum = 0 // Initialized correctly?
for (i <- 0 until xs.size) { // Fenceposts?
sum = sum + (xs(i) * 2) // Correct value being extracted?
// Value extraction and +/* smashed together
}
sum // Correct value returned?
}
// Iteration sugar (similar to Java 5)
def sumDoubled2(xs: List[Int]) = {
var sum = 0
for (x <- xs) // We don't need to worry about fenceposts or
sum = sum + (x * 2) // value extraction anymore; that's progress
sum
}
// Verbose Scala
def sumDoubled3(xs: List[Int]) = xs.map((x: Int) => x*2). // the doubling
reduceLeft((x: Int, y: Int) => x+y) // the addition
// Idiomatic Scala
def sumDoubled4(xs: List[Int]) = xs.map(_*2).reduceLeft(_+_)
// ^ the doubling ^
// \ the addition
Note that our first example, sumDoubled1, is already more declarative than (most would say superior to) C/C++/Java<5 for loops, because we haven't had to micromanage the iteration state and termination logic, but we're still vulnerable to off-by-one errors.
Next, in sumDoubled2, we're basically at the level of Java>=5. There are still a couple things that can go wrong, but we're getting pretty good at reading this code-shape, so errors are quite unlikely. However, don't forget that a pattern that's trivial in a toy example isn't always so readable when scaled up to production code!
With sumDoubled3, desugared for didactic purposes, and sumDoubled4, the idiomatic Scala version, the iteration, initialization, value extraction and choice of return value are all gone.
Sure, it takes time to learn to read the functional versions, but we've drastically foreclosed our options for making mistakes. The "business logic" is clearly marked, and the plumbing is chosen from the same menu that everyone else is reading from.
It is worth pointing out that there is another way of calling foldLeft which takes advantages of:
The ability to use (almost) any Unicode symbol in an identifier
The feature that if a method name ends with a colon :, and is called infix, then the target and parameter are switched
For me this version is much clearer, because I can see that I am folding the expr value into the replacements collection
def expand(expr: String, replacements: Traversable[(String, String)]): String = {
(expr /: replacements) { case (r, (o, n)) => r.replace(o, n) }
}

Use-cases for Streams in Scala

In Scala there is a Stream class that is very much like an iterator. The topic Difference between Iterator and Stream in Scala? offers some insights into the similarities and differences between the two.
Seeing how to use a stream is pretty simple but I don't have very many common use-cases where I would use a stream instead of other artifacts.
The ideas I have right now:
If you need to make use of an infinite series. But this does not seem like a common use-case to me so it doesn't fit my criteria. (Please correct me if it is common and I just have a blind spot)
If you have a series of data where each element needs to be computed but that you may want to reuse several times. This is weak because I could just load it into a list which is conceptually easier to follow for a large subset of the developer population.
Perhaps there is a large set of data or a computationally expensive series and there is a high probability that the items you need will not require visiting all of the elements. But in this case an Iterator would be a good match unless you need to do several searches, in that case you could use a list as well even if it would be slightly less efficient.
There is a complex series of data that needs to be reused. Again a list could be used here. Although in this case both cases would be equally difficult to use and a Stream would be a better fit since not all elements need to be loaded. But again not that common... or is it?
So have I missed any big uses? Or is it a developer preference for the most part?
Thanks
The main difference between a Stream and an Iterator is that the latter is mutable and "one-shot", so to speak, while the former is not. Iterator has a better memory footprint than Stream, but the fact that it is mutable can be inconvenient.
Take this classic prime number generator, for instance:
def primeStream(s: Stream[Int]): Stream[Int] =
Stream.cons(s.head, primeStream(s.tail filter { _ % s.head != 0 }))
val primes = primeStream(Stream.from(2))
It can be easily be written with an Iterator as well, but an Iterator won't keep the primes computed so far.
So, one important aspect of a Stream is that you can pass it to other functions without having it duplicated first, or having to generate it again and again.
As for expensive computations/infinite lists, these things can be done with Iterator as well. Infinite lists are actually quite useful -- you just don't know it because you didn't have it, so you have seen algorithms that are more complex than strictly necessary just to deal with enforced finite sizes.
In addition to Daniel's answer, keep in mind that Stream is useful for short-circuiting evaluations. For example, suppose I have a huge set of functions that take String and return Option[String], and I want to keep executing them until one of them works:
val stringOps = List(
(s:String) => if (s.length>10) Some(s.length.toString) else None ,
(s:String) => if (s.length==0) Some("empty") else None ,
(s:String) => if (s.indexOf(" ")>=0) Some(s.trim) else None
);
Well, I certainly don't want to execute the entire list, and there isn't any handy method on List that says, "treat these as functions and execute them until one of them returns something other than None". What to do? Perhaps this:
def transform(input: String, ops: List[String=>Option[String]]) = {
ops.toStream.map( _(input) ).find(_ isDefined).getOrElse(None)
}
This takes a list and treats it as a Stream (which doesn't actually evaluate anything), then defines a new Stream that is a result of applying the functions (but that doesn't evaluate anything either yet), then searches for the first one which is defined--and here, magically, it looks back and realizes it has to apply the map, and get the right data from the original list--and then unwraps it from Option[Option[String]] to Option[String] using getOrElse.
Here's an example:
scala> transform("This is a really long string",stringOps)
res0: Option[String] = Some(28)
scala> transform("",stringOps)
res1: Option[String] = Some(empty)
scala> transform(" hi ",stringOps)
res2: Option[String] = Some(hi)
scala> transform("no-match",stringOps)
res3: Option[String] = None
But does it work? If we put a println into our functions so we can tell if they're called, we get
val stringOps = List(
(s:String) => {println("1"); if (s.length>10) Some(s.length.toString) else None },
(s:String) => {println("2"); if (s.length==0) Some("empty") else None },
(s:String) => {println("3"); if (s.indexOf(" ")>=0) Some(s.trim) else None }
);
// (transform is the same)
scala> transform("This is a really long string",stringOps)
1
res0: Option[String] = Some(28)
scala> transform("no-match",stringOps)
1
2
3
res1: Option[String] = None
(This is with Scala 2.8; 2.7's implementation will sometimes overshoot by one, unfortunately. And note that you do accumulate a long list of None as your failures accrue, but presumably this is inexpensive compared to your true computation here.)
I could imagine, that if you poll some device in real time, a Stream is more convenient.
Think of an GPS tracker, which returns the actual position if you ask it. You can't precompute the location where you will be in 5 minutes. You might use it for a few minutes only to actualize a path in OpenStreetMap or you might use it for an expedition over six months in a desert or the rain forest.
Or a digital thermometer or other kinds of sensors which repeatedly return new data, as long as the hardware is alive and turned on - a log file filter could be another example.
Stream is to Iterator as immutable.List is to mutable.List. Favouring immutability prevents a class of bugs, occasionally at the cost of performance.
scalac itself isn't immune to these problems: http://article.gmane.org/gmane.comp.lang.scala.internals/2831
As Daniel points out, favouring laziness over strictness can simplify algorithms and make it easier to compose them.