Scala/functional way of doing things - scala

I am using scala to write up a spark application that reads data from csv files using dataframes (none of these details matter really, my question can be answered by anyone who is good at functional programming)
I'm used to sequential programming and its taking a while to think of things in the functional way.
I basically want to read to columns (a,b) from a csv file and keep track of those rows where b < 0.
I implemented this but its pretty much how I would do it Java and I would like to utilize Scala's features instead:
val ValueDF = fileDataFrame.select("colA", "colB")
val ValueArr = ValueDF.collect()
for ( index <- 0 until (ValueArr.length)){
var row = ValueArr(index)
var A = row(0).toString()
var B = row(1).toString().toDouble
if (B < 0){
//write A and B somewhere
}
}
Converting the dataframe to an array defeats the purpose of distributed computation.
So how could I possibly get the same results but instead of forming an array and traversing through it, I would rather want to perform some transformations of the data frame itself (such as map/filter/flatmap etc).
I should get going soon hopefully, just need some examples to wrap my head around it.

You are doing basically a filtering operation (ignore if not (B < 0)) and mapping (from each row, get A and B / do something with A and B).
You could write it like this:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.filter(_(1).toString().toDouble < 0).map{row => (row(0).toString(), row(1).toString().toDouble)}
// do something with result
You also can do first the mapping and then the filtering:
val result = valueArr.map{row => (row(0).toString(), row(1).toString().toDouble)}.filter(_._2 < 0)
Scala also offers more convenient versions for this kind of operations (thanks Sascha Kolberg), called withFilter and collect. withFilter has the advantage over filter that it doesn't create a new collection, saving you one pass, see this answer for more details. With collect you also map and filter in one pass, passing a partial function which allows to do pattern matching, see e.g. this answer.
In your case collect would look like this:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.collect{
case row if row(1).toString().toDouble < 0) => (row(0).toString(), row(1).toString().toDouble)
}
// do something with result
(I think there's a more elegant way to express this but that's left as an exercise ;))
Also, there's a lightweight notation called "sequence comprehensions". With this you could write:
val result = for (row <- valueArr if row(1).toString().toDouble < 0) yield (row(0).toString(), row(1).toString().toDouble)
Or a more flexible variant:
val result = for (row <- valueArr) yield {
val double = row(1).toString().toDouble
if (double < 0) {
(row(0).toString(), double)
}
}
Alternatively, you can use foldLeft:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.foldLeft(Seq[(String, Double)]()) {(s, row) =>
val a = row(0).toString()
val b = row(1).toString().toDouble
if (b < 0){
s :+ (a, b) // append tuple with A and B to results sequence
} else {
s // let results sequence unmodified
}
}
// do something with result
All of these are considered functional... which one you prefer is for the most part a matter of taste. The first 2 examples (filter/map, map/filter) do have a performance disadvantage compared to the rest because they iterate through the sequence twice.
Note that in FP it's very important to minimize side effects / isolate them from the main logic. I/O ("write A and B somewhere") is a side effect. So you normally will write your functions such that they don't have side effects - just input -> output logic without affecting or retrieving data from the surroundings. Once you have a final result, you can do side effects. In this concrete case, once you have result (which is a sequence of A and B tuples), you can loop through it and print it. This way you can for example change easily the way to print (you may want to print to the console, send to a remote place, etc.) without touching the main logic.
Also you should prefer immutable values (val) wherever possible, which is safer. Even in your loop, row, A and B are not modified so there's no reason to use var.
(Btw, I corrected the values names to start with lower case, see conventions).

Related

Scala: IndexedSeq.newBuilder vs. ArrayBuffer

I had accepted that building an IndexedSeq in a loop should use an ArrayBuffer, followed by a conversion to a Vector via ".toVector()".
In an example profiled showed the CPU hotspot was in this section, and so I tried an alternative: use IndexedSeq.newBuilder() followed by conversion to immutable via ".result()".
This change gave a significance performance improvement. The code looks almost the same. So it seems using IndexedSeq.newBuilder() is best practice. Is this correct? The example method is shown below, with the ArrayBuffer difference commented out.
def interleave[T](a: IndexedSeq[T], b: IndexedSeq[T]): IndexedSeq[T] = {
val al = a.length
val bl = b.length
val buffer = IndexedSeq.newBuilder[T]
//---> val buffer = new ArrayBuffer[T](al + bl)
val commonLength = Math.min(al, bl)
val aExtra = al - commonLength
val bExtra = bl - commonLength
var i = 0
while (i < commonLength) {
buffer += a(i)
buffer += b(i)
i += 1
}
if (aExtra > 0) {
while (i < al) {
buffer += a(i)
i += 1
}
} else if (bExtra > 0) {
while (i < bl) {
buffer += b(i)
i += 1
}
}
buffer.result()
//---> buffer.toVector()
}
As to which is best practice, I guess it depends upon your requirements. Both approaches are acceptable and understandable. All things being equal, in this particular case, I would favor the IndexedSeq.newBuilder over ArrayBuilder (since the latter targets the creation of an Array, while the former's result is a Vector).
Just one point on benchmarking: this is a real art form, due to caching, JIT & HotSpot performance, garbage collection, etc. One piece of software you might consider using to do this is ScalaMeter. You will need to write both versions of the function to populate the final vector, and ScalaMeter will give you accurate statistics on both. ScalaMeter allows the code to warm-up before taking measurements, and can also look at memory requirements as well as CPU time.
In this example informal testing did not deceive, but ScalaMeter does provide a clearer picture of performance. Building the result in an ArrayBuffer (top orange line) is definitely slower than the more direct newBuilder (blue line).
Returning the ArrayBuffer as an IndexedSeq is the fastest (green line), but of course it does not give you the true protection of an immutable collection.
Building the intermediate result in an Array (red line) is intermediate between ArrayBuffer and newBuilder.
The "zipAll" collection method allows the interleave to be done in a more functional style:
def interleaveZipAllBuilderPat[T](a: IndexedSeq[T], b: IndexedSeq[T]): IndexedSeq[T] = {
a.zipAll(b, null, null).foldLeft(Vector.newBuilder[T]) { (z, tp) =>
tp match {
case ((x:T, null)) => z += x
case ((x:T,y:T)) => z += x += y
}
}.result()
}
The slowest are the functional method, with the top two almost the same and these differ only in that one does the pattern match, and the other an if statement, so the pattern is not slow.
Functional is marginally worse than the direct loop method if an ArrayBuffer is used to accumulate the result, but the direct loop using the newBuilder is significantly faster.
If "zipAll" could return a builder, and if the builder were iterable, the functional style could be faster - no need to produce the immutable result if the next step just requires an iteration over elements.
So for me newBuilder is the clear winner.

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Efficiently iterate over one Set, and then another, in one for loop in Scala

I want to iterate over all the elements of one Set, and then all the elements of another Set, using a single loop. (I don't care about duplicates, because I happen to know the two Sets are disjoint.)
The reason I want to do it in a single loop is because I have some additional code to measure progress, which requires it to be in a single loop.
This doesn't work in general, because it might intermix the two Sets arbitrarily:
for(x <- firstSet ++ secondSet) {
...
}
This works, but builds 3 intermediate Seqs in memory, so it's far too inefficient in terms of both time and space usage:
for(x <- firstSet.toSeq ++ secondSet.toSeq) {
...
}
for(x <- firstSet.toIterator ++ secondSet.toIterator) {
...
}
This doesn't build any intermediate data structures, so I think it's the most efficient way.
If you just want a traversal, and you want maximum performance, this is the best way even though it is ugly:
val s1 = Set(1,2,3)
val s2 = Set(4,5,6)
val block : Int => Unit = x => { println(x) }
s1.foreach(block)
s2.foreach(block)
Since this is pretty ugly, you can just define a class for it:
def traverse[T](a:Traversable[T], b:Traversable[T]) : Traversable[T] =
new Traversable[T] {
def foreach[U](f:T=>U) { a.foreach(f); b.foreach(f) }
}
And then use it like this:
for(x<-traverse(s1, s2)) println(x)
However, unless this is extremely performance-critical, the solution posted by Robin Green is better. The overhead is creation of two iterators and concatenation of them. If you have deeper nested data structures, concatenating iterators can be quite expensive though. For example a tree iterator that is defined by concatenating the iterators of the subtrees will be painfully slow, whereas a tree traversable where you just call foreach on each subtree will be close to optimal.

Is Scala idiomatic coding style just a cool trap for writing inefficient code?

I sense that the Scala community has a little big obsession with writing "concise", "cool", "scala idiomatic", "one-liner" -if possible- code. This is immediately followed by a comparison to Java/imperative/ugly code.
While this (sometimes) leads to easy to understand code, it also leads to inefficient code for 99% of developers. And this is where Java/C++ is not easy to beat.
Consider this simple problem: Given a list of integers, remove the greatest element. Ordering does not need to be preserved.
Here is my version of the solution (It may not be the greatest, but it's what the average non-rockstar developer would do).
def removeMaxCool(xs: List[Int]) = {
val maxIndex = xs.indexOf(xs.max);
xs.take(maxIndex) ::: xs.drop(maxIndex+1)
}
It's Scala idiomatic, concise, and uses a few nice list functions. It's also very inefficient. It traverses the list at least 3 or 4 times.
Here is my totally uncool, Java-like solution. It's also what a reasonable Java developer (or Scala novice) would write.
def removeMaxFast(xs: List[Int]) = {
var res = ArrayBuffer[Int]()
var max = xs.head
var first = true;
for (x <- xs) {
if (first) {
first = false;
} else {
if (x > max) {
res.append(max)
max = x
} else {
res.append(x)
}
}
}
res.toList
}
Totally non-Scala idiomatic, non-functional, non-concise, but it's very efficient. It traverses the list only once!
So, if 99% of Java developers write more efficient code than 99% of Scala developers, this is a huge
obstacle to cross for greater Scala adoption. Is there a way out of this trap?
I am looking for practical advice to avoid such "inefficiency traps" while keeping implementation clear ans concise.
Clarification: This question comes from a real-life scenario: I had to write a complex algorithm. First I wrote it in Scala, then I "had to" rewrite it in Java. The Java implementation was twice as long, and not that clear, but at the same time it was twice as fast. Rewriting the Scala code to be efficient would probably take some time and a somewhat deeper understanding of scala internal efficiencies (for vs. map vs. fold, etc)
Let's discuss a fallacy in the question:
So, if 99% of Java developers write more efficient code than 99% of
Scala developers, this is a huge obstacle to cross for greater Scala
adoption. Is there a way out of this trap?
This is presumed, with absolutely no evidence backing it up. If false, the question is moot.
Is there evidence to the contrary? Well, let's consider the question itself -- it doesn't prove anything, but shows things are not that clear.
Totally non-Scala idiomatic, non-functional, non-concise, but it's
very efficient. It traverses the list only once!
Of the four claims in the first sentence, the first three are true, and the fourth, as shown by user unknown, is false! And why it is false? Because, contrary to what the second sentence states, it traverses the list more than once.
The code calls the following methods on it:
res.append(max)
res.append(x)
and
res.toList
Let's consider first append.
append takes a vararg parameter. That means max and x are first encapsulated into a sequence of some type (a WrappedArray, in fact), and then passed as parameter. A better method would have been +=.
Ok, append calls ++=, which delegates to +=. But, first, it calls ensureSize, which is the second mistake (+= calls that too -- ++= just optimizes that for multiple elements). Because an Array is a fixed size collection, which means that, at each resize, the whole Array must be copied!
So let's consider this. When you resize, Java first clears the memory by storing 0 in each element, then Scala copies each element of the previous array over to the new array. Since size doubles each time, this happens log(n) times, with the number of elements being copied increasing each time it happens.
Take for example n = 16. It does this four times, copying 1, 2, 4 and 8 elements respectively. Since Java has to clear each of these arrays, and each element must be read and written, each element copied represents 4 traversals of an element. Adding all we have (n - 1) * 4, or, roughly, 4 traversals of the complete list. If you count read and write as a single pass, as people often erroneously do, then it's still three traversals.
One can improve on this by initializing the ArrayBuffer with an initial size equal to the list that will be read, minus one, since we'll be discarding one element. To get this size, we need to traverse the list once, though.
Now let's consider toList. To put it simply, it traverses the whole list to create a new list.
So, we have 1 traversal for the algorithm, 3 or 4 traversals for resize, and 1 additional traversal for toList. That's 4 or 5 traversals.
The original algorithm is a bit difficult to analyse, because take, drop and ::: traverse a variable number of elements. Adding all together, however, it does the equivalent of 3 traversals. If splitAt was used, it would be reduced to 2 traversals. With 2 more traversals to get the maximum, we get 5 traversals -- the same number as the non-functional, non-concise algorithm!
So, let's consider improvements.
On the imperative algorithm, if one uses ListBuffer and +=, then all methods are constant-time, which reduces it to a single traversal.
On the functional algorithm, it could be rewritten as:
val max = xs.max
val (before, _ :: after) = xs span (max !=)
before ::: after
That reduces it to a worst case of three traversals. Of course, there are other alternatives presented, based on recursion or fold, that solve it in one traversal.
And, most interesting of all, all of these algorithms are O(n), and the only one which almost incurred (accidentally) in worst complexity was the imperative one (because of array copying). On the other hand, the cache characteristics of the imperative one might well make it faster, because the data is contiguous in memory. That, however, is unrelated to either big-Oh or functional vs imperative, and it is just a matter of the data structures that were chosen.
So, if we actually go to the trouble of benchmarking, analyzing the results, considering performance of methods, and looking into ways of optimizing it, then we can find faster ways to do this in an imperative manner than in a functional manner.
But all this effort is very different from saying the average Java programmer code will be faster than the average Scala programmer code -- if the question is an example, that is simply false. And even discounting the question, we have seen no evidence that the fundamental premise of the question is true.
EDIT
First, let me restate my point, because it seems I wasn't clear. My point is that the code the average Java programmer writes may seem to be more efficient, but actually isn't. Or, put another way, traditional Java style doesn't gain you performance -- only hard work does, be it Java or Scala.
Next, I have a benchmark and results too, including almost all solutions suggested. Two interesting points about it:
Depending on list size, the creation of objects can have a bigger impact than multiple traversals of the list. The original functional code by Adrian takes advantage of the fact that lists are persistent data structures by not copying the elements right of the maximum element at all. If a Vector was used instead, both left and right sides would be mostly unchanged, which might lead to even better performance.
Even though user unknown and paradigmatic have similar recursive solutions, paradigmatic's is way faster. The reason for that is that he avoids pattern matching. Pattern matching can be really slow.
The benchmark code is here, and the results are here.
def removeOneMax (xs: List [Int]) : List [Int] = xs match {
case x :: Nil => Nil
case a :: b :: xs => if (a < b) a :: removeOneMax (b :: xs) else b :: removeOneMax (a :: xs)
case Nil => Nil
}
Here is a recursive method, which only iterates once. If you need performance, you have to think about it, if not, not.
You can make it tail-recursive in the standard way: giving an extra parameter carry, which is per default the empty List, and collects the result while iterating. That is, of course, a bit longer, but if you need performance, you have to pay for it:
import annotation.tailrec
#tailrec
def removeOneMax (xs: List [Int], carry: List [Int] = List.empty) : List [Int] = xs match {
case a :: b :: xs => if (a < b) removeOneMax (b :: xs, a :: carry) else removeOneMax (a :: xs, b :: carry)
case x :: Nil => carry
case Nil => Nil
}
I don't know what the chances are, that later compilers will improve slower map-calls to be as fast as while-loops. However: You rarely need high speed solutions, but if you need them often, you will learn them fast.
Do you know how big your collection has to be, to use a whole second for your solution on your machine?
As oneliner, similar to Daniel C. Sobrals solution:
((Nil : List[Int], xs(0)) /: xs.tail) ((p, x)=> if (p._2 > x) (x :: p._1, p._2) else ((p._2 :: p._1), x))._1
but that is hard to read, and I didn't measure the effective performance. The normal pattern is (x /: xs) ((a, b) => /* something */). Here, x and a are pairs of List-so-far and max-so-far, which solves the problem to bring everything into one line of code, but isn't very readable. However, you can earn reputation on CodeGolf this way, and maybe someone likes to make a performance measurement.
And now to our big surprise, some measurements:
An updated timing-method, to get the garbage collection out of the way, and have the hotspot-compiler warm up, a main, and many methods from this thread, together in an Object named
object PerfRemMax {
def timed (name: String, xs: List [Int]) (f: List [Int] => List [Int]) = {
val a = System.currentTimeMillis
val res = f (xs)
val z = System.currentTimeMillis
val delta = z-a
println (name + ": " + (delta / 1000.0))
res
}
def main (args: Array [String]) : Unit = {
val n = args(0).toInt
val funs : List [(String, List[Int] => List[Int])] = List (
"indexOf/take-drop" -> adrian1 _,
"arraybuf" -> adrian2 _, /* out of memory */
"paradigmatic1" -> pm1 _, /**/
"paradigmatic2" -> pm2 _,
// "match" -> uu1 _, /*oom*/
"tailrec match" -> uu2 _,
"foldLeft" -> uu3 _,
"buf-=buf.max" -> soc1 _,
"for/yield" -> soc2 _,
"splitAt" -> daniel1,
"ListBuffer" -> daniel2
)
val r = util.Random
val xs = (for (x <- 1 to n) yield r.nextInt (n)).toList
// With 1 Mio. as param, it starts with 100 000, 200k, 300k, ... 1Mio. cases.
// a) warmup
// b) look, where the process gets linear to size
funs.foreach (f => {
(1 to 10) foreach (i => {
timed (f._1, xs.take (n/10 * i)) (f._2)
compat.Platform.collectGarbage
});
println ()
})
}
I renamed all the methods, and had to modify uu2 a bit, to fit to the common method declaration (List [Int] => List [Int]).
From the long result, i only provide the output for 1M invocations:
scala -Dserver PerfRemMax 2000000
indexOf/take-drop: 0.882
arraybuf: 1.681
paradigmatic1: 0.55
paradigmatic2: 1.13
tailrec match: 0.812
foldLeft: 1.054
buf-=buf.max: 1.185
for/yield: 0.725
splitAt: 1.127
ListBuffer: 0.61
The numbers aren't completly stable, depending on the sample size, and a bit varying from run to run. For example, for 100k to 1M runs, in steps of 100k, the timing for splitAt was as follows:
splitAt: 0.109
splitAt: 0.118
splitAt: 0.129
splitAt: 0.139
splitAt: 0.157
splitAt: 0.166
splitAt: 0.749
splitAt: 0.752
splitAt: 1.444
splitAt: 1.127
The initial solution is already pretty fast. splitAt is a modification from Daniel, often faster, but not always.
The measurement was done on a single core 2Ghz Centrino, running xUbuntu Linux, Scala-2.8 with Sun-Java-1.6 (desktop).
The two lessons for me are:
always measure your performance improvements; it is very hard to estimate it, if you don't do it on a daily basis
it is not only fun, to write functional code - sometimes the result is even faster
Here is a link to my benchmarkcode, if somebody is interested.
First of all, the behavior of the methods you presented is not the same. The first one keeps the element ordering, while the second one doesn't.
Second, among all the possible solution which could be qualified as "idiomatic", some are more efficient than others. Staying very close to your example, you can for instance use tail-recursion to eliminate variables and manual state management:
def removeMax1( xs: List[Int] ) = {
def rec( max: Int, rest: List[Int], result: List[Int]): List[Int] = {
if( rest.isEmpty ) result
else if( rest.head > max ) rec( rest.head, rest.tail, max :: result)
else rec( max, rest.tail, rest.head :: result )
}
rec( xs.head, xs.tail, List() )
}
or fold the list:
def removeMax2( xs: List[Int] ) = {
val result = xs.tail.foldLeft( xs.head -> List[Int]() ) {
(acc,x) =>
val (max,res) = acc
if( x > max ) x -> ( max :: res )
else max -> ( x :: res )
}
result._2
}
If you want to keep the original insertion order, you can (at the expense of having two passes, rather than one) without any effort write something like:
def removeMax3( xs: List[Int] ) = {
val max = xs.max
xs.filterNot( _ == max )
}
which is more clear than your first example.
The biggest inefficiency when you're writing a program is worrying about the wrong things. This is usually the wrong thing to worry about. Why?
Developer time is generally much more expensive than CPU time — in fact, there is usually a dearth of the former and a surplus of the latter.
Most code does not need to be very efficient because it will never be running on million-item datasets multiple times every second.
Most code does need to bug free, and less code is less room for bugs to hide.
The example you gave is not very functional, actually. Here's what you are doing:
// Given a list of Int
def removeMaxCool(xs: List[Int]): List[Int] = {
// Find the index of the biggest Int
val maxIndex = xs.indexOf(xs.max);
// Then take the ints before and after it, and then concatenate then
xs.take(maxIndex) ::: xs.drop(maxIndex+1)
}
Mind you, it is not bad, but you know when functional code is at its best when it describes what you want, instead of how you want it. As a minor criticism, if you used splitAt instead of take and drop you could improve it slightly.
Another way of doing it is this:
def removeMaxCool(xs: List[Int]): List[Int] = {
// the result is the folding of the tail over the head
// and an empty list
xs.tail.foldLeft(xs.head -> List[Int]()) {
// Where the accumulated list is increased by the
// lesser of the current element and the accumulated
// element, and the accumulated element is the maximum between them
case ((max, ys), x) =>
if (x > max) (x, max :: ys)
else (max, x :: ys)
// and of which we return only the accumulated list
}._2
}
Now, let's discuss the main issue. Is this code slower than the Java one? Most certainly! Is the Java code slower than a C equivalent? You can bet it is, JIT or no JIT. And if you write it directly in assembler, you can make it even faster!
But the cost of that speed is that you get more bugs, you spend more time trying to understand the code to debug it, and you have less visibility of what the overall program is doing as opposed to what a little piece of code is doing -- which might result in performance problems of its own.
So my answer is simple: if you think the speed penalty of programming in Scala is not worth the gains it brings, you should program in assembler. If you think I'm being radical, then I counter that you just chose the familiar as being the "ideal" trade off.
Do I think performance doesn't matter? Not at all! I think one of the main advantages of Scala is leveraging gains often found in dynamically typed languages with the performance of a statically typed language! Performance matters, algorithm complexity matters a lot, and constant costs matters too.
But, whenever there is a choice between performance and readability and maintainability, the latter is preferable. Sure, if performance must be improved, then there isn't a choice: you have to sacrifice something to it. And if there's no lost in readability/maintainability -- such as Scala vs dynamically typed languages -- sure, go for performance.
Lastly, to gain performance out of functional programming you have to know functional algorithms and data structures. Sure, 99% of Java programmers with 5-10 years experience will beat the performance of 99% of Scala programmers with 6 months experience. The same was true for imperative programming vs object oriented programming a couple of decades ago, and history shows it didn't matter.
EDIT
As a side note, your "fast" algorithm suffer from a serious problem: you use ArrayBuffer. That collection does not have constant time append, and has linear time toList. If you use ListBuffer instead, you get constant time append and toList.
For reference, here's how splitAt is defined in TraversableLike in the Scala standard library,
def splitAt(n: Int): (Repr, Repr) = {
val l, r = newBuilder
l.sizeHintBounded(n, this)
if (n >= 0) r.sizeHint(this, -n)
var i = 0
for (x <- this) {
(if (i < n) l else r) += x
i += 1
}
(l.result, r.result)
}
It's not unlike your example code of what a Java programmer might come up with.
I like Scala because, where performance matters, mutability is a reasonable way to go. The collections library is a great example; especially how it hides this mutability behind a functional interface.
Where performance isn't as important, such as some application code, the higher order functions in Scala's library allow great expressivity and programmer efficiency.
Out of curiosity, I picked an arbitrary large file in the Scala compiler (scala.tools.nsc.typechecker.Typers.scala) and counted something like 37 for loops, 11 while loops, 6 concatenations (++), and 1 fold (it happens to be a foldRight).
What about this?
def removeMax(xs: List[Int]) = {
val buf = xs.toBuffer
buf -= (buf.max)
}
A bit more ugly, but faster:
def removeMax(xs: List[Int]) = {
var max = xs.head
for ( x <- xs.tail )
yield {
if (x > max) { val result = max; max = x; result}
else x
}
}
Try this:
(myList.foldLeft((List[Int](), None: Option[Int]))) {
case ((_, None), x) => (List(), Some(x))
case ((Nil, Some(m), x) => (List(Math.min(x, m)), Some(Math.max(x, m))
case ((l, Some(m), x) => (Math.min(x, m) :: l, Some(Math.max(x, m))
})._1
Idiomatic, functional, traverses only once. Maybe somewhat cryptic if you are not used to functional-programming idioms.
Let's try to explain what is happening here. I will try to make it as simple as possible, lacking some rigor.
A fold is an operation on a List[A] (that is, a list that contains elements of type A) that will take an initial state s0: S (that is, an instance of a type S) and a function f: (S, A) => S (that is, a function that takes the current state and an element from the list, and gives the next state, ie, it updates the state according to the next element).
The operation will then iterate over the elements of the list, using each one to update the state according to the given function. In Java, it would be something like:
interface Function<T, R> { R apply(T t); }
class Pair<A, B> { ... }
<State> State fold(List<A> list, State s0, Function<Pair<A, State>, State> f) {
State s = s0;
for (A a: list) {
s = f.apply(new Pair<A, State>(a, s));
}
return s;
}
For example, if you want to add all the elements of a List[Int], the state would be the partial sum, that would have to be initialized to 0, and the new state produced by a function would simply add the current state to the current element being processed:
myList.fold(0)((partialSum, element) => partialSum + element)
Try to write a fold to multiply the elements of a list, then another one to find extreme values (max, min).
Now, the fold presented above is a bit more complex, since the state is composed of the new list being created along with the maximum element found so far. The function that updates the state is more or less straightforward once you grasp these concepts. It simply puts into the new list the minimum between the current maximum and the current element, while the other value goes to the current maximum of the updated state.
What is a bit more complex than to understand this (if you have no FP background) is to come up with this solution. However, this is only to show you that it exists, can be done. It's just a completely different mindset.
EDIT: As you see, the first and second case in the solution I proposed are used to setup the fold. It is equivalent to what you see in other answers when they do xs.tail.fold((xs.head, ...)) {...}. Note that the solutions proposed until now using xs.tail/xs.head don't cover the case in which xs is List(), and will throw an exception. The solution above will return List() instead. Since you didn't specify the behavior of the function on empty lists, both are valid.
Another option would be:
package code.array
object SliceArrays {
def main(args: Array[String]): Unit = {
println(removeMaxCool(Vector(1,2,3,100,12,23,44)))
}
def removeMaxCool(xs: Vector[Int]) = xs.filter(_ < xs.max)
}
Using Vector instead of List, the reason is that Vector is more versatile and has a better general performance and time complexity if compared to List.
Consider the following collections operations:
head, tail, apply, update, prepend, append
Vector takes an amortized constant time for all operations, as per Scala docs:
"The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys"
While List takes constant time only for head, tail and prepend operations.
Using
scalac -print
generates:
package code.array {
object SliceArrays extends Object {
def main(args: Array[String]): Unit = scala.Predef.println(SliceArrays.this.removeMaxCool(scala.`package`.Vector().apply(scala.Predef.wrapIntArray(Array[Int]{1, 2, 3, 100, 12, 23, 44})).$asInstanceOf[scala.collection.immutable.Vector]()));
def removeMaxCool(xs: scala.collection.immutable.Vector): scala.collection.immutable.Vector = xs.filter({
((x$1: Int) => SliceArrays.this.$anonfun$removeMaxCool$1(xs, x$1))
}).$asInstanceOf[scala.collection.immutable.Vector]();
final <artifact> private[this] def $anonfun$removeMaxCool$1(xs$1: scala.collection.immutable.Vector, x$1: Int): Boolean = x$1.<(scala.Int.unbox(xs$1.max(scala.math.Ordering$Int)));
def <init>(): code.array.SliceArrays.type = {
SliceArrays.super.<init>();
()
}
}
}
Another contender. This uses a ListBuffer, like Daniel's second offering, but shares the post-max tail of the original list, avoiding copying it.
def shareTail(xs: List[Int]): List[Int] = {
var res = ListBuffer[Int]()
var maxTail = xs
var first = true;
var x = xs
while ( x != Nil ) {
if (x.head > maxTail.head) {
while (!(maxTail.head == x.head)) {
res += maxTail.head
maxTail = maxTail.tail
}
}
x = x.tail
}
res.prependToList(maxTail.tail)
}

Scala, make my loop more functional

I'm trying to reduce the extent to which I write Scala (2.8) like Java. Here's a simplification of a problem I came across. Can you suggest improvements on my solutions that are "more functional"?
Transform the map
val inputMap = mutable.LinkedHashMap(1->'a',2->'a',3->'b',4->'z',5->'c')
by discarding any entries with value 'z' and indexing the characters as they are encountered
First try
var outputMap = new mutable.HashMap[Char,Int]()
var counter = 0
for(kvp <- inputMap){
val character = kvp._2
if(character !='z' && !outputMap.contains(character)){
outputMap += (character -> counter)
counter += 1
}
}
Second try (not much better, but uses an immutable map and a 'foreach')
var outputMap = new immutable.HashMap[Char,Int]()
var counter = 0
inputMap.foreach{
case(number,character) => {
if(character !='z' && !outputMap.contains(character)){
outputMap2 += (character -> counter)
counter += 1
}
}
}
Nicer solution:
inputMap.toList.filter(_._2 != 'z').map(_._2).distinct.zipWithIndex.toMap
I find this solution slightly simpler than arjan's:
inputMap.values.filter(_ != 'z').toSeq.distinct.zipWithIndex.toMap
The individual steps:
inputMap.values // Iterable[Char] = MapLike(a, a, b, z, c)
.filter(_ != 'z') // Iterable[Char] = List(a, a, b, c)
.toSeq.distinct // Seq[Char] = List(a, b, c)
.zipWithIndex // Seq[(Char, Int)] = List((a,0), (b,1), (c,2))
.toMap // Map[Char, Int] = Map((a,0), (b,1), (c,2))
Note that your problem doesn't inherently involve a map as input, since you're just discarding the keys. If I were coding this, I'd probably write a function like
def buildIndex[T](s: Seq[T]): Map[T, Int] = s.distinct.zipWithIndex.toMap
and invoke it as
buildIndex(inputMap.values.filter(_ != 'z').toSeq)
First, if you're doing this functionally, you should use an immutable map.
Then, to get rid of something, you use the filter method:
inputMap.filter(_._2 != 'z')
and finally, to do the remapping, you can just use the values (but as a set) withzipWithIndex, which will count up from zero, and then convert back to a map:
inputMap.filter(_._2 != 'z').values.toSet.zipWithIndex.toMap
Since the order of values isn't going to be preserved anyway*, presumably it doesn't matter that the order may have been shuffled yet again with the set transformation.
Edit: There's a better solution in a similar vein; see Arjan's. Assumption (*) is wrong, since it was a LinkedHashMap. So you do need to preserve order, which Arjan's solution does.
i would create some "pipeline" like this, but this has a lot of operations and could be probably shortened. These two List.map's could be put in one, but I think you've got a general idea.
inputMap
.toList // List((5,c), (1,a), (2,a), (3,b), (4,z))
.sorted // List((1,a), (2,a), (3,b), (4,z), (5,c))
.filterNot((x) => {x._2 == 'z'}) // List((1,a), (2,a), (3,b), (5,c))
.map(_._2) // List(a, a, b, c)
.zipWithIndex // List((a,0), (a,1), (b,2), (c,3))
.map((x)=>{(x._2+1 -> x._1)}) // List((1,a), (2,a), (3,b), (4,c))
.toMap // Map((1,a), (2,a), (3,b), (4,c))
performing these operation on lists keeps ordering of elements.
EDIT: I misread the OP question - thought you wanted run length encoding. Here's my take on your actual question:
val values = inputMap.values.filterNot(_ == 'z').toSet.zipWithIndex.toMap
EDIT 2: As noted in the comments, use toSeq.distinct or similar if preserving order is important.
val values = inputMap.values.filterNot(_ == 'z').toSeq.distinct.zipWithIndex.toMap
In my experience I have found that maps and functional languages do not play nice. You'll note that all answers so far in one way or another in involve turning the map into a list, filtering the list, and then turning the list back into a map.
I think this is due to maps being mutable data structures by nature. Consider that when building a list, that the underlying structure of the list does not change when you append a new element and if a true list then an append is a constant O(1) operation. Whereas for a map the internal structure of a map can vastly change when a new element is added ie. when the load factor becomes too high and the add algorithm resizes the map. In this way a functional language cannot just create a series of a values and pop them into a map as it goes along due to the possible side effects of introducing a new key/value pair.
That said, I still think there should be better support for filtering, mapping and folding/reducing maps. Since we start with a map, we know the maximum size of the map and it should be easy to create a new one.
If you're wanting to get to grips with functional programming then I'd recommending steering clear of maps to start with. Stick with the things that functional languages were designed for -- list manipulation.