When working with large collections, we usually hear the term "lazy evaluation". I want to better demonstrate the difference between strict and lazy evaluation, so I tried the following example - getting the first two even numbers from a list:
scala> var l = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
l: List[Int] = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
scala> l.filter(_ % 2 == 0).take(2)
res0: List[Int] = List(38, 46)
scala> l.toStream.filter(_ % 2 == 0).take(2)
res1: scala.collection.immutable.Stream[Int] = Stream(38, ?)
I noticed that when I'm using toStream, I'm getting Stream(38, ?). What does the "?" mean here? Does this have something to do with lazy evaluation?
Also, what are some good example of lazy evaluation, when should I use it and why?
One benefit using lazy collections is to "save" memory, e.g. when mapping to large data structures. Consider this:
val r =(1 to 10000)
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
And using lazy evaluation:
val r =(1 to 10000).toStream
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
The first statement will genrate 10000 Seqs of size 10000 and keeps them in memory, while in the second case only one Seq at a time needs to exist in memory, therefore its much faster...
Another use-case is when only a part of the data is actually needed. I often use lazy collections together with take, takeWhile etc
Let's take a real life scenario - Instead of having a list, you have a big log file that you want to extract first 10 lines that contains "Success".
The straight forward solution would be reading the file line-by-line, and once you have a line that contains "Success", print it and continue to the next line.
But since we love functional programming, we don't want to use the traditional loops. Instead, we want to achieve our goal by composing functions.
First attempt:
Source.fromFile("log_file").getLines.toList.filter(_.contains("Success")).take(10)
Let's try to understand what actually happened here:
we read the whole file
filter relevant lines
took the first 10 elements
If we try to print Source.fromFile("log_file").getLines.toList, we will get the whole file, which is obviously a waste, since not all lines are relevant for us.
Why we got all lines and only then we performed the filtering? That's because the List is a strict data structure, so when we call toList, it evaluates immediately, and only after having the whole data, the filtering is applied.
Luckily, Scala provides lazy data structures, and stream is one of them:
Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success")).take(10)
In order to demonstrate the difference, let's try:
Source.fromFile("log_file").getLines.toStream
Now we get something like:
Scala.collection.immutable.Stream[Int] = Stream(That's the first line, ?)
toStream evaluates to only one element - the first line in the file. The next element is represented by a "?", which indicates that the stream has not evaluated the next element, and that's because toStream is lazy function, and the next item is evaluated only when used.
Now after we apply the filter function, it will start reading the next line until we get the first line that contains "Success":
> var res = Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success"))
Scala.collection.immutable.Stream[Int] = Stream(First line contains Success!, ?)
Now we apply the take function. There is still no action is performed, but it knows that is should pick 10 lines, so it doesn't evaluate until we use the result:
res foreach println
Finally, i we now print res, we'll get a Stream containing the first 10 lines, as we expected.
Related
Let's say I have a list of numerics:
val list = List(4,12,3,6,9)
For every element in the list, I need to find the rolling sum, i,e. the final output should be:
List(4, 16, 19, 25, 34)
Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both?
Something like map(initial)((curr,prev) => curr+prev)
I want to achieve this without maintaining any shared global state.
EDIT: I would like to be able to do the same kinds of computation on RDDs.
You may use scanLeft
list.scanLeft(0)(_ + _).tail
The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
val num = implicitly[Numeric[N]]
val nPartitions = rdd.partitions.length
val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) =>
if (index == nPartitions - 1) Iterator.empty
else Iterator.single(iter.foldLeft(num.zero)(num.plus))
).collect
.scanLeft(num.zero)(num.plus)
rdd.mapPartitionsWithIndex((index, iter) =>
if (iter.isEmpty) iter
else {
val start = num.plus(partitionCumSums(index), iter.next)
iter.scanLeft(start)(num.plus)
}
)
}
It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.
I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):
list.zipWithIndex.map{x => list.take(x._2+1).sum}
This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).
When printing it, I get the following:
List(4, 16, 19, 25, 34)
1) Is it possible to iterate through an Array using while loop in Scala?
2) How to find the numbers that are greater than 50 using reduce loop?
val reduce_left_list=List(12,34,54,50,82,34,78,90,3,45,43,1,2343,234)
val greatest_num=reduce_left_list.reduceLeft((x:Int)=> { for(line <- reduce_left_list) line > 50)
1) Is it possible to iterate through an Array using while loop in Scala?
That depends on your definition of "iterate through an array". You can certainly do the same thing that you would do in C, for example, that is taking an integer, increasing it by 1 in every iteration of the loop, stopping when it is equal to the size of the array, and use this integer as an index into the array:
val anArray = Array('A, 'B, 'C, 'D)
var i = 0
val s = anArray.size
while (i < s) {
println(anArray(i))
i += 1
}
// 'A
// 'B
// 'C
// 'D
But I wouldn't call this "iterating through an array". You are iterating through integers, not the array.
And besides, why would you want to do that, if you can just tell the array to iterate itself?
anArray foreach println
// 'A
// 'B
// 'C
// 'D
If you absolutely insist on juggling indices yourself (but again, why would you want to), there are much better ways available than using a while loop. You could, for example, iterate over a Range:
(0 until s) foreach (i ⇒ println(anArray(i)))
Or written using a for comprehension:
for (i ← 0 until s) println(anArray(i))
Loops are never idiomatic in Scala. While Scala does allow side-effects, it is generally idiomatic to avoid them and strive for referential transparency. Albert Einstein is quoted as saying "Insanity is doing the same thing and expecting a different result", but that's exactly what we expect a loop to do: the loop executes the same code over and over, but we expect it to do a different thing every time (or at least once, namely, stop the loop). According to Einstein, loops are insane, and who are we to defy Einstein?
Seriously, though: loops cannot work without side-effects, but the Scala community tries to avoid side-effects, so the Scala community tries to avoid loops.
2) How to find the numbers that are greater than 50 using reduce loop?
There is no such thing as a "reduce loop" in Scala. I assume, you mean the reduce method.
The answer is: No. The types don't line up. reduce returns a value of the same type as the element type of the collection, but you want to return a collection of elements.
You can, however, use a fold, more precisely, a right fold:
(reduce_left_list :\ List.empty[Int])((el, acc) => if (el > 50) el :: acc else acc)
//=> List(54, 82, 78, 90, 2343, 234)
You can also use a left fold if you reverse the result afterwards:
(List.empty[Int] /: reduce_left_list)((acc, el) => if (el > 50) el :: acc else acc) reverse
//=> List(54, 82, 78, 90, 2343, 234)
If you try appending the element to the result instead, your runtime will be quadratic instead of linear:
(List.empty[Int] /: reduce_left_list)((acc, el) => if (el > 50) acc :+ el else acc)
//=> List(54, 82, 78, 90, 2343, 234)
However, saying that "you can do this using a left/right fold" is tautological: left/right fold is universal, which means that anything you can do with a collection, can be done with a left/right fold. Which means that using a left/right fold is not very intention-revealing: since a left/right fold can do anything, seeing a left/right fold in the code doesn't tell the reader anything about what's going on.
So, whenever possible, you should use a more specialized operation with a more intention-revealing name. In this particular case, you want to filter out some particular elements that satisfy a predicate. And the Scala collections API actually has a method that filters, and it is called (surprise!) filter:
reduce_left_list filter (_ > 50)
//=> List(54, 82, 78, 90, 2343, 234)
Alternatively, you can use withFilter instead:
reduce_left_list withFilter (_ > 50)
The difference is that filter returns a new list, whereas withFilter returns an instance of FilterMonadic, which is a view of the existing list that only includes the elements that satisfy the predicate.
Maybe you want to try filter:
List(12,34,54,50,82,34,78,90,3,45,43,1,2343,234).filter(_ > 50)
I am trying to parse and concatenate two columns at the same time using the following expression:
val part : RDD[(String)] = sc.textFile("hdfs://xxx:8020/user/sample_head.csv")
.map{line => val row = line split ','
(row(1), row(2)).toString}
which returns something like:
Array((AAA,111), (BBB,222),(CCC,333))
But how could I directly get:
Array(AAA, 111 , BBB, 222, CCC, 333)
Your toString() on a tuple really doesn't make much sense to me. Can you explain why do you want to create strings from tuples and then split them again later?
If you are willing to map each row into a list of elements instead of a stringified tuple of elements, you could rewrite
(row(1), row(2)).toString
to
List(row(1), row(2))
and simply flatten the resulting list:
val list = List("0,aaa,111", "1,bbb,222", "2,ccc,333")
val tuples = list.map{ line =>
val row = line split ','
List(row(1), row(2))}
val flattenedTuples = tuples.flatten
println(flattenedTuples) // prints List(aaa, 111, bbb, 222, ccc, 333)
Note that what you are trying to achieve involves flattening and can be done using flatMap, but not using just map. You need to either flatMap directly, or do map followed by flatten like I showed you (I honestly don't remember if Spark supports flatMap). Also, as you can see I used a List as a more idiomatic Scala data structure, but it's easily convertible to Array and vice versa.
I'm new to Scala and I want to write a higher-order function (say "partition2") that takes a list of integers and a function that returns either true or false. The output would be a list of values for which the function is true and a list of values for which the function is false. I'd like to implement this using a fold. I know something like this would be a really straightforward way to do this:
val (passed, failed) = List(49, 58, 76, 82, 88, 90) partition ( _ > 60 )
I'm wondering how this same logic could be applied using a fold.
You can start by thinking about what you want your accumulator to look like. In many cases it'll have the same type as the thing you want to end up with, and that works here—you can use two lists to keep track of the elements that passed and failed. Then you just need to write the cases and add the element to the appropriate list:
List(49, 58, 76, 82, 88, 90).foldRight((List.empty[Int], List.empty[Int])) {
case (i, (passed, failed)) if i > 60 => (i :: passed, failed)
case (i, (passed, failed)) => (passed, i :: failed)
}
I'm using a right fold here because prepending to a list is nicer than the alternative, but you could easily rewrite it to use a left fold.
You can do this:
List(49, 58, 76, 82, 88, 90).foldLeft((Vector.empty[Int], Vector.empty[Int])){
case ((passed, failed), x) =>
if (x > 60) (passed :+ x, failed)
else (passed, failed :+ x)
}
Basically you have two accumulators, and as you visit each element, you add it to the appropriate accumulator.
In scala, given a sorted map, tree or list, what is the most efficient way to return the next larger value of a non-existing key? Additionally, is it possible to get an "iterator/cursor" starting at this element?
Edit:
I'm happy with any interpretation of "efficiently", e.g. "runtime", "memory usage", "clarity" or "taking the smallest possible amount of programmer time to implement and maintain" (thanks Kevin Wright).
If you use a SortedMap, then you can call range on it. Well, sort of. It is broken up to 2.8.1 if you plan to add and/or remove elements from the map afterwards. It should be ok if you avoid these operations, and it has been fixed for upcoming Scala versions as well.
Defining "efficiently" as "taking the smallest possible amount of programmer time to implement and maintain"...
For a Sequence:
val s = Seq(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41)
val overSixteen = s dropWhile (_ < 16)
For a Map:
val s = Map(2->"a", 3->"b", 5->"c", 7->"d", 11->"e", 13->"f")
val overSix = s dropWhile (_._1 < 6)
If you prefer an Iterator, just call .iterator on the resulting collection, or you can use .view before dropWhile if you're especially interested in lazy behaviour.