Iam trying to understand functional programming using scala. So the question is very basic. To start off, I have a trait which looks something like this
trait DebugLogger{
def time(stageName:String)(func : => Unit):Unit = {
val currentTime= System.currentTimeMillis()
println(s"Stage ${stageName} started at ${currentTime}")
func
println(s"Stage ${stageName} completed.. Took ${(System.currentTimeMillis() - currentTime)/1000.0} seconds")
}
}
Now I have a function which looks something like this
object GeneralRecap extends App with DebugLogger {
val aCondition: Boolean = true
val list1 = Seq(1,2,4,4)
val list2 = Seq('a','b','c','d')
time ("Time taken in for loop"){
val a1 = for (i <- list1;
j <- list2
) yield i * j
println(a1)
}
time("Time taken in flatmap") {
val c = list1 flatMap (number => list2.map(value => number * value))
println(c)
}
I was assuming the bytecode that both the functions would generate would be the same and was assuming both the functions would take the same time to process. However to my surprise this is how the output ended up with
Stage Time taken in for loop started at 1661398450618
List(97, 98, 99, 100, 194, 196, 198, 200, 388, 392, 396, 400, 388, 392, 396, 400)
Stage Time taken in for loop completed.. Took 0.011 seconds
Stage Time taken in flatmap started at 1661398450629
List(97, 98, 99, 100, 194, 196, 198, 200, 388, 392, 396, 400, 388, 392, 396, 400)
Stage Time taken in flatmap completed.. Took 0.001 seconds
So the flatmap map way takes 1/10th of the for loop time. Considering both the functions are n square, I would assume that it should have taken the same time. Any reason why the first one takes more time than the other
There are all sorts of problems with that time method. Firstly it is including the time for the first println in the total, and secondly there may well be formatting code executed before the second time is captured.
This is a much better version:
def time(stageName: String)(func: => Unit): Unit = {
println(s"Starting ${stageName}")
val startTime = System.currentTimeMillis()
func
val endTime = System.currentTimeMillis()
val elapsed = (endTime - startTime) / 1000.0
println(s"Stage ${stageName} completed. Took $elapsed seconds")
}
More importantly you can't measure Scala performance on a single run. There is a lot of optimisation that happens during the run of a program that means that the first pass is usually significantly slower than the later ones.
Related
Feel free to edit the title of this post.
Given these:
case class Foo(amount: Int, running: Int)
val seed = List(Foo(10, 10), Foo(5, 15), Foo(10, 25))
val next = List(20, 10, 15)
How do I map next into List(Foo(20, 45), Foo(10, 55), Foo(15, 70)) the Scala way? As you can see it continues the running number.
Here's a couple different approaches, both scanLeft based:
val initial = seed.map(_.amount).sum
next
.zip(next.scanLeft(initial)(_ + _))
.map((Foo.apply _).tupled)
and
val initial = Foo(0, seed.map(_.amount).sum)
next
.scanLeft(initial){
case (Foo(_, total), n) =>
Foo(n, total + n)}
.tail
you might also consider solving it using recursion.
When working with large collections, we usually hear the term "lazy evaluation". I want to better demonstrate the difference between strict and lazy evaluation, so I tried the following example - getting the first two even numbers from a list:
scala> var l = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
l: List[Int] = List(1, 47, 38, 53, 51, 67, 39, 46, 93, 54, 45, 33, 87)
scala> l.filter(_ % 2 == 0).take(2)
res0: List[Int] = List(38, 46)
scala> l.toStream.filter(_ % 2 == 0).take(2)
res1: scala.collection.immutable.Stream[Int] = Stream(38, ?)
I noticed that when I'm using toStream, I'm getting Stream(38, ?). What does the "?" mean here? Does this have something to do with lazy evaluation?
Also, what are some good example of lazy evaluation, when should I use it and why?
One benefit using lazy collections is to "save" memory, e.g. when mapping to large data structures. Consider this:
val r =(1 to 10000)
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
And using lazy evaluation:
val r =(1 to 10000).toStream
.map(_ => Seq.fill(10000)(scala.util.Random.nextDouble))
.map(_.sum)
.sum
The first statement will genrate 10000 Seqs of size 10000 and keeps them in memory, while in the second case only one Seq at a time needs to exist in memory, therefore its much faster...
Another use-case is when only a part of the data is actually needed. I often use lazy collections together with take, takeWhile etc
Let's take a real life scenario - Instead of having a list, you have a big log file that you want to extract first 10 lines that contains "Success".
The straight forward solution would be reading the file line-by-line, and once you have a line that contains "Success", print it and continue to the next line.
But since we love functional programming, we don't want to use the traditional loops. Instead, we want to achieve our goal by composing functions.
First attempt:
Source.fromFile("log_file").getLines.toList.filter(_.contains("Success")).take(10)
Let's try to understand what actually happened here:
we read the whole file
filter relevant lines
took the first 10 elements
If we try to print Source.fromFile("log_file").getLines.toList, we will get the whole file, which is obviously a waste, since not all lines are relevant for us.
Why we got all lines and only then we performed the filtering? That's because the List is a strict data structure, so when we call toList, it evaluates immediately, and only after having the whole data, the filtering is applied.
Luckily, Scala provides lazy data structures, and stream is one of them:
Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success")).take(10)
In order to demonstrate the difference, let's try:
Source.fromFile("log_file").getLines.toStream
Now we get something like:
Scala.collection.immutable.Stream[Int] = Stream(That's the first line, ?)
toStream evaluates to only one element - the first line in the file. The next element is represented by a "?", which indicates that the stream has not evaluated the next element, and that's because toStream is lazy function, and the next item is evaluated only when used.
Now after we apply the filter function, it will start reading the next line until we get the first line that contains "Success":
> var res = Source.fromFile("log_file").getLines.toStream.filter(_.contains("Success"))
Scala.collection.immutable.Stream[Int] = Stream(First line contains Success!, ?)
Now we apply the take function. There is still no action is performed, but it knows that is should pick 10 lines, so it doesn't evaluate until we use the result:
res foreach println
Finally, i we now print res, we'll get a Stream containing the first 10 lines, as we expected.
Problem at Hand
Wrote an attempted improved bi-gram generator working over lines, taking into account full stops and the like. Results are as wanted. It does not use mapPartitions but is as per below.
import org.apache.spark.mllib.rdd.RDDFunctions._
val wordsRdd = sc.textFile("/FileStore/tables/natew5kh1478347610918/NGram_File.txt",10)
val wordsRDDTextSplit = wordsRdd.map(line => (line.trim.split(" "))).flatMap(x => x).map(x => (x.toLowerCase())).map(x => x.replaceAll(",{1,}","")).map(x => x.replaceAll("!
{1,}",".")).map(x => x.replaceAll("\\?{1,}",".")).map(x => x.replaceAll("\\.{1,}",".")).map(x => x.replaceAll("\\W+",".")).filter(_ != ".")filter(_ != "")
val x = wordsRDDTextSplit.collect() // need to do this due to lazy evaluation etc. I think, need collect()
val y = for ( Array(a,b,_*) <- x.sliding(2).toArray)
yield (a, b)
val z = y.filter(x => !(x._1 contains ".")).map(x => (x._1.replaceAll("\\.{1,}",""), x._2.replaceAll("\\.{1,}","")))
I have some questions:
The results are as expected. No data is missed. But can I convert such an approach to a mapPartitions approach? Would I not lose some data? Many say that that this is the case due to the partitions that we would be processing having a subset of all the words and hence missing the relationship at a boundary of the split, ie.the next and the previous word. With a large file split I can see from the map point of view this could occur as well. Correct?
However, if you look at the code above (no mapPartitions attempt), it always works regardless of how much I parallelize this, 10 or 100 specified with partitions with words that are consecutive over different partitions. I checked this with mapPartitionsWithIndex. This I am not clear on. OK, a reduce on (x, y) => x + y is well understood.
Thanks in advance. I must be missing some elementary point in all this.
Output & Results
z: Array[(String, String)] = Array((hello,how), (how,are), (are,you), (you,today), (i,am), (am,fine), (fine,but), (but,would), (would,like), (like,to), (to,talk), (talk,to), (to,you), (you,about), (about,the), (the,cat), (he,is), (is,not), (not,doing), (doing,so), (so,well), (what,should), (should,we), (we,do), (please,help), (help,me), (hi,there), (there,ged))
mapped: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[669] at mapPartitionsWithIndex at :123
Partition Assignment
res13: Array[String] = Array(hello -> 0, how -> 0, are -> 0, you -> 0, today. -> 0, i -> 0, am -> 32, fine -> 32, but -> 32, would -> 32, like -> 32, to -> 32, talk -> 60, to -> 60, you -> 60, about -> 60, the -> 60, cat. -> 60, he -> 60, is -> 60, not -> 96, doing -> 96, so -> 96, well. -> 96, what -> 96, should -> 122, we -> 122, do. -> 122, please -> 122, help -> 122, me. -> 122, hi -> 155, there -> 155, ged. -> 155)
May be SPARK is just really smart, smarter than I thought initially. Or may be not? Saw some stuff on partition preservation, some of it contradictory imho.
map vs mapValues meaning former destroys partitioning and hence single partition processing?
You can use mapPartitions on in place of any of the maps used to create wordsRDDTextSplit, but I don't really see any reason to. mapPartitions is most useful when you have a high initialization cost that you don't want to pay for every record in the RDD.
Whether you use map or mapPartitions to create wordsRDDTextSplit, your sliding window doesn't operate on anything until you create the local data structure x.
I have a flat file which contains several million lines like one below
59, 254, 2016-09-09T00:00, 1, 6, 3, 40, 18, 0
I want to process this file in batches of X rows at a time. So I wrote this code
def func(x: Int) = {
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(x, x)
} yield batches.map("(" + _ + ")").mkString(",")
}
func(2).foreach(println)
This code produces exactly the output I want. the function walks through entire file taking 2 rows at a time batch them into 1 string.
(59, 828, 2016-09-09T00:00, 0, 8, 2, 52, 0, 0),(59, 774, 2016-09-09T00:00, 0, 10, 2, 51, 0, 0)
But when I see scala pros write code everything happens inside the for comprehension and you just return the last thing from your comprehension.
So in order to be a scala pro I change my code
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(2, 2)
line <- batches.map("(" + _ + ")").mkString(",")
} yield line
This produces 1 character per line and not the output I expected. Why did the code behavior totally change? At least on reading they look the same to me.
In the line line <- batches.map("(" + _ + ")").mkString(","), the right-hand side produces a String (the result of mkString), and the loop iterates over this string. When you iterate over a string, the individual items are characters, so in your case line is going to be a character. What you actually want is not to iterate over that string, but to assign it to the variable name line, which you can do by replacing the <- with =: line = batches.map("(" + _ + ")").mkString(",").
By the way, sliding(2,2) can be more clearly written as grouped(2).
#dhg has given the explaination, here's my suggestion on how this could be done in another way
for {
batches <- Source.fromFile("./foo.txt").getLines().sliding(2, 2)
batch <- batches.map("(" + _ + ")")
} yield batch.mkString(",")
so batch would be a traversable consist of 2 lines
I'm new to Scala and I want to write a higher-order function (say "partition2") that takes a list of integers and a function that returns either true or false. The output would be a list of values for which the function is true and a list of values for which the function is false. I'd like to implement this using a fold. I know something like this would be a really straightforward way to do this:
val (passed, failed) = List(49, 58, 76, 82, 88, 90) partition ( _ > 60 )
I'm wondering how this same logic could be applied using a fold.
You can start by thinking about what you want your accumulator to look like. In many cases it'll have the same type as the thing you want to end up with, and that works here—you can use two lists to keep track of the elements that passed and failed. Then you just need to write the cases and add the element to the appropriate list:
List(49, 58, 76, 82, 88, 90).foldRight((List.empty[Int], List.empty[Int])) {
case (i, (passed, failed)) if i > 60 => (i :: passed, failed)
case (i, (passed, failed)) => (passed, i :: failed)
}
I'm using a right fold here because prepending to a list is nicer than the alternative, but you could easily rewrite it to use a left fold.
You can do this:
List(49, 58, 76, 82, 88, 90).foldLeft((Vector.empty[Int], Vector.empty[Int])){
case ((passed, failed), x) =>
if (x > 60) (passed :+ x, failed)
else (passed, failed :+ x)
}
Basically you have two accumulators, and as you visit each element, you add it to the appropriate accumulator.