Scala's collect inefficient in Spark? - scala

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}

The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

Related

How does LazyList.fill(n) actually works in Scala?

I'm trying to understand how does the LazyList.fill actually works. I implemented a retry logic using LazyList.fill(n). But seems like it is not working as expected.
def retry[T](n: Int)(block: => T): Try[T] = {
val lazyList = LazyList.fill(n)(Try(block))
lazyList find (_.isSuccess) getOrElse lazyList.head
}
Considering the above piece of code, I am trying to execute block with a retry logic. If the execution succeeds, return the result from block else retry until it succeeds for a maximum of n attempts.
Is it like LazyList will evaluate the first element and if it finds true, it skips the evaluation for the remaining elements in the list?
As I already mentiond in the comment, this is exactly what a LazyList is supposed to do.
The elements in a LazyList are also materialized/computed only when there is demand from an actual consumer.
And find method of LazyList respect this lazyness. You can find it cleary written in documentation as well - https://www.scala-lang.org/api/2.13.x/scala/collection/immutable/LazyList.html#find(p:A=%3EBoolean):Option[A]
def find(p: (A) => Boolean): Option[A]
// Finds the first element of the lazy list satisfying a predicate, if any.
// Note: may not terminate for infinite-sized collections.
// This method does not evaluate any elements further than the first element matching the predicate.
So, If the first element succeeds, it will stop at the first element itself.
Also, if you are writing a retry method then you probably also want to stop at first success. Why would you want to continue evaluating the block even after the suceess.
You might want to better clarify your exact requirements to get a more helpful answer.

Scala break out of map

I need to break out of a seq map when a condition is met something like this where foo would return a list of objects where the size depends on how long it takes to find the targetId
def foo(ids: Seq[String], targetId: String) = ids.map(id => getObject(id)).until(id == targetId)
obviously the until method does not exist but I am looking for something that does the equivalent
No need to create intermediate stream/iterator/view.
Just call takeWhile first:
ids.takeWhile(_ != targetId).map(getObject)
There are 2 ways I use:
1) replace map with a recursive call that processes things in certain way. Pretty handy if there are some complex side-effects.
2) use Stream or Iterator and takeWhile to evaluate it's elements lazily and terminate once the condition is met. I would go with this variant since it will be close to the first option - but much more consise.
Since the RDD I was playing with was small, I achieved the same using take(n).

Lazily generate partial sums in Scala

I want to produce a lazy list of partial sums and stop when I have found a "suitable" sum. For example, I want to do something like the following:
val str = Stream.continually {
val i = Random.nextInt
println("generated " + i)
List(i)
}
str
.take(5)
.scanLeft(List[Int]())(_ ++ _)
.find(l => !l.forall(_ > 0))
This produces output like the following:
generated -354822103
generated 1841977627
z: Option[List[Int]] = Some(List(-354822103))
This is nice because I've avoided producing the entire list of lists before finding a suitable list. However, it's suboptimal because I generated one extra random number that I don't need (i.e., the second, positive number in this test run). I know I can hand code a solution to do what I want, but is there a way to use the core scala collection library to achieve this result without writing my own recursion?
The above example is just a toy, but the real application involves heavy-duty network traffic for each "retry" as I build up a map until the map is "complete".
EDIT: Note that even substituting take(1) for find(...) results in the generation of a random number even though the returned value List() does not depend on the number. Does anyone know why the number is being generated in this case? I would think scanLeft does not need to fetch an element of the iterable receiving the call to scanLeft in this case.

summing a transformation of a list of numbers in scala

I frequently need to sum the transformation of a list of numbers in Scala. One way to do this of course is:
list.map(transform(_)).sum
However, this creates memory when creating memory is not required. An alternative is to fold the list.
list.foldLeft(0.0) { (total, x) => total + f(x) }
I find the first expression far easier to write than the second expression. Is there a method I can use that has the ease of the first with the efficiency of the second? Or am I better off writing my own implicit method?
list.mapSum(transform(_))
You can use a view to make your transformer methods (map, filter...) lazy. See here for more information.
So for example in your case of a method called transform, you would write
list.view.map(transform).sum
(note you can optionally omit the (_))
This operation is called foldMap, and you can find it in Scalaz.
list foldMap transform

Use-cases for Streams in Scala

In Scala there is a Stream class that is very much like an iterator. The topic Difference between Iterator and Stream in Scala? offers some insights into the similarities and differences between the two.
Seeing how to use a stream is pretty simple but I don't have very many common use-cases where I would use a stream instead of other artifacts.
The ideas I have right now:
If you need to make use of an infinite series. But this does not seem like a common use-case to me so it doesn't fit my criteria. (Please correct me if it is common and I just have a blind spot)
If you have a series of data where each element needs to be computed but that you may want to reuse several times. This is weak because I could just load it into a list which is conceptually easier to follow for a large subset of the developer population.
Perhaps there is a large set of data or a computationally expensive series and there is a high probability that the items you need will not require visiting all of the elements. But in this case an Iterator would be a good match unless you need to do several searches, in that case you could use a list as well even if it would be slightly less efficient.
There is a complex series of data that needs to be reused. Again a list could be used here. Although in this case both cases would be equally difficult to use and a Stream would be a better fit since not all elements need to be loaded. But again not that common... or is it?
So have I missed any big uses? Or is it a developer preference for the most part?
Thanks
The main difference between a Stream and an Iterator is that the latter is mutable and "one-shot", so to speak, while the former is not. Iterator has a better memory footprint than Stream, but the fact that it is mutable can be inconvenient.
Take this classic prime number generator, for instance:
def primeStream(s: Stream[Int]): Stream[Int] =
Stream.cons(s.head, primeStream(s.tail filter { _ % s.head != 0 }))
val primes = primeStream(Stream.from(2))
It can be easily be written with an Iterator as well, but an Iterator won't keep the primes computed so far.
So, one important aspect of a Stream is that you can pass it to other functions without having it duplicated first, or having to generate it again and again.
As for expensive computations/infinite lists, these things can be done with Iterator as well. Infinite lists are actually quite useful -- you just don't know it because you didn't have it, so you have seen algorithms that are more complex than strictly necessary just to deal with enforced finite sizes.
In addition to Daniel's answer, keep in mind that Stream is useful for short-circuiting evaluations. For example, suppose I have a huge set of functions that take String and return Option[String], and I want to keep executing them until one of them works:
val stringOps = List(
(s:String) => if (s.length>10) Some(s.length.toString) else None ,
(s:String) => if (s.length==0) Some("empty") else None ,
(s:String) => if (s.indexOf(" ")>=0) Some(s.trim) else None
);
Well, I certainly don't want to execute the entire list, and there isn't any handy method on List that says, "treat these as functions and execute them until one of them returns something other than None". What to do? Perhaps this:
def transform(input: String, ops: List[String=>Option[String]]) = {
ops.toStream.map( _(input) ).find(_ isDefined).getOrElse(None)
}
This takes a list and treats it as a Stream (which doesn't actually evaluate anything), then defines a new Stream that is a result of applying the functions (but that doesn't evaluate anything either yet), then searches for the first one which is defined--and here, magically, it looks back and realizes it has to apply the map, and get the right data from the original list--and then unwraps it from Option[Option[String]] to Option[String] using getOrElse.
Here's an example:
scala> transform("This is a really long string",stringOps)
res0: Option[String] = Some(28)
scala> transform("",stringOps)
res1: Option[String] = Some(empty)
scala> transform(" hi ",stringOps)
res2: Option[String] = Some(hi)
scala> transform("no-match",stringOps)
res3: Option[String] = None
But does it work? If we put a println into our functions so we can tell if they're called, we get
val stringOps = List(
(s:String) => {println("1"); if (s.length>10) Some(s.length.toString) else None },
(s:String) => {println("2"); if (s.length==0) Some("empty") else None },
(s:String) => {println("3"); if (s.indexOf(" ")>=0) Some(s.trim) else None }
);
// (transform is the same)
scala> transform("This is a really long string",stringOps)
1
res0: Option[String] = Some(28)
scala> transform("no-match",stringOps)
1
2
3
res1: Option[String] = None
(This is with Scala 2.8; 2.7's implementation will sometimes overshoot by one, unfortunately. And note that you do accumulate a long list of None as your failures accrue, but presumably this is inexpensive compared to your true computation here.)
I could imagine, that if you poll some device in real time, a Stream is more convenient.
Think of an GPS tracker, which returns the actual position if you ask it. You can't precompute the location where you will be in 5 minutes. You might use it for a few minutes only to actualize a path in OpenStreetMap or you might use it for an expedition over six months in a desert or the rain forest.
Or a digital thermometer or other kinds of sensors which repeatedly return new data, as long as the hardware is alive and turned on - a log file filter could be another example.
Stream is to Iterator as immutable.List is to mutable.List. Favouring immutability prevents a class of bugs, occasionally at the cost of performance.
scalac itself isn't immune to these problems: http://article.gmane.org/gmane.comp.lang.scala.internals/2831
As Daniel points out, favouring laziness over strictness can simplify algorithms and make it easier to compose them.