Lazily generate partial sums in Scala - scala

I want to produce a lazy list of partial sums and stop when I have found a "suitable" sum. For example, I want to do something like the following:
val str = Stream.continually {
val i = Random.nextInt
println("generated " + i)
List(i)
}
str
.take(5)
.scanLeft(List[Int]())(_ ++ _)
.find(l => !l.forall(_ > 0))
This produces output like the following:
generated -354822103
generated 1841977627
z: Option[List[Int]] = Some(List(-354822103))
This is nice because I've avoided producing the entire list of lists before finding a suitable list. However, it's suboptimal because I generated one extra random number that I don't need (i.e., the second, positive number in this test run). I know I can hand code a solution to do what I want, but is there a way to use the core scala collection library to achieve this result without writing my own recursion?
The above example is just a toy, but the real application involves heavy-duty network traffic for each "retry" as I build up a map until the map is "complete".
EDIT: Note that even substituting take(1) for find(...) results in the generation of a random number even though the returned value List() does not depend on the number. Does anyone know why the number is being generated in this case? I would think scanLeft does not need to fetch an element of the iterable receiving the call to scanLeft in this case.

Related

Scala immutable list internal implementation

Suppose I am having a huge list having elements from 1 to 1 million.
val initialList = List(1,2,3,.....1 million)
and
val myList = List(1,2,3)
Now when I apply an operation such as foldLeft on the myList giving initialList as the starting value such as
val output = myList.foldLeft(initialList)(_ :+ _)
// result ==>> List(1,2,3,.....1 million, 1 , 2 , 3)
Now my question comes here, both the lists being immutable the intermediate lists that were produced were
List(1,2,3,.....1 million, 1)
List(1,2,3,.....1 million, 1 , 2)
List(1,2,3,.....1 million, 1 , 2 , 3)
By the concept of immutability, every time a new list is being created and the old one being discarded. So isn't this operation a performance killer in scala as every time a new list of 1 million elements has to be copied to create a new list.
Please correct me if I am wrong as I am trying to understand the internal implementation of an immutable list.
Thanks in advance.
Yup, this is performance killer, but this is a cost of having immutable structures (which are amazing, safe and makes programs much less buggy). That's why you should often avoid appending list if you can. There is many tricks that can avoid this issue (try to use accumulators).
For example:
Instead of:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = myList.foldLeft(initialList)(_ :+ _)
You can write:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = List(initialList,myList).flatten
Flatten is implemented to copy first line only once instead of copying it for every single append.
P.S.
At least adding element to the front of list works fast (O(1)), cause sharing of old list is possible. Let's Look at this example:
You can see how memory sharing works for immutable linked lists. Computer only keeps one copy of (b,c,d) end. But if you want to append bar to the end of baz you cannot modify baz, cause you would destroy foo, bar and raz! That's why you have to copy first list.
Appending to a List is not a good idea because List has linear cost for appending. So, if you can
either prepend to the List (List have constant time prepend)
or choose another collection that is efficient for appending. That would be a Queue
For the list of performance characteristic per operation on most scala collections, See:
https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
Note that, depending on your requirement, you may also make your own smarter collection, such as chain iterable for example

views in collections in scala

I understand that a view is a light-weight collection and that it is lazy. I would like to understand what makes a view light weight.
Say I have a list of 1000 random numbers. I'll like to find even numbers in this list and pick only 1st 10 even numbers. I believe using a view here is better because we can avoid creating an intermediate list esp because I'll pick only 1st 10 even numbers. Initially, I thought that the the optimization is achieved because the function I'll use in the filter method will not get executed till the method force is called but this isn't correct I believe. I am struggling to understand what makes using the view better in this scenario. Or have I picked a wrong example?
val r = scala.util.Random
val l:List[Int] = List.tabulate(1000)(x=>r.nextInt())
//without view, I'll get an intermediate list. The function x%2==0 will be on each elemenet of l
val l1 = l.filter(x=>(x%2 == 0))
//this will give size of l2. I got size as 508 but yours could be different depending on the random numbers generated in your case
l1.size
//pick 1st 10 even numbers
val l2 = l1.take(10)
//using view. I thought that x%2==0 will not be executed right now
val lv1 = l.view.filter(x=>(x%2 == 0))
lv1: scala.collection.SeqView[Int,List[Int]] = SeqViewF(...)
lv1.size //this is same as l1 size so my assumption that x%2==0 will not be executed is wrong else lv1.size will not be same as l1.size
val lv2 = lv1.take(10).force
**Question 1 - if I use view, how is the processing optimised?
Question 2 - lv1 is of type SeqViewF, F is related to filter but what does it mean?
Question 3 - what do the elements of lv1 look like (l1 for example are integers)**
You wrote:
lv1.size //this is same as l1 size so my assumption that x%2==0 will
not be executed is wrong else lv1.size will not be same as l1.size
Your assumption is actually correct it's just that your means of measuring the difference is faulty.
val l:List[Int] = List.fill(10)(util.Random.nextInt) // ten random Ints
// print every Int that gets tested in the filter
val lv1 = l.view.filter{x => println(x); x%2 == 0} // no lines printed
lv1.size // ten Ints sent to STDOUT
So, as you see, taking the size of your view also forces its completion.
Yeah, that's not a very fitting example. What you are doing is better done with an iterator: list.filter(_ % 2 == 0).take(10). This doesn't create intermediate collections, and does not scan the list past the first 10 even elements (view wouldn't either, it's just a bit of an overcomplication for this case).
A view is a sequence of delayed operations. It has a reference to the collection, and a bunch of operations to be applied when it is forced. The way operations to be applied are recorded is rather complicated, and not really important. You guessed right - SeqViewF means a view of a sequence with a filter applied. If you map over it, you'll get a SeqViewFM etc.
When would this be needed?
One example is when you need to "massage" a sequence that you are passing somewhere else. Suppose, you have a function, that combines elements of a sequence you pass in somehow:
def combine(s: Seq[Int]) = s.iterator.zipWithIndex.map {
case(x, i) if i % 2 == 0 => x
case(x, _) => -x
}.sum
Now, suppose, you have a huge stream of numbers, and you want to combine only even ones, while dropping the others. You can use your existing function for that:
val result = combine(stream.view.filter(_ % 2 == 0))
Of course, if combine parameter was declared as iterator to begin with, you would not need the view again, but that is not always possible, sometimes you just have to use some standard interface, that just wants a sequence.
Here is a fancier example, that also takes advantage of the fact that the elements are computed on access:
def notifyUsers(users: Seq[User]) = users
.iterator
.filter(_.needsNotification)
.foreach(_.notify)
timer.schedule(60 seconds) { notifyUsers(userIDs.view.map(getUser)) }
So, I have some ids of the users that may need to be notified of some external events. I have them stored in userIDs.
Every minute a task runs, that finds all users that need to be notified, and sends a notification to each of them.
Here is the trick: notifyUsers takes a collection of User as a parameter. But what we are really passing in is a view, composed of the initial set of user ids, and a .map operation, getting the User object for each of them. As a result, every time the task runs, a new User object will be obtained for each id (perhaps, from the database), so, if the _needsNotification flag gets changed, the new value is picked up.
Surely, I could change notifyUsers to receive the list of ids, and do getUser on its own instead, but that wouldn't be as neat. First, this way, it is easier to unit-test - I can just pass an a list of test objects directly in, without bothering to mock out getUser. And second, a generic utility like this is more useful - a User could be a trait, for example, that could be representing many different domain objects.

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

Avoid testing duplicate values with ScalaTest forAll

I'm playing with property-based testing on ScalaTest and I had the following code:
val myStrings = Gen.oneOf("hi", "hello")
forAll(myStrings) { s: String =>
println(s"String tested: $s")
}
When I run the forAll code, I've noticed that the same value is tried more than once, e.g.
String tested: hi
String tested: hello
String tested: hi
String tested: hello
String tested: hi
String tested: hello
...
I was wondering if there is a way for, given the code above, for each value in oneOf to be tried only once. In other words, to get ScalaTest not to use the same value twice.
Even if I used other generators, such as Gen.alphaStr, I'd like to find a way to avoid testing the same String twice. The reason I'm interested in doing this is because each test runs against a server running in a different process, and hence there's a bit of cost involved, so I'd like to avoid testing the same thing twice.
What you're trying to do is seems to be against scalacheck ideology(see Note1); however it's kind of possible (with high probability) by reducing the number of samples:
scala> forAll(oneOf("a", "b")){i => println(i); true}.check(Test.Parameters.default.withMinSuccessfulTests(2))
a
b
+ OK, passed 2 tests.
Note that you can still get aa/bb sometimes, as scala-check is built on randomness and statistical approach. If you need to always check all combinations - you probably don't need scala-check:
scala> assert(Set("a", "b").forall(_ => true))
Basically Gen allows you to create an infinite collection that represents a distribution of input values. The more values you generate - the better sampling you get. So if you have N possible states, you can't guarantee that they won't repeat in an infinite collection.
The only way to do exactly what you want is to explicitly check for duplicates before calling the service. You can use something like Option(ConcurrentHashMap.putIfAbscent(value, value)).isEmpty for that. Keep in mind it is a risk of OOM so be careful to take care of the amount of generated values and maybe even add an explicit check.
Note1) What scalacheck is needed for reducing number of combinations from maximum (which is more than 100) to some value that still gives you a good check. So scalacheck is useful when a set of possible inputs is really huge. And in that case the probability of repetitions is really small
P.S.
Talking about oneOf (from scaladoc):
def oneOf[T](t0: T, t1: T, tn: T*): Gen[T]
Picks a random value from a list
See also (examples are a bit outdated): How can I reduce the number of test cases ScalaCheck generates?
I would aim to increase the entropy of values. Using random sentences will increase it a lot, although not (theoretically) fixing the issue.
val genWord = Gen.onOf("hi", "hello")
def sentanceOf(words: Int): Gen[String] = {
Gen.listOfN(words, genWord).map(_.mkString(" ")
}

How to correctly get current loop count from a Iterator in scala

I am looping over the following lines from a csv file to parse them. I want to identify the first line since its the header. Whats the best way of doing this instead of making a var counter holder.
var counter = 0
for (line <- lines) {
println(CsvParser.parse(line, counter))
counter++
}
I know there is got to be a better way to do this, newbie to Scala.
Try zipWithIndex:
for (line <- lines.zipWithIndex) {
println(CsvParser.parse(line._1, line._2))
}
#tenshi suggested the following improvement with pattern matching:
for ((line, count) <- lines.zipWithIndex) {
println(CsvParser.parse(line, count))
}
I totally agree with the given answer, still that I've to point something important out and initially I planned to put in a simple comment.
But it would be quite long, so that, leave me set it as a variant answer.
It's prefectly true that zip* methods are helpful in order to create tables with lists, but they have the counterpart that they loop the lists in order to create it.
So that, a common recommendation is to sequence the actions required on the lists in a view, so that you combine all of them to be applied only producing a result will be required. Producing a result is considered when the returnable isn't an Iterable. So is foreach for instance.
Now, talking about the first answer, if you have lines to be the list of lines in a very big file (or even an enumeratee on it), zipWithIndex will go through all of 'em and produce a table (Iterable of tuples). Then the for-comprehension will go back again through the same amount of items.
Finally, you've impacted the running lenght by n, where n is the length of lines and added a memory footprint of m + n*16 (roughtly) where m is the lines' footprint.
Proposition
lines.view.zipWithIndex map Function.tupled(CsvParser.parse) foreach println
Some few words left (I promise), lines.view will create something like scala.collection.SeqView that will hold all further "mapping" function producing new Iterable, as are zipWithIndex and map.
Moreover, I think the expression is more elegant because it follows the reader and logical.
"For lines, create a view that will zip each item with its index, the result as to be mapped on the result of the parser which must be printed".
HTH.