scala mixing view and strict collection in for expression - scala

This piece of scala code mixes view with strict List in a for expression:
val list = List.range(1, 4)
def compute(n: Int) = {
println("Computing "+n)
n * 2
}
val view = for (n <- list.view; k<-List(1,2)) yield compute(n)
val x = view(0)
The output is:
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
Computing 1
Computing 1
I expected that it should just have the last 2 lines "Computing 1" in the output. Why would it computed all the values eagerly? And why it then recomputed the values again?

Arguably, access by index forces the view to be computed. Also, notice that you flatmap the list with something which is not lazy (k is not a view).
Compare the following:
// 0) Your example
val v0 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v0(0) // Computing 1
// Computing 1
// Computing 2
// Computing 2
// Computing 3
// Computing 3
// Computing 1
// Computing 1
v0(0) // Computing 1
// Computing 1
// 1) Your example, but access by head and not by index
val v1 = List.range(1, 4).view.flatMap(n => List(1,2).map(k => compute(n)))
v1.head // Computing 1
// Computing 1
// 2) Do not mix views and strict lists
val v2 = List.range(1, 4).view.flatMap(n => List(1,2).view.map(k => compute(n)))
v2(0) // Computing 1
Regarding example 0, notice that views are not like streams; while streams do cache their results, lazy views do not (they just compute lazily, i.e., by-need, on access). It seems that indexed-access requires computing the entire list, and then another computation is needed to actually access the element by index.
You may ask why indexed access in example 2 does not compute the entire list. This requires an understanding of how things work underneath; in particular, we may see the difference of the method calls from example 0 wrt example 2 in the following excerpts:
Example 0
java.lang.Exception scala.collection.SeqViewLike$FlatMapped.$anonfun$index$1(SeqViewLike.scala:75)
at scala.collection.SeqViewLike$FlatMapped.index(SeqViewLike.scala:74)
at scala.collection.SeqViewLike$FlatMapped.index$(SeqViewLike.scala:71)
at scala.collection.SeqViewLike$$anon$5.index$lzycompute(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$$anon$5.index(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.length(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$FlatMapped.length$(SeqViewLike.scala:84)
at scala.collection.SeqViewLike$$anon$5.length(SeqViewLike.scala:197)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:86)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
Example 2
java.lang.Exception scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:12)
at scala.collection.SeqViewLike$Mapped.apply(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$Mapped.apply$(SeqViewLike.scala:67)
at scala.collection.SeqViewLike$$anon$4.apply(SeqViewLike.scala:196)
at scala.collection.SeqViewLike$FlatMapped.apply(SeqViewLike.scala:88)
at scala.collection.SeqViewLike$FlatMapped.apply$(SeqViewLike.scala:85)
at scala.collection.SeqViewLike$$anon$5.apply(SeqViewLike.scala:197)
at scala.collection.immutable.List.foreach(List.scala:389)
Computing 1
In particular, you see that example 0 results in a call of Flatmapped.length (which needs to evaluate the entire list).

view here is a SeqView[Int,Seq[_]], which is immutable and recomputes every item when iterated over.
You could access just the first by explicitly using the .iterator:
# view.iterator.next
Computing 1
Computing 1
res11: Int = 2
Or explicitly make it a List (eg. if you need to reuse many entries):
# val view2: List[Int] = view.toList
Computing 1
Computing 1
Computing 2
Computing 2
Computing 3
Computing 3
view2: List[Int] = List(2, 2, 4, 4, 6, 6)
# view2(0)
res13: Int = 2

Related

How to make stateful API's pure

I'm learning functional programming partially by reading the book Functional Programming in Scala a.k.a. The Red Book and I've run into my first real blocker. In Chapter 6, The book uses the example of a random number generator to illustrate how to change state by using side effects. Then the book goes on to demonstrate the patterns you would typically encounter as well as some of the tangents you might take in making a functional stateful api. My problem comes when trying to understand the following code:
type Rand[+A] = RNG => (A, RNG)
def map[A, B](s: Rand[A])(f: A => B): Rand[B] =
rng => {
val (nxt, nxtrng) = s(rng)
(f(nxt), nxtrng)
}
def nonNegativeLessThan(n: Int): Rand[Int] =
map(nonNegativeIntv2) { i =>
val mod = i % n
if (i + (n - 1) - mod >= 0) mod else nonNegativeLessThan(n)(???)
}
I hope this is enough context to get an idea what the code does. This is coming directly from the book on page 86. How does the if statement for method nonNegativeLessThan filters out values of i that are greater than the largest multiple of n in a 32 bit integer? Doesn't the recursive call in the else clause return a value of type Rand[Int]? In general I'm confused about what's going on in the code that is bolded. This is a pretty good book so I'm happy with how things are going so far and I feel I've learned a lot about functional programming. My apologies if this question is ill formed and if there is an issue with the formatting. This is my first post on stack overflow! Thank you to those who take a look at this and I hope it helps someone who has the same problem.
How does the if statement for method nonNegativeLessThan filters out values of i that are greater than the largest multiple of n in a 32 bit integer?
If i is greater than the largest multiple of n, then i + (n - 1) - mod will overflow and yield a negative number. The subsequent >= 0 is then false.
Doesn't the recursive call in the else clause return a value of type Rand[Int]?
Well, nonNegativeLessThan(n) is indeed of type Rand[Int]. However it says nonNegativeLessThan(n)(???), that is, it applies nonNegativeLessThan to n, and then it applies the resulting value (of type Rand[Int], which is a function type) to ???, and that yields an Int. Or rather it would do that if ??? ever yielded real value, which it doesn't.
The problem here is that you would have to pass the state of the RNG instead of ???, but the map function doesn't let you access that state. You'll need flatMap to solve this – presumably that's what the book is going to discuss next.
Imagine that you have a new domain of a set of integers, for example 0 to 11. The Rand type represents an aleatory number in that domain, but this type can not be skewed. That means that there can not be numbers that have a greater probability of being generated that others.
If you want to generate a number that is positive and less than four, you can use the map function of the Rand type, that allows to transform the result of a state action. First, it generates a nonnegative number and transform the result, via the map, applying the anonymous function: _ % n to obtain a number less than n.
def nonNegativeLessThan(n: Int): Rand[Int] =
map(nonNegativeInt) { _ % n }
It uses the modulo operation:
scala> 1 % 4
res0: Int = 1
scala> 2 % 4
res1: Int = 2
scala> 3 % 4
res2: Int = 3
scala> 4 % 4
res3: Int = 0
scala> 5 % 4
res4: Int = 1
scala> 6 % 4
res5: Int = 2
scala> 7 % 4
res6: Int = 3
scala> 8 % 4
res7: Int = 0
scala> 9 % 4
res8: Int = 1
scala> 10 % 4
res9: Int = 2
As you can see, the largest multiple of 4 in this domain is 8, so if the non-negative number that is generated is 9 or 10, we have a problem. The probability of having a 1 or 2 is greater than having a 3 or a 0. And for that reason, the other implementation detects that the number which is first generated as a result of the nonnegativeInt is not major than the largest multiple of n in a specific domain, the Int32 numbers domain in the book, to have a non- biased generator.
And yes, that book is amazing.

How to trick Scala map method to produce more than one output per each input item?

Quite complex algorith is being applied to list of Spark Dataset's rows (list was obtained using groupByKey and flatMapGroups). Most rows are transformed 1 : 1 from input to output, but in some scenarios require more than one output per each input. The input row schema can change anytime. The map() fits the requirements quite well for the 1:1 transformation, but is there a way to use it producing 1 : n output?
The only work-around I found relies on foreach method which has unpleasant overhed cause by creating the initial empty list (remember, unlike the simplified example below, real-life list structure is changing randomly).
My original problem is too complex to share here, but this example demonstrates the concept. Let's have a list of integers. Each should be transformed into its square value and if the input is even it should also transform into one half of the original value:
val X = Seq(1, 2, 3, 4, 5)
val y = X.map(x => x * x) //map is intended for 1:1 transformation so it works great here
val z = X.map(x => for(n <- 1 to 5) (n, x * x)) //this attempt FAILS - generates list of five rows with emtpy tuples
// this work-around works, but newX definition is problematic
var newX = List[Int]() //in reality defining as head of the input list and dropping result's tail at the end
val za = X.foreach(x => {
newX = x*x :: newX
if(x % 2 == 0) newX = (x / 2) :: newX
})
newX
Is there a better way than foreach construct?
.flatMap produces any number of outputs from a single input.
val X = Seq(1, 2, 3, 4, 5)
X.flatMap { x =>
if (x % 2 == 0) Seq(x*x, x / 2) else Seq(x / 2)
}
#=> Seq[Int] = List(0, 4, 1, 1, 16, 2, 2)
flatMap in more detail
In X.map(f), f is a function that maps each input to a single output. By contrast, in X.flatMap(g), the function g maps each input to a sequence of outputs. flatMap then takes all the sequences produced (one for each element in f) and concatenates them.
The neat thing is .flatMap works not just for sequences, but for all sequence-like objects. For an option, for instance, Option(x)#flatMap(g) will allow g to return an Option. Similarly, Future(x)#flatMap(g) will allow g to return a Future.
Whenever the number of elements you return depends on the input, you should think of flatMap.

Scala: Side effects with collection transformations

I am trying to understand views in scala via this link http://docs.scala-lang.org/overviews/collections/views.html.
I didn't understand what collection transformations have/not side effects means ! ?
Thank you
By having side effects there is meant a situation when you execute a code in the collection transformations that closures over some external state, or have any side effect that effects anything else than the result of the transformation. Example:
val l = List(1, 2, 3, 4).view.map(x => {println(x); x + 1})
When you execute this code it will print nothing, because view delays the execution of map. Furthermore each time you try to iterate over this list, map will be executed, resulting in printing value more times than it was desirable.
var counter = 0
val ll = for (i <- List(1, 2, 3, 4).view)
yield { counter += 1; i + 1}
println(counter) // 0
println(ll.toList) // this executes .force internally
println(counter) // 4
Behaves in the same way, but it is even more unexpected. counter increases only after the fact of iteration, and given that ll is lazy and delayed, the iteration may happen far more deeper in the code, resulting in counter being equal to 0 before that
Scala has immutable collections (everything in scala.collection.immutable). These collection types have no operations to modfiy them, but only to get modified copies.
So for example this
Set(1) + 2
will give you a new Set containing 1 and 2, not modify the first set. The same holds for transformations as map, flatMap, filter etc.
Views do not change anything about that. The only difference between a view and the collection it is based on is, that (most) operations on views are lazy, i.e. intermediate results are not computed.
val l1 = List(1,2)
val l2 = List(1,2).map(x => x + 1) // a new List(2,3) is computed here
l2.foreach(println) // the elements of l2 are just printed
With views:
val v2 = l1.view.map(x => x + 1) // nothing is computed here
v2.foreach(println) // the values are computed step by step

transforming from native matrix format, scalding

So this question is related to question Transforming matrix format, scalding
But now, I want to make the back operation. So i can make it in a such way:
Tsv(in, ('row, 'col, 'v))
.read
.groupBy('row) { _.sortBy('col).mkString('v, "\t") }
.mapTo(('row, 'v) -> ('c)) { res : (Long, String) =>
val (row, v) = res
v }
.write(Tsv(out))
But, there, we got problem with zeros. As we know, scalding skips zero values fields. So for example we got matrix:
1 0 8
4 5 6
0 8 9
In scalding format is is:
1 1 1
1 3 8
2 1 4
2 2 5
2 3 6
3 2 8
3 3 9
Using my function I wrote above we can only get:
1 8
4 5 6
8 9
And that's incorrect. So, how can i deal with it? I see two possible variants:
To find way, to add zeros (actually, dunno how to insert data)
To write own operations on own matrix format (it is unpreferable, cause I'm interested in Scalding matrix operations, and dont want to write all of them my own)
Mb there r some methods, and I can avoid skipping zeros in matrix?
Scalding stores a sparse representation of the data. If you want to output a dense matrix (first of all, that won't scale, because the rows will be bigger than can fit in memory at some point), you will need to enumerate all the rows and columns:
// First, I highly suggest you use the TypedPipe api, as it is easier to get
// big jobs right generally
val mat = // has your matrix in 'row1, 'col1, 'val1
def zero: V = // the zero of your value type
val rows = IterableSource(0 to 1000, 'row)
val cols = IterableSource(0 to 2000, 'col)
rows.crossWithTiny(cols)
.leftJoinWithSmaller(('row, 'col) -> ('row1, 'col1), mat)
.map('val1 -> 'val1) { v: V =>
if(v == null) // this value should be 0 in your type:
zero
else
v
}
.groupBy('row) {
_.toList[(Int, V)](('col, 'val1) -> 'cols)
}
.map('cols -> 'cols) { cols: List[(Int, V)] =>
cols.sortBy(_._1).map(_._2).mkString("\t")
}
.write(TypedTsv[(Int, String)]("output"))

Using Streams for iteration in Scala

SICP says that iterative processes (e.g. Newton method of square root calculation, "pi" calculation, etc.) can be formulated in terms of Streams.
Does anybody use streams in Scala to model iterations?
Here is one way to produce the stream of approximations of pi:
val naturals = Stream.from(0) // 0, 1, 2, ...
val odds = naturals.map(_ * 2 + 1) // 1, 3, 5, ...
val oddInverses = odds.map(1.0d / _) // 1/1, 1/3, 1/5, ...
val alternations = Stream.iterate(1)(-_) // 1, -1, 1, ...
val products = (oddInverses zip alternations)
.map(ia => ia._1 * ia._2) // 1/1, -1/3, 1/5, ...
// Computes a stream representing the cumulative sum of another one
def sumUp(s : Stream[Double], acc : Double = 0.0d) : Stream[Double] =
Stream.cons(s.head + acc, sumUp(s.tail, s.head + acc))
val pi = sumUp(products).map(_ * 4.0) // Approximations of pi.
Now, say you want the 200th iteration:
scala> pi(200)
resN: Double = 3.1465677471829556
...or the 300000th:
scala> pi(300000)
resN : Double = 3.14159598691202
Streams are extremely useful when you are doing a sequence of recursive calculations and a single result depends on previous results, such as calculating pi. Here's a simpler example, consider the classic recursive algorithm for calculating fibbonacci numbers (1, 2, 3, 5, 8, 13, ...):
def fib(n: Int) : Int = n match {
case 0 => 1
case 1 => 2
case _ => fib(n - 1) + fib(n - 2)
}
One of the main points of this code is that while very simple, is extremely inefficient. fib(100) almost crashed my computer! Each recursion branches into two calls and you are essentially calculating the same values many times.
Streams allow you to do dynamic programming in a recursive fashion, where once a value is calculated, it is reused every time it is needed again. To implement the above using streams:
val naturals: Stream[Int] = Stream.cons(0, naturals.map{_ + 1})
val fibs : Stream[Int] = naturals.map{
case 0 => 1
case 1 => 2
case n => fibs(n - 1) + fibs( n - 2)
}
fibs(1) //2
fibs(2) //3
fibs(3) //5
fibs(100) //1445263496
Whereas the recursive solution runs in O(2^n) time, the Streams solution runs in O(n^2) time. Since you only need the last 2 generated members, you can easily optimize this using Stream.drop so that the stream size doesn't overflow memory.