I've recently run into a bug in my code, in which iterating over multiple streams causes them to only iterate only through the first item. I converted my streams to buffers (I wasn't even aware that the function's implementation that I was calling returns a stream) and the problem was fixed. I found this hard to believe, so I created a minimum verifiable example:
def f(as: Seq[String], bs: Seq[String]): Unit =
for {
a <- as
b <- bs
} yield println((a, b))
val seq = Seq(1, 2, 3).map(_.toString)
f(seq, seq)
println()
val stream = Stream.iterate(1)(_ + 1).map(_.toString).take(3)
f(stream, stream)
A function that prints every combination of its inputs, and is invoked with the Seq [1, 2, 3] and the Stream [1, 2, 3].
The result with the seq is:
(1,1)
(1,2)
(1,3)
(2,1)
(2,2)
(2,3)
(3,1)
(3,2)
(3,3)
And the result with the stream is:
(1,1)
I've only been able to replicate this when iterating through multiple generators, iterating through a single stream seems to work fine.
So my questions are: why does this happen, and how can I avoid this kind of glitch? That is, short of using .toBuffer or .to[Vector] before every multi-generator iteration?
Thanks.
The manner in which you're using the for-comprehension (with the println in the yield) is a bit strange and probably not what you want to do. If you really just want to print out the entries, then just use foreach. This will force lazy sequences like Stream, i.e.
def f_strict(as: Seq[String], bs: Seq[String]): Unit = {
for {
a <- as
b <- bs
} println((a, b))
}
The reason you're getting the strange behavior with your f is that Streams are lazy, and elements are only computed (and then memoized) as needed. Since you never use the Stream created by f (necessarily because your f returns a Unit), only the head ever gets computed (which is why you're seeing the single (1, 1).) If you were instead to have it return the sequence it generated (which will have type Seq[Unit]), i.e.
def f_new(as: Seq[String], bs: Seq[String]): Seq[Unit] = {
for {
a <- as
b <- bs
} yield println((a, b))
}
Then you'll get the following behavior which should hopefully help to elucidate what's going on:
val xs = Stream(1, 2, 3)
val result = f_new(xs.map(_.toString), xs.map(_.toString))
//prints out (1, 1) as a result of evaluating the head of the resulting Stream
result.foreach(aUnit => {})
//prints out the other elements as the rest of the entries of Stream are computed, i.e.
//(1,2)
//(1,3)
//(2,1)
//...
result.foreach(aUnit => {})
//probably won't print out anything because elements of Stream have been computed,
//memoized and probably don't need to be computed again at this point.
Related
I have an RDD with strings like this (ordered in a specific way):
["A","B","C","D"]
And another RDD with lists like this:
["C","B","F","K"],
["B","A","Z","M"],
["X","T","D","C"]
I would like to order the elements in each list in the second RDD based on the order in which they appear in the first RDD. The order of the elements that do not appear in the first list is not of concern.
From the above example, I would like to get an RDD like this:
["B","C","F","K"],
["A","B","Z","M"],
["C","D","X","T"]
I know I am supposed to use a broadcast variable to broadcast the first RDD as I process each list in the second RDD. But I am very new to Spark/Scala (and functional programming in general) so I am not sure how to do this.
I am assuming that the first RDD is small since you talk about broadcasting it. In that case you are right, broadcasting the ordering is a good way to solve your problem.
// generating data
val ordering_rdd = sc.parallelize(Seq("A","B","C","D"))
val other_rdd = sc.parallelize(Seq(
Seq("C","B","F","K"),
Seq("B","A","Z","M"),
Seq("X","T","D","C")
))
// let's start by collecting the ordering onto the driver
val ordering = ordering_rdd.collect()
// Let's broadcast the list:
val ordering_br = sc.broadcast(ordering)
// Finally, let's use the ordering to sort your records:
val result = other_rdd
.map( _.sortBy(x => {
val index = ordering_br.value.indexOf(x)
if(index == -1) Int.MaxValue else index
}))
Note that indexOf returns -1 if the element is not found in the list. If we leave it as is, all non-found elements would end up at the beginning. I understand that you want them at the end so I relpace -1 by some big number.
Printing the result:
scala> result.collect().foreach(println)
List(B, C, F, K)
List(A, B, Z, M)
List(C, D, X, T)
I have a sequence of values of type A that I want to transform to a sequence of type B.
Some of the elements with type A can be converted to a B, however some other elements need to be combined with the immediately previous element to produce a B.
I see it as a small state machine with two states, the first one handling the transformation from A to B when just the current A is needed, or saving A if the next row is needed and going to the second state; the second state combining the saved A with the new A to produce a B and then go back to state 1.
I'm trying to use scalaz's Iteratees but I fear I'm overcomplicating it, and I'm forced to return a dummy B when the input has reached EOF.
What's the most elegant solution to do it?
What about invoking the sliding() method on your sequence?
You might have to put a dummy element at the head of the sequence so that the first element (the real head) is evaluated/converted correctly.
If you map() over the result from sliding(2) then map will "see" every element with its predecessor.
val input: Seq[A] = ??? // real data here (no dummy values)
val output: Seq[B] = (dummy +: input).sliding(2).flatMap(a2b).toSeq
def a2b( arg: Seq[A] ): Seq[B] = {
// arg holds 2 elements
// return a Seq() of zero or more elements
}
Taking a stab at it:
Partition your list into two lists. The first is the one you can directly convert and the second is the one that you need to merge.
scala> val l = List("String", 1, 4, "Hello")
l: List[Any] = List(String, 1, 4, Hello)
scala> val (string, int) = l partition { case s:String => true case _ => false}
string: List[Any] = List(String, Hello)
int: List[Any] = List(1, 4)
Replace the logic in the partition block with whatever you need.
After you have the two lists, you can do whatever you need to with your second using something like this
scala> string ::: int.collect{case i:Integer => i}.sliding(2).collect{
| case List(a, b) => a+b.toString}.toList
res4: List[Any] = List(String, Hello, 14)
You would replace the addition with whatever your aggregate function is.
Hopefully this is helpful.
For example, how to merge two Streams of sorted Integers? I thought it's very basic, but just found it's non trivial at all. The below one is not tail-recursive and it will stack-overflow when the Streams are large.
def merge(as: Stream[Int], bs: Stream[Int]): Stream[Int] = {
(as, bs) match {
case (Stream.Empty, bss) => bss
case (ass, Stream.Empty) => ass
case (a #:: ass, b #:: bss) =>
if (a < b) a #:: merge(ass, bs)
else b #:: merge(as, bss)
}
}
We may want to turn it into a tail-recursive one by introducing a accumulator. However, if we pre-pend the accumulator, we will only get a stream of reversed order; if we append the accumulator with concatenation (#:::), it's NOT lazy (strict) any more.
What could be the solution here? Thanks
Turning a comment into an answer, there's nothing wrong with your merge.
It's not recursive at all - any one call to merge returns a new Stream without any other call to merge. a #:: merge(ass, bs) return a stream with first element a and where merge(ass, bs) will be called to evaluate the rest of the stream when required.
So
val m = merge(Stream.from(1,2), Stream.from(2, 2))
//> m : Stream[Int] = Stream(1, ?)
m.drop(10000000).take(1)
//> res0: scala.collection.immutable.Stream[Int] = Stream(10000001, ?)
works just fine. No stack overflow.
I have two lists that I zip and go through the zipped result and call a function. This function returns a List of Strings as response. I now want to collect all the responses that I get and I do not want to have some sort of buffer that would collect the responses for each iteration.
seq1.zip(seq2).foreach((x: (Obj1, Obj1)) => {
callMethod(x._1, x._2) // This method returns a Seq of String when called
}
What I want to avoid is to create a ListBuffer and keep collecting it. Any clues to do it functionally?
Why not use map() to transform each input into a corresponding output ? Here's map() operating in a simple scenario:
scala> val l = List(1,2,3,4,5)
scala> l.map( x => x*2 )
res60: List[Int] = List(2, 4, 6, 8, 10)
so in your case it would look something like:
seq1.zip(seq2).map((x: (Obj1, Obj1)) => callMethod(x._1, x._2))
Given that your function returns a Seq of Strings, you could use flatMap() to flatten the results into one sequence.
Is it possible to use yield as an iterator without evaluation of every value?
It is a common task when it is easy to implement complex list generation, and then you need to convert it into Iterator, because you don't need some results...
Sure. Actually, there are three options for non-strictness, which I list below. For the examples, assume:
val list = List.range(1, 10)
def compute(n: Int) = {
println("Computing "+n)
n * 2
}
Stream. A Stream is a lazily evaluated list. It will compute values on demand, but it will not recompute values once they have been computed. It is most useful if you'll reuse parts of the stream many times. For example, running the code below will print "Computing 1", "Computing 2" and "Computing 3", one time each.
val stream = for (n <- list.toStream) yield compute(n)
val third = stream(2)
println("%d %d" format (third, stream(2)))
A view. A view is a composition of operations over a base collection. When examining a view, each element examined is computed on-demand. It is most useful if you'll randomly access the view, but will never look but at a small part of it. For example, running the code below will print "Computing 3" two times, and nothing else (well, besides the result).
val view = for (n <- list.view) yield compute(n)
val third = view(2)
println("%d %d" format (third, view(2)))
Iterator. An Iterator is something that is used to lazily walk through a collection. One can think of it as a "one-shot" collection, so to speak. It will neither recompute nor store any elements -- once an element has been "computed", it cannot be used again. It is a bit more tricky to use because of that, but it is the most efficient one given these constraints. For example, the following example needs to be different, because Iterator does not support indexed access (and view would perform badly if written this way), and the code below prints "Computing 1", "Computing 2", "Computing 3", "Computing 4", "Computing 5" and "Computing 6". Also, it prints two different numbers at the end.
val iterator = for (n <- list.iterator) yield compute(n)
val third = iterator.drop(2).next
println("%d %d" format (third, iterator.drop(2).next))
Use views if you want lazy evaluation, see Views.
The Scala 2.8 Collections API is a fantastic read if you're going to use the Scala collections a lot.
I have a List...
scala> List(1, 2, 3)
res0: List[Int] = List(1, 2, 3)
And a function...
scala> def foo(i : Int) : String = { println("Eval: " + i); i.toString + "Foo" }
foo: (i: Int)String
And now I'll use a for-comprehension with an Iterator...
scala> for { i <- res0.iterator } yield foo(i)
res2: Iterator[java.lang.String] = non-empty iterator
You can use a for comprehension on any type with flatMap, map and filter methods. You could also use the views:
scala> for { i <- res0.view } yield foo(i)
res3: scala.collection.SeqView[String,Seq[_]] = SeqViewM(...)
Evaluation is non-strict in either case...
scala> res3.head
Eval: 1
res4: String = 1Foo