Stream of random values with Rng library - scala

I am looking through Rng sources to see how they generate a stream of random values.
def stream[AA >: A](s: Size): Rng[EphemeralStream[AA]] =
list(s) map (EphemeralStream(_: _*))
First of all I am confused with this signature. I need a generator of an infinite stream that yields random values lazily. So I do not need the size argument.
I am confused with the implementation too. The generator seems to create a list and then convert it to the stream. However the whole point of the stream is to be lazy. What am I missing ?

I don't know about Stream, but we can do something like this explicitly with scalaz-stream Process:
def infiniteRandomProcess: Process[Rng, Int] =
Process.await(Rng.int)({
i => Process.emit(i) ++ infiniteRandomProcess
})
Then you transform the lazy Process until you get something you're happy to run (or runLog or so on), and you end up with an Rng value (or perhaps a monad transformer that combines this with e.g. Task).
The API might allow converting this to a Stream; there's a lot of it and I tend to stay around the shallow end.

Related

Equivalent of Iterator.continually for an Iterable?

I need to produce an java.lang.Iterable[T] where the supply of T is some long running operation. In addition, after T is supplied it is wrapped and further computation is made to prepare for the next iteration.
Initially I thought I could do this with Iterator.continually. However, calling toIterable on the result actually creates a Stream[T] - with the problem being that the head is eagerly evaluated, which I don't want.
How can I either:
Create an Iterable[T] from a supplying function or
Convert an Iterator[T] into an Iterable[T] without using Stream?
In Scala 2.13, you can use LazyList:
LazyList.continually(1)
Unlike Stream, LazyList is also lazy in its head.
Because java.lang.Iterable is a very simple API, it's trivial to go from a scala.collection.Iterator to it.
case class IterableFromIterator[T](override val iterator:java.util.Iterator[T]) extends java.lang.Iterable[T]
val iterable:java.lang.Iterable[T] = IterableFromIterator(Iterator.continually(...).asJava)
Note this contradicts the expectation that iterable.iterator() produces a fresh Iterator each time; instead, iterable.iterator() can only be called once.

Lazy eager map evaluation

There are basically two options to evaluate a map in Scala.
Lazy evaluation computers the function that is passed as a parameter when the next value is needed. IF the function takes one hour to execute then it's one hour to wait when the value is needed. (e.g. Stream and Iterator)
Eager evaluation computes the function when the map is defined. It produces a new list (Vector or whatever) and stores the results, making the program to be busy in that time.
With Future we can obtain the list (Seq or whatever) in a separate thread, this means that our thread doesn't block, but the results have to be stored.
So I did something different, please check it here.
This was a while ago so I don't remember whether I tested it. The point is to have a map that applies concurrently (non-blocking) and kind of eagerly to a set of elements, filling a buffer (the size of the number of cores in the computer, and not more). This means that:
The invocation of the map doesn't block the current thread.
Obtaining an element doesn't block the current thread (in case there was time to calculate it before and store the result in the buffer).
Infinite lists can be handled because we only prefetch a few results (about 8, depending on the number of cores).
So this all sounds very good and you may be wondering what's the problem. The problem is that this solution is not particularly elegant IMHO. Assuming the code I shared works in Java and or Scala, to iterate over the elements in the iterable produced by the map I would only need to write:
new CFMap(whateverFunction).apply(whateverIterable)
However what I would like to write is something like:
whateverIterable.bmap(whateverFunction)
As it is usual in Scala (the 'b' is for buffered), or perhaps something like:
whateverThing.toBuffered.map(whateverFunction)
Either of them works for me. So the question is, how can I do this in an idiomatic way in Scala? Some options:
Monads: create a new collection "Buffered" so that I can use the toBuffered method (that should be added to the previous ones as an implicit) and implement map and everything else for this Buffered thing (sounds like quite some work).
Implicits: create an implicit method that transforms the usual collections or the superclass of them (I'm not sure about which one should it be, Iterable maybe?) to something else to which I can apply the .bmap method and obtain something from it, probably an iterable.
Other: there are probably many options I have not considered so far. It's possible that some library does already implement this (I'd be actually surprised of the opposite, I can't believe nobody thought of this before). Using something that has already been done is usually a good idea.
Please let me know if something is unclear.
What you're looking for is the "pimp-my-library" pattern. Check it out:
object CFMapExtensions {
import sanity.commons.functional.CFMap
import scala.collection.JavaConversions._
implicit class IterableExtensions[I](i: Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(asJavaIterable(i))
}
implicit class JavaIterableExtensions[I](i: java.lang.Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(i)
}
// Add an implicit conversion to a java function.
import java.util.function.{Function => JFunction}
implicit def toJFunction[I, O](f: Function1[I, O]): JFunction[I, O] = {
new JFunction[I, O]() {
def apply(t: I): O = f(t)
}
}
}
object Test extends App {
import CFMapExtensions._
List(1,2,3,4).bmap(_ + 5).foreach(println)
}

Processing multiple files in parallel with scalaz streams

I'm trying to use scalaz-stream to process multiple files simultaneously, applying a single function to each line in the files, across all the files.
For concreteness, suppose I have a function that takes a list of strings
def f(lines: Seq[String]): Something = ???
And a couple of files, the first:
foo1
foo2
foo3
the second:
bar1
bar2
bar3
The result of the whole process should be:
List(
f(Seq("foo1", "bar1")),
f(Seq("foo2", "bar2")),
f(Seq("foo3", "bar3"))
)
(or more likely written directly into some other file)
The number of files is not known beforehand, and the number of lines may vary between the different files, but I'm okay with padding (at runtime) the ends of the shorter files with a default values, or cutting out the longer files.
So essentially, I need a way to combine a Seq[Process[Task, String]] (obtained via something like io.linesR) into a single Process[Task, Seq[String]].
What would be the simplest way to achieve that?
Or, more generally, how do I combine n instances of Process[F, I] into a single instance Process[F, Seq[I]]?
I'm sure there's some standard combinator for this purpose, but I wasn't able to figure it out...
Thanks.
This combinator doesn't exist yet, but you could add it. I think it will be something like:
def zipN[F[_], A](xs: Seq[Process[F,A]]): Process[F,Seq[A]] =
if (xs.isEmpty) Process.halt
else xs.map(_ map (Vector(_))).reduceLeft(_.zipWith(_)(_ ++ _))
You could also add zipAllN, which takes a value to pad the sequences with (and which uses zipAll, and alignN, which allows streams to 'drop out' of the output process when they are exhausted. (So the output sequence may get shorter.)
I would suggest you implement it as a 'balanced' reduce rather than a left or right reduce, since it will be more efficient that way.
Please do submit a pull request + tests if you end up implementing this for real!

Scala Iterable Memory Leaks

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.
The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.
Here's my repro code:
class FileIterable(file: java.io.File) extends Iterable[String] {
override def iterator = io.Source.fromFile(file).getLines
}
// Iterator
// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines
// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator
// Iterable
// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable
// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)
def values = lines
.drop(1)
//.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
//.filter(l => l.startsWith("*"))
val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()
The file it's reading is ~10GB with 1MB lines.
The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)
The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.
The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.
My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.
Any ideas?
Note how the ScalaDoc of Iterable says:
Implementations of this trait need to provide a concrete method with
signature:
def iterator: Iterator[A]
They also need to provide a method newBuilder which creates a builder
for collections of the same kind.
Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.
You might want to implement Iterable.drop as
def drop(n: Int) = iterator.drop(n).toIterable
but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).

Use-cases for Streams in Scala

In Scala there is a Stream class that is very much like an iterator. The topic Difference between Iterator and Stream in Scala? offers some insights into the similarities and differences between the two.
Seeing how to use a stream is pretty simple but I don't have very many common use-cases where I would use a stream instead of other artifacts.
The ideas I have right now:
If you need to make use of an infinite series. But this does not seem like a common use-case to me so it doesn't fit my criteria. (Please correct me if it is common and I just have a blind spot)
If you have a series of data where each element needs to be computed but that you may want to reuse several times. This is weak because I could just load it into a list which is conceptually easier to follow for a large subset of the developer population.
Perhaps there is a large set of data or a computationally expensive series and there is a high probability that the items you need will not require visiting all of the elements. But in this case an Iterator would be a good match unless you need to do several searches, in that case you could use a list as well even if it would be slightly less efficient.
There is a complex series of data that needs to be reused. Again a list could be used here. Although in this case both cases would be equally difficult to use and a Stream would be a better fit since not all elements need to be loaded. But again not that common... or is it?
So have I missed any big uses? Or is it a developer preference for the most part?
Thanks
The main difference between a Stream and an Iterator is that the latter is mutable and "one-shot", so to speak, while the former is not. Iterator has a better memory footprint than Stream, but the fact that it is mutable can be inconvenient.
Take this classic prime number generator, for instance:
def primeStream(s: Stream[Int]): Stream[Int] =
Stream.cons(s.head, primeStream(s.tail filter { _ % s.head != 0 }))
val primes = primeStream(Stream.from(2))
It can be easily be written with an Iterator as well, but an Iterator won't keep the primes computed so far.
So, one important aspect of a Stream is that you can pass it to other functions without having it duplicated first, or having to generate it again and again.
As for expensive computations/infinite lists, these things can be done with Iterator as well. Infinite lists are actually quite useful -- you just don't know it because you didn't have it, so you have seen algorithms that are more complex than strictly necessary just to deal with enforced finite sizes.
In addition to Daniel's answer, keep in mind that Stream is useful for short-circuiting evaluations. For example, suppose I have a huge set of functions that take String and return Option[String], and I want to keep executing them until one of them works:
val stringOps = List(
(s:String) => if (s.length>10) Some(s.length.toString) else None ,
(s:String) => if (s.length==0) Some("empty") else None ,
(s:String) => if (s.indexOf(" ")>=0) Some(s.trim) else None
);
Well, I certainly don't want to execute the entire list, and there isn't any handy method on List that says, "treat these as functions and execute them until one of them returns something other than None". What to do? Perhaps this:
def transform(input: String, ops: List[String=>Option[String]]) = {
ops.toStream.map( _(input) ).find(_ isDefined).getOrElse(None)
}
This takes a list and treats it as a Stream (which doesn't actually evaluate anything), then defines a new Stream that is a result of applying the functions (but that doesn't evaluate anything either yet), then searches for the first one which is defined--and here, magically, it looks back and realizes it has to apply the map, and get the right data from the original list--and then unwraps it from Option[Option[String]] to Option[String] using getOrElse.
Here's an example:
scala> transform("This is a really long string",stringOps)
res0: Option[String] = Some(28)
scala> transform("",stringOps)
res1: Option[String] = Some(empty)
scala> transform(" hi ",stringOps)
res2: Option[String] = Some(hi)
scala> transform("no-match",stringOps)
res3: Option[String] = None
But does it work? If we put a println into our functions so we can tell if they're called, we get
val stringOps = List(
(s:String) => {println("1"); if (s.length>10) Some(s.length.toString) else None },
(s:String) => {println("2"); if (s.length==0) Some("empty") else None },
(s:String) => {println("3"); if (s.indexOf(" ")>=0) Some(s.trim) else None }
);
// (transform is the same)
scala> transform("This is a really long string",stringOps)
1
res0: Option[String] = Some(28)
scala> transform("no-match",stringOps)
1
2
3
res1: Option[String] = None
(This is with Scala 2.8; 2.7's implementation will sometimes overshoot by one, unfortunately. And note that you do accumulate a long list of None as your failures accrue, but presumably this is inexpensive compared to your true computation here.)
I could imagine, that if you poll some device in real time, a Stream is more convenient.
Think of an GPS tracker, which returns the actual position if you ask it. You can't precompute the location where you will be in 5 minutes. You might use it for a few minutes only to actualize a path in OpenStreetMap or you might use it for an expedition over six months in a desert or the rain forest.
Or a digital thermometer or other kinds of sensors which repeatedly return new data, as long as the hardware is alive and turned on - a log file filter could be another example.
Stream is to Iterator as immutable.List is to mutable.List. Favouring immutability prevents a class of bugs, occasionally at the cost of performance.
scalac itself isn't immune to these problems: http://article.gmane.org/gmane.comp.lang.scala.internals/2831
As Daniel points out, favouring laziness over strictness can simplify algorithms and make it easier to compose them.