Processing multiple files in parallel with scalaz streams - scala

I'm trying to use scalaz-stream to process multiple files simultaneously, applying a single function to each line in the files, across all the files.
For concreteness, suppose I have a function that takes a list of strings
def f(lines: Seq[String]): Something = ???
And a couple of files, the first:
foo1
foo2
foo3
the second:
bar1
bar2
bar3
The result of the whole process should be:
List(
f(Seq("foo1", "bar1")),
f(Seq("foo2", "bar2")),
f(Seq("foo3", "bar3"))
)
(or more likely written directly into some other file)
The number of files is not known beforehand, and the number of lines may vary between the different files, but I'm okay with padding (at runtime) the ends of the shorter files with a default values, or cutting out the longer files.
So essentially, I need a way to combine a Seq[Process[Task, String]] (obtained via something like io.linesR) into a single Process[Task, Seq[String]].
What would be the simplest way to achieve that?
Or, more generally, how do I combine n instances of Process[F, I] into a single instance Process[F, Seq[I]]?
I'm sure there's some standard combinator for this purpose, but I wasn't able to figure it out...
Thanks.

This combinator doesn't exist yet, but you could add it. I think it will be something like:
def zipN[F[_], A](xs: Seq[Process[F,A]]): Process[F,Seq[A]] =
if (xs.isEmpty) Process.halt
else xs.map(_ map (Vector(_))).reduceLeft(_.zipWith(_)(_ ++ _))
You could also add zipAllN, which takes a value to pad the sequences with (and which uses zipAll, and alignN, which allows streams to 'drop out' of the output process when they are exhausted. (So the output sequence may get shorter.)
I would suggest you implement it as a 'balanced' reduce rather than a left or right reduce, since it will be more efficient that way.
Please do submit a pull request + tests if you end up implementing this for real!

Related

Avoid testing duplicate values with ScalaTest forAll

I'm playing with property-based testing on ScalaTest and I had the following code:
val myStrings = Gen.oneOf("hi", "hello")
forAll(myStrings) { s: String =>
println(s"String tested: $s")
}
When I run the forAll code, I've noticed that the same value is tried more than once, e.g.
String tested: hi
String tested: hello
String tested: hi
String tested: hello
String tested: hi
String tested: hello
...
I was wondering if there is a way for, given the code above, for each value in oneOf to be tried only once. In other words, to get ScalaTest not to use the same value twice.
Even if I used other generators, such as Gen.alphaStr, I'd like to find a way to avoid testing the same String twice. The reason I'm interested in doing this is because each test runs against a server running in a different process, and hence there's a bit of cost involved, so I'd like to avoid testing the same thing twice.
What you're trying to do is seems to be against scalacheck ideology(see Note1); however it's kind of possible (with high probability) by reducing the number of samples:
scala> forAll(oneOf("a", "b")){i => println(i); true}.check(Test.Parameters.default.withMinSuccessfulTests(2))
a
b
+ OK, passed 2 tests.
Note that you can still get aa/bb sometimes, as scala-check is built on randomness and statistical approach. If you need to always check all combinations - you probably don't need scala-check:
scala> assert(Set("a", "b").forall(_ => true))
Basically Gen allows you to create an infinite collection that represents a distribution of input values. The more values you generate - the better sampling you get. So if you have N possible states, you can't guarantee that they won't repeat in an infinite collection.
The only way to do exactly what you want is to explicitly check for duplicates before calling the service. You can use something like Option(ConcurrentHashMap.putIfAbscent(value, value)).isEmpty for that. Keep in mind it is a risk of OOM so be careful to take care of the amount of generated values and maybe even add an explicit check.
Note1) What scalacheck is needed for reducing number of combinations from maximum (which is more than 100) to some value that still gives you a good check. So scalacheck is useful when a set of possible inputs is really huge. And in that case the probability of repetitions is really small
P.S.
Talking about oneOf (from scaladoc):
def oneOf[T](t0: T, t1: T, tn: T*): Gen[T]
Picks a random value from a list
See also (examples are a bit outdated): How can I reduce the number of test cases ScalaCheck generates?
I would aim to increase the entropy of values. Using random sentences will increase it a lot, although not (theoretically) fixing the issue.
val genWord = Gen.onOf("hi", "hello")
def sentanceOf(words: Int): Gen[String] = {
Gen.listOfN(words, genWord).map(_.mkString(" ")
}

Stream of random values with Rng library

I am looking through Rng sources to see how they generate a stream of random values.
def stream[AA >: A](s: Size): Rng[EphemeralStream[AA]] =
list(s) map (EphemeralStream(_: _*))
First of all I am confused with this signature. I need a generator of an infinite stream that yields random values lazily. So I do not need the size argument.
I am confused with the implementation too. The generator seems to create a list and then convert it to the stream. However the whole point of the stream is to be lazy. What am I missing ?
I don't know about Stream, but we can do something like this explicitly with scalaz-stream Process:
def infiniteRandomProcess: Process[Rng, Int] =
Process.await(Rng.int)({
i => Process.emit(i) ++ infiniteRandomProcess
})
Then you transform the lazy Process until you get something you're happy to run (or runLog or so on), and you end up with an Rng value (or perhaps a monad transformer that combines this with e.g. Task).
The API might allow converting this to a Stream; there's a lot of it and I tend to stay around the shallow end.

Add streams in Scala

I see at least two different implementations:
def add_streams(s1:Stream[Int], s2:Stream[Int]): Stream[Int] =
Stream.cons(s1.head + s2.head, add_stream(s1.tail, s2.tail))
def add_streams(s1:Stream[Int], s2:Stream[Int]) =
(s1 zip s2) map {case (x,y) => x + y}
I guess the last one is more efficient since it is not recursive.
Is it correct? How would you code such a function ?
The first version is broken as it doesn‘t check for the end of a Stream. (The streams needn’t be of different length for this to happen.) Given that, the zip version is the one to prefer.
First of all: your implementations have different behavior when either of the streams is finite. The first will crash with a NoSuchElementException, while the second will just truncate the longer stream.
I find the latter much more expressive and elegant, anyway, although I doubt the performance difference would be noticeable in most cases.

Scala: selecting function returning Option versus PartialFunction

I'm a relative Scala beginner and would like some advice on how to proceed on an implementation that seems like it can be done either with a function returning Option or with PartialFunction. I've read all the related posts I could find (see bottom of question), but these seem to involve the technical details of using PartialFunction or converting one to the other; I am looking for an answer of the type "if the circumstances are X,Y,Z, then use A else B, but also consider C".
My example use case is a path search between locations using a library of path finders. Say the locations are of type L, a path was of type P and the desired path search result would be an Iterable[P]. The patch search result should be assembled by asking all the path finders (in something like Google maps these might be Bicycle, Car, Walk, Subway, etc.) for their path suggestions, which may or may not be defined for a particular start/end location pair.
There seem to be two ways to go about this:
(a) define a path finder as f: (L,L) => Option[P] and then get the result via something like finders.map( _.apply(l1,l2) ).filter( _.isDefined ).map( _.get )
(b) define a path finder as f: PartialFunction[(L,L),P] and then get the result via something likefinders.filter( _.isDefined( (l1,l2) ) ).map( _.apply( (l1,l2)) )`
It seems like using a function returning Option[P] would avoid double evaluation of results, so for an expensive computation this may be preferable unless one caches the results. It also seems like using Option one can have an arbitrary input signature, whereas PartialFunction expects a single argument. But I am particularly interested in hearing from someone with practical experience about less immediate, more "bigger picture" considerations, such as the interaction with the Scala library. Would using a PartialFunction have significant benefits in making available certain methods of the collections API that might pay off in other ways? Would such code generally be more concise?
Related but different questions:
Inverse of PartialFunction's lift method
Is the PartialFunction design inefficient?
How to convert X => Option[R] to PartialFunction[X,R]
Is there a nicer way of lifting a PartialFunction in Scala?
costly computation occuring in both isDefined and Apply of a PartialFunction
It's not all that well known, but since 2.8 Scala has a collect method defined on it's collections. collect is similar to filter, but takes a partial function and has the semantics you describe.
It feels like Option might fit your use case better.
My interpretation is that Partial Functions work well to be combined over input ranges. So if f is defined over (SanDiego,Irvine) and g is defined over (Paris,London) then you can get a function that is defined over the combined input (SanDiego,Irvine) and (Paris,London) by doing f orElse g.
But in your case it seems, things happen for a given (l1,l2) location tuple and then you do some work...
If you find yourself writing a lot of {case (L,M) => ... case (P,Q) => ...} then it may be the sign that partial functions are a better fit.
Otherwise options work well with the rest of the collections and can be used like this instead of your (a) proposal:
val processedPaths = for {
f <- finders
p <- f(l1, l2)
} yield process(p)
Within the for comprehension p is lifted into an Traversable, so you don't even have to call filter, isDefined or get to skip the finders without results.

Use-cases for Streams in Scala

In Scala there is a Stream class that is very much like an iterator. The topic Difference between Iterator and Stream in Scala? offers some insights into the similarities and differences between the two.
Seeing how to use a stream is pretty simple but I don't have very many common use-cases where I would use a stream instead of other artifacts.
The ideas I have right now:
If you need to make use of an infinite series. But this does not seem like a common use-case to me so it doesn't fit my criteria. (Please correct me if it is common and I just have a blind spot)
If you have a series of data where each element needs to be computed but that you may want to reuse several times. This is weak because I could just load it into a list which is conceptually easier to follow for a large subset of the developer population.
Perhaps there is a large set of data or a computationally expensive series and there is a high probability that the items you need will not require visiting all of the elements. But in this case an Iterator would be a good match unless you need to do several searches, in that case you could use a list as well even if it would be slightly less efficient.
There is a complex series of data that needs to be reused. Again a list could be used here. Although in this case both cases would be equally difficult to use and a Stream would be a better fit since not all elements need to be loaded. But again not that common... or is it?
So have I missed any big uses? Or is it a developer preference for the most part?
Thanks
The main difference between a Stream and an Iterator is that the latter is mutable and "one-shot", so to speak, while the former is not. Iterator has a better memory footprint than Stream, but the fact that it is mutable can be inconvenient.
Take this classic prime number generator, for instance:
def primeStream(s: Stream[Int]): Stream[Int] =
Stream.cons(s.head, primeStream(s.tail filter { _ % s.head != 0 }))
val primes = primeStream(Stream.from(2))
It can be easily be written with an Iterator as well, but an Iterator won't keep the primes computed so far.
So, one important aspect of a Stream is that you can pass it to other functions without having it duplicated first, or having to generate it again and again.
As for expensive computations/infinite lists, these things can be done with Iterator as well. Infinite lists are actually quite useful -- you just don't know it because you didn't have it, so you have seen algorithms that are more complex than strictly necessary just to deal with enforced finite sizes.
In addition to Daniel's answer, keep in mind that Stream is useful for short-circuiting evaluations. For example, suppose I have a huge set of functions that take String and return Option[String], and I want to keep executing them until one of them works:
val stringOps = List(
(s:String) => if (s.length>10) Some(s.length.toString) else None ,
(s:String) => if (s.length==0) Some("empty") else None ,
(s:String) => if (s.indexOf(" ")>=0) Some(s.trim) else None
);
Well, I certainly don't want to execute the entire list, and there isn't any handy method on List that says, "treat these as functions and execute them until one of them returns something other than None". What to do? Perhaps this:
def transform(input: String, ops: List[String=>Option[String]]) = {
ops.toStream.map( _(input) ).find(_ isDefined).getOrElse(None)
}
This takes a list and treats it as a Stream (which doesn't actually evaluate anything), then defines a new Stream that is a result of applying the functions (but that doesn't evaluate anything either yet), then searches for the first one which is defined--and here, magically, it looks back and realizes it has to apply the map, and get the right data from the original list--and then unwraps it from Option[Option[String]] to Option[String] using getOrElse.
Here's an example:
scala> transform("This is a really long string",stringOps)
res0: Option[String] = Some(28)
scala> transform("",stringOps)
res1: Option[String] = Some(empty)
scala> transform(" hi ",stringOps)
res2: Option[String] = Some(hi)
scala> transform("no-match",stringOps)
res3: Option[String] = None
But does it work? If we put a println into our functions so we can tell if they're called, we get
val stringOps = List(
(s:String) => {println("1"); if (s.length>10) Some(s.length.toString) else None },
(s:String) => {println("2"); if (s.length==0) Some("empty") else None },
(s:String) => {println("3"); if (s.indexOf(" ")>=0) Some(s.trim) else None }
);
// (transform is the same)
scala> transform("This is a really long string",stringOps)
1
res0: Option[String] = Some(28)
scala> transform("no-match",stringOps)
1
2
3
res1: Option[String] = None
(This is with Scala 2.8; 2.7's implementation will sometimes overshoot by one, unfortunately. And note that you do accumulate a long list of None as your failures accrue, but presumably this is inexpensive compared to your true computation here.)
I could imagine, that if you poll some device in real time, a Stream is more convenient.
Think of an GPS tracker, which returns the actual position if you ask it. You can't precompute the location where you will be in 5 minutes. You might use it for a few minutes only to actualize a path in OpenStreetMap or you might use it for an expedition over six months in a desert or the rain forest.
Or a digital thermometer or other kinds of sensors which repeatedly return new data, as long as the hardware is alive and turned on - a log file filter could be another example.
Stream is to Iterator as immutable.List is to mutable.List. Favouring immutability prevents a class of bugs, occasionally at the cost of performance.
scalac itself isn't immune to these problems: http://article.gmane.org/gmane.comp.lang.scala.internals/2831
As Daniel points out, favouring laziness over strictness can simplify algorithms and make it easier to compose them.