Scala filter collection by future result - scala

I have a class called Source, it has lazy val isValid which does some network checks that takes time. I've made it to return Future[Boolean] so it won't block main thread:
lazy val isValid: Future[Boolean] = doSomeChecks()
Now, I want to check few Sources and discard invalid ones.
Here's some pseudo code:
val sources = Seq(Source1, Source2,...)
val validOnes: Seq[Source] = sources.filter(_.isValid)
I was thinking of some method that would transform Future[Boolean] returned from isValid into some other Future that would resolve with whole Source object after validation checks finish.
What is the best way to do this?

No need to return the Source itself from isValid. Simply append the source itself to each result of an isValid request.
Once the results of validity checks are there, filter those sources that
are valid, and throw away the validity-boolean:
val sources: Seq[Source] = ...
val validSources: Future[Seq[Source]] = (for {
checkedSources <- Future.sequence(
sources.map(s => s.isValid.map(b => (s, b)))
)
} yield checkedSources.filter(_._2).map(_._1))
Here, s.isValid returns a Future[Boolean]. This future is mapped with b => (s, b), so that s.isValid.map(b => (s, b)) is a Future[(Source, Boolean)]. The Future.sequence transforms bunch of futures into a single future of a sequence. Once the sequence has been computed, it can be filtered by the second component, and then the second component can be discarded.
I'm not sure whether this is "the best" way to do this, because it's not clear to me how long the validity check remains sufficiently "fresh": it might well be that half of the results of isValid checks are outdated by the time all results are collected into the checkedSources sequence.
EDIT (1): Even shorter:
val validSources: Future[Seq[Source]] =
Future.sequence(sources.map(_.isValid)).map {
bs => (sources zip bs).filter(_._2).map(_._1)
}
This should work too.
(code unchecked with compiler, rough sketch off the top of my head; Please provide minimal compilable example if you want a compilable solution)

Related

Akka Streams: validation of elements being streamed

I'm new to Akka Streams and I'm wondering how to implement some kind of mid-stream validation. Example:
FileIO
.fromPath(file)
.via(Framing.delimiter(...)
.map(_.utf8String)
.map(_.split("\t", -1))
.validate(arr => arr.length == 10) // or similar
...
I assumed that this scenario is so common that there must be a predefined functionality for validating a stream on the fly. However, I wasn't able to find anything about it. Am I on the wrong tracks here and validation is something that should not be done this way in Akka Streams?
In my particular scenario, I'm processing a file line by line. If only one single line is invalid, it does not make sense to continue and the processing should be aborted.
I'd probably create a type to represent the constraints, then you can do the assertions when creating instances of that type, as well as know downstream which constraints have been applied.
Example:
object LineItem {
// Makes it possible to provide the validation before allocating the item
def apply(string: String): LineItem = {
require(string.length == 10)
new LineItem(string) // Call the companion-accessible constructor
}
}
// private[LineItem] makes sure that `new` only works from companion object
final case class LineItem private[LineItem](string: String)
You could use .takeWhile. This will process all elements before the invalid item, and not process any items after it.
FileIO
.fromPath(file)
.via(Framing.delimiter(...)
.map(_.utf8String)
.map(_.split("\t", -1))
.takeWhile(arr => arr.length == 10)
...
I agree with #Stephen that takeWhile is what you need. You can use it with the inclusive flag set to true if you want the failing elements to be passed downstream.
Also, if you want to make your stream the most expressive, you can have the validation flow producing Either[ValidationError, String].
The example below is a bit clunky, I would prefer to use the graphDSL and partition, but hopefully you get the idea.
val errorSink: Sink[TooManyElements, _] = ???
val sink: Sink[Array[String], _] = ???
FileIO
.fromPath(file)
.via(Framing.delimiter(...))
.map(_.utf8String.split("\t", -1))
.map{
case arr if arr.length > 10 ⇒ Left(TooManyElements(arr.length))
case arr ⇒ Right(arr)
}
.takeWhile(_.isRight, inclusive = true)
.alsoTo(Flow[Either[TooManyElements, Array[String]]].filter(_.isLeft).to(errorSink)
.filter(_.isRight)
.to(sink)

Use only the value from IO monad without precedent IO actions

Doing some home project, I encountered an interested effect, which now , seems obvious to me, but still I do not see a way to get away from it.
That is the gist (I am using ScalaZ, but in haskell there would be probably the same result):
def askAndReadResponse(question: String): IO[String] = {
putStrLn(question) >> readLn
}
def core: IO[String] = {
val answer: IO[String] = askAndReadResponse("enter something")
val cond: IO[Boolean] = answer map {_.length > 2}
IO.ioMonad.ifM(cond, answer, core)
}
When I am trying to get an input from core, the askAndReadResponse evaluates twice - once for evaluating the condition, and then in ifM (so I have the message and readLn one more time then necessary).
What I need - just the validated value (to print it later, for instance)
Is there any elegant way to do this, in particular - to pass further the result of IO, without preceding IO actions, namely avoiding execution of askAndReadResponse twice?
You can sequence the effects using monadic binding with flatMap:
def core: IO[String] = askAndReadResponse("enter something").flatMap {
case response if response.length > 2 => response.point[IO]
case response => core
}
This lets you take the result of one computation (the user entering text after being prompted) and use it in subsequent computations (the calculation about whether to return or loop, and the result if returning).
ifM just isn't going to be useful in your case—it would only work here if your condition and your successful branch were independent computations.

Converting thunk to sequence upon iteration

I have a server API that returns a list of things, and does so in chunks of, let's say, 25 items at a time. With every response, we get a list of items, and a "token" that we can use for the following server call to return the next 25, and so on.
Please note that we're using a client library that has been written in stodgy old mutable Java, and doesn't lend itself nicely to all of Scala's functional compositional patterns.
I'm looking for a way to return a lazily evaluated sequence of all server items, by doing a server call with the latest token whenever the local list of items has been exhausted. What I have so far is:
def fetchFromServer(uglyStateObject: StateObject): Seq[Thing] = {
val results = server.call(uglyStateObject)
uglyStateObject.update(results.token())
results.asScala.toList ++ (if results.moreAvailable() then
fetchFromServer(uglyStateObject)
else
List())
}
However, this function does eager evaluation. What I'm looking for is to have ++ concatenate a "strict sequence" and a "lazy sequence", where a thunk will be used to retrieve the next set of items from the server. In effect, I want something like this:
results.asScala.toList ++ Seq.lazy(() => fetchFromServer(uglyStateObject))
Except I don't know what to use in place of Seq.lazy.
Things I've seen so far:
SeqView, but I've seen comments that it shouldn't be used because it re-evaluates all the time?
Streams, but they seem like the abstraction is supposed to generate elements at a time, whereas I want to generate a bunch of elements at a time.
What should I use?
I also suggest you to take a look at scalaz-strem. Here is small example how it may look like
import scalaz.stream._
import scalaz.concurrent.Task
// Returns updated state + fetched data
def fetchFromServer(uglyStateObject: StateObject): (StateObject, Seq[Thing]) = ???
// Initial state
val init: StateObject = new StateObject
val p: Process[Task, Thing] = Process.repeatEval[Task, Seq[Thing]] {
var state = init
Task(fetchFromServer(state)) map {
case (s, seq) =>
state = s
seq
}
} flatMap Process.emitAll
As a matter of fact, in the meantime I already found a slightly different answer that I find more readable (indeed using Streams):
def fetchFromServer(uglyStateObject: StateObject): Stream[Thing] = {
val results = server.call(uglyStateObject)
uglyStateObject.update(results.token())
results.asScala.toStream #::: (if results.moreAvailable() then
fetchFromServer(uglyStateObject)
else
Stream.empty)
}
Thanks everyone for

Iteratees in Scala that use lazy evaluation or fusion?

I have heard that iteratees are lazy, but how lazy exactly are they? Alternatively, can iteratees be fused with a postprocessing function, so that an intermediate data structure does not have to be built?
Can I in my iteratee for example build a 1 million item Stream[Option[String]] from a java.io.BufferedReader, and then subsequently filter out the Nones, in a compositional way, without requiring the entire Stream to be held in memory? And at the same time guarantee that I don't blow the stack? Or something like that - it doesn't have to use a Stream.
I'm currently using Scalaz 6 but if other iteratee implementations are able to do this in a better way, I'd be interested to know.
Please provide a complete solution, including closing the BufferedReader and calling unsafePerformIO, if applicable.
Here's a quick iteratee example using the Scalaz 7 library that demonstrates the properties you're interested in: constant memory and stack usage.
The problem
First assume we've got a big text file with a string of decimal digits on each line, and we want to find all the lines that contain at least twenty zeros. We can generate some sample data like this:
val w = new java.io.PrintWriter("numbers.txt")
val r = new scala.util.Random(0)
(1 to 1000000).foreach(_ =>
w.println((1 to 100).map(_ => r.nextInt(10)).mkString)
)
w.close()
Now we've got a file named numbers.txt. Let's open it with a BufferedReader:
val reader = new java.io.BufferedReader(new java.io.FileReader("numbers.txt"))
It's not excessively large (~97 megabytes), but it's big enough for us to see easily whether our memory use is actually staying constant while we process it.
Setting up our enumerator
First for some imports:
import scalaz._, Scalaz._, effect.IO, iteratee.{ Iteratee => I }
And an enumerator (note that I'm changing the IoExceptionOrs into Options for the sake of convenience):
val enum = I.enumReader(reader).map(_.toOption)
Scalaz 7 doesn't currently provide a nice way to enumerate a file's lines, so we're chunking through the file one character at a time. This will of course be painfully slow, but I'm not going to worry about that here, since the goal of this demo is to show that we can process this large-ish file in constant memory and without blowing the stack. The final section of this answer gives an approach with better performance, but here we'll just split on line breaks:
val split = I.splitOn[Option[Char], List, IO](_.cata(_ != '\n', false))
And if the fact that splitOn takes a predicate that specifies where not to split confuses you, you're not alone. split is our first example of an enumeratee. We'll go ahead and wrap our enumerator in it:
val lines = split.run(enum).map(_.sequence.map(_.mkString))
Now we've got an enumerator of Option[String]s in the IO monad.
Filtering the file with an enumeratee
Next for our predicate—remember that we said we wanted lines with at least twenty zeros:
val pred = (_: String).count(_ == '0') >= 20
We can turn this into a filtering enumeratee and wrap our enumerator in that:
val filtered = I.filter[Option[String], IO](_.cata(pred, true)).run(lines)
We'll set up a simple action that just prints everything that makes it through this filter:
val printAction = (I.putStrTo[Option[String]](System.out) &= filtered).run
Of course we've not actually read anything yet. To do that we use unsafePerformIO:
printAction.unsafePerformIO()
Now we can watch the Some("0946943140969200621607610...")s slowly scroll by while our memory usage remains constant. It's slow and the error handling and output are a little clunky, but not too bad I think for about nine lines of code.
Getting output from an iteratee
That was the foreach-ish usage. We can also create an iteratee that works more like a fold—for example gathering up the elements that make it through the filter and returning them in a list. Just repeat everything above up until the printAction definition, and then write this instead:
val gatherAction = (I.consume[Option[String], IO, List] &= filtered).run
Kick that action off:
val xs: Option[List[String]] = gatherAction.unsafePerformIO().sequence
Now go get a coffee (it might need to be pretty far away). When you come back you'll either have a None (in the case of an IOException somewhere along the way) or a Some containing a list of 1,943 strings.
Complete (faster) example that automatically closes the file
To answer your question about closing the reader, here's a complete working example that's roughly equivalent to the second program above, but with an enumerator that takes responsibility for opening and closing the reader. It's also much, much faster, since it reads lines, not characters. First for imports and a couple of helper methods:
import java.io.{ BufferedReader, File, FileReader }
import scalaz._, Scalaz._, effect._, iteratee.{ Iteratee => I, _ }
def tryIO[A, B](action: IO[B]) = I.iterateeT[A, IO, Either[Throwable, B]](
action.catchLeft.map(
r => I.sdone(r, r.fold(_ => I.eofInput, _ => I.emptyInput))
)
)
def enumBuffered(r: => BufferedReader) =
new EnumeratorT[Either[Throwable, String], IO] {
lazy val reader = r
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(reader.readLine())).flatMap {
case Right(null) => s.pointI
case Right(line) => k(I.elInput(Right(line))) >>== apply[A]
case e => k(I.elInput(e))
}
)
}
And now the enumerator:
def enumFile(f: File): EnumeratorT[Either[Throwable, String], IO] =
new EnumeratorT[Either[Throwable, String], IO] {
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(new BufferedReader(new FileReader(f)))).flatMap {
case Right(reader) => I.iterateeT(
enumBuffered(reader).apply(s).value.ensuring(IO(reader.close()))
)
case Left(e) => k(I.elInput(Left(e)))
}
)
}
And we're ready to go:
val action = (
I.consume[Either[Throwable, String], IO, List] %=
I.filter(_.fold(_ => true, _.count(_ == '0') >= 20)) &=
enumFile(new File("numbers.txt"))
).run
Now the reader will be closed when the processing is done.
I should have read a little bit further... this is precisely what enumeratees are for. Enumeratees are defined in Scalaz 7 and Play 2, but not in Scalaz 6.
Enumeratees are for "vertical" composition (in the sense of "vertically integrated industry") while ordinary iteratees compose monadically in a "horizontal" way.

what is proper monad or sequence comprehension to both map and carry state across?

I'm writing a programming language interpreter.
I have need of the right code idiom to both evaluate a sequence of expressions to get a sequence of their values, and propagate state from one evaluator to the next to the next as the evaluations take place. I'd like a functional programming idiom for this.
It's not a fold because the results come out like a map. It's not a map because of the state prop across.
What I have is this code which I'm using to try to figure this out. Bear with a few lines of test rig first:
// test rig
class MonadLearning extends JUnit3Suite {
val d = List("1", "2", "3") // some expressions to evaluate.
type ResType = Int
case class State(i : ResType) // trivial state for experiment purposes
val initialState = State(0)
// my stub/dummy "eval" function...obviously the real one will be...real.
def computeResultAndNewState(s : String, st : State) : (ResType, State) = {
val State(i) = st
val res = s.toInt + i
val newStateInt = i + 1
(res, State(newStateInt))
}
My current solution. Uses a var which is updated as the body of the map is evaluated:
def testTheVarWay() {
var state = initialState
val r = d.map {
s =>
{
val (result, newState) = computeResultAndNewState(s, state)
state = newState
result
}
}
println(r)
println(state)
}
I have what I consider unacceptable solutions using foldLeft which does what I call "bag it as you fold" idiom:
def testTheFoldWay() {
// This startFold thing, requires explicit type. That alone makes it muddy.
val startFold : (List[ResType], State) = (Nil, initialState)
val (r, state) = d.foldLeft(startFold) {
case ((tail, st), s) => {
val (r, ns) = computeResultAndNewState(s, st)
(tail :+ r, ns) // we want a constant-time append here, not O(N). Or could Cons on front and reverse later
}
}
println(r)
println(state)
}
I also have a couple of recursive variations (which are obvious, but also not clear or well motivated), one using streams which is almost tolerable:
def testTheStreamsWay() {
lazy val states = initialState #:: resultStates // there are states
lazy val args = d.toStream // there are arguments
lazy val argPairs = args zip states // put them together
lazy val resPairs : Stream[(ResType, State)] = argPairs.map{ case (d1, s1) => computeResultAndNewState(d1, s1) } // map across them
lazy val (results , resultStates) = myUnzip(resPairs)// Note .unzip causes infinite loop. Had to write my own.
lazy val r = results.toList
lazy val finalState = resultStates.last
println(r)
println(finalState)
}
But, I can't figure out anything as compact or clear as the original 'var' solution above, which I'm willing to live with, but I think somebody who eats/drinks/sleeps monad idioms is going to just say ... use this... (Hopefully!)
With the map-with-accumulator combinator (the easy way)
The higher-order function you want is mapAccumL. It's in Haskell's standard library, but for Scala you'll have to use something like Scalaz.
First the imports (note that I'm using Scalaz 7 here; for previous versions you'd import Scalaz._):
import scalaz._, syntax.std.list._
And then it's a one-liner:
scala> d.mapAccumLeft(initialState, computeResultAndNewState)
res1: (State, List[ResType]) = (State(3),List(1, 3, 5))
Note that I've had to reverse the order of your evaluator's arguments and the return value tuple to match the signatures expected by mapAccumLeft (state first in both cases).
With the state monad (the slightly less easy way)
As Petr Pudlák points out in another answer, you can also use the state monad to solve this problem. Scalaz actually provides a number of facilities that make working with the state monad much easier than the version in his answer suggests, and they won't fit in a comment, so I'm adding them here.
First of all, Scalaz does provide a mapM—it's just called traverse (which is a little more general, as Petr Pudlák notes in his comment). So assuming we've got the following (I'm using Scalaz 7 again here):
import scalaz._, Scalaz._
type ResType = Int
case class Container(i: ResType)
val initial = Container(0)
val d = List("1", "2", "3")
def compute(s: String): State[Container, ResType] = State {
case Container(i) => (Container(i + 1), s.toInt + i)
}
We can write this:
d.traverse[({type L[X] = State[Container, X]})#L, ResType](compute).run(initial)
If you don't like the ugly type lambda, you can get rid of it like this:
type ContainerState[X] = State[Container, X]
d.traverse[ContainerState, ResType](compute).run(initial)
But it gets even better! Scalaz 7 gives you a version of traverse that's specialized for the state monad:
scala> d.traverseS(compute).run(initial)
res2: (Container, List[ResType]) = (Container(3),List(1, 3, 5))
And as if that wasn't enough, there's even a version with the run built in:
scala> d.runTraverseS(initial)(compute)
res3: (Container, List[ResType]) = (Container(3),List(1, 3, 5))
Still not as nice as the mapAccumLeft version, in my opinion, but pretty clean.
What you're describing is a computation within the state monad. I believe that the answer to your question
It's not a fold because the results come out like a map. It's not a map because of the state prop across.
is that it's a monadic map using the state monad.
Values of the state monad are computations that read some internal state, possibly modify it, and return some value. It is often used in Haskell (see here or here).
For Scala, there is a trait in the ScalaZ library called State that models it (see also the source). There are utility methods in States for creating instances of State. Note that from the monadic point of view State is just a monadic value. This may seem confusing at first, because it's described by a function depending on a state. (A monadic function would be something of type A => State[B].)
Next you need is a monadic map function that computes values of your expressions, threading the state through the computations. In Haskell, there is a library method mapM that does just that, when specialized to the state monad.
In Scala, there is no such library function (if it is, please correct me). But it's possible to create one. To give a full example:
import scalaz._;
object StateExample
extends App
with States /* utility methods */
{
// The context that is threaded through the state.
// In our case, it just maps variables to integer values.
class Context(val map: Map[String,Int]);
// An example that returns the requested variable's value
// and increases it's value in the context.
def eval(expression: String): State[Context,Int] =
state((ctx: Context) => {
val v = ctx.map.get(expression).getOrElse(0);
(new Context(ctx.map + ((expression, v + 1)) ), v);
});
// Specialization of Haskell's mapM to our State monad.
def mapState[S,A,B](f: A => State[S,B])(xs: Seq[A]): State[S,Seq[B]] =
state((initState: S) => {
var s = initState;
// process the sequence, threading the state
// through the computation
val ys = for(x <- xs) yield { val r = f(x)(s); s = r._1; r._2 };
// return the final state and the output result
(s, ys);
});
// Example: Try to evaluate some variables, starting from an empty context.
val expressions = Seq("x", "y", "y", "x", "z", "x");
print( mapState(eval)(expressions) ! new Context(Map[String,Int]()) );
}
This way you can create simple functions that take some arguments and return State and then combine them into more complex ones by using State.map or State.flatMap (or perhaps better using for comprehensions), and then you can run the whole computation on a list of expressions by mapM.
See also http://blog.tmorris.net/posts/the-state-monad-for-scala-users/
Edit: See Travis Brown's answer, he described how to use the state monad in Scala much more nicely.
He also asks:
But why, when there's a standard combinator that does exactly what you need in this case?
(I ask this as someone who's been slapped for using the state monad when mapAccumL would do.)
It's because the original question asked:
It's not a fold because the results come out like a map. It's not a map because of the state prop across.
and I believe the proper answer is it is a monadic map using the state monad.
Using mapAccumL is surely faster, both less memory and CPU overhead. But the state monad captures the concept of what is going on, the essence of the problem. I believe in many (if not most) cases this is more important. Once we realize the essence of the problem, we can either use the high-level concepts to nicely describe the solution (perhaps sacrificing speed/memory a little) or optimize it to be fast (or perhaps even manage to do both).
On the other hand, mapAccumL solves this particular problem, but doesn't give us a broader answer. If we need to modify it a little, it might happen it won't work any more. Or, if the library starts to be complex, the code can start to be messy and we won't know how to improve it, how to make the original idea clear again.
For example, in the case of evaluating stateful expressions, the library can become complicated and complex. But if we use the state monad, we can build the library around small functions, each taking some arguments and returning something like State[Context,Result]. These atomic computations can be combined to more complex ones using flatMap method or for comprehensions, and finally we'll construct the desired task. The principle will stay the same across the whole library, and the final task will also be something that returns State[Context,Result].
To conclude: I'm not saying using the state monad is the best solution, and certainly it's not the fastest one. I just believe it is most didactic for a functional programmer - it describes the problem in a clean, abstract way.
You could do this recursively:
def testTheRecWay(xs: Seq[String]) = {
def innerTestTheRecWay(xs: Seq[String], priorState: State = initialState, result: Vector[ResType] = Vector()): Seq[ResType] = {
xs match {
case Nil => result
case x :: tail =>
val (res, newState) = computeResultAndNewState(x, priorState)
innerTestTheRecWay(tail, newState, result :+ res)
}
}
innerTestTheRecWay(xs)
}
Recursion is a common practice in functional programming and is most of the time easier to read, write and understand than loops. In fact scala does not have any loops other than while. fold, map, flatMap, for (which is just sugar for flatMap/map), etc. are all recursive.
This method is tail recursive and will be optimized by the compiler to not build a stack, so it is absolutely safe to use. You can add the #annotation.tailrec annotaion, to force the compiler to apply tail recursion elimination. If your method is not tailrec the compiler will then complain.
edit: renamed inner method to avoid ambiguity