Related
I'm writing an app in Scala and I'm using Akka streams.
At one point, I need to filter out streams that have less than N elements, with N given. So, for example, with N=5:
Source(List(1,2,3)).via(myFilter) // => List()
Source(List(1,2,3,4)).via(myFilter) // => List()
will become empty streams, and
Source(List(1,2,3,4,5)).via(myFilter) // => List(1,2,3,4,5)
Source(List(1,2,3,4,5,6)).via(myFilter) // => List(1,2,3,4,5,6)
will be unchanged.
Of course, we can't know the number of elements in the stream until it's over, and waiting till the end before pushing it through might not be the best idea.
So, instead, I've thought about the following algorithm:
for the first N-1 elements, just buffer them, without passing further;
if the input stream finishes before reaching the Nth element, output an empty stream;
if the input stream reaches Nth element, output the buffered N-1 elements, then output the Nth element, then pass all the following elements that come.
However, I have no idea how to build a Flow element implementing it. Are there some built-in Akka elements I could use?
Edit:
Okay, so I played with it yesterday and I came up with something like that:
Flow[Int].
prefixAndTail(N).
flatMapConcat {
case (prefix, tail) if prefix.length == N =>
Source(prefix).concat(tail)
case _ =>
Source.empty[Int]
}
Will it do what I want?
Perhaps statefulMapConcat could help you:
import akka.actor.ActorSystem
import akka.stream.scaladsl.{Sink, Source}
import akka.stream.{ActorMaterializer, Materializer}
import scala.collection.mutable.ListBuffer
import scala.concurrent.ExecutionContext
object StatefulMapConcatExample extends App {
implicit val system: ActorSystem = ActorSystem()
implicit val materializer: Materializer = ActorMaterializer()
implicit val ec: ExecutionContext = scala.concurrent.ExecutionContext.Implicits.global
def filterLessThen(threshold: Int): (Int) => List[Int] = {
var buffering = true
val buffer: ListBuffer[Int] = ListBuffer()
(elem: Int) =>
if (buffering) {
buffer += elem
if (buffer.size < threshold) {
Nil
} else {
buffering = false
buffer.toList
}
} else {
List(elem)
}
}
//Nil
Source(List(1, 2, 3)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
//Nil
Source(List(1, 2, 3, 4)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
//Vector(1,2,3,4,5)
Source(List(1, 2, 3, 4, 5)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
//Vector(1,2,3,4,5,6)
Source(List(1, 2, 3, 4, 5, 6)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
}
This may be one of those instances where a little "state" can go a long way. Even though the solution is not "purely functional", the updating state will be isolated and unreachable by the rest of the system. I think this is one of the beauties of scala: when an FP solution isn't obvious you can always revert to imperative in an isolated manner...
The completed Flow will be a combination of multiple sub-parts. The first Flow will just group your elements into sequences of size N:
val group : Int => Flow[Int, Seq[Int], _] =
(N) => Flow[Int] grouped N
Now for the non-functional part, a filter that will only allow the grouped Seq values through if the first sequence was the right size:
val minSizeRequirement : Int => Seq[Int] => Boolean =
(minSize) => {
var isFirst : Boolean = True
var passedMinSize : Boolean = False
(testSeq) => {
if(isFirst) {
isFirst = False
passedMinSize = testSeq.size >= minSize
passedMinSize
}
else
passedMinSize
}
}
}
val minSizeFilter : Int => Flow[Seq[Int], Seq[Int], _] =
(minSize) => Flow[Seq[Int]].filter(minSizeRequirement(minSize))
The last step is to convert the Seq[Int] values back into Int values:
val flatten = Flow[Seq[Int]].flatMapConcat(l => Source(l))
Finally, combine them all together:
val combinedFlow : Int => Flow[Int, Int, _] =
(minSize) =>
group(minSize)
.via(minSizeFilter(minSize))
.via(flatten)
I'm writing a small exercise app that calculates number of unique letters (incl Unicode) in a seq of strings, and I'm using aggregate for it, as I try to run in parallel
here's my code:
class Frequency(seq: Seq[String]) {
type FreqMap = Map[Char, Int]
def calculate() = {
val freqMap: FreqMap = Map[Char, Int]()
val pattern = "(\\p{L}+)".r
val seqop: (FreqMap, String) => FreqMap = (fm, s) => {
s.toLowerCase().foldLeft(freqMap){(fm, c) =>
c match {
case pattern(char) => fm.get(char) match {
case None => fm+((char, 1))
case Some(i) => fm.updated(char, i+1)
}
case _ => fm
}
}
}
val reduce: (FreqMap, FreqMap) => FreqMap =
(m1, m2) => {
m1 ++ m2.map { case (k, v) => k -> (v + m1.getOrElse(k, 0)) }
}
seq.par.aggregate(freqMap)(seqop, reduce)
}
}
and then the code that makes use of that
object Frequency extends App {
val text = List("abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc", "abc");
def frequency(seq: Seq[String]):Map[Char, Int] = {
new Frequency(seq).calculate()
}
Console println frequency(seq=text)
}
though I supplied "abc" 9 times, the result is Map(a -> 8, b -> 8, c -> 8), as it is for any number of "abc"'s > 8
I've looked at this, and it seems like I'm using aggregate correctly
Any suggestions to make it work?
You're discarding already collected results (the first fm) in your seqop. You need to add these to the new results you're computing, e.g. like this:
def calculate() = {
val freqMap: FreqMap = Map[Char, Int]()
val pattern = "(\\p{L}+)".r
val reduce: (FreqMap, FreqMap) => FreqMap =
(m1, m2) => {
m1 ++ m2.map { case (k, v) => k -> (v + m1.getOrElse(k, 0)) }
}
val seqop: (FreqMap, String) => FreqMap = (fm, s) => {
val res = s.toLowerCase().foldLeft(freqMap){(fm, c) =>
c match {
case pattern(char) => fm.get(char) match {
case None => fm+((char, 1))
case Some(i) => fm.updated(char, i+1)
}
case _ => fm
}
}
// I'm reusing your existing combinator function here:
reduce(res,fm)
}
seq.par.aggregate(freqMap)(seqop, reduce)
}
Depending on how the parallel collections divide the work you discard some of it. In your case (9x "abc") it divides the thing in 8 parallel seqop operations which means you discard exactly one result set. This varies depending on numbers, if you run in with say 17x "abc" it runs in 13 parallel operations, discarding 4 result sets (on my machine anyway - I'm not familiar with the underlying code and how it divides the work, this probably depends on the used ExecutionContext/Threadpool and subsequently number of CPUs/cores and so on).
Generally parallel collections are a drop in replacement for sequential collections, meaning if you drop .par you should still get the same result, albeit usually slower. If you do this with your original code you get a result of 1, which tells you that it's not a parallelization problem. This is a good way to test if you're doing to right thing when using these.
And last but not least: This was harder to spot than usual for me because you use the same variable name twice and subsequently shadow fm. Not doing that would make the code more readable and mistakes such as this easier to spot.
I've got an ADT that's essentially a cross between Option and Try:
sealed trait Result[+T]
case object Empty extends Result[Nothing]
case class Error(cause: Throwable) extends Result[Nothing]
case class Success[T](value: T) extends Result[T]
(assume common combinators like map, flatMap etc are defined on Result)
Given an Iteratee[A, Result[B] called inner, I want to create a new Iteratee[Result[A], Result[B]] with the following behavior:
If the input is a Success(a), feed a to inner
If the input is an Empty, no-op
If the input is an Error(err), I want inner to be completely ignored, instead returning a Done iteratee with the Error(err) as its result.
Example Behavior:
// inner: Iteratee[Int, Result[List[Int]]]
// inputs:
1
2
3
// output:
Success(List(1,2,3))
// wrapForResultInput(inner): Iteratee[Result[Int], Result[List[Int]]]
// inputs:
Success(1)
Success(2)
Error(Exception("uh oh"))
Success(3)
// output:
Error(Exception("uh oh"))
This sounds to me like the job for an Enumeratee, but I haven't been able to find anything in the docs that looks like it'll do what I want, and the internal implementations are still voodoo to me.
How can I implement wrapForResultInput to create the behavior described above?
Adding some more detail that won't really fit in a comment:
Yes it looks like I was mistaken in my question. I described it in terms of Iteratees but it seems I really am looking for Enumeratees.
At a certain point in the API I'm building, there's a Transformer[A] class that is essentially an Enumeratee[Event, Result[A]]. I'd like to allow clients to transform that object by providing an Enumeratee[Result[A], Result[B]], which would result in a Transformer[B] aka an Enumeratee[Event, Result[B]].
For a more complex example, suppose I have a Transformer[AorB] and want to turn that into a Transformer[(A, List[B])]:
// the Transformer[AorB] would give
a, b, a, b, b, b, a, a, b
// but the client wants to have
a -> List(b),
a -> List(b, b, b),
a -> Nil
a -> List(b)
The client could implement an Enumeratee[AorB, Result[(A, List[B])]] without too much trouble using Enumeratee.grouped, but they are required to provide an Enumeratee[Result[AorB], Result[(A, List[B])] which seems to introduce a lot of complication that I'd like to hide from them if possible.
val easyClientEnumeratee = Enumeratee.grouped[AorB]{
for {
_ <- Enumeratee.dropWhile(_ != a) ><> Iteratee.ignore
headResult <- Iteratee.head.map{ Result.fromOption }
bs <- Enumeratee.takeWhile(_ == b) ><> Iteratee.getChunks
} yield headResult.map{_ -> bs}
val harderEnumeratee = ??? ><> easyClientEnumeratee
val oldTransformer: Transformer[AorB] = ... // assume it already exists
val newTransformer: Transformer[(A, List[B])] = oldTransformer.andThen(harderEnumeratee)
So what I'm looking for is the ??? to define the harderEnumeratee in order to ease the burden on the user who already implemented easyClientEnumeratee.
I guess the ??? should be an Enumeratee[Result[AorB], AorB], but if I try something like
Enumeratee.collect[Result[AorB]] {
case Success(ab) => ab
case Error(err) => throw err
}
the error will actually be thrown; I actually want the error to come back out as an Error(err).
Simplest implementation of such would be Iteratee.fold2 method, that could collect elements until something is happened.
Since you return single result and can't really return anything until you verify there is no errors, Iteratee would be enough for such a task
def listResults[E] = Iteratee.fold2[Result[E], Either[Throwable, List[E]]](Right(Nil)) { (state, elem) =>
val Right(list) = state
val next = elem match {
case Empty => (Right(list), false)
case Success(x) => (Right(x :: list), false)
case Error(t) => (Left(t), true)
}
Future(next)
} map {
case Right(list) => Success(list.reverse)
case Left(th) => Error(th)
}
Now if we'll prepare little playground
import scala.concurrent.ExecutionContext.Implicits._
import scala.concurrent.{Await, Future}
import scala.concurrent.duration._
val good = Enumerator.enumerate[Result[Int]](
Seq(Success(1), Empty, Success(2), Success(3)))
val bad = Enumerator.enumerate[Result[Int]](
Seq(Success(1), Success(2), Error(new Exception("uh oh")), Success(3)))
def runRes[X](e: Enumerator[Result[X]]) : Result[List[X]] = Await.result(e.run(listResults), 3 seconds)
we can verify those results
runRes(good) //res0: Result[List[Int]] = Success(List(1, 2, 3))
runRes(bad) //res1: Result[List[Int]] = Error(java.lang.Exception: uh oh)
I'm going through log file that is too big to fit into memory and collecting 2 type of expressions, what is better functional alternative to my iterative snippet below?
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)]={
val lines : Iterator[String] = io.Source.fromFile(file).getLines()
val logins: mutable.Map[String, String] = new mutable.HashMap[String, String]()
val errors: mutable.ListBuffer[(String, String)] = mutable.ListBuffer.empty
for (line <- lines){
line match {
case errorPat(date,ip)=> errors.append((ip,date))
case loginPat(date,user,ip,id) =>logins.put(ip, id)
case _ => ""
}
}
errors.toList.map(line => (logins.getOrElse(line._1,"none") + " " + line._1,line._2))
}
Here is a possible solution:
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String,String)] = {
val lines = Source.fromFile(file).getLines
val (err, log) = lines.collect {
case errorPat(inf, ip) => (Some((ip, inf)), None)
case loginPat(_, _, ip, id) => (None, Some((ip, id)))
}.toList.unzip
val ip2id = log.flatten.toMap
err.collect{ case Some((ip,inf)) => (ip2id.getOrElse(ip,"none") + "" + ip, inf) }
}
Corrections:
1) removed unnecessary types declarations
2) tuple deconstruction instead of ulgy ._1
3) left fold instead of mutable accumulators
4) used more convenient operator-like methods :+ and +
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)] = {
val lines = io.Source.fromFile(file).getLines()
val (logins, errors) =
((Map.empty[String, String], Seq.empty[(String, String)]) /: lines) {
case ((loginsAcc, errorsAcc), next) =>
next match {
case errorPat(date, ip) => (loginsAcc, errorsAcc :+ (ip -> date))
case loginPat(date, user, ip, id) => (loginsAcc + (ip -> id) , errorsAcc)
case _ => (loginsAcc, errorsAcc)
}
}
// more concise equivalent for
// errors.toList.map { case (ip, date) => (logins.getOrElse(ip, "none") + " " + ip) -> date }
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
I have a few suggestions:
Instead of a pair/tuple, it's often better to use your own class. It gives meaningful names to both the type and its fields, which makes the code much more readable.
Split the code into small parts. In particular, try to decouple pieces of code that don't need to be tied together. This makes your code easier to understand, more robust, less prone to errors and easier to test. In your case it'd be good to separate producing your input (lines of a log file) and consuming it to produce a result. For example, you'd be able to make automatic tests for your function without having to store sample data in a file.
As an example and exercise, I tried to make a solution based on Scalaz iteratees. It's a bit longer (includes some auxiliary code for IteratorEnumerator) and perhaps it's a bit overkill for the task, but perhaps someone will find it helpful.
import java.io._;
import scala.util.matching.Regex
import scalaz._
import scalaz.IterV._
object MyApp extends App {
// A type for the result. Having names keeps things
// clearer and shorter.
type LogResult = List[(String,String)]
// Represents a state of our computation. Not only it
// gives a name to the data, we can also put here
// functions that modify the state. This nicely
// separates what we're computing and how.
sealed case class State(
logins: Map[String,String],
errors: Seq[(String,String)]
) {
def this() = {
this(Map.empty[String,String], Seq.empty[(String,String)])
}
def addError(date: String, ip: String): State =
State(logins, errors :+ (ip -> date));
def addLogin(ip: String, id: String): State =
State(logins + (ip -> id), errors);
// Produce the final result from accumulated data.
def result: LogResult =
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
// An iteratee that consumes lines of our input. Based
// on the given regular expressions, it produces an
// iteratee that parses the input and uses State to
// compute the result.
def logIteratee(errorPat: Regex, loginPat: Regex):
IterV[String,List[(String,String)]] = {
// Consumes a signle line.
def consume(line: String, state: State): State =
line match {
case errorPat(date, ip) => state.addError(date, ip);
case loginPat(date, user, ip, id) => state.addLogin(ip, id);
case _ => state
}
// The core of the iteratee. Every time we consume a
// line, we update our state. When done, compute the
// final result.
def step(state: State)(s: Input[String]): IterV[String, LogResult] =
s(el = line => Cont(step(consume(line, state))),
empty = Cont(step(state)),
eof = Done(state.result, EOF[String]))
// Return the iterate waiting for its first input.
Cont(step(new State()));
}
// Converts an iterator into an enumerator. This
// should be more likely moved to Scalaz.
// Adapted from scalaz.ExampleIteratee
implicit val IteratorEnumerator = new Enumerator[Iterator] {
#annotation.tailrec def apply[E, A](e: Iterator[E], i: IterV[E, A]): IterV[E, A] = {
val next: Option[(Iterator[E], IterV[E, A])] =
if (e.hasNext) {
val x = e.next();
i.fold(done = (_, _) => None, cont = k => Some((e, k(El(x)))))
} else
None;
next match {
case None => i
case Some((es, is)) => apply(es, is)
}
}
}
// main ---------------------------------------------------
{
// Read a file as an iterator of lines:
// val lines: Iterator[String] =
// io.Source.fromFile("test.log").getLines();
// Create our testing iterator:
val lines: Iterator[String] = Seq(
"Error: 2012/03 1.2.3.4",
"Login: 2012/03 user 1.2.3.4 Joe",
"Error: 2012/03 1.2.3.5",
"Error: 2012/04 1.2.3.4"
).iterator;
// Create an iteratee.
val iter = logIteratee("Error: (\\S+) (\\S+)".r,
"Login: (\\S+) (\\S+) (\\S+) (\\S+)".r);
// Run the the iteratee against the input
// (the enumerator is implicit)
println(iter(lines).run);
}
}
I am using the asynchronous I/O library of the playframework which uses Iteratees and Enumerators. I now have an Iterator[T] as data sink (for simplification say it's an Iterator[Byte] which stores its content into a file). This Iterator[Byte] is passed to the function which handles the writing.
But before writing I want to add some statistical information at the file begin (for simplification say it's one Byte), so I transfer the iterator the following way before passing it to the write function:
def write(value: Byte, output: Iteratee[Byte]): Iteratee[Byte] =
Iteratee.flatten(output.feed(Input.El(value)))
When I now read the stored file from the disk, I get an Enumerator[Byte] for it.
At first I want to read and remove the additional data and then I want to pass the rest of the Enumerator[Byte] to a function which handles the reading.
So I also need to transform the enumerator:
def read(input: Enumerator[Byte]): (Byte, Enumerator[Byte]) = {
val firstEnumeratorEntry = ...
val remainingEnumerator = ...
(firstEnumeratorEntry, remainingEnumerator)
}
But I have no idea, how to do this. How can I read some bytes from an Enumerator and get the remaining Enumerator?
Replacing Iteratee[Byte] with OutputStream and Enumerator[Byte] with InputStream, this would be very easy:
def write(value: Byte, output: OutputStream) = {
output.write(value)
output
}
def read(input: InputStream) = (input.read,input)
But I need the asynchronous I/O of the play framework.
I wonder if you can tackle your goal from another angle.
That function that would use the remaining enumerator, let's call it remaining, presumably it applies to an iteratee to do the processing of the remainder: remaining |>> iteratee yielding another iteratee. Let's call that resulting iteratee iteratee2... Can you check whether you can get a reference to iteratee2? If that's the case, then you can get and process the first byte using a first iteratee head, then combine head and iteratee2 through flatMap:
val head = Enumeratee.take[Byte](1) &>> Iteratee.foreach[Byte](println)
val processing = for { h <- head; i <- iteratee2 } yield (h, i)
Iteratee.flatten(processing).run
If you cannot get a hold of iteratee2 - which would be the case if your enumerator combines with an enumeratee that you did not implement - then this approach won't work.
Here is one way to achieve this by folding within the Iteratee and an appropriate (kind-of) State accumulator (a tuple here)
I go read the routes file, the first byte will be read as a Char and the other will be appended to a String as UTF-8 bytestrings.
def index = Action {
/*let's do everything asyncly*/
Async {
/*for comprehension for read-friendly*/
for (
i <- read; /*read the file */
(r:(Option[Char], String)) <- i.run /*"create" the related Promise and run it*/
) yield Ok("first : " + r._1.get + "\n" + "rest" + r._2) /* map the Promised result in a correct Request's Result*/
}
}
def read = {
//get the routes file in an Enumerator
val file: Enumerator[Array[Byte]] = Enumerator.fromFile(Play.getFile("/conf/routes"))
//apply the enumerator with an Iteratee that folds the data as wished
file(Iteratee.fold((None, ""):(Option[Char], String)) { (acc, b) =>
acc._1 match {
/*on the first chunk*/ case None => (Some(b(0).toChar), acc._2 + new String(b.tail, Charset.forName("utf-8")))
/*on other chunks*/ case x => (x, acc._2 + new String(b, Charset.forName("utf-8")))
}
})
}
EDIT
I found yet another way using Enumeratee but it needs to create 2 Enumerator s (one short lived). However is it a bit more elegant. We use a "kind-of" Enumeratee but the Traversal one which works at a finer level than Enumeratee (chunck level).
We use take 1 that will take only 1 byte and then close the stream. On the other one, we use drop that simply drops the first byte (because we're using a Enumerator[Array[Byte]])
Furthermore, now read2 has a signature much more closer than what you wished, because it returns 2 enumerators (not so far from Promise, Enumerator)
def index = Action {
Async {
val (first, rest) = read2
val enee = Enumeratee.map[Array[Byte]] {bs => new String(bs, Charset.forName("utf-8"))}
def useEnee(enumor:Enumerator[Array[Byte]]) = Iteratee.flatten(enumor &> enee |>> Iteratee.consume[String]()).run.asInstanceOf[Promise[String]]
for {
f <- useEnee(first);
r <- useEnee(rest)
} yield Ok("first : " + f + "\n" + "rest" + r)
}
}
def read2 = {
def create = Enumerator.fromFile(Play.getFile("/conf/routes"))
val file: Enumerator[Array[Byte]] = create
val file2: Enumerator[Array[Byte]] = create
(file &> Traversable.take[Array[Byte]](1), file2 &> Traversable.drop[Array[Byte]](1))
}
Actually we like Iteratees because they compose. So instead of creating multiple Enumerators from your original one, you rather compose the two Iteratees sequentially (read-first and read-rest), and feed it with your single Enumerator.
For this you need a sequential composition method, now I call it andThen. Here is a rough implementation. Note that returning the unconsumed input is a bit harsh, maybe could customize behavior with a typeclass based on the Input type. Also it doesn't handle passing the leftover stuff from the first iterator to the second one (Exercise :).
object Iteratees {
def andThen[E, A, B](a: Iteratee[E, A], b: Iteratee[E, B]): Iteratee[E, (A,B)] = new Iteratee[E, (A,B)] {
def fold[C](
done: ((A, B), Input[E]) => Promise[C],
cont: ((Input[E]) => Iteratee[E, (A, B)]) => Promise[C],
error: (String, Input[E]) => Promise[C]): Promise[C] = {
a.fold(
(ra, aleft) => b.fold(
(rb, bleft) => done((ra, rb), aleft /* could be magicop(aleft, bleft)*/),
(bcont) => cont(e => bcont(e) map (rb => (ra, rb))),
(s, err) => error(s, err)
),
(acont) => cont(e => andThen[E, A, B](acont(e), b)),
(s, err) => error(s, err)
)
}
}
}
Now you can just use the following:
object Application extends Controller {
def index = Action { Async {
val strings: Enumerator[String] = Enumerator("1","2","3","4")
val takeOne = Cont[String, String](e => e match {
case Input.El(e) => Done(e, Input.Empty)
case x => Error("not enough", x)
})
val takeRest = Iteratee.consume[String]()
val firstAndRest = Iteratees.andThen(takeOne, takeRest)
val futureRes = strings(firstAndRest) flatMap (_.run)
futureRes.map(x => Ok(x.toString)) // prints (1,234)
} }
}