execute DAG like operations in scala Future - scala

I am working on use case where in I have to execute inter-dependent operations (defined as a Directed Acyclic Graph) using scala Future. Basically every operation (say node of DAG) will be executed in a Future and subsequent dependent nodes will be triggered (they should be in a Future too) once the current node Future completes. This will go on until every node has finished processing or one of them fails. So far I have (minimal code):
def run(node: Node, result: Result): Unit = {
val f: Future[(Node, Result)] = Future {
// process current Node
...
}
f onComplete {
case Success(x) =>
val n = x._1 // Current Node
val r = x._2 // Result of current Node
if (!n.isLeaf()) {
n.children.foreach { z =>
run(z, r)
}
}
case Failure(e) => throw e
}
}
Is this correct way to tackle this problem (Calling another Future in a callback)? Again I don't have proper way stop other running future once one of the node fails processing.
Can this be solved using Future composition? If so, how can I achieve that?
Thanks,
Pravin

Here is a more functional approach: instead of using Unit as a result of evaluation of the run/Future we can have a generic type. Usually you would want to work with results of the Future functionally, rather than with its side effects.
I've added type annotations and descriptive variable names so that it would be easier to understand. I also added a few cases to show how it will fail. You can also chose to recover rather than fail everything when failure occurs. However, for this problem if the child computation relies on parent value it's probably more reasonable to fail.
import scala.concurrent.{Await, Future}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
import scala.util.Try
case class Node[T](value: T, children: List[Node[T]])
object DagFuture extends App {
def run[A, B](node: Node[A], result: B)(nodeEval: (Node[A], B) => B)(aggregator: List[B] => B): Future[B] = {
val nodeResult: Future[B] = Future(nodeEval(node, result))
val allResults: Future[List[B]] = nodeResult.flatMap(r => Future.sequence(nodeResult :: node.children.map(x => run(x, r)(nodeEval)(aggregator))))
val finalResult: Future[B] = allResults.map(cl => aggregator(cl))
finalResult
}
val debugSum = (l: List[Int]) => {
println(s"aggregating: $l")
l.sum
}
def debugNodeEval(f: (Node[Int], Int) => Int)(n: Node[Int], r: Int) = {
val eval = Try { f(n, r) }
println(s"node: $n, result: $r, eval: $eval")
eval.get
}
val debugNodeEvalDefault = debugNodeEval((n, r) => n.value + r) _
val singleNodeDag = Node(1, Nil)
val multiNodeDag = Node(1, List(Node(20, Nil), Node(300, Nil)))
println("\nSINGLE NODE DAG EXAMPLE:")
val singleNodeFuture = run(singleNodeDag, 0)(debugNodeEvalDefault)(debugSum)
val singleNodeResult = Await.result(singleNodeFuture, 5 seconds)
println(s"Single node result: $singleNodeResult")
println("\nDAG PATH LENGTH EXAMPLE:")
val pathLengthFuture = run(multiNodeDag, 0)(debugNodeEvalDefault)(debugSum)
val pathLengthResult = Await.result(pathLengthFuture, 5 seconds)
println(s"Path length: $pathLengthResult")
println("\nFAILED DAG ROOT NODE EXAMPLE:")
val failedRootNodeFuture = run(multiNodeDag, 0)(debugNodeEval((n, r) => throw new Exception))(debugSum)
val failedRootNodePromise = Await.ready(failedRootNodeFuture, 5 seconds)
println(s"Failed root node: ${failedRootNodePromise.value}")
println("\nFAILED DAG CHILD NODE EXAMPLE:")
val failedChildNodeFuture = run(multiNodeDag, 0)(debugNodeEval((n, r) => if (n.value == 300) throw new Exception else n.value + r))(debugSum)
val failedChildNodePromise = Await.ready(failedChildNodeFuture, 5 seconds)
println(s"Failed child node: ${failedChildNodePromise.value}")
}
Prints this:
SINGLE NODE DAG EXAMPLE:
node: Node(1,List()), result: 0, eval: Success(1)
aggregating: List(1)
Single node result: 1
DAG PATH LENGTH EXAMPLE:
node: Node(1,List(Node(20,List()), Node(300,List()))), result: 0, eval: Success(1)
node: Node(20,List()), result: 1, eval: Success(21)
node: Node(300,List()), result: 1, eval: Success(301)
aggregating: List(301)
aggregating: List(21)
aggregating: List(1, 21, 301)
Path length: 323
FAILED DAG ROOT NODE EXAMPLE:
node: Node(1,List(Node(20,List()), Node(300,List()))), result: 0, eval: Failure(java.lang.Exception)
Failed root node: Some(Failure(java.lang.Exception))
FAILED DAG CHILD NODE EXAMPLE:
node: Node(1,List(Node(20,List()), Node(300,List()))), result: 0, eval: Success(1)
node: Node(20,List()), result: 1, eval: Success(21)
aggregating: List(21)
node: Node(300,List()), result: 1, eval: Failure(java.lang.Exception)
Failed child node: Some(Failure(java.lang.Exception))
TL;DR
def run[A, B](node: Node[A], result: B)(nodeEval: (Node[A], B) => B)(aggregator: Traversable[B] => B): Future[B] = {
val nodeResult = Future(nodeEval(node, result))
val allResults = nodeResult flatMap { r => Future.sequence(nodeResult :: node.children.map { x => run(x, r)(nodeEval)(aggregator) }) }
allResults map aggregator
}
Loosely speaking it's just a Future.flatMap(result => Future.sequence(children ...)). When the parent Future completes its result is passed in flatMap to children computation. If parent Future fails the whole computation fails as well. sequence combines result from list of Futures into a single Future. A child Future is a parent to it's children, and so on recursively. Thus the same failure mode applies.

Related

Grouping a stream of elements into multiple streams

Let's say we have a case class MyCaseClass(name: String, value: Int). Given an fs2.Stream[F, MyCaseClass] I want to group elements with the same name
val sourceStream: fs2.Stream[F, MyCaseClass] = //
val groupedSameNameStream: fs2.Stream[F, fs2.Stream[F, MyCaseClass]] = //
The reason I need to do this is I want to apply effectfful transformation
val transform: MyCaseClass => F[Unit] = //
to all elements of a stream and in case one group fails the other should keep working.
Is something like this possible to do?
This is possible, with caveats.
It's relatively straightforward to do this if you accept having a Map with an unbounded number of keys, and an unbounded number of associated Queues for each.
We've use code based on a gist by github user kiambogo in production (though ours has been tweaked), and it works fine:
import fs2.concurrent.Queue
import cats.implicits._
import cats.effect.Concurrent
import cats.effect.concurrent.Ref
def groupBy[F[_], A, K](selector: A => F[K])(implicit F: Concurrent[F]): Pipe[F, A, (K, Stream[F, A])] = {
in =>
Stream.eval(Ref.of[F, Map[K, Queue[F, Option[A]]]](Map.empty)).flatMap { st =>
val cleanup = {
import alleycats.std.all._
st.get.flatMap(_.traverse_(_.enqueue1(None)))
}
(in ++ Stream.eval_(cleanup))
.evalMap { el =>
(selector(el), st.get).mapN { (key, queues) =>
queues.get(key).fold {
for {
newQ <- Queue.unbounded[F, Option[A]] // Create a new queue
_ <- st.modify(x => (x + (key -> newQ), x)) // Update the ref of queues
_ <- newQ.enqueue1(el.some)
} yield (key -> newQ.dequeue.unNoneTerminate).some
}(_.enqueue1(el.some) as None)
}.flatten
}.unNone.onFinalize(cleanup)
}
}
If we assume an overhead of 64 bytes for each Map entry (I believe this is very overestimated) then a cardinality of 100,000 unique keys gives us approximately 6.1MiB - well within reasonable size for a jvm process.

Functional way of interrupting lazy iteration depedning on timeout and comparisson between previous and next, while, LazyList vs Stream

Background
I have the following scenario. I want to execute the method of a class from an external library, repeatedly, and I want to do so until a certain timeout condition and result condition (compared to the previous result) is met. Furthermore I want to collect the return values, even on the "failed" run (the run with the "failing" result condition that should interrupt further execution).
Thus far I have accomplished this with initializing an empty var result: Result, a var stop: Boolean and using a while loop that runs while the conditions are true and modifying the outer state. I would like to get rid of this and use a functional approach.
Some context. Each run is expected to run from 0 to 60 minutes and the total time of iteration is capped at 60 minutes. Theoretically, there's no bound to how many times it executes in this period but in practice, it's generally 2-60 times.
The problem is, the runs take a long time so I need to stop the execution. My idea is to use some kind of lazy Iterator or Stream coupled with scanLeft and Option.
Code
Boiler plate
This code isn't particularly relevant but used in my approach samples and provide identical but somewhat random pseudo runtime results.
import scala.collection.mutable.ListBuffer
import scala.util.Random
val r = Random
r.setSeed(1)
val sleepingTimes: Seq[Int] = (1 to 601)
.map(x => Math.pow(2, x).toInt * r.nextInt(100))
.toList
.filter(_ > 0)
.sorted
val randomRes = r.shuffle((0 to 600).map(x => r.nextInt(10)).toList)
case class Result(val a: Int, val slept: Int)
class Lib() {
def run(i: Int) = {
println(s"running ${i}")
Thread.sleep(sleepingTimes(i))
Result(randomRes(i), sleepingTimes(i))
}
}
case class Baz(i: Int, result: Result)
val lib = new Lib()
val timeout = 10 * 1000
While approach
val iteratorStart = System.currentTimeMillis()
val iterator = for {
i <- (0 to 600).iterator
if System.currentTimeMillis() < iteratorStart + timeout
f = Baz(i, lib.run(i))
} yield f
val iteratorBuffer = ListBuffer[Baz]()
if (iterator.hasNext) iteratorBuffer.append(iterator.next())
var run = true
while (run && iterator.hasNext) {
val next = iterator.next()
run = iteratorBuffer.last.result.a < next.result.a
iteratorBuffer.append(next)
}
Stream approach (Scala.2.12)
Full example
val streamStart = System.currentTimeMillis()
val stream = for {
i <- (0 to 600).toStream
if System.currentTimeMillis() < streamStart + timeout
} yield Baz(i, lib.run(i))
var last: Option[Baz] = None
val head = stream.headOption
val tail = if (stream.nonEmpty) stream.tail else stream
val streamVersion = (tail
.scanLeft((head, true))((x, y) => {
if (x._1.exists(_.result.a > y.result.a)) (Some(y), false)
else (Some(y), true)
})
.takeWhile {
case (baz, continue) =>
if (!baz.eq(head)) last = baz
continue
}
.map(_._1)
.toList :+ last).flatten
LazyList approach (Scala 2.13)
Full example
val lazyListStart = System.currentTimeMillis()
val lazyList = for {
i <- (0 to 600).to(LazyList)
if System.currentTimeMillis() < lazyListStart + timeout
} yield Baz(i, lib.run(i))
var last: Option[Baz] = None
val head = lazyList.headOption
val tail = if (lazyList.nonEmpty) lazyList.tail else lazyList
val lazyListVersion = (tail
.scanLeft((head, true))((x, y) => {
if (x._1.exists(_.result.a > y.result.a)) (Some(y), false)
else (Some(y), true)
})
.takeWhile {
case (baz, continue) =>
if (!baz.eq(head)) last = baz
continue
}
.map(_._1)
.toList :+ last).flatten
Result
Both approaches appear to yield the correct end result:
List(Baz(0,Result(4,170)), Baz(1,Result(5,208)))
and they interrupt execution as desired.
Edit: The desired outcome is to not execute the next iteration but still return the result of the iteration that caused the interruption. Thus the desired result is
List(Baz(0,Result(4,170)), Baz(1,Result(5,208)), Baz(2,Result(2,256))
and lib.run(i) should only run 3 times.
This is achieved by the while approach, as well as the LazyList approach but not the Stream approach which executes lib.run 4 times (Bad!).
Question
Is there another stateless approach, which is hopefully more elegant?
Edit
I realized my examples were faulty and not returning the "failing" result, which it should, and that they kept executing beyond the stop condition. I rewrote the code and examples but I believe the spirit of the question is the same.
I would use something higher level, like fs2.
(or any other high-level streaming library, like: monix observables, akka streams or zio zstreams)
def runUntilOrTimeout[F[_]: Concurrent: Timer, A](work: F[A], timeout: FiniteDuration)
(stop: (A, A) => Boolean): Stream[F, A] = {
val interrupt =
Stream.sleep_(timeout)
val run =
Stream
.repeatEval(work)
.zipWithPrevious
.takeThrough {
case (Some(p), c) if stop(p, c) => false
case _ => true
} map {
case (_, c) => c
}
run mergeHaltBoth interrupt
}
You can see it working here.

Splitting a Monix Observable

I would like to write a split function for monix.reactive.Observable. It should split a source Observable[A] into a new pair (Observable[A], Observable[A]), based on the value of a predicate, evaluated against each element in the source. I would like the split to work independently of whether the source Observable is hot or cold. In the case where the source is cold, the new pair of Observables should also be cold and where the source is hot the new pair of Observables will be hot. I would like to know if such an implementation is possible and, if so, how (I have pasted a failing testcase below).
The signature, as a method on an implicit class, would look like, or similar to
/**
* Split an observable by a predicate, placing values for which the predicate returns true
* to the right (and values for which the predicate returns false to the left).
* This is consistent with the convention adopted by Either.cond.
*/
def split(p: T => Boolean)(implicit scheduler: Scheduler, taskLike: TaskLike[Future]): (Observable[T], Observable[T]) = {
splitEither[T, T](elem => Either.cond(p(elem), elem, elem))
}
Currently, I have a naive implementation that consumes the source elements and pushes them to PublishSubject. The new pair of Observables is thus hot. My tests for a cold Observable are failing.
import monix.eval.TaskLike
import monix.execution.{Ack, Scheduler}
import monix.reactive.{Observable, Observer}
import monix.reactive.subjects.PublishSubject
import scala.concurrent.Future
object ObservableOps {
implicit class ObservableExtensions[T](o: Observable[T]) {
/**
* Split an observable by a predicate, placing values for which the predicate returns true
* to the right (and values for which the predicate returns false to the left).
* This is consistent with the convention adopted by Either.cond.
*/
def split(p: T => Boolean)(implicit scheduler: Scheduler, taskLike: TaskLike[Future]): (Observable[T], Observable[T]) = {
splitEither[T, T](elem => Either.cond(p(elem), elem, elem))
}
/**
* Split an observable into a pair of Observables, one left, one right, according
* to a determinant function.
*/
def splitEither[U, V](f: T => Either[U, V])(implicit scheduler: Scheduler, taskLike: TaskLike[Future]): (Observable[U], Observable[V]) = {
val l = PublishSubject[U]()
val r = PublishSubject[V]()
o.subscribe(new Observer[T] {
override def onNext(elem: T): Future[Ack] = {
f(elem) match {
case Left(u) => l.onNext(u)
case Right(v) => r.onNext(v)
}
}
override def onError(ex: Throwable): Unit = {
l.onError(ex)
r.onError(ex)
}
override def onComplete(): Unit = {
l.onComplete()
r.onComplete()
}
})
(l, r)
}
}
}
//////////
import ObservableOps._
import monix.execution.Scheduler.Implicits.global
import monix.reactive.Observable
import monix.reactive.subjects.PublishSubject
import org.scalatest.FlatSpec
import org.scalatest.Matchers._
import org.scalatest.concurrent.ScalaFutures._
class ObservableOpsSpec extends FlatSpec {
val isEven: Int => Boolean = _ % 2 == 0
"Observable Ops" should "split a cold observable" in {
val o = Observable(1, 2, 3, 4, 5)
val (l, r) = o.split(isEven)
l.toListL.runToFuture.futureValue shouldBe List(1, 3, 5)
r.toListL.runToFuture.futureValue shouldBe List(2, 4)
}
"Observable Ops" should "split a hot observable" in {
val o = PublishSubject[Int]()
val (l, r) = o.split(isEven)
val lbuf = l.toListL.runToFuture
val rbuf = r.toListL.runToFuture
Observable.fromIterable(1 to 5).mapEvalF(i => o.onNext(i)).subscribe()
o.onComplete()
lbuf.futureValue shouldBe List(1, 3, 5)
rbuf.futureValue shouldBe List(2, 4)
}
}
I expect both testcases above to pass but "Observable Ops" should "split a cold observable" is failing.
Edit: working code
An implementation that passes both test cases is as follows:
import monix.execution.Scheduler
import monix.reactive.Observable
object ObservableOps {
implicit class ObservableExtension[T](o: Observable[T]) {
/**
* Split an observable by a predicate, placing values for which the predicate returns true
* to the right (and values for which the predicate returns false to the left).
* This is consistent with the convention adopted by Either.cond.
*/
def split(
p: T => Boolean
)(implicit scheduler: Scheduler): (Observable[T], Observable[T]) = {
splitEither[T, T](elem => Either.cond(p(elem), elem, elem))
}
/**
* Split an observable into a pair of Observables, one left, one right, according
* to a determinant function.
*/
def splitEither[U, V](
f: T => Either[U, V]
)(implicit scheduler: Scheduler): (Observable[U], Observable[V]) = {
val oo = o.map(f)
val l = oo.collect {
case Left(u) => u
}
val r = oo.collect {
case Right(v) => v
}
(l, r)
}
}
}
class ObservableOpsSpec extends FlatSpec {
val isEven: Int => Boolean = _ % 2 == 0
"Observable Ops" should "split a cold observable" in {
val o = Observable(1, 2, 3, 4, 5)
val o2 = o.publish
val (l, r) = o2.split(isEven)
val x= l.toListL.runToFuture
val y = r.toListL.runToFuture
o2.connect()
x.futureValue shouldBe List(1, 3, 5)
y.futureValue shouldBe List(2, 4)
}
"Observable Ops" should "split a hot observable" in {
val o = PublishSubject[Int]()
val (l, r) = o.split(isEven)
val lbuf = l.toListL.runToFuture
val rbuf = r.toListL.runToFuture
Observable.fromIterable(1 to 5).mapEvalF(i => o.onNext(i)).subscribe()
o.onComplete()
lbuf.futureValue shouldBe List(1, 3, 5)
rbuf.futureValue shouldBe List(2, 4)
}
}
Cold observable, by definition, is lazily evaluated for each subscriber. You can't split it without either evaluating everything twice or converting it into hot one.
If you don't mind evaluating everything twice, just use .filter two times.
If you don't mind converting to hot, do it with .publish (or .publish.refCount so you don't need to connect manually).
If you want to preserve cold/hot property and process two pieces in parallel, there's a publishSelector method that lets you treat any observable like a hot one in a limited scope:
coldOrHot.publishSelector { totallyHot =>
val s1 = totallyHot.filter(...).flatMap(...) // any processing
val s2 = totallyHot.filter(...).mapEval(...) // any processing 2
Observable(s1, s2).merge
}
It's limitation, apart from scope, is that result of inner lambda has to be another Observable (which will be returned from publishSelector), so you can't have the helper with the signature you want. But the result will still be cold if the original was cold.

Akka streams — filtering by the number of elements in stream

I'm writing an app in Scala and I'm using Akka streams.
At one point, I need to filter out streams that have less than N elements, with N given. So, for example, with N=5:
Source(List(1,2,3)).via(myFilter) // => List()
Source(List(1,2,3,4)).via(myFilter) // => List()
will become empty streams, and
Source(List(1,2,3,4,5)).via(myFilter) // => List(1,2,3,4,5)
Source(List(1,2,3,4,5,6)).via(myFilter) // => List(1,2,3,4,5,6)
will be unchanged.
Of course, we can't know the number of elements in the stream until it's over, and waiting till the end before pushing it through might not be the best idea.
So, instead, I've thought about the following algorithm:
for the first N-1 elements, just buffer them, without passing further;
if the input stream finishes before reaching the Nth element, output an empty stream;
if the input stream reaches Nth element, output the buffered N-1 elements, then output the Nth element, then pass all the following elements that come.
However, I have no idea how to build a Flow element implementing it. Are there some built-in Akka elements I could use?
Edit:
Okay, so I played with it yesterday and I came up with something like that:
Flow[Int].
prefixAndTail(N).
flatMapConcat {
case (prefix, tail) if prefix.length == N =>
Source(prefix).concat(tail)
case _ =>
Source.empty[Int]
}
Will it do what I want?
Perhaps statefulMapConcat could help you:
import akka.actor.ActorSystem
import akka.stream.scaladsl.{Sink, Source}
import akka.stream.{ActorMaterializer, Materializer}
import scala.collection.mutable.ListBuffer
import scala.concurrent.ExecutionContext
object StatefulMapConcatExample extends App {
implicit val system: ActorSystem = ActorSystem()
implicit val materializer: Materializer = ActorMaterializer()
implicit val ec: ExecutionContext = scala.concurrent.ExecutionContext.Implicits.global
def filterLessThen(threshold: Int): (Int) => List[Int] = {
var buffering = true
val buffer: ListBuffer[Int] = ListBuffer()
(elem: Int) =>
if (buffering) {
buffer += elem
if (buffer.size < threshold) {
Nil
} else {
buffering = false
buffer.toList
}
} else {
List(elem)
}
}
//Nil
Source(List(1, 2, 3)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
//Nil
Source(List(1, 2, 3, 4)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
//Vector(1,2,3,4,5)
Source(List(1, 2, 3, 4, 5)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
//Vector(1,2,3,4,5,6)
Source(List(1, 2, 3, 4, 5, 6)).statefulMapConcat(() => filterLessThen(5))
.runWith(Sink.seq).map(println)
}
This may be one of those instances where a little "state" can go a long way. Even though the solution is not "purely functional", the updating state will be isolated and unreachable by the rest of the system. I think this is one of the beauties of scala: when an FP solution isn't obvious you can always revert to imperative in an isolated manner...
The completed Flow will be a combination of multiple sub-parts. The first Flow will just group your elements into sequences of size N:
val group : Int => Flow[Int, Seq[Int], _] =
(N) => Flow[Int] grouped N
Now for the non-functional part, a filter that will only allow the grouped Seq values through if the first sequence was the right size:
val minSizeRequirement : Int => Seq[Int] => Boolean =
(minSize) => {
var isFirst : Boolean = True
var passedMinSize : Boolean = False
(testSeq) => {
if(isFirst) {
isFirst = False
passedMinSize = testSeq.size >= minSize
passedMinSize
}
else
passedMinSize
}
}
}
val minSizeFilter : Int => Flow[Seq[Int], Seq[Int], _] =
(minSize) => Flow[Seq[Int]].filter(minSizeRequirement(minSize))
The last step is to convert the Seq[Int] values back into Int values:
val flatten = Flow[Seq[Int]].flatMapConcat(l => Source(l))
Finally, combine them all together:
val combinedFlow : Int => Flow[Int, Int, _] =
(minSize) =>
group(minSize)
.via(minSizeFilter(minSize))
.via(flatten)

Scala: how to traverse stream/iterator collecting results into several different collections

I'm going through log file that is too big to fit into memory and collecting 2 type of expressions, what is better functional alternative to my iterative snippet below?
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)]={
val lines : Iterator[String] = io.Source.fromFile(file).getLines()
val logins: mutable.Map[String, String] = new mutable.HashMap[String, String]()
val errors: mutable.ListBuffer[(String, String)] = mutable.ListBuffer.empty
for (line <- lines){
line match {
case errorPat(date,ip)=> errors.append((ip,date))
case loginPat(date,user,ip,id) =>logins.put(ip, id)
case _ => ""
}
}
errors.toList.map(line => (logins.getOrElse(line._1,"none") + " " + line._1,line._2))
}
Here is a possible solution:
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String,String)] = {
val lines = Source.fromFile(file).getLines
val (err, log) = lines.collect {
case errorPat(inf, ip) => (Some((ip, inf)), None)
case loginPat(_, _, ip, id) => (None, Some((ip, id)))
}.toList.unzip
val ip2id = log.flatten.toMap
err.collect{ case Some((ip,inf)) => (ip2id.getOrElse(ip,"none") + "" + ip, inf) }
}
Corrections:
1) removed unnecessary types declarations
2) tuple deconstruction instead of ulgy ._1
3) left fold instead of mutable accumulators
4) used more convenient operator-like methods :+ and +
def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)] = {
val lines = io.Source.fromFile(file).getLines()
val (logins, errors) =
((Map.empty[String, String], Seq.empty[(String, String)]) /: lines) {
case ((loginsAcc, errorsAcc), next) =>
next match {
case errorPat(date, ip) => (loginsAcc, errorsAcc :+ (ip -> date))
case loginPat(date, user, ip, id) => (loginsAcc + (ip -> id) , errorsAcc)
case _ => (loginsAcc, errorsAcc)
}
}
// more concise equivalent for
// errors.toList.map { case (ip, date) => (logins.getOrElse(ip, "none") + " " + ip) -> date }
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
I have a few suggestions:
Instead of a pair/tuple, it's often better to use your own class. It gives meaningful names to both the type and its fields, which makes the code much more readable.
Split the code into small parts. In particular, try to decouple pieces of code that don't need to be tied together. This makes your code easier to understand, more robust, less prone to errors and easier to test. In your case it'd be good to separate producing your input (lines of a log file) and consuming it to produce a result. For example, you'd be able to make automatic tests for your function without having to store sample data in a file.
As an example and exercise, I tried to make a solution based on Scalaz iteratees. It's a bit longer (includes some auxiliary code for IteratorEnumerator) and perhaps it's a bit overkill for the task, but perhaps someone will find it helpful.
import java.io._;
import scala.util.matching.Regex
import scalaz._
import scalaz.IterV._
object MyApp extends App {
// A type for the result. Having names keeps things
// clearer and shorter.
type LogResult = List[(String,String)]
// Represents a state of our computation. Not only it
// gives a name to the data, we can also put here
// functions that modify the state. This nicely
// separates what we're computing and how.
sealed case class State(
logins: Map[String,String],
errors: Seq[(String,String)]
) {
def this() = {
this(Map.empty[String,String], Seq.empty[(String,String)])
}
def addError(date: String, ip: String): State =
State(logins, errors :+ (ip -> date));
def addLogin(ip: String, id: String): State =
State(logins + (ip -> id), errors);
// Produce the final result from accumulated data.
def result: LogResult =
for ((ip, date) <- errors.toList)
yield (logins.getOrElse(ip, "none") + " " + ip) -> date
}
// An iteratee that consumes lines of our input. Based
// on the given regular expressions, it produces an
// iteratee that parses the input and uses State to
// compute the result.
def logIteratee(errorPat: Regex, loginPat: Regex):
IterV[String,List[(String,String)]] = {
// Consumes a signle line.
def consume(line: String, state: State): State =
line match {
case errorPat(date, ip) => state.addError(date, ip);
case loginPat(date, user, ip, id) => state.addLogin(ip, id);
case _ => state
}
// The core of the iteratee. Every time we consume a
// line, we update our state. When done, compute the
// final result.
def step(state: State)(s: Input[String]): IterV[String, LogResult] =
s(el = line => Cont(step(consume(line, state))),
empty = Cont(step(state)),
eof = Done(state.result, EOF[String]))
// Return the iterate waiting for its first input.
Cont(step(new State()));
}
// Converts an iterator into an enumerator. This
// should be more likely moved to Scalaz.
// Adapted from scalaz.ExampleIteratee
implicit val IteratorEnumerator = new Enumerator[Iterator] {
#annotation.tailrec def apply[E, A](e: Iterator[E], i: IterV[E, A]): IterV[E, A] = {
val next: Option[(Iterator[E], IterV[E, A])] =
if (e.hasNext) {
val x = e.next();
i.fold(done = (_, _) => None, cont = k => Some((e, k(El(x)))))
} else
None;
next match {
case None => i
case Some((es, is)) => apply(es, is)
}
}
}
// main ---------------------------------------------------
{
// Read a file as an iterator of lines:
// val lines: Iterator[String] =
// io.Source.fromFile("test.log").getLines();
// Create our testing iterator:
val lines: Iterator[String] = Seq(
"Error: 2012/03 1.2.3.4",
"Login: 2012/03 user 1.2.3.4 Joe",
"Error: 2012/03 1.2.3.5",
"Error: 2012/04 1.2.3.4"
).iterator;
// Create an iteratee.
val iter = logIteratee("Error: (\\S+) (\\S+)".r,
"Login: (\\S+) (\\S+) (\\S+) (\\S+)".r);
// Run the the iteratee against the input
// (the enumerator is implicit)
println(iter(lines).run);
}
}