Apache Flink : Creating a Lagged Datastream - scala

I am just starting out with Apache Flink using Scala. Can someone please tell me how to create a lagged stream(lagged by k events or k units of time) from a current datastream that I have?
Basically, I want to implement an auto regression model (Linear regression on the stream with the time lagged version of itself) on a data-stream. So, a method is needed something similar to the following pseudo code.
val ds : DataStream = ...
val laggedDS : DataStream = ds.map(lag _)
def lag(ds : DataStream, k : Time) : DataStream = {
I expect the sample input and output like this if every event is spaced at 1 second interval and there is a 2 second lag.
Input : 1, 2, 3, 4, 5, 6, 7...
Output: NA, NA, 1, 2, 3, 4, 5...

Given that I your requirements right, I would implement this as a FlatMapFunction with a FIFO queue. The queue buffers k events and emits the head whenever a new event arrives. In case you need a fault tolerant streaming application, the queue must be registered as state. Flink will then take care of checkpointing the state (i.e., the queue) and restore it in case of a failure.
The FlatMapFunction could look like this:
class Lagger(val k: Int)
extends FlatMapFunction[X, X]
with Checkpointed[mutable.Queue[X]]
var fifo: mutable.Queue[X] = new mutable.Queue[X]()
override def flatMap(value: X, out: Collector[X]): Unit = {
// add new element to queue
if (fifo.size == k + 1) {
// remove head element and emit
// restore state
override def restoreState(state: mutable.Queue[X]) = { fifo = state }
// get state to checkpoint
override def snapshotState(cId: Long, cTS: Long): mutable.Queue[X] = fifo
Returning elements with a time lag is more involved. This would require timer threads for the emission because the function is only called when a new element arrives.


Are recursive computations with Apache Spark RDD possible?

I'm developing chess engine using Scala and Apache Spark (and I need to stress that my sanity is not the topic of this question). My problem is that Negamax algorithm is recursive in its essence and when I try naive approach:
class NegaMaxSparc(#transient val sc: SparkContext) extends Serializable {
val movesOrdering = new Ordering[Tuple2[Move, Double]]() {
override def compare(x: (Move, Double), y: (Move, Double)): Int =
Ordering[Double].compare(x._2, y._2)
def negaMaxSparkHelper(game: Game, color: PieceColor, depth: Int, previousMovesPar: RDD[Move]): (Move, Double) = {
val board = game.board
if (depth == 0) {
(null, NegaMax.evaluateDefault(game, color))
} else {
val moves = board.possibleMovesForColor(color)
val movesPar = previousMovesPar.context.parallelize(moves)
val moveMappingFunc = (m: Move) => { negaMaxSparkHelper(new Game(board.boardByMakingMove(m), color.oppositeColor, null), color.oppositeColor, depth - 1, movesPar) }
val movesWithScorePar = movesPar.map(moveMappingFunc)
val move = movesWithScorePar.min()(movesOrdering)
(move._1, -move._2)
def negaMaxSpark(game: Game, color: PieceColor, depth: Int): (Move, Double) = {
if (depth == 0) {
(null, NegaMax.evaluateDefault(game, color))
} else {
val movesPar = sc.parallelize(new Array[Move](0))
negaMaxSparkHelper(game, color, depth, movesPar)
class NegaMaxSparkBot(val maxDepth: Int, sc: SparkContext) extends Bot {
def nextMove(game: Game): Move = {
val nms = new NegaMaxSparc(sc)
nms.negaMaxSpark(game, game.colorToMove, maxDepth)._1
I get:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
The question is: can this algorithm be implemented recursively using Spark? If not, then what is the proper Spark-way to solve that problem?
Only the driver can launch computation on RDD. The reason is that even though RDD "feel" like regular collections of data, behind the scene they are still distributed collections, so launching operations on them requires coordinating execution of tasks on all remote slaves, which spark hides from us most of the time.
So recursing from the slaves, i.e. launching new distributed tasks dynamically directly from slaves is not possible: only the drive can take care of such coordination.
Here's a possible alternative of a simplification of your problem (if I get things correctly). The idea is to successively build instances of Moves, each one representing the full sequence of Move from initial state.
Each instance of Moves is able to transform itself into a set of Moves, each one corresponding to the same sequence of Move plus one possible next Move.
From there the driver just has to successively flatMap the Moves for as deep as we want, and the resulting RDD[Moves] will execute all operations in parallel for us.
The downside of the approach is that all depth level are kept synchronized, i.e. we have to compute all moves at level n (i.e. the RDD[Moves] for level n) before going to the next one.
The code below is not tested, it probably has flaws and does not even compile, but hopefully it provides an idea on how to approach the problem.
/* one modification to the board */
case class Move(from: String, to: String)
case class PieceColor(color: String)
/* state of the game */
case class Board {
def possibleMovesForColor(color: PieceColor): Seq[Move] =
Move("here", "there") :: Move("there", "over there") :: Move("there", "here") :: Nil
// TODO: compute a new instance of board here, based on current + this move
def update(move: Move): Board = new Board
/** Solution, i.e. a sequence of moves*/
case class Moves(moves: Seq[Move], game: Board, color: PieceColor) {
lazy val score = NegaMax.evaluateDefault(game, color)
/** #return all valid next Moves */
def nextPossibleMoves: Seq[Moves] =
board.possibleMovesForColor(color).map {
nextMove =>
play.copy(moves = nextMove :: play.moves,
game = play.game.update(nextMove)
/** Driver code: negaMax: looks for the best next move from a give game state */
def negaMax(sc: SparkContext, game: Board, color: PieceColor, maxDepth: Int):Moves = {
val initialSolution = Moves(Seq[moves].empty, game, color)
val allPlays: rdd[Moves] =
(1 to maxDepth).foldLeft (sc.parallelize(Seq(initialSolution))) {
rdd => rdd.flatMap(_.nextPossibleMoves)
allPlays.reduce { case (m1, m2) => if (m1.score < m2.score) m1 else m2}
This is a limitation that makes sense in terms of the implementation, but it can be a pain to work with.
You can try pulling out the recursion to top level, just in the "driver" code that creates and operates with RDDs? Something like:
def step(rdd: Rdd[Move], limit: Int) =
if(0 == limit) rdd
else {
val newRdd = rdd.flatMap(...)
step(newRdd, limit - 1)
Alternately it's always possible to translate recursion into iteration, by managing the "stack" explicitly by hand (although it may result in more cumbersome code).

Splitting a scalaz-stream process into two child streams

Using scalaz-stream is it possible to split/fork and then rejoin a stream?
As an example, let's say I have the following function
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val sumOfEvenNumbers = streamOfNumbers.filter(isEven).fold(0)(add)
val sumOfOddNumbers = streamOfNumbers.filter(isOdd).fold(0)(add)
zip( sumOfEven, sumOfOdd ).to( someEffectfulFunction )
With scalaz-stream, in this example the results would be as you expect - a tuple of numbers from 1 to 10 passed to a sink.
However if we replace streamOfNumbers with something that requires IO, it will actually execute the IO action twice.
Using a Topic I'm able create a pub/sub process that duplicates elements in the stream correctly, however it does not buffer - it simply consumers the entire source as fast as possible regardless of the pace sinks consume it.
I can wrap this in a bounded Queue, however the end result feels a lot more complex than it needs to be.
Is there a simpler way of splitting a stream in scalaz-stream without duplicate IO actions from the source?
Also to clarify the previous answer delas with the "splitting" requirement. The solution to your specific issue may be without the need of splitting streams:
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val oddOrEven: Process[Task,Int\/Int] = streamOfNumbers.map {
case even if even % 2 == 0 => right(even)
case odd => left(odd)
val summed = oddOrEven.pipeW(sump1).pipeO(sump1)
val evenSink: Sink[Task,Int] = ???
val oddSink: Sink[Task,Int] = ???
You can perhaps still use topic and just assure that the children processes will subscribe before you will push to topic.
However please note this solution does not have any bounds on it, i.e. if you will be pushing too fast, you may encounter OOM error.
def split[A](source:Process[Task,A]): Process[Task,(Process[Task,A], Proces[Task,A])]] = {
val topic = async.topic[A]
val sub1 = topic.subscribe
val sub2 = topic.subscribe
merge.mergeN(Process(emit(sub1->sub2),(source to topic.publish).drain))
I likewise needed this functionality. My situation was quite a bit trickier disallowing me to work around it in this manner.
Thanks to Daniel Spiewak's response in this thread, I was able to get the following to work. I improved on his solution by adding onHalt so my application would exit once the Process completed.
def split[A](p: Process[Task, A], limit: Int = 10): Process[Task, (Process[Task, A], Process[Task, A])] = {
val left = async.boundedQueue[A](limit)
val right = async.boundedQueue[A](limit)
val enqueue = p.observe(left.enqueue).observe(right.enqueue).drain.onHalt { cause =>
Process.await(Task.gatherUnordered(Seq(left.close, right.close))){ _ => Halt(cause) }
val dequeue = Process((left.dequeue, right.dequeue))
enqueue merge dequeue

Factorial calculation using Scala actors

How to compute the factorial using Scala actors ?
And would it prove more time efficient compared to for instance
def factorial(n: Int): BigInt = (BigInt(1) to BigInt(n)).par.product
Many Thanks.
You have to split up your input in partial products. This partial products can then be calculated in parallel. The partial products are then multiplied to get the final product.
This can be reduced to a broader class of problems: The so called Parallel prefix calculation. You can read up about it on Wikipedia.
Short version: When you calculate a*b*c*d with an associative operation _ * _, you can structure the calculation a*(b*(c*d)) or (a*b)*(c*d). With the second approach, you can then calculate a*b and c*d in parallel and then calculate the final result from these partial results. Of course you can do this recursively, when you have a bigger number of input values.
This sounds a little bit like a homework assignment. So I will provide a solution that has two properties:
It contains a small bug
It shows how to solve parallel prefix in general, without solving the problem directly
So you can see how the solution should be structured, but no one can use it to cheat on her homework.
Solution in detail
First I need a few imports
import akka.event.Logging
import java.util.concurrent.TimeUnit
import scala.concurrent.duration.FiniteDuration
import akka.actor._
Then I create some helper classes for the communication between the actors
case class Calculate[T](values : Seq[T], segment : Int, parallelLimit : Int, fn : (T,T) => T)
trait CalculateResponse
case class CalculationResult[T](result : T, index : Int) extends CalculateResponse
case object Busy extends CalculateResponse
Instead of telling the receiver you are busy, the actor could also use the stash or implement its own queue for partial results. But in this case I think the sender shoudl decide how much parallel calculations are allowed.
Now I create the actor:
class ParallelPrefixActor[T] extends Actor {
val log = Logging(context.system, this)
val subCalculation = Props(classOf[ParallelPrefixActor[BigInt]])
val fanOut = 2
def receive = waitForCalculation
def waitForCalculation : Actor.Receive = {
case c : Calculate[T] =>
log.debug(s"Start calculation for ${c.values.length} values, segment nr. ${c.index}, from ${c.values.head} to ${c.values.last}")
if (c.values.length < c.parallelLimit) {
log.debug("Calculating result direct")
val result = c.values.reduceLeft(c.fn)
sender ! CalculationResult(result, c.index)
val groupSize: Int = Math.max(1, (c.values.length / fanOut) + Math.min(c.values.length % fanOut, 1))
log.debug(s"Splitting calculation for ${c.values.length} values up to ${fanOut} children, ${groupSize} elements each, limit ${c.parallelLimit}")
def segments=c.values.grouped(groupSize)
log.debug("Starting children")
segments.zipWithIndex.foreach{case (values, index) =>
context.actorOf(subCalculation) ! c.copy(values = values, index = index)
val partialResults: Vector[T] = segments.map(_.head).to[Vector]
log.debug(s"Waiting for ${partialResults.length} results (${partialResults.indices})")
context.become(waitForResults(segments.length, partialResults, c, sender), discardOld = true)
def waitForResults(outstandingResults : Int, partialResults : Vector[T], originalRequest : Calculate[T], originalSender : ActorRef) : Actor.Receive = {
case c : Calculate[_] => sender ! Busy
case r : CalculationResult[T] =>
log.debug(s"Putting result ${r.result} on position ${r.index} in ${partialResults.length}")
val updatedResults = partialResults.updated(r.index, r.result)
log.debug("Killing sub-worker")
sender ! PoisonPill
if (outstandingResults==1) {
log.debug("Calculating result from partial results")
val result = updatedResults.reduceLeft(originalRequest.fn)
originalSender ! CalculationResult(result, originalRequest.index)
context.become(waitForCalculation, discardOld = true)
log.debug(s"Still waiting for ${outstandingResults-1} results")
// For fanOut > 2 one could here already combine consecutive partial results
context.become(waitForResults(outstandingResults-1, updatedResults, originalRequest, originalSender), discardOld = true)
Using parallel prefix calculation is not optimal. The actors calculating the the product of the bigger numbers will do much more work than the actors calculating the product of the smaller numbers (e.g. when calculating 1 * ... * 100 , it is faster to calculate 1 * ... * 10 than 90 * ... * 100). So it might be a good idea to shuffle the numbers, so big numbers will be mixed with small numbers. This works in this case, because we use an commutative operation. Parallel prefix calculation in general only needs an associative operation to work.
In theory
Performance of the actor solution is worse than the "naive" solution (using parallel collections) for small amounts of data. The actor solution will shine, when you make complex calculations or distribute your calculation on specialized hardware (e.g. graphics card or FPGA) or on multiple machines. With the actor you can control, who does which calculation and you can even restart "hanging calculations". This can give a big speed up.
On a single machine, the actor solution might help when you have a non-uniform memory architecture. You could then organize the actors in a way that pins memory to a certain processor.
Some measurement
I did some real performance measurement using a Scala worksheet in IntelliJ IDEA.
First I set up the actor system:
// Setup the actor system
val system = ActorSystem("root")
// Start one calculation actor
val calculationStart = Props(classOf[ParallelPrefixActor[BigInt]])
val calcolon = system.actorOf(calculationStart, "Calcolon-BigInt")
val inbox = Inbox.create(system)
Then I defined a helper method to measure time:
// Helper function to measure time
def time[A] (id : String)(f: => A) = {
val start = System.nanoTime()
val result = f
val stop = System.nanoTime()
println(s"""Time for "${id}": ${(stop-start)*1e-6d}ms""")
And then I did some performance measurement:
// Test code
val limit = 10000
def testRange = (1 to limit).map(BigInt(_))
time("par product")(testRange.par.product)
val timeOut = FiniteDuration(240, TimeUnit.SECONDS)
inbox.send(calcolon, Calculate[BigInt]((1 to limit).map(BigInt(_)), 0, 10, _ * _))
time("actor product")(inbox.receive(timeOut))
time("par sum")(testRange.par.sum)
inbox.send(calcolon, Calculate[BigInt](testRange, 0, 5, _ + _))
time("actor sum")(inbox.receive(timeOut))
I got the following results
> Time for "par product": 134.38289ms
res0: scala.math.BigInt = 284625968091705451890641321211986889014805140170279923
Time for "actor product": 1310.217247ms
res2: Any = CalculationResult(28462596809170545189064132121198688901480514017027
> Time for "par sum": 6.488620999999999ms
res3: scala.math.BigInt = 50005000
> Time for "actor sum": 657.752832ms
res5: Any = CalculationResult(50005000,0)
You can easily see that the actor version is much slower than using parallel collections.

Scala View + Stream combo causing OutOfMemory Error. How do I replace it with a View?

I was looking at solving a very simple problem, Eratosthenes sieve, using idiomatic Scala, for learning purposes.
I've learned a Stream caches, so it is not so performant when determining the nth element because it's an O(n) complexity access with memoisation of data, therefore not suitable for this situation.
def primes(nums: Stream[Int]): Stream[Int] = {
primes((nums tail) filter (x => x % nums.head != 0)))
def ints(n: Int): Stream[Int] = {
Stream.cons(n, ints(n + 1))
def nthPrime(n: Int): Int = {
val prim = primes(ints(2)).view take n toList;
return prim(n - 1);
The Integer stream is the problematic one. While the prime number filtering is done, JVM runs OutOfMemory. What is the correct way to achieve the same functionality without using Streams?
Basically take a view of primes from a view of ints and display the last element, without memoisation?
I have had similar cases where a stream was a good idea, but I did not need to store it's values. In order to consume the stream without storing it's values I created (what I called) ThrowAwayIterator:
class ThrowAwayIterator[T](var stream: Stream[T]) extends Iterator[T] {
def hasNext: Boolean = stream.nonEmpty
def next(): T = {
val next = stream.head
stream = stream.tail
Make sure that you do not store a reference to the instance of stream that is passed in.

akka split task into smaller and fold results

The question is about Akka actors library. A want to split one big task into smaller tasks and then fold the result of them into one 'big' result. This will give me faster computation profit. Smaller tasks can be computed in parallel if they are independent.
Assume that we need to compute somethig like this. Function count2X is time consuming, so using it several times in one thread is not optimal.
def count2X(x: Int) = {
x * 2
val sum = count2X(1) + count2X(2) + count2X(3)
And here goes the question.
How to dispatch tasks and collect results and then fold them, all using akka actors?
Is such functionality already provided by Akka or do I need to implement it myself? What are best practisies in such approach.
Here is 'visual' interpretation of my question:
/-> [SMALL_TASK_1] -\
\-> [SMALL_TASK_1] -/
Below is my scaffold implementation with missing/bad implementation :)
case class Count2X(x: Int)
class Count2XActor extends Actor {
def receive = {
case Count2X(x) => count2X(x); // AND NOW WHAT ?
case class CountSumOf2X(a: Int, b: Int, c: Int)
class SumOf2XActor extends Actor {
val aCounter = context.actorOf(Props[Count2XActor])
val bCounter = context.actorOf(Props[Count2XActor])
val cCounter = context.actorOf(Props[Count2XActor])
def receive = {
case CountSumOf2X(a, b, c) => // AND NOW WHAT ? aCounter ! Count2X(a); bCounter ! Count2X(b); cCounter ! Count2X(c);
val aSystem = ActorSystem("mySystem")
val actor = aSystem.actorOf(Props[SumOf2XActor])
actor ! CountSumOf2X(10, 20, 30)
Thanks for any help.
In Akka I would do something like this:
val a = aCounter ? Count2X(10) mapTo[Int]
val b = bCounter ? Count2X(10) mapTo[Int]
val c = cCounter ? Count2X(10) mapTo[Int]
Await.result(Future.sequence(a, b, c) map (_.sum), 1 second).asInstanceOf[Int]
I'm sure there is a better way - here you start summing results after all Future-s are complete in parallel, for simple task it's ok, but generally you shouldn't wait so long
Two things you could do:
1) Use Akka futures. These allow you to dispatch operations and fold on them in an asynchronous manner. Check out http://doc.akka.io/docs/akka/2.0.4/scala/futures.html for more information.
2) You can dispatch work to multiple "worker" actors and then have a "master" actor aggregate them, keeping track of which messages are pending/processed by storing information in the messages themselves. I have a simple stock quote example of this using Akka actors here: https://github.com/ryanlecompte/quotes