Apache Flink : Creating a Lagged Datastream - scala

I am just starting out with Apache Flink using Scala. Can someone please tell me how to create a lagged stream(lagged by k events or k units of time) from a current datastream that I have?
Basically, I want to implement an auto regression model (Linear regression on the stream with the time lagged version of itself) on a data-stream. So, a method is needed something similar to the following pseudo code.
val ds : DataStream = ...
val laggedDS : DataStream = ds.map(lag _)
def lag(ds : DataStream, k : Time) : DataStream = {
}
I expect the sample input and output like this if every event is spaced at 1 second interval and there is a 2 second lag.
Input : 1, 2, 3, 4, 5, 6, 7...
Output: NA, NA, 1, 2, 3, 4, 5...

Given that I your requirements right, I would implement this as a FlatMapFunction with a FIFO queue. The queue buffers k events and emits the head whenever a new event arrives. In case you need a fault tolerant streaming application, the queue must be registered as state. Flink will then take care of checkpointing the state (i.e., the queue) and restore it in case of a failure.
The FlatMapFunction could look like this:
class Lagger(val k: Int)
extends FlatMapFunction[X, X]
with Checkpointed[mutable.Queue[X]]
{
var fifo: mutable.Queue[X] = new mutable.Queue[X]()
override def flatMap(value: X, out: Collector[X]): Unit = {
// add new element to queue
fifo.enqueue(value)
if (fifo.size == k + 1) {
// remove head element and emit
out.collect(fifo.dequeue())
}
}
// restore state
override def restoreState(state: mutable.Queue[X]) = { fifo = state }
// get state to checkpoint
override def snapshotState(cId: Long, cTS: Long): mutable.Queue[X] = fifo
}
Returning elements with a time lag is more involved. This would require timer threads for the emission because the function is only called when a new element arrives.

Related

Are recursive computations with Apache Spark RDD possible?

I'm developing chess engine using Scala and Apache Spark (and I need to stress that my sanity is not the topic of this question). My problem is that Negamax algorithm is recursive in its essence and when I try naive approach:
class NegaMaxSparc(#transient val sc: SparkContext) extends Serializable {
val movesOrdering = new Ordering[Tuple2[Move, Double]]() {
override def compare(x: (Move, Double), y: (Move, Double)): Int =
Ordering[Double].compare(x._2, y._2)
}
def negaMaxSparkHelper(game: Game, color: PieceColor, depth: Int, previousMovesPar: RDD[Move]): (Move, Double) = {
val board = game.board
if (depth == 0) {
(null, NegaMax.evaluateDefault(game, color))
} else {
val moves = board.possibleMovesForColor(color)
val movesPar = previousMovesPar.context.parallelize(moves)
val moveMappingFunc = (m: Move) => { negaMaxSparkHelper(new Game(board.boardByMakingMove(m), color.oppositeColor, null), color.oppositeColor, depth - 1, movesPar) }
val movesWithScorePar = movesPar.map(moveMappingFunc)
val move = movesWithScorePar.min()(movesOrdering)
(move._1, -move._2)
}
}
def negaMaxSpark(game: Game, color: PieceColor, depth: Int): (Move, Double) = {
if (depth == 0) {
(null, NegaMax.evaluateDefault(game, color))
} else {
val movesPar = sc.parallelize(new Array[Move](0))
negaMaxSparkHelper(game, color, depth, movesPar)
}
}
}
class NegaMaxSparkBot(val maxDepth: Int, sc: SparkContext) extends Bot {
def nextMove(game: Game): Move = {
val nms = new NegaMaxSparc(sc)
nms.negaMaxSpark(game, game.colorToMove, maxDepth)._1
}
}
I get:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
The question is: can this algorithm be implemented recursively using Spark? If not, then what is the proper Spark-way to solve that problem?
Only the driver can launch computation on RDD. The reason is that even though RDD "feel" like regular collections of data, behind the scene they are still distributed collections, so launching operations on them requires coordinating execution of tasks on all remote slaves, which spark hides from us most of the time.
So recursing from the slaves, i.e. launching new distributed tasks dynamically directly from slaves is not possible: only the drive can take care of such coordination.
Here's a possible alternative of a simplification of your problem (if I get things correctly). The idea is to successively build instances of Moves, each one representing the full sequence of Move from initial state.
Each instance of Moves is able to transform itself into a set of Moves, each one corresponding to the same sequence of Move plus one possible next Move.
From there the driver just has to successively flatMap the Moves for as deep as we want, and the resulting RDD[Moves] will execute all operations in parallel for us.
The downside of the approach is that all depth level are kept synchronized, i.e. we have to compute all moves at level n (i.e. the RDD[Moves] for level n) before going to the next one.
The code below is not tested, it probably has flaws and does not even compile, but hopefully it provides an idea on how to approach the problem.
/* one modification to the board */
case class Move(from: String, to: String)
case class PieceColor(color: String)
/* state of the game */
case class Board {
// TODO
def possibleMovesForColor(color: PieceColor): Seq[Move] =
Move("here", "there") :: Move("there", "over there") :: Move("there", "here") :: Nil
// TODO: compute a new instance of board here, based on current + this move
def update(move: Move): Board = new Board
}
/** Solution, i.e. a sequence of moves*/
case class Moves(moves: Seq[Move], game: Board, color: PieceColor) {
lazy val score = NegaMax.evaluateDefault(game, color)
/** #return all valid next Moves */
def nextPossibleMoves: Seq[Moves] =
board.possibleMovesForColor(color).map {
nextMove =>
play.copy(moves = nextMove :: play.moves,
game = play.game.update(nextMove)
}
}
/** Driver code: negaMax: looks for the best next move from a give game state */
def negaMax(sc: SparkContext, game: Board, color: PieceColor, maxDepth: Int):Moves = {
val initialSolution = Moves(Seq[moves].empty, game, color)
val allPlays: rdd[Moves] =
(1 to maxDepth).foldLeft (sc.parallelize(Seq(initialSolution))) {
rdd => rdd.flatMap(_.nextPossibleMoves)
}
allPlays.reduce { case (m1, m2) => if (m1.score < m2.score) m1 else m2}
}
This is a limitation that makes sense in terms of the implementation, but it can be a pain to work with.
You can try pulling out the recursion to top level, just in the "driver" code that creates and operates with RDDs? Something like:
def step(rdd: Rdd[Move], limit: Int) =
if(0 == limit) rdd
else {
val newRdd = rdd.flatMap(...)
step(newRdd, limit - 1)
}
Alternately it's always possible to translate recursion into iteration, by managing the "stack" explicitly by hand (although it may result in more cumbersome code).

Splitting a scalaz-stream process into two child streams

Using scalaz-stream is it possible to split/fork and then rejoin a stream?
As an example, let's say I have the following function
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val sumOfEvenNumbers = streamOfNumbers.filter(isEven).fold(0)(add)
val sumOfOddNumbers = streamOfNumbers.filter(isOdd).fold(0)(add)
zip( sumOfEven, sumOfOdd ).to( someEffectfulFunction )
With scalaz-stream, in this example the results would be as you expect - a tuple of numbers from 1 to 10 passed to a sink.
However if we replace streamOfNumbers with something that requires IO, it will actually execute the IO action twice.
Using a Topic I'm able create a pub/sub process that duplicates elements in the stream correctly, however it does not buffer - it simply consumers the entire source as fast as possible regardless of the pace sinks consume it.
I can wrap this in a bounded Queue, however the end result feels a lot more complex than it needs to be.
Is there a simpler way of splitting a stream in scalaz-stream without duplicate IO actions from the source?
Also to clarify the previous answer delas with the "splitting" requirement. The solution to your specific issue may be without the need of splitting streams:
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val oddOrEven: Process[Task,Int\/Int] = streamOfNumbers.map {
case even if even % 2 == 0 => right(even)
case odd => left(odd)
}
val summed = oddOrEven.pipeW(sump1).pipeO(sump1)
val evenSink: Sink[Task,Int] = ???
val oddSink: Sink[Task,Int] = ???
summed
.drainW(evenSink)
.to(oddSink)
You can perhaps still use topic and just assure that the children processes will subscribe before you will push to topic.
However please note this solution does not have any bounds on it, i.e. if you will be pushing too fast, you may encounter OOM error.
def split[A](source:Process[Task,A]): Process[Task,(Process[Task,A], Proces[Task,A])]] = {
val topic = async.topic[A]
val sub1 = topic.subscribe
val sub2 = topic.subscribe
merge.mergeN(Process(emit(sub1->sub2),(source to topic.publish).drain))
}
I likewise needed this functionality. My situation was quite a bit trickier disallowing me to work around it in this manner.
Thanks to Daniel Spiewak's response in this thread, I was able to get the following to work. I improved on his solution by adding onHalt so my application would exit once the Process completed.
def split[A](p: Process[Task, A], limit: Int = 10): Process[Task, (Process[Task, A], Process[Task, A])] = {
val left = async.boundedQueue[A](limit)
val right = async.boundedQueue[A](limit)
val enqueue = p.observe(left.enqueue).observe(right.enqueue).drain.onHalt { cause =>
Process.await(Task.gatherUnordered(Seq(left.close, right.close))){ _ => Halt(cause) }
}
val dequeue = Process((left.dequeue, right.dequeue))
enqueue merge dequeue
}

Factorial calculation using Scala actors

How to compute the factorial using Scala actors ?
And would it prove more time efficient compared to for instance
def factorial(n: Int): BigInt = (BigInt(1) to BigInt(n)).par.product
Many Thanks.
Problem
You have to split up your input in partial products. This partial products can then be calculated in parallel. The partial products are then multiplied to get the final product.
This can be reduced to a broader class of problems: The so called Parallel prefix calculation. You can read up about it on Wikipedia.
Short version: When you calculate a*b*c*d with an associative operation _ * _, you can structure the calculation a*(b*(c*d)) or (a*b)*(c*d). With the second approach, you can then calculate a*b and c*d in parallel and then calculate the final result from these partial results. Of course you can do this recursively, when you have a bigger number of input values.
Solution
Disclaimer
This sounds a little bit like a homework assignment. So I will provide a solution that has two properties:
It contains a small bug
It shows how to solve parallel prefix in general, without solving the problem directly
So you can see how the solution should be structured, but no one can use it to cheat on her homework.
Solution in detail
First I need a few imports
import akka.event.Logging
import java.util.concurrent.TimeUnit
import scala.concurrent.duration.FiniteDuration
import akka.actor._
Then I create some helper classes for the communication between the actors
case class Calculate[T](values : Seq[T], segment : Int, parallelLimit : Int, fn : (T,T) => T)
trait CalculateResponse
case class CalculationResult[T](result : T, index : Int) extends CalculateResponse
case object Busy extends CalculateResponse
Instead of telling the receiver you are busy, the actor could also use the stash or implement its own queue for partial results. But in this case I think the sender shoudl decide how much parallel calculations are allowed.
Now I create the actor:
class ParallelPrefixActor[T] extends Actor {
val log = Logging(context.system, this)
val subCalculation = Props(classOf[ParallelPrefixActor[BigInt]])
val fanOut = 2
def receive = waitForCalculation
def waitForCalculation : Actor.Receive = {
case c : Calculate[T] =>
log.debug(s"Start calculation for ${c.values.length} values, segment nr. ${c.index}, from ${c.values.head} to ${c.values.last}")
if (c.values.length < c.parallelLimit) {
log.debug("Calculating result direct")
val result = c.values.reduceLeft(c.fn)
sender ! CalculationResult(result, c.index)
}else{
val groupSize: Int = Math.max(1, (c.values.length / fanOut) + Math.min(c.values.length % fanOut, 1))
log.debug(s"Splitting calculation for ${c.values.length} values up to ${fanOut} children, ${groupSize} elements each, limit ${c.parallelLimit}")
def segments=c.values.grouped(groupSize)
log.debug("Starting children")
segments.zipWithIndex.foreach{case (values, index) =>
context.actorOf(subCalculation) ! c.copy(values = values, index = index)
}
val partialResults: Vector[T] = segments.map(_.head).to[Vector]
log.debug(s"Waiting for ${partialResults.length} results (${partialResults.indices})")
context.become(waitForResults(segments.length, partialResults, c, sender), discardOld = true)
}
}
def waitForResults(outstandingResults : Int, partialResults : Vector[T], originalRequest : Calculate[T], originalSender : ActorRef) : Actor.Receive = {
case c : Calculate[_] => sender ! Busy
case r : CalculationResult[T] =>
log.debug(s"Putting result ${r.result} on position ${r.index} in ${partialResults.length}")
val updatedResults = partialResults.updated(r.index, r.result)
log.debug("Killing sub-worker")
sender ! PoisonPill
if (outstandingResults==1) {
log.debug("Calculating result from partial results")
val result = updatedResults.reduceLeft(originalRequest.fn)
originalSender ! CalculationResult(result, originalRequest.index)
context.become(waitForCalculation, discardOld = true)
}else{
log.debug(s"Still waiting for ${outstandingResults-1} results")
// For fanOut > 2 one could here already combine consecutive partial results
context.become(waitForResults(outstandingResults-1, updatedResults, originalRequest, originalSender), discardOld = true)
}
}
}
Optimizations
Using parallel prefix calculation is not optimal. The actors calculating the the product of the bigger numbers will do much more work than the actors calculating the product of the smaller numbers (e.g. when calculating 1 * ... * 100 , it is faster to calculate 1 * ... * 10 than 90 * ... * 100). So it might be a good idea to shuffle the numbers, so big numbers will be mixed with small numbers. This works in this case, because we use an commutative operation. Parallel prefix calculation in general only needs an associative operation to work.
Performance
In theory
Performance of the actor solution is worse than the "naive" solution (using parallel collections) for small amounts of data. The actor solution will shine, when you make complex calculations or distribute your calculation on specialized hardware (e.g. graphics card or FPGA) or on multiple machines. With the actor you can control, who does which calculation and you can even restart "hanging calculations". This can give a big speed up.
On a single machine, the actor solution might help when you have a non-uniform memory architecture. You could then organize the actors in a way that pins memory to a certain processor.
Some measurement
I did some real performance measurement using a Scala worksheet in IntelliJ IDEA.
First I set up the actor system:
// Setup the actor system
val system = ActorSystem("root")
// Start one calculation actor
val calculationStart = Props(classOf[ParallelPrefixActor[BigInt]])
val calcolon = system.actorOf(calculationStart, "Calcolon-BigInt")
val inbox = Inbox.create(system)
Then I defined a helper method to measure time:
// Helper function to measure time
def time[A] (id : String)(f: => A) = {
val start = System.nanoTime()
val result = f
val stop = System.nanoTime()
println(s"""Time for "${id}": ${(stop-start)*1e-6d}ms""")
result
}
And then I did some performance measurement:
// Test code
val limit = 10000
def testRange = (1 to limit).map(BigInt(_))
time("par product")(testRange.par.product)
val timeOut = FiniteDuration(240, TimeUnit.SECONDS)
inbox.send(calcolon, Calculate[BigInt]((1 to limit).map(BigInt(_)), 0, 10, _ * _))
time("actor product")(inbox.receive(timeOut))
time("par sum")(testRange.par.sum)
inbox.send(calcolon, Calculate[BigInt](testRange, 0, 5, _ + _))
time("actor sum")(inbox.receive(timeOut))
I got the following results
> Time for "par product": 134.38289ms
res0: scala.math.BigInt = 284625968091705451890641321211986889014805140170279923
079417999427441134000376444377299078675778477581588406214231752883004233994015
351873905242116138271617481982419982759241828925978789812425312059465996259867
065601615720360323979263287367170557419759620994797203461536981198970926112775
004841988454104755446424421365733030767036288258035489674611170973695786036701
910715127305872810411586405612811653853259684258259955846881464304255898366493
170592517172042765974074461334000541940524623034368691540594040662278282483715
120383221786446271838229238996389928272218797024593876938030946273322925705554
596900278752822425443480211275590191694254290289169072190970836905398737474524
833728995218023632827412170402680867692104515558405671725553720158521328290342
799898184493136...
Time for "actor product": 1310.217247ms
res2: Any = CalculationResult(28462596809170545189064132121198688901480514017027
992307941799942744113400037644437729907867577847758158840621423175288300423399
401535187390524211613827161748198241998275924182892597878981242531205946599625
986706560161572036032397926328736717055741975962099479720346153698119897092611
277500484198845410475544642442136573303076703628825803548967461117097369578603
670191071512730587281041158640561281165385325968425825995584688146430425589836
649317059251717204276597407446133400054194052462303436869154059404066227828248
371512038322178644627183822923899638992827221879702459387693803094627332292570
555459690027875282242544348021127559019169425429028916907219097083690539873747
452483372899521802363282741217040268086769210451555840567172555372015852132829
034279989818449...
> Time for "par sum": 6.488620999999999ms
res3: scala.math.BigInt = 50005000
> Time for "actor sum": 657.752832ms
res5: Any = CalculationResult(50005000,0)
You can easily see that the actor version is much slower than using parallel collections.

Scala View + Stream combo causing OutOfMemory Error. How do I replace it with a View?

I was looking at solving a very simple problem, Eratosthenes sieve, using idiomatic Scala, for learning purposes.
I've learned a Stream caches, so it is not so performant when determining the nth element because it's an O(n) complexity access with memoisation of data, therefore not suitable for this situation.
def primes(nums: Stream[Int]): Stream[Int] = {
Stream.cons(nums.head,
primes((nums tail) filter (x => x % nums.head != 0)))
}
def ints(n: Int): Stream[Int] = {
Stream.cons(n, ints(n + 1))
};
def nthPrime(n: Int): Int = {
val prim = primes(ints(2)).view take n toList;
return prim(n - 1);
};
The Integer stream is the problematic one. While the prime number filtering is done, JVM runs OutOfMemory. What is the correct way to achieve the same functionality without using Streams?
Basically take a view of primes from a view of ints and display the last element, without memoisation?
I have had similar cases where a stream was a good idea, but I did not need to store it's values. In order to consume the stream without storing it's values I created (what I called) ThrowAwayIterator:
class ThrowAwayIterator[T](var stream: Stream[T]) extends Iterator[T] {
def hasNext: Boolean = stream.nonEmpty
def next(): T = {
val next = stream.head
stream = stream.tail
next
}
}
Make sure that you do not store a reference to the instance of stream that is passed in.

akka split task into smaller and fold results

The question is about Akka actors library. A want to split one big task into smaller tasks and then fold the result of them into one 'big' result. This will give me faster computation profit. Smaller tasks can be computed in parallel if they are independent.
Assume that we need to compute somethig like this. Function count2X is time consuming, so using it several times in one thread is not optimal.
//NOT OPTIMAL
def count2X(x: Int) = {
Thread.sleep(1000)
x * 2
}
val sum = count2X(1) + count2X(2) + count2X(3)
println(sum)
And here goes the question.
How to dispatch tasks and collect results and then fold them, all using akka actors?
Is such functionality already provided by Akka or do I need to implement it myself? What are best practisies in such approach.
Here is 'visual' interpretation of my question:
/-> [SMALL_TASK_1] -\
[BIG_TASK] -+--> [SMALL_TASK_1] --> [RESULT_FOLD]
\-> [SMALL_TASK_1] -/
Below is my scaffold implementation with missing/bad implementation :)
case class Count2X(x: Int)
class Count2XActor extends Actor {
def receive = {
case Count2X(x) => count2X(x); // AND NOW WHAT ?
}
}
case class CountSumOf2X(a: Int, b: Int, c: Int)
class SumOf2XActor extends Actor {
val aCounter = context.actorOf(Props[Count2XActor])
val bCounter = context.actorOf(Props[Count2XActor])
val cCounter = context.actorOf(Props[Count2XActor])
def receive = {
case CountSumOf2X(a, b, c) => // AND NOW WHAT ? aCounter ! Count2X(a); bCounter ! Count2X(b); cCounter ! Count2X(c);
}
}
val aSystem = ActorSystem("mySystem")
val actor = aSystem.actorOf(Props[SumOf2XActor])
actor ! CountSumOf2X(10, 20, 30)
Thanks for any help.
In Akka I would do something like this:
val a = aCounter ? Count2X(10) mapTo[Int]
val b = bCounter ? Count2X(10) mapTo[Int]
val c = cCounter ? Count2X(10) mapTo[Int]
Await.result(Future.sequence(a, b, c) map (_.sum), 1 second).asInstanceOf[Int]
I'm sure there is a better way - here you start summing results after all Future-s are complete in parallel, for simple task it's ok, but generally you shouldn't wait so long
Two things you could do:
1) Use Akka futures. These allow you to dispatch operations and fold on them in an asynchronous manner. Check out http://doc.akka.io/docs/akka/2.0.4/scala/futures.html for more information.
2) You can dispatch work to multiple "worker" actors and then have a "master" actor aggregate them, keeping track of which messages are pending/processed by storing information in the messages themselves. I have a simple stock quote example of this using Akka actors here: https://github.com/ryanlecompte/quotes