Are recursive computations with Apache Spark RDD possible? - scala

I'm developing chess engine using Scala and Apache Spark (and I need to stress that my sanity is not the topic of this question). My problem is that Negamax algorithm is recursive in its essence and when I try naive approach:
class NegaMaxSparc(#transient val sc: SparkContext) extends Serializable {
val movesOrdering = new Ordering[Tuple2[Move, Double]]() {
override def compare(x: (Move, Double), y: (Move, Double)): Int =
Ordering[Double].compare(x._2, y._2)
}
def negaMaxSparkHelper(game: Game, color: PieceColor, depth: Int, previousMovesPar: RDD[Move]): (Move, Double) = {
val board = game.board
if (depth == 0) {
(null, NegaMax.evaluateDefault(game, color))
} else {
val moves = board.possibleMovesForColor(color)
val movesPar = previousMovesPar.context.parallelize(moves)
val moveMappingFunc = (m: Move) => { negaMaxSparkHelper(new Game(board.boardByMakingMove(m), color.oppositeColor, null), color.oppositeColor, depth - 1, movesPar) }
val movesWithScorePar = movesPar.map(moveMappingFunc)
val move = movesWithScorePar.min()(movesOrdering)
(move._1, -move._2)
}
}
def negaMaxSpark(game: Game, color: PieceColor, depth: Int): (Move, Double) = {
if (depth == 0) {
(null, NegaMax.evaluateDefault(game, color))
} else {
val movesPar = sc.parallelize(new Array[Move](0))
negaMaxSparkHelper(game, color, depth, movesPar)
}
}
}
class NegaMaxSparkBot(val maxDepth: Int, sc: SparkContext) extends Bot {
def nextMove(game: Game): Move = {
val nms = new NegaMaxSparc(sc)
nms.negaMaxSpark(game, game.colorToMove, maxDepth)._1
}
}
I get:
org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
The question is: can this algorithm be implemented recursively using Spark? If not, then what is the proper Spark-way to solve that problem?

Only the driver can launch computation on RDD. The reason is that even though RDD "feel" like regular collections of data, behind the scene they are still distributed collections, so launching operations on them requires coordinating execution of tasks on all remote slaves, which spark hides from us most of the time.
So recursing from the slaves, i.e. launching new distributed tasks dynamically directly from slaves is not possible: only the drive can take care of such coordination.
Here's a possible alternative of a simplification of your problem (if I get things correctly). The idea is to successively build instances of Moves, each one representing the full sequence of Move from initial state.
Each instance of Moves is able to transform itself into a set of Moves, each one corresponding to the same sequence of Move plus one possible next Move.
From there the driver just has to successively flatMap the Moves for as deep as we want, and the resulting RDD[Moves] will execute all operations in parallel for us.
The downside of the approach is that all depth level are kept synchronized, i.e. we have to compute all moves at level n (i.e. the RDD[Moves] for level n) before going to the next one.
The code below is not tested, it probably has flaws and does not even compile, but hopefully it provides an idea on how to approach the problem.
/* one modification to the board */
case class Move(from: String, to: String)
case class PieceColor(color: String)
/* state of the game */
case class Board {
// TODO
def possibleMovesForColor(color: PieceColor): Seq[Move] =
Move("here", "there") :: Move("there", "over there") :: Move("there", "here") :: Nil
// TODO: compute a new instance of board here, based on current + this move
def update(move: Move): Board = new Board
}
/** Solution, i.e. a sequence of moves*/
case class Moves(moves: Seq[Move], game: Board, color: PieceColor) {
lazy val score = NegaMax.evaluateDefault(game, color)
/** #return all valid next Moves */
def nextPossibleMoves: Seq[Moves] =
board.possibleMovesForColor(color).map {
nextMove =>
play.copy(moves = nextMove :: play.moves,
game = play.game.update(nextMove)
}
}
/** Driver code: negaMax: looks for the best next move from a give game state */
def negaMax(sc: SparkContext, game: Board, color: PieceColor, maxDepth: Int):Moves = {
val initialSolution = Moves(Seq[moves].empty, game, color)
val allPlays: rdd[Moves] =
(1 to maxDepth).foldLeft (sc.parallelize(Seq(initialSolution))) {
rdd => rdd.flatMap(_.nextPossibleMoves)
}
allPlays.reduce { case (m1, m2) => if (m1.score < m2.score) m1 else m2}
}

This is a limitation that makes sense in terms of the implementation, but it can be a pain to work with.
You can try pulling out the recursion to top level, just in the "driver" code that creates and operates with RDDs? Something like:
def step(rdd: Rdd[Move], limit: Int) =
if(0 == limit) rdd
else {
val newRdd = rdd.flatMap(...)
step(newRdd, limit - 1)
}
Alternately it's always possible to translate recursion into iteration, by managing the "stack" explicitly by hand (although it may result in more cumbersome code).

Related

Functional Breadth First Search in Scala with the State Monad

I'm trying to implement a functional Breadth First Search in Scala to compute the distances between a given node and all the other nodes in an unweighted graph. I've used a State Monad for this with the signature as :-
case class State[S,A](run:S => (A,S))
Other functions such as map, flatMap, sequence, modify etc etc are similar to what you'd find inside a standard State Monad.
Here's the code :-
case class Node(label: Int)
case class BfsState(q: Queue[Node], nodesList: List[Node], discovered: Set[Node], distanceFromSrc: Map[Node, Int]) {
val isTerminated = q.isEmpty
}
case class Graph(adjList: Map[Node, List[Node]]) {
def bfs(src: Node): (List[Node], Map[Node, Int]) = {
val initialBfsState = BfsState(Queue(src), List(src), Set(src), Map(src -> 0))
val output = bfsComp(initialBfsState)
(output.nodesList,output.distanceFromSrc)
}
#tailrec
private def bfsComp(currState:BfsState): BfsState = {
if (currState.isTerminated) currState
else bfsComp(searchNode.run(currState)._2)
}
private def searchNode: State[BfsState, Unit] = for {
node <- State[BfsState, Node](s => {
val (n, newQ) = s.q.dequeue
(n, s.copy(q = newQ))
})
s <- get
_ <- sequence(adjList(node).filter(!s.discovered(_)).map(n => {
modify[BfsState](s => {
s.copy(s.q.enqueue(n), n :: s.nodesList, s.discovered + n, s.distanceFromSrc + (n -> (s.distanceFromSrc(node) + 1)))
})
}))
} yield ()
}
Please can you advice on :-
Should the State Transition on dequeue in the searchNode function be a member of BfsState itself?
How do I make this code more performant/concise/readable?
First off, I suggest moving all the private defs related to bfs into bfs itself. This is the convention for methods that are solely used to implement another.
Second, I suggest simply not using State for this matter. State (like most monads) is about composition. It is useful when you have many things that all need access to the same global state. In this case, BfsState is specialized to bfs, will likely never be used anywhere else (it might be a good idea to move the class into bfs too), and the State itself is always run, so the outer world never sees it. (In many cases, this is fine, but here the scope is too small for State to be useful.) It'd be much cleaner to pull the logic of searchNode into bfsComp itself.
Third, I don't understand why you need both nodesList and discovered, when you can just call _.toList on discovered once you've done your computation. I've left it in in my reimplementation, though, in case there's more to this code that you haven't displayed.
def bfsComp(old: BfsState): BfsState = {
if(old.q.isEmpty) old // You don't need isTerminated, I think
else {
val (currNode, newQ) = old.q.dequeue
val newState = old.copy(q = newQ)
adjList(curNode)
.filterNot(s.discovered) // Set[T] <: T => Boolean and filterNot means you don't need to write !s.discovered(_)
.foldLeft(newState) { case (BfsState(q, nodes, discovered, distance), adjNode) =>
BfsState(
q.enqueue(adjNode),
adjNode :: nodes,
discovered + adjNode,
distance + (adjNode -> (distance(currNode) + 1)
)
}
}
}
def bfs(src: Node): (List[Node], Map[Node, Int]) = {
// I suggest moving BfsState and bfsComp into this method
val output = bfsComp(BfsState(Queue(src), List(src), Set(src), Map(src -> 0)))
(output.nodesList, output.distanceFromSrc)
// Could get rid of nodesList and say output.discovered.toList
}
In the event that you think you do have a good reason for using State here, here are my thoughts.
You use def searchNode. The point of a State is that it is pure and immutable, so it should be a val, or else you reconstruct the same State every use.
You write:
node <- State[BfsState, Node](s => {
val (n, newQ) = s.q.dequeue
(n, s.copy(q = newQ))
})
First off, Scala's syntax was designed so that you don't need to have both a () and {} surrounding an anonymous function:
node <- State[BfsState, Node] { s =>
// ...
}
Second, this doesn't look quite right to me. One benefit of using for-syntax is that the anonymous functions are hidden from you and there is minimal indentation. I'd just write it out
oldState <- get
(node, newQ) = oldState.q.dequeue
newState = oldState.copy(q = newQ)
Footnote: would it make sense to make Node an inner class of Graph? Just a suggestion.

ParSeq.fill running sequentially?

I am trying to initialize an array in Scala, using parallelization. However, when using ParSeq.fill method, the performance doesn't seem to be better any better than sequential initialization (Seq.fill). If I do the same task, but initializing the collection with map, then it is much faster.
To show my point, I set up the following example:
import scala.collection.parallel.immutable.ParSeq
import scala.util.Random
object Timer {
def apply[A](f: => A): (A, Long) = {
val s = System.nanoTime
val ret = f
(ret, System.nanoTime - s)
}
}
object ParallelBenchmark extends App {
def randomIsPrime: Boolean = {
val n = Random.nextInt(1000000)
(2 until n).exists(i => n % i == 0)
}
val seqSize = 100000
val (_, timeSeq) = Timer { Seq.fill(seqSize)(randomIsPrime) }
println(f"Time Seq:\t\t $timeSeq")
val (_, timeParFill) = Timer { ParSeq.fill(seqSize)(randomIsPrime) }
println(f"Time Par Fill:\t $timeParFill")
val (_, timeParMap) = Timer { (0 until seqSize).par.map(_ => randomIsPrime) }
println(f"Time Par map:\t $timeParMap")
}
And the result is:
Time Seq: 32389215709
Time Par Fill: 32730035599
Time Par map: 17270448112
Clearly showing that the fill method is not running in parallel.
The parallel collections library in Scala can only parallelize existing collections, fill hasn't been implemented yet (and may never be). Your method of using a Range to generate a cheap placeholder collection is probably your best option if you want to see a speed boost.
Here's the underlying method being called by ParSeq.fill, obviously not parallel.

What is the ideal collection for incremental (with multiple passings) filtering of collection?

I've seen many questions about Scala collections and could not decide.
This question was the most useful until now.
I think the core of the question is twofold:
1) Which are the best collections for this use case?
2) Which are the recommended ways to use them?
Details:
I am implementing an algorithm that iterates over all elements in a collection
searching for the one that matches a certain criterion.
After the search, the next step is to search again with a new criterion, but without the chosen element among the possibilities.
The idea is to create a sequence with all original elements ordered by the criterion (which changes at every new selection).
The original sequence doesn't really need to be ordered, but there can be duplicates (the algorithm will only pick one at a time).
Example with a small sequence of Ints (just to simplify):
object Foo extends App {
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt), Seq()))
println("ti = " + ti)
//algorithm
def recur(collection: Seq[Int], already_selected: Seq[Int]): (Seq[Int], Seq[Int]) =
if (collection.isEmpty) (Seq(), already_selected)
else {
val selected = collection maxBy f(already_selected)
val rest = collection diff Seq(selected) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}
Try #inline and as icn suggested How can I idiomatically "remove" a single element from a list in Scala and close the gap?:
object Foo extends App {
#inline
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt()).zipWithIndex, Seq()))
println("ti = " + ti)
//algorithm
#tailrec
def recur(collection: Seq[(Int, Int)], already_selected: Seq[Int]): Seq[Int] =
if (collection.isEmpty) already_selected
else {
val (selected, i) = collection.maxBy(x => f(already_selected)(x._2))
val rest = collection.patch(i, Nil, 1) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}

Scala View + Stream combo causing OutOfMemory Error. How do I replace it with a View?

I was looking at solving a very simple problem, Eratosthenes sieve, using idiomatic Scala, for learning purposes.
I've learned a Stream caches, so it is not so performant when determining the nth element because it's an O(n) complexity access with memoisation of data, therefore not suitable for this situation.
def primes(nums: Stream[Int]): Stream[Int] = {
Stream.cons(nums.head,
primes((nums tail) filter (x => x % nums.head != 0)))
}
def ints(n: Int): Stream[Int] = {
Stream.cons(n, ints(n + 1))
};
def nthPrime(n: Int): Int = {
val prim = primes(ints(2)).view take n toList;
return prim(n - 1);
};
The Integer stream is the problematic one. While the prime number filtering is done, JVM runs OutOfMemory. What is the correct way to achieve the same functionality without using Streams?
Basically take a view of primes from a view of ints and display the last element, without memoisation?
I have had similar cases where a stream was a good idea, but I did not need to store it's values. In order to consume the stream without storing it's values I created (what I called) ThrowAwayIterator:
class ThrowAwayIterator[T](var stream: Stream[T]) extends Iterator[T] {
def hasNext: Boolean = stream.nonEmpty
def next(): T = {
val next = stream.head
stream = stream.tail
next
}
}
Make sure that you do not store a reference to the instance of stream that is passed in.

Functional Alternative to Game Loop

I'm just starting out with the Scala and am trying a little toy program - in this case a text based TicTacToe. I wrote a working version based on what I know about scala, but noticed it was mostly imperative and my classes were mutable.
I'm going through and trying to implement some functional idioms and have managed to at least make the classes representing the game state immutable. However, I'm left with a class responsible for performing the game loop relying on mutable state and imperative loop as follows:
var board: TicTacToeBoard = new TicTacToeBoard
def start() {
var gameState: GameState = new XMovesNext
outputState(gameState)
while (!gameState.isGameFinished) {
val position: Int = getSelectionFromUser
board = board.updated(position, gameState.nextTurn)
gameState = getGameState(board)
outputState(gameState)
}
}
What would be a more idiomatic way to program what I'm doing imperatively in this loop?
Full source code is here https://github.com/whaley/TicTacToe-in-Scala/tree/master/src/main/scala/com/jasonwhaley/tictactoe
imho for Scala, the imperative loop is just fine. You can always write a recursive function to behave like a loop. I also threw in some pattern matching.
def start() {
def loop(board: TicTacToeBoard) = board.state match {
case Finished => Unit
case Unfinished(gameState) => {
gameState.output()
val position: Int = getSelectionFromUser()
loop(board.updated(position))
}
}
loop(new TicTacToeBoard)
}
Suppose we had a function whileSome : (a -> Option[a]) a -> (), which runs the input function until its result is None. That would strip away a little boilerplate.
def start() {
def step(board: TicTacToeBoard) = {
board.gameState.output()
val position: Int = getSelectionFromUser()
board.updated(position) // returns either Some(nextBoard) or None
}
whileSome(step, new TicTacToeBoard)
}
whileSome should be trivial to write; it is simply an abstraction of the former pattern. I'm not sure if it's in any common Scala libs, but in Haskell you could grab whileJust_ from monad-loops.
You could implement it as a recursive method. Here's an unrelated example:
object Guesser extends App {
val MIN = 1
val MAX = 100
readLine("Think of a number between 1 and 100. Press enter when ready")
def guess(max: Int, min: Int) {
val cur = (max + min) / 2
readLine("Is the number "+cur+"? (y/n) ") match {
case "y" => println("I thought so")
case "n" => {
def smallerGreater() {
readLine("Is it smaller or greater? (s/g) ") match {
case "s" => guess(cur - 1, min)
case "g" => guess(max, cur + 1)
case _ => smallerGreater()
}
}
smallerGreater()
}
case _ => {
println("Huh?")
guess(max, min)
}
}
}
guess(MAX, MIN)
}
How about something like:
Stream.continually(processMove).takeWhile(!_.isGameFinished)
where processMove is a function that gets selection from user, updates board and returns new state.
I'd go with the recursive version, but here's a proper implementation of the Stream version:
var board: TicTacToeBoard = new TicTacToeBoard
def start() {
def initialBoard: TicTacToeBoard = new TicTacToeBoard
def initialGameState: GameState = new XMovesNext
def gameIterator = Stream.iterate(initialBoard -> initialGameState) _
def game: Stream[GameState] = {
val (moves, end) = gameIterator {
case (board, gameState) =>
val position: Int = getSelectionFromUser
val updatedBoard = board.updated(position, gameState.nextTurn)
(updatedBoard, getGameState(board))
}.span { case (_, gameState) => !gameState.isGameFinished }
(moves ::: end.take(1)) map { case (_, gameState) => gameState }
}
game foreach outputState
}
This looks weirder than it should. Ideally, I'd use takeWhile, and then map it afterwards, but it won't work as the last case would be left out!
If the moves of the game could be discarded, then dropWhile followed by head would work. If I had the side effect (outputState) instead the Stream, I could go that route, but having side-effect inside a Stream is way worse than a var with a while loop.
So, instead, I use span which gives me both takeWhile and dropWhile but forces me to save the intermediate results -- which can be real bad if memory is a concern, as the whole game will be kept in memory because moves points to the head of the Stream. So I had to encapsulate all that inside another method, game. That way, when I foreach through the results of game, there won't be anything pointing to the Stream's head.
Another alternative would be to get rid of the other side effect you have: getSelectionFromUser. You can get rid of that with an Iteratee, and then you can save the last move and reapply it.
OR... you could write yourself a takeTo method and use that.