Recursive method call in Apache Spark - scala

I'm building a family tree from a database on Apache Spark, using a recursive search to find the ultimate parent (i.e. the person at the top of the family tree) for each person in the DB.
It is assumed that the first person returned when searching for their id is the correct parent
val peopleById = peopleRDD.keyBy(f => f.id)
def findUltimateParentId(personId: String) : String = {
if((personId == null) || (personId.length() == 0))
return "-1"
val personSeq = peopleById.lookup(personId)
val person = personSeq(0)
if(person.personId == "0 "|| person.id == person.parentId) {
return person.id
}
else {
return findUltimateParentId(person.parentId)
}
}
val ultimateParentIds = peopleRDD.foreach(f => f.findUltimateParentId(f.parentId))
It is giving the following error
"Caused by: org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063."
I understand from reading other similar questions that the problem is that I'm calling the findUltimateParentId from within the foreach loop, and if I call the method from the shell with a person's id, it returns the correct ultimate parent id
However, none of the other suggested solutions work for me, or at least I can't see how to implement them in my program, can anyone help?

If I understood you correctly - here's a solution that would work for any size of input (although performance might not be great) - it performs N iterations over the RDD where N is the "deepest family" (largest distance from ancestor to child) in the input:
// representation of input: each person has an ID and an optional parent ID
case class Person(id: Int, parentId: Option[Int])
// representation of result: each person is optionally attached its "ultimate" ancestor,
// or none if it had no parent id in the first place
case class WithAncestor(person: Person, ancestor: Option[Person]) {
def hasGrandparent: Boolean = ancestor.exists(_.parentId.isDefined)
}
object RecursiveParentLookup {
// requested method
def findUltimateParent(rdd: RDD[Person]): RDD[WithAncestor] = {
// all persons keyed by id
def byId = rdd.keyBy(_.id).cache()
// recursive function that "climbs" one generation at each iteration
def climbOneGeneration(persons: RDD[WithAncestor]): RDD[WithAncestor] = {
val cached = persons.cache()
// find which persons can climb further up family tree
val haveGrandparents = cached.filter(_.hasGrandparent)
if (haveGrandparents.isEmpty()) {
cached // we're done, return result
} else {
val done = cached.filter(!_.hasGrandparent) // these are done, we'll return them as-is
// for those who can - join with persons to find the grandparent and attach it instead of parent
val withGrandparents = haveGrandparents
.keyBy(_.ancestor.get.parentId.get) // grandparent id
.join(byId)
.values
.map({ case (withAncestor, grandparent) => WithAncestor(withAncestor.person, Some(grandparent)) })
// call this method recursively on the result
done ++ climbOneGeneration(withGrandparents)
}
}
// call recursive method - start by assuming each person is its own parent, if it has one:
climbOneGeneration(rdd.map(p => WithAncestor(p, p.parentId.map(i => p))))
}
}
Here's a test to better understand how this works:
/**
* Example input tree:
*
* 1 5
* | |
* ----- 2 ----- 6
* | |
* 3 4
*
*/
val person1 = Person(1, None)
val person2 = Person(2, Some(1))
val person3 = Person(3, Some(2))
val person4 = Person(4, Some(2))
val person5 = Person(5, None)
val person6 = Person(6, Some(5))
test("find ultimate parent") {
val input = sc.parallelize(Seq(person1, person2, person3, person4, person5, person6))
val result = RecursiveParentLookup.findUltimateParent(input).collect()
result should contain theSameElementsAs Seq(
WithAncestor(person1, None),
WithAncestor(person2, Some(person1)),
WithAncestor(person3, Some(person1)),
WithAncestor(person4, Some(person1)),
WithAncestor(person5, None),
WithAncestor(person6, Some(person5))
)
}
It should be easy to map your input into these Person objects, and to map the output WithAncestor objects into whatever it is you need. Note that this code assumes that if any person has parentId X - another person with that id actually exists in the input

fixed this by using SparkContext.broadcast:
val peopleById = peopleRDD.keyBy(f => f.id)
val broadcastedPeople = sc.broadcast(peopleById.collectAsMap())
def findUltimateParentId(personId: String) : String = {
if((personId == null) || (personId.length() == 0))
return "-1"
val personOption = broadcastedPeople.value.get(personId)
if(personOption.isEmpty) {
return "0";
}
val person = personOption.get
if(person.personId == 0 || person.orgId == person.personId) {
return person.id
}
else {
return findUltimateParentId(person.parentId)
}
}
val ultimateParentIds = peopleRDD.foreach(f => f.findUltimateParentId(f.parentId))
working great now!

Related

Mapping over a collection that might return multiple values or a single value

I'm currently mapping over a collection for validation, and I need to return back a single or multiple validation errors:
val errors: Seq[Option[ProductErrors]] = products.map {
if(....) Some(ProductError(...))
else if(...) Some(ProductError(..))
else None
}
errors.flatten
So currently I am returning an Option[ProductError] per map iteration, but in some cases I need to return multiple ProductError's, how can I acheive this?
e.g.
if(...) {
val p1 = Some(ProductError(...))
val p2 = Some(ProductError(....))
}
case class ProductErrors(msg: String = "anything")
val products = (1 to 10).toList
def convert(p: Int): Seq[ProductErrors] = {
if (p < 5) Seq(ProductErrors("less than 5"))
else if (p < 8 && p % 2 == 1) Seq(ProductErrors("element is odd"), ProductErrors("less than 8"))
else Seq()
}
val errors = products.map(convert)
// errors.flatten.size
// val res8: Int = 8
// you can just use flatMap here
products.flatMap(convert).size // 8

sequencing (parallel) Observables in Outwatch (or zipping)

How should I render a list of Observables in Outwatch? What if I need a single Observable: how should I sequence/zip them as an applicative? Is it expected that it render, when I use the applicative(?) operation 'zip' to transform a List[Observable] to an Observable[List] ? (ie when I don't need Observables to be chained)
val literals:Seq[Observable[VNode]] = handlers.map { i => i.map(li(_)) }
return div(ol( children <-- Observable.zip[VNode, Seq[VNode]](literals)(identity) ))
With one answer below
div(ol(
(for (item <- literals) yield { child <-- item} ):_*))
any one child is only rendered only after every input has been entered by the user. How do I render each child as soon as the user enters any first input, without having to enter them all?
Full code follows
import outwatch.dom._
import rxscalajs.Observable
import scala.scalajs.js.JSApp
object Outwatchstarter extends JSApp {
def createInputMappedToStringHandler(s:Handler[String]) = input(inputString --> s)
def main(): Unit = {
val root = {
val names = (0 until 2).map(_.toString) // when 0 until 1, this emits
val handlers: Seq[Handler[String]] = names.map(name => createStringHandler())
val inputNodes = handlers.map(createInputMappedToStringHandler)
val notworkingformorethan1 = {
val literals = handlers.map { i => i.map(li(_)) }
val y: Observable[Seq[VNode]] = Observable.zip[VNode, Seq[VNode]](literals)(identity)
div(ol(
children <-- y
))
}
val list = List("What", "Is", "Up?").map(s => li(s))
val lists = Observable.just(list)
val workingList = ul(children <-- lists)
div(
div(inputNodes: _*),
workingList,
notworkingformorethan1)
}
OutWatch.render("#app", root)
}
}
Nothing shows up when list length >1, but does with a one-element list. I'm an html/scala-js and rx noob. And may be misunderstanding how Observables may (or may not) be applicative functors. I was looking for 'sequence' rather than 'zip'.
Screenshot from full code
For-comprehensions work inside the OutWatch DOM DSL
div(ol(
(for (item <- literals) yield { child <-- item} ):_*))

Scala - Recursive method is return different values

I have implemented a calculation to obtain the node score of each nodes.
The formula to obtain the value is:
The children list can not be empty or a flag must be true;
The iterative way works pretty well:
class TreeManager {
def scoreNo(nodes:List[Node]): List[(String, Double)] = {
nodes.headOption.map(node => {
val ranking = node.key.toString -> scoreNode(Some(node)) :: scoreNo(nodes.tail)
ranking ::: scoreNo(node.children)
}).getOrElse(Nil)
}
def scoreNode(node:Option[Node], score:Double = 0, depth:Int = 0):Double = {
node.map(n => {
var nodeScore = score
for(child <- n.children){
if(!child.children.isEmpty || child.hasInvitedSomeone == Some(true)){
nodeScore = scoreNode(Some(child), (nodeScore + scala.math.pow(0.5, depth)), depth+1)
}
}
nodeScore
}).getOrElse(score)
}
}
But after i've refactored this piece of code to use recursion, the results are totally wrong:
class TreeManager {
def scoreRecursive(nodes:List[Node]): List[(Int, Double)] = {
def scoreRec(nodes:List[Node], score:Double = 0, depth:Int = 0): Double = nodes match {
case Nil => score
case n =>
if(!n.head.children.isEmpty || n.head.hasInvitedSomeone == Some(true)){
score + scoreRec(n.tail, score + scala.math.pow(0.5, depth), depth + 1)
} else {
score
}
}
nodes.headOption.map(node => {
val ranking = node.key -> scoreRec(node.children) :: scoreRecursive(nodes.tail)
ranking ::: scoreRecursive(node.children)
}).getOrElse(Nil).sortWith(_._2 > _._2)
}
}
The Node is an object of a tree and it's represented by the following class:
case class Node(key:Int,
children:List[Node] = Nil,
hasInvitedSomeone:Option[Boolean] = Some(false))
And here is the part that i'm running to check results:
object Main {
def main(bla:Array[String]) = {
val xx = new TreeManager
val values = List(
Node(10, List(Node(11, List(Node(13))),
Node(12,
List(
Node(14, List(
Node(15, List(Node(18))), Node(17, hasInvitedSomeone = Some(true)),
Node(16, List(Node(19, List(Node(20)))),
hasInvitedSomeone = Some(true))),
hasInvitedSomeone = Some(true))),
hasInvitedSomeone = Some(true))),
hasInvitedSomeone = Some(true)))
val resIterative = xx.scoreNo(values)
//val resRecursive = xx.scoreRec(values)
println("a")
}
}
The iterative way is working because i've checked it but i didn't get why recursive return wrong values.
Any idea?
Thank in advance.
The recursive version never recurses on children of the nodes, just on the tail. Whereas the iterative version correctly both recurse on the children and iterate on the tail.
You'll notice your "iterative" version is also recursive btw.

Apache Spark: dealing with Option/Some/None in RDDs

I'm mapping over an HBase table, generating one RDD element per HBase row. However, sometimes the row has bad data (throwing a NullPointerException in the parsing code), in which case I just want to skip it.
I have my initial mapper return an Option to indicate that it returns 0 or 1 elements, then filter for Some, then get the contained value:
// myRDD is RDD[(ImmutableBytesWritable, Result)]
val output = myRDD.
map( tuple => getData(tuple._2) ).
filter( {case Some(y) => true; case None => false} ).
map( _.get ).
// ... more RDD operations with the good data
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
try {
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
Some( ( id, ( List(x),
// more stuff ...
) ) )
} catch {
case e: NullPointerException => {
logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e)
None
}
}
}
Is there a more idiomatic way to do this that's shorter? I feel like this looks pretty messy, both in getData() and in the map.filter.map dance I'm doing.
Perhaps a flatMap could work (generate 0 or 1 items in a Seq), but I don't want it to flatten the tuples I'm creating in the map function, just eliminate empties.
An alternative, and often overlooked way, would be using collect(PartialFunction pf), which is meant to 'select' or 'collect' specific elements in the RDD that are defined at the partial function.
The code would look like this:
val output = myRDD.collect{case Success(tuple) => tuple }
def getData(r: Result):Try[(String, List[X])] = Try {
val id = Bytes.toString(key, 0, 11)
val x = Long.MaxValue - Bytes.toLong(key, 11)
(id, List(x))
}
If you change your getData to return a scala.util.Try then you can simplify your transformations considerably. Something like this could work:
def getData(r: Result) = {
val key = r.getRow
var id = "(unk)"
var x = -1L
val tr = util.Try{
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
( id, ( List(x)
// more stuff ...
) )
}
tr.failed.foreach(e => logWarning("Skipping id=" + id + ", x=" + x + "; \n" + e))
tr
}
Then your transform could start like so:
myRDD.
flatMap(tuple => getData(tuple._2).toOption)
If your Try is a Failure it will be turned into a None via toOption and then removed as part of the flatMap logic. At that point, your next step in the transform will only be working with the successful cases being whatever the underlying type is that is returned from getData without the wrapping (i.e. No Option)
If you are ok with dropping the data then you can just use mapPartitions. Here is a sample:
import scala.util._
val mixedData = sc.parallelize(List(1,2,3,4,0))
mixedData.mapPartitions(x=>{
val foo = for(y <- x)
yield {
Try(1/y)
}
for{goodVals <- foo.partition(_.isSuccess)._1}
yield goodVals.get
})
If you want to see the bad values, then you can use an accumulator or just log as you have been.
Your code would look something like this:
val output = myRDD.
mapPartitions( tupleIter => getCleanData(tupleIter) )
// ... more RDD operations with the good data
def getCleanData(iter: Iter[???]) = {
val triedData = getDataInTry(iter)
for{goodVals <- triedData.partition(_.isSuccess)._1}
yield goodVals.get
}
def getDataInTry(iter: Iter[???]) = {
for(r <- iter) yield {
Try{
val key = r._2.getRow
var id = "(unk)"
var x = -1L
id = Bytes.toString(key, 0, 11)
x = Long.MaxValue - Bytes.toLong(key, 11)
// ... more code that might throw exceptions
}
}
}

How can I make this method more Scalalicious

I have a function that calculates the left and right node values for some collection of treeNodes given a simple node.id, node.parentId association. It's very simple and works well enough...but, well, I am wondering if there is a more idiomatic approach. Specifically is there a way to track the left/right values without using some externally tracked value but still keep the tasty recursion.
/*
* A tree node
*/
case class TreeNode(val id:String, val parentId: String){
var left: Int = 0
var right: Int = 0
}
/*
* a method to compute the left/right node values
*/
def walktree(node: TreeNode) = {
/*
* increment state for the inner function
*/
var c = 0
/*
* A method to set the increment state
*/
def increment = { c+=1; c } // poo
/*
* the tasty inner method
* treeNodes is a List[TreeNode]
*/
def walk(node: TreeNode): Unit = {
node.left = increment
/*
* recurse on all direct descendants
*/
treeNodes filter( _.parentId == node.id) foreach (walk(_))
node.right = increment
}
walk(node)
}
walktree(someRootNode)
Edit -
The list of nodes is taken from a database. Pulling the nodes into a proper tree would take too much time. I am pulling a flat list into memory and all I have is an association via node id's as pertains to parents and children.
Adding left/right node values allows me to get a snapshop of all children (and childrens children) with a single SQL query.
The calculation needs to run very quickly in order to maintain data integrity should parent-child associations change (which they do very frequently).
In addition to using the awesome Scala collections I've also boosted speed by using parallel processing for some pre/post filtering on the tree nodes. I wanted to find a more idiomatic way of tracking the left/right node values. After looking at the answer from #dhg it got even better. Using groupBy instead of a filter turns the algorithm (mostly?) linear instead of quadtratic!
val treeNodeMap = treeNodes.groupBy(_.parentId).withDefaultValue(Nil)
def walktree(node: TreeNode) = {
def walk(node: TreeNode, counter: Int): Int = {
node.left = counter
node.right =
treeNodeMap(node.id)
.foldLeft(counter+1) {
(result, curnode) => walk(curnode, result) + 1
}
node.right
}
walk(node,1)
}
Your code appears to be calculating an in-order traversal numbering.
I think what you want to make your code better is a fold that carries the current value downward and passes the updated value upward. Note that it might also be worth it to do a treeNodes.groupBy(_.parentId) before walktree to prevent you from calling treeNodes.filter(...) every time you call walk.
val treeNodes = List(TreeNode("1","0"),TreeNode("2","1"),TreeNode("3","1"))
val treeNodeMap = treeNodes.groupBy(_.parentId).withDefaultValue(Nil)
def walktree2(node: TreeNode) = {
def walk(node: TreeNode, c: Int): Int = {
node.left = c
val newC =
treeNodeMap(node.id) // get the children without filtering
.foldLeft(c+1)((c, child) => walk(child, c) + 1)
node.right = newC
newC
}
walk(node, 1)
}
And it produces the same result:
scala> walktree2(TreeNode("0","-1"))
scala> treeNodes.map(n => "(%s,%s)".format(n.left,n.right))
res32: List[String] = List((2,7), (3,4), (5,6))
That said, I would completely rewrite your code as follows:
case class TreeNode( // class is now immutable; `walktree` returns a new tree
id: String,
value: Int, // value to be set during `walktree`
left: Option[TreeNode], // recursively-defined structure
right: Option[TreeNode]) // makes traversal much simpler
def walktree(node: TreeNode) = {
def walk(nodeOption: Option[TreeNode], c: Int): (Option[TreeNode], Int) = {
nodeOption match {
case None => (None, c) // if this child doesn't exist, do nothing
case Some(node) => // if this child exists, recursively walk
val (newLeft, cLeft) = walk(node.left, c) // walk the left side
val newC = cLeft + 1 // update the value
val (newRight, cRight) = walk(node.right, newC) // walk the right side
(Some(TreeNode(node.id, newC, newLeft, newRight)), cRight)
}
}
walk(Some(node), 0)._1
}
Then you can use it like this:
walktree(
TreeNode("1", -1,
Some(TreeNode("2", -1,
Some(TreeNode("3", -1, None, None)),
Some(TreeNode("4", -1, None, None)))),
Some(TreeNode("5", -1, None, None))))
To produce:
Some(TreeNode(1,4,
Some(TreeNode(2,2,
Some(TreeNode(3,1,None,None)),
Some(TreeNode(4,3,None,None)))),
Some(TreeNode(5,5,None,None))))
If I get your algorithm correctly:
def walktree(node: TreeNode, c: Int): Int = {
node.left = c
val c2 = treeNodes.filter(_.parentId == node.id).foldLeft(c + 1) {
(cur, n) => walktree(n, cur)
}
node.right = c2 + 1
c2 + 2
}
walktree(new TreeNode("", ""), 0)
Off-by-one errors are likely to occur.
Few random thoughts (better suited for http://codereview.stackexchange.com):
try posting that compiles... We have to guess that is a sequence of TreeNode:
val is implicit for case classes:
case class TreeNode(val id: String, val parentId: String) {
Avoid explicit = and Unit for Unit functions:
def walktree(node: TreeNode) = {
def walk(node: TreeNode): Unit = {
Methods with side-effects should have ():
def increment = {c += 1; c}
This is terribly slow, consider storing list of children in the actual node:
treeNodes filter (_.parentId == node.id) foreach (walk(_))
More concice syntax would be treeNodes foreach walk:
treeNodes foreach (walk(_))