Creating Seq after waiting for all results from map/foreach in Scala - scala

I am trying to loop over inputs and process them to produce scores.
Just for the first input, I want to do some processing that takes a while.
The function ends up returning just the values from the 'else' part. The 'if' part is done executing after the function returns the value.
I am new to Scala and understand the behavior but not sure how to fix it.
I've tried inputs.zipWithIndex.map instead of foreach but the result is the same.
def getscores(
inputs: inputs
): Future[Seq[scoreInfo]] = {
var scores: Seq[scoreInfo] = Seq()
inputs.zipWithIndex.foreach {
case (f, i) => {
if (i == 0) {
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
scores = scores.:+(score)
})
})
} else {
scores = scores.:+(
scoreInfo(
id = "",
score = 5
)
)
}
}
}
Future {
scores
}
}

For what you need, I would drop the mutable variable and replace foreach with map to obtain an immutable list of Futures and recover to handle exceptions, followed by a sequence like below:
def getScores(inputs: Inputs): Future[List[ScoreInfo]] = Future.sequence(
inputs.zipWithIndex.map{ case (input, idx) =>
if (idx == 0)
getGeoScore(input).map(_.getOrElse(defaultScore)).recover{ case e => errorHandling(e) }
else
Future.successful(ScoreInfo("", 5))
})
To capture/print the result, one way is to use onComplete:
getScores(inputs).onComplete(println)

The part your missing is understanding a tricky element of concurrency, and that is that the order of execution when using multiple futures is not guaranteed.
If your block here is long running, it will take a while before appending the score to scores
// long operation that returns Future[Option[scoreInfo]]
getgeoscore(f).foreach(gso => {
gso.foreach(score => {
// stick a println("here") in here to see what happens, for demonstration purposes only
scores = scores.:+(score)
})
})
Since that executes concurrently, your getscores function will also simultaneously continue its work iterating over the rest of inputs in your zipWithindex. This iteration, especially since it's trivial work, likely finishes well before the long-running getgeoscore(f) completes the execution of the Future it scheduled, and the code will exit the function, moving on to whatever code is next after you called getscores
val futureScores: Future[Seq[scoreInfo]] = getScores(inputs)
futureScores.onComplete{
case Success(scoreInfoSeq) => println(s"Here's the scores: ${scoreInfoSeq.mkString(",")}"
}
//a this point the call to getgeoscore(f) could still be running and finish later, but you will never know
doSomeOtherWork()
Now to clean this up, since you can run a zipWithIndex on your inputs parameter, I assume you mean it's something like a inputs:Seq[Input]. If all you want to do is operate on the first input, then use the head function to only retrieve the first option, so getgeoscores(inputs.head) , you don't need the rest of the code you have there.
Also, as a note, if using Scala, get out of the habit of using mutable vars, especially if you're working with concurrency. Scala is built around supporting immutability, so if you find yourself wanting to use a var , try using a val and look up how to work with the Scala's collection library to make it work.

In general, that is when you have several concurrent futures, I would say Leo's answer describes the right way to do it. However, you want only the first element transformed by a long running operation. So you can use the future return by the respective function and append the other elements when the long running call returns by mapping the future result:
def getscores(inputs: Inputs): Future[Seq[ScoreInfo]] =
getgeoscore(inputs.head)
.map { optInfo =>
optInfo ++ inputs.tail.map(_ => scoreInfo(id = "", score = 5))
}
So you neither need zipWithIndex nor do you need an additional future or join the results of several futures with sequence. Mapping the future just gives you a new future with the result transformed by the function passed to .map().

Related

Return keyword inside the inline function in Scala

I've heard about to not use Return keyword in Scala, because it might change the flow of the program like;
// this will return only 2 because of return keyword
List(1, 2, 3).map(value => return value * 2)
Here is my case; I've recursive case class, and using DFS to do some calculation on it. So, maximum depth could be 5. Here is the model;
case class Line(
lines: Option[Seq[Line]],
balls: Option[Seq[Ball]],
op: Option[String]
)
I'm using DFS approach to search this recursive model. But at some point, if a special value exist in the data, I want to stop iterating over the data left and return the result directly instead. Here is an example;
Line(
lines = Some(Seq(
Line(None, Some(Seq(Ball(1), Ball(3))), Some("s")),
Line(None, Some(Seq(Ball(5), Ball(2))), Some("d")),
Line(None, Some(Seq(Ball(9))), None)
)),
balls = None,
None
)
In this data, I want to return as like "NOT_OKAY" if I run into the Ball(5), which means I do not need to any operation on Ball(2) and Ball(9) anymore. Otherwise, I will apply a calculation to the each Ball(x) with the given operator.
I'm using this sort of DFS method;
def calculate(line: Line) = {
// string here is the actual result that I want, Boolean just keeps if there is a data that I don't want
def dfs(line: Line): (String, Boolean) = {
line.balls.map{_.map { ball =>
val result = someCalculationOnBall(ball)
// return keyword here because i don't want to iterate values left in the balls
if (result == "NOTREQUIRED") return ("NOT_OKAY", true)
("OKAY", false)
}}.getOrElse(
line.lines.map{_.map{ subLine =>
val groupResult = dfs(subLine)
// here is I'm using return because I want to return the result directly instead of iterating the values left in the lines
if (groupResult._2) return ("NOT_OKAY", true)
("OKAY", false)
}}
)
}
.... rest of the thing
}
In this case, I'm using return keyword in the inline functions, and change the behaviour of the inner map functions completely. I've just read somethings about not using return keyword in Scala, but couldn't make sure this will create a problem or not. Because in my case, I don't want to do any calculation if I run into a value that I don't want to see. Also I couldn't find the functional way to get rid of return keyword.
Is there any side effect like stack exception etc. to use return keyword here? I'm always open to the alternative ways. Thank you so much!

Filtering a collection of IO's: List[IO[Page]] scala

I am refactoring a scala http4s application to remove some pesky side effects causing my app to block. I'm replacing .unsafeRunSync with cats.effect.IO. The problem is as follows:
I have 2 lists: alreadyAccessible: IO[List[Page]] and pages: List[Page]
I need to filter out the pages that are not contained in alreadyAccessible.
Then map over the resulting list to "grant Access" in the database to these pages. (e.g. call another method that hits the database and returns an IO[Page].
val addable: List[Page] = pages.filter(p => !alreadyAccessible.contains(p))
val added: List[Page] = addable.map((p: Page) => {
pageModel.grantAccess(roleLst.head.id, p.id) match {
case Right(p) => p
}
})
This is close to what I want; However, it does not work because filter requires a function that returns a Boolean but alreadyAccessible is of type IO[List[Page]] which precludes you from removing anything from the IO monad. I understand you can't remove data from the IO so maybe transform it:
val added: List[IO[Page]] = for(page <- pages) {
val granted = alreadyAccessible.flatMap((aa: List[Page]) => {
if (!aa.contains(page))
pageModel.grantAccess(roleLst.head.id, page.id) match { case Right(p) => p }
else null
})
} yield granted
this unfortunately does not work with the following error:
Error:(62, 7) ';' expected but 'yield' found.
} yield granted
I think because I am somehow mistreating the for comprehension syntax, I just don't understand why I cannot do what I'm doing.
I know there must be a straight forward solution to such a problem, so any input or advice is greatly appreciates. Thank you for your time in reading this!
granted is going to be an IO[List[Page]]. There's no particular point in having IO inside anything else unless you truly are going to treat the actions like values and reorder them/filter them etc.
val granted: IO[List[Page]] = for {
How do you compute it? Well, the first step is to execute alreadyAccessible to get the actual list. In fact, alreadyAccessible is misnamed. It is not the list of accessible pages; it is an action that gets the list of accessible pages. I would recommend you rename it getAlreadyAccessible.
alreadyAccessible <- getAlreadyAccessible
Then you filter pages with it
val required = pages.filterNot(alreadyAccessible.contains)
Now, I cannot decipher what you're doing to these pages. I'm just going to assume you have some kind of function grantAccess: Page => IO[Page]. If you map this function over required, you will get a List[IO[Page]], which is not desirable. Instead, we should traverse with grantAccess, which will produce a IO[List[Page]] that executes each IO[Page] and then assembles all the results into a List[Page].
granted <- required.traverse(grantAccess)
And we're done
} yield granted

Should 'require' go inside or outside of the Future?

How do I replace my first conditional with the require function in the context of a Future? Should I wrap the entire inRange method in a Future, and if I do that, how do I handle the last Future so that it doesn't return a Future[Future[List[UserId]], or is there a better way?
I have a block of code that looks something like this:
class RetrieveHomeownersDefault(depA: DependencyA, depB: DependencyB) extends RetrieveHomeowners {
def inRange(range: GpsRange): Future[List[UserId]] = {
// I would like to replace this conditional with `require(count >= 0, "The offset…`
if (count < 0) {
Future.failed(new IllegalArgumentException("The offset must be a positive integer.")
} else {
val retrieveUsers: Future[List[UserId]] = depA.inRange(range)
for (
userIds <- retrieveUsers
homes <- depB.homesForUsers(userIds)
) yield FilterUsers.withoutHomes(userIds, homes)
}
}
}
I started using the require function in other areas of my code, but when I tried to use it in the context of Futures I ran into some hiccups.
class RetrieveHomeownersDefault(depA: DependencyA, depB: DependencyB) extends RetrieveHomeowners {
// Wrapped the entire method with Future, but is that the correct approach?
def inRange(range: GpsRange): Future[List[UserId]] = Future {
require(count >= 0, "The offset must be a positive integer.")
val retrieveUsers: Future[List[UserId]] = depA.inRange(range)
// Now I get Future[Future[List[UserId]]] error in the compiler.
for (
userIds <- retrieveUsers
homes <- depB.homesForUsers(userIds)
) yield FilterUsers.withoutHomes(userIds, homes)
}
}
Any tips, feedback, or suggestions would be greatly appreciated. I'm just getting started with Futures and still having a tough time wrapping my head around many concepts.
Thanks a bunch!
Just remove the outer Future {...} wrapper. It's not necessary. There's no good reason for the require call to go inside the Future. It's actually better outside since then it will report immediately (in the same thread) to the caller that the argument is invalid.
By the way, the original code is wrong too. The Future.failed(...) is created but not returned. So essentially it didn't do anything.

Spark - how to handle with lazy evaluation in case of iterative (or recursive) function calls

I have a recursive function that needs to compare the results of the current call to the previous call to figure out whether it has reached a convergence. My function does not contain any action - it only contains map, flatMap, and reduceByKey. Since Spark does not evaluate transformations (until an action is called), my next iteration does not get the proper values to compare for convergence.
Here is a skeleton of the function -
def func1(sc: SparkContext, nodes:RDD[List[Long]], didConverge: Boolean, changeCount: Int) RDD[(Long] = {
if (didConverge)
nodes
else {
val currChangeCount = sc.accumulator(0, "xyz")
val newNodes = performSomeOps(nodes, currChangeCount) // does a few map/flatMap/reduceByKey operations
if (currChangeCount.value == changeCount) {
func1(sc, newNodes, true, currChangeCount.value)
} else {
func1(sc, newNode, false, currChangeCount.value)
}
}
}
performSomeOps only contains map, flatMap, and reduceByKey transformations. Since it does not have any action, the code in performSomeOps does not execute. So my currChangeCount does not get the actual count. What that implies, the condition to check for the convergence (currChangeCount.value == changeCount) is going to be invalid. One way to overcome is to force an action within each iteration by calling a count but that is an unnecessary overhead.
I am wondering what I can do to force an action w/o much overhead or is there another way to address this problem?
I believe there is a very important thing you're missing here:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Because of that accumulators cannot be reliably used for managing control flow and are better suited for job monitoring.
Moreover executing an action is not an unnecessary overhead. If you want to know what is the result of the computation you have to perform it. Unless of course the result is trivial. The cheapest action possible is:
rdd.foreach { case _ => }
but it won't address the problem you have here.
In general iterative computations in Spark can be structured as follows:
def func1(chcekpoinInterval: Int)(sc: SparkContext, nodes:RDD[List[Long]],
didConverge: Boolean, changeCount: Int, iteration: Int) RDD[(Long] = {
if (didConverge) nodes
else {
// Compute and cache new nodes
val newNodes = performSomeOps(nodes, currChangeCount).cache
// Periodically checkpoint to avoid stack overflow
if (iteration % checkpointInterval == 0) newNodes.checkpoint
/* Call a function which computes values
that determines control flow. This execute an action on newNodes.
*/
val changeCount = computeChangeCount(newNodes)
// Unpersist old nodes
nodes.unpersist
func1(checkpointInterval)(
sc, newNodes, currChangeCount.value == changeCount,
currChangeCount.value, iteration + 1
)
}
}
I see that these map/flatMap/reduceByKey transformations are updating an accumulator. Therefore the only way to perform all updates is to execute all these functions and count is the easiest way to achieve that and gives the lowest overhead compared to other ways (cache + count, first or collect).
Previous answers put me on the right track to solve a similar convergence detection problem.
foreach is presented in the docs as:
foreach(func) : Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
It seems like instead of using rdd.foreach() as a cheap action to trigger accumulator increments placed in various transformations, it should be used to do the incrementing itself.
I'm unable to produce a scala example, but here's a basic java version, if it can still help:
// Convergence is reached when two iterations
// return the same number of results
long previousCount = -1;
long currentCount = 0;
while (previousCount != currentCount){
rdd = doSomethingThatUpdatesRdd(rdd);
// Count entries in new rdd with foreach + accumulator
rdd.foreach(tuple -> accumulator.add(1));
// Update helper values
previousCount = currentCount;
currentCount = accumulator.sum();
accumulator.reset();
}
// Convergence is reached

How to use reduce or fold to avoid mutable state

I have a mutable variable in my code that I want to avoid by using some of aggregation function. Unfortunatelly I couldn't find solution for the following pseudocode.
def someMethods(someArgs) = {
var someMutableVariable = factory
val resources = getResourcesForVariable(someMutableVariable)
resources foreach (resource => {
val localTempVariable = getSomeOtherVariable(resource)
someMutableVariable = chooseBetteVariable(someMutableVariable, localTempVariable)
})
someMutableVariable
}
I have two places in my code where I need to build some variable, then in loop compare it with other possibilities and if it worse then replace it with this newly possibility.
If you resources variable supports it:
//This is the "currently best" and "next" in list being folded over
resources.foldLeft(factory)((cur, next) =>
val local = getSomeOther(next)
//Since this function returns the "best" between the two, you're solid
chooseBetter(local, cur)
}
and then you don't have mutable state.