Related
I have an expensive function which I want to run as few times as possible with the following requirement:
I have several input values to try
If the function returns a value below a given threshold, I don't want to try other inputs
if no result is below the threshold, I want to take the result with the minimal output
I could not find a nice solution using Iterator's takeWhile/dropWhile, because I want to have the first matching element included. just ended up with the following solution:
val pseudoResult = Map("a" -> 0.6,"b" -> 0.2, "c" -> 1.0)
def expensiveFunc(s:String) : Double = {
pseudoResult(s)
}
val inputsToTry = Seq("a","b","c")
val inputIt = inputsToTry.iterator
val results = mutable.ArrayBuffer.empty[(String, Double)]
val earlyAbort = 0.5 // threshold
breakable {
while (inputIt.hasNext) {
val name = inputIt.next()
val res = expensiveFunc(name)
results += Tuple2(name,res)
if (res<earlyAbort) break()
}
}
println(results) // ArrayBuffer((a,0.6), (b,0.2))
val (name, bestResult) = results.minBy(_._2) // (b, 0.2)
If i set val earlyAbort = 0.1, the result should still be (b, 0.2) without evaluating all the cases again.
You can make use of Stream to achieve what you are looking for, remember Stream is some kind of lazy collection, that evaluate operations on demand.
Here is the scala Stream documentation.
You only need to do this:
val pseudoResult = Map("a" -> 0.6,"b" -> 0.2, "c" -> 1.0)
val earlyAbort = 0.5
def expensiveFunc(s: String): Double = {
println(s"Evaluating for $s")
pseudoResult(s)
}
val inputsToTry = Seq("a","b","c")
val results = inputsToTry.toStream.map(input => input -> expensiveFunc(input))
val finalResult = results.find { case (k, res) => res < earlyAbort }.getOrElse(results.minBy(_._2))
If find does not get any value, you can use the same stream to find the min, and the function is not evaluated again, this is because of memoization:
The Stream class also employs memoization such that previously computed values are converted from Stream elements to concrete values of type A
Consider that this code will fail if the original collection was empty, if you want to support empty collections you should replace minBy with sortBy(_._2).headOption and getOrElse by orElse:
val finalResultOpt = results.find { case (k, res) => res < earlyAbort }.orElse(results.sortBy(_._2).headOption)
And the output for this is:
Evaluating for a
Evaluating for b
finalResult: (String, Double) = (b,0.2)
finalResultOpt: Option[(String, Double)] = Some((b,0.2))
The clearest, simplest, thing to do is fold over the input, passing forward only the current best result.
val inputIt :Iterator[String] = inputsToTry.iterator
val earlyAbort = 0.5 // threshold
inputIt.foldLeft(("",Double.MaxValue)){ case (low,name) =>
if (low._2 < earlyAbort) low
else Seq(low, (name, expensiveFunc(name))).minBy(_._2)
}
//res0: (String, Double) = (b,0.2)
It calls on expensiveFunc() only as many times as is needed, but it does walk through the entire input iterator. If that's still too onerous (lots of input) then I'd go with a tail-recursive method.
val inputIt :Iterator[String] = inputsToTry.iterator
val earlyAbort = 0.5 // threshold
def bestMin(low :(String,Double) = ("",Double.MaxValue)) :(String,Double) = {
if (inputIt.hasNext) {
val name = inputIt.next()
val res = expensiveFunc(name)
if (res < earlyAbort) (name, res)
else if (res < low._2) bestMin((name,res))
else bestMin(low)
} else low
}
bestMin() //res0: (String, Double) = (b,0.2)
Use view in your input list:
try the following:
val pseudoResult = Map("a" -> 0.6, "b" -> 0.2, "c" -> 1.0)
def expensiveFunc(s: String): Double = {
println(s"executed for ${s}")
pseudoResult(s)
}
val inputsToTry = Seq("a", "b", "c")
val earlyAbort = 0.5 // threshold
def doIt(): List[(String, Double)] = {
inputsToTry.foldLeft(List[(String, Double)]()) {
case (n, name) =>
val res = expensiveFunc(name)
if(res < earlyAbort) {
return n++List((name, res))
}
n++List((name, res))
}
}
val (name, bestResult) = doIt().minBy(_._2)
println(name)
println(bestResult)
The output:
executed for a
executed for b
b
0.2
As you can see, only a and b are evaluated, and not c.
This is one of the use-cases for tail-recursion:
import scala.annotation.tailrec
val pseudoResult = Map("a" -> 0.6,"b" -> 0.2, "c" -> 1.0)
def expensiveFunc(s:String) : Double = {
pseudoResult(s)
}
val inputsToTry = Seq("a","b","c")
val earlyAbort = 0.5 // threshold
#tailrec
def f(s: Seq[String], result: Map[String, Double] = Map()): Map[String, Double] = s match {
case Nil => result
case h::t =>
val expensiveCalculation = expensiveFunc(h)
val intermediateResult = result + (h -> expensiveCalculation)
if(expensiveCalculation < earlyAbort) {
intermediateResult
} else {
f(t, intermediateResult)
}
}
val result = f(inputsToTry)
println(result) // Map(a -> 0.6, b -> 0.2)
val (name, bestResult) = f(inputsToTry).minBy(_._2) // ("b", 0.2)
If you implement takeUntil and use it, you'd still have to go through the list once more to get the lowest one if you don't find what you are looking for. Probably a better approach would be to have a function that combines find with reduceOption, returning early if something is found or else returning the result of reducing the collection to a single item (in your case, finding the smallest one).
The result is comparable with what you could achieve using a Stream, as highlighted in a previous reply, but avoids leveraging memoization, which can be cumbersome for very large collections.
A possible implementation could be the following:
import scala.annotation.tailrec
def findOrElse[A](it: Iterator[A])(predicate: A => Boolean,
orElse: (A, A) => A): Option[A] = {
#tailrec
def loop(elseValue: Option[A]): Option[A] = {
if (!it.hasNext) elseValue
else {
val next = it.next()
if (predicate(next)) Some(next)
else loop(Option(elseValue.fold(next)(orElse(_, next))))
}
}
loop(None)
}
Let's add our inputs to test this:
def f1(in: String): Double = {
println("calling f1")
Map("a" -> 0.6, "b" -> 0.2, "c" -> 1.0, "d" -> 0.8)(in)
}
def f2(in: String): Double = {
println("calling f2")
Map("a" -> 0.7, "b" -> 0.6, "c" -> 1.0, "d" -> 0.8)(in)
}
val inputs = Seq("a", "b", "c", "d")
As well as our helpers:
def apply[IN, OUT](in: IN, f: IN => OUT): (IN, OUT) =
in -> f(in)
def threshold[A](a: (A, Double)): Boolean =
a._2 < 0.5
def compare[A](a: (A, Double), b: (A, Double)): (A, Double) =
if (a._2 < b._2) a else b
We can now run this and see how it goes:
val r1 = findOrElse(inputs.iterator.map(apply(_, f1)))(threshold, compare)
val r2 = findOrElse(inputs.iterator.map(apply(_, f2)))(threshold, compare)
val r3 = findOrElse(Map.empty[String, Double].iterator)(threshold, compare)
r1 is Some(b, 0.2), r2 is Some(b, 0.6) and r3 is (reasonably) None. In the first case, since we use a lazy iterator and terminate early, we only invoke f1 twice.
You can have a look at the results and can play with this code here on Scastie.
Considering a list of several million objects like:
case class Point(val name:String, val x:Double, val y:Double)
I need, for a given Point target, to pick the 10 other points which are closest to the target.
val target = Point("myPoint", 34, 42)
val points = List(...) // list of several million points
def distance(p1: Point, p2: Point) = ??? // return the distance between two points
val closest10 = points.sortWith((a, b) => {
distance(a, target) < distance(b, target)
}).take(10)
This method works but is very slow. Indeed, the whole list is exhaustively sorted for each target request, whereas past the first 10 closest points, I really don't care about any kind of sorting. I don't even need that the first 10 closest are returned in the correct order.
Ideally, I'd be looking for a "return 10 first and don't pay attention to the rest" kind of method..
Naive solution that I can think of would sound like this: sort by buckets of 1000, take first bucket, sort it by buckets of 100, take first bucket, sort it by buckets of 10, take first bucket, done.
Question is, I guess this must be a very common problem in CS, so before rolling out my own solution based on this naive approach, I'd like to know of any state-of-the-art way of doing that, or even if some standard method already exists.
TL;DR how to get the first 10 items of an unsorted list, without having to sort the whole list?
Below is a barebone method adapted from this SO answer for picking n smallest integers from a list (which can be enhanced to handle more complex data structure):
def nSmallest(n: Int, list: List[Int]): List[Int] = {
def update(l: List[Int], e: Int): List[Int] =
if (e < l.head) (e :: l.tail).sortWith(_ > _) else l
list.drop(n).foldLeft( list.take(n).sortWith(_ > _) )( update(_, _) )
}
nSmallest( 5, List(3, 2, 8, 2, 9, 1, 5, 5, 9, 1, 7, 3, 4) )
// res1: List[Int] = List(3, 2, 2, 1, 1)
Please note that the output is in reverse order.
I was looking at this and wondered if a PriorityQueue might be useful.
import scala.collection.mutable.PriorityQueue
case class Point(val name:String, val x:Double, val y:Double)
val target = Point("myPoint", 34, 42)
val points = List(...) //list of points
def distance(p1: Point, p2: Point) = ??? //distance between two points
//load points-priority-queue with first 10 points
val ppq = PriorityQueue(points.take(10):_*){
case (a,b) => distance(a,target) compare distance(b,target) //prioritize points
}
//step through everything after the first 10
points.drop(10).foldLeft(distance(ppq.head,target))((mxDst,nextPnt) =>
if (mxDst > distance(nextPnt,target)) {
ppq.dequeue() //drop current far point
ppq.enqueue(nextPnt) //load replacement point
distance(ppq.head,target) //return new max distance
} else mxDst)
val result: List[Double] = ppq.dequeueAll //10 closest points
How it can be done with QuickSelect. I used in-place QuickSelect. Basically, for every target point we calculate the distance between all points and target and use QuickSelect to get k-th smallest distance (k-th order statistic). Will this algo be faster than using sorting depends on factors like number of points, number of nearests and number of targets. In my machine for 3kk random generated points, 10 target points and asking for 10 nearest points, it's 2 times faster than using Sort algo:
Number of points: 3000000
Number of targets: 10
Number of nearest: 10
QuickSelect: 10737 ms.
Sort: 20763 ms.
Results from QuickSelect are valid
Code:
import scala.annotation.tailrec
import scala.concurrent.duration.Deadline
import scala.util.Random
case class Point(val name: String, val x: Double, val y: Double)
class NearestPoints(val points: Seq[Point]) {
private case class PointWithDistance(p: Point, d: Double) extends Ordered[PointWithDistance] {
def compare(that: PointWithDistance): Int = d.compareTo(that.d)
}
def distance(p1: Point, p2: Point): Double = {
Math.sqrt(Math.pow(p2.x - p1.x, 2) + Math.pow(p2.y - p1.y, 2))
}
def get(target: Point, n: Int): Seq[Point] = {
val pd = points.map(p => PointWithDistance(p, distance(p, target))).toArray
(1 to n).map(i => quickselect(i, pd).get.p)
}
// In-place QuickSelect from https://gist.github.com/mooreniemi/9e45d55c0410cad0a9eb6d62a5b9b7ae
def quickselect[T <% Ordered[T]](k: Int, xs: Array[T]): Option[T] = {
def randint(lo: Int, hi: Int): Int =
lo + scala.util.Random.nextInt((hi - lo) + 1)
#inline
def swap[T](xs: Array[T], i: Int, j: Int): Unit = {
val t = xs(i)
xs(i) = xs(j)
xs(j) = t
}
def partition[T <% Ordered[T]](xs: Array[T], l: Int, r: Int): Int = {
var pivotIndex = randint(l, r)
val pivotValue = xs(pivotIndex)
swap(xs, r, pivotIndex)
pivotIndex = l
var i = l
while (i <= r - 1) {
if (xs(i) < pivotValue) {
swap(xs, i, pivotIndex)
pivotIndex = pivotIndex + 1
}
i = i + 1
}
swap(xs, r, pivotIndex)
pivotIndex
}
#tailrec
def quickselect0[T <% Ordered[T]](xs: Array[T], l: Int, r: Int, k: Int): T = {
if (l == r) {
xs(l)
} else {
val pivotIndex = partition(xs, l, r)
k compare pivotIndex match {
case 0 => xs(k)
case -1 => quickselect0(xs, l, pivotIndex - 1, k)
case 1 => quickselect0(xs, pivotIndex + 1, r, k)
}
}
}
xs match {
case _ if xs.isEmpty => None
case _ if k < 1 || k > xs.length => None
case _ => Some(quickselect0(xs, 0, xs.size - 1, k - 1))
}
}
}
object QuickSelectVsSort {
def main(args: Array[String]): Unit = {
val rnd = new Random(42L)
val MAX_N: Int = 3000000
val NUM_OF_NEARESTS: Int = 10
val NUM_OF_TARGETS: Int = 10
println(s"Number of points: $MAX_N")
println(s"Number of targets: $NUM_OF_TARGETS")
println(s"Number of nearest: $NUM_OF_NEARESTS")
// Generate random points
val points = (1 to MAX_N)
.map(x => Point(x.toString, rnd.nextDouble, rnd.nextDouble))
// Generate target points
val targets = (1 to NUM_OF_TARGETS).map(x => Point(s"Target$x", rnd.nextDouble, rnd.nextDouble))
var start = Deadline.now
val np = new NearestPoints(points)
val viaQuickSelect = targets.map { case target =>
val nearest = np.get(target, NUM_OF_NEARESTS)
nearest
}
var end = Deadline.now
println(s"QuickSelect: ${(end - start).toMillis} ms.")
start = Deadline.now
val viaSort = targets.map { case target =>
val closest = points.sortWith((a, b) => {
np.distance(a, target) < np.distance(b, target)
}).take(NUM_OF_NEARESTS)
closest
}
end = Deadline.now
println(s"Sort: ${(end - start).toMillis} ms.")
// Validate
assert(viaQuickSelect.length == viaSort.length)
viaSort.zipWithIndex.foreach { case (p, idx) =>
assert(p == viaQuickSelect(idx))
}
println("Results from QuickSelect are valid")
}
}
For finding the top n elements in a list you can Quicksort it and terminate early. That is, terminate at the point where you know there are n elements that are bigger than the pivot. See my implementation in the Rank class of Apache Jackrabbit (in Java though), which does just that.
I tried this
probability function returning only single value of list but not all why
var l=List(2.2,3.1)
def sum(xs: List[Double]): Double=
{
if(xs.isEmpty) 0.0
else xs.head+sum(xs.tail)
}
var s=sum(l)
def probability_calculation( xs: List[Double], s:Double): List[Double]=
{
var p=List[Double]()
var R=2
if(xs.isEmpty) List()
else
{
p=p:::List(xs.head*R/s)
probability_calculation(xs.tail,s)
}
p
}
probability_calculation(l,s)
You're re-initializing the list on every recursion:
var p=List[Double]()
So your final result list will only contain the last element.
I don't see the need for the recursion, you could just map it:
def probability_calculation(xs: List[Double], s: Double): List[Double] = {
val R = 2
xs.map(x => x * R / s)
}
A few comments on your approach!
Try avoiding using var's in your function as that is exactly the source of your problem that you are facing!
Also, your sum function could be written as simple as:
val sum = List(2.2, 3.1).foldLeft(0.0)(_ + _)
Your probability distribution function could be written as:
def probCaclculation(xs: List[Double], s: Double): List[Double] = xs.map(x => x * 2 / s)
How can the following Scala function be refactored to use idiomatic best practices?
def getFilteredList(ids: Seq[Int],
idsMustBeInThisListIfItExists: Option[Seq[Int]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[Int]]): Seq[Int] = {
var output = ids
if (idsMustBeInThisListIfItExists.isDefined) {
output = output.intersect(idsMustBeInThisListIfItExists.get)
}
if (idsMustAlsoBeInThisListIfItExists.isDefined) {
output = output.intersect(idsMustAlsoBeInThisListIfItExists.get)
}
output
}
Expected IO:
val ids = Seq(1,2,3,4,5)
val output1 = getFilteredList(ids, None, Some(Seq(3,5))) // 3, 5
val output2 = getFilteredList(ids, None, None) // 1,2,3,4,5
val output3 = getFilteredList(ids, Some(Seq(1,2)), None) // 1,2
val output4 = getFilteredList(ids, Some(Seq(1)), Some(Seq(5))) // 1,5
Thank you for your time.
Here's a simple way to do this:
implicit class SeqAugmenter[T](val seq: Seq[T]) extends AnyVal {
def intersect(opt: Option[Seq[T]]): Seq[T] = {
opt.fold(seq)(seq intersect _)
}
}
def getFilteredList(ids: Seq[Int],
idsMustBeInThisListIfItExists: Option[Seq[Int]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[Int]]
): Seq[Int] = {
ids intersect
idsMustBeInThisListIfItExists intersect
idsMustAlsoBeInThisListIfItExists
}
Yet another way without for comprehensions and implicits:
def getFilteredList(ids: Seq[Int],
idsMustBeInThisListIfItExists: Option[Seq[Int]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[Int]]): Seq[Int] = {
val output1 = ids.intersect(idsMustBeInThisListIfItExists.getOrElse(ids))
val output2 = output1.intersect(idsMustAlsoBeInThisListIfItExists.getOrElse(output1))
output2
}
Another similar way, without implicits.
def getFilteredList[A](ids: Seq[A],
idsMustBeInThisListIfItExists: Option[Seq[A]],
idsMustAlsoBeInThisListIfItExists: Option[Seq[A]]): Seq[A] = {
val a = intersect(Some(ids), idsMustBeInThisListIfItExists)(ids)
val b = intersect(Some(a), idsMustAlsoBeInThisListIfItExists)(a)
b
}
def intersect[A](ma: Option[Seq[A]], mb: Option[Seq[A]])(default: Seq[A]) = {
(for {
a <- ma
b <- mb
} yield {
a.intersect(b)
}).getOrElse(default)
}
Can you help me to avoid broadcasting of a large lookup table? I have a table with measurements:
Measurement Value
x1 5.1
x2 8.9
x1 9.1
x3 4.4
x2 2.1
...
And a list of pairs:
P1 P2
x1 x2
x2 x3
...
The task is to get all values for both elements of every pair and put them into a magic function. That's how I solved it by broadcasting the large table with the measurements.
case class Measurement(measurement: String, value: Double)
case class Candidate(c1: String, c2: String)
val measurements = Seq(Measurement("x1", 5.1), Measurement("x2", 8.9),
Measurement("x1", 9.1), Measurement("x3", 4.4))
val candidates = Seq(Candidate("x1", "x2"), Candidate("x2", "x3"))
// create data frames
val dfm = sqc.createDataFrame(measurements)
val dfc = sqc.createDataFrame(candidates)
// broadcast lookup table
val lookup = sc.broadcast(dfm.rdd.map(r => (r(0), r(1))).collect())
// udf: run magic test with every candidate
val magic: ((String, String) => Double) = (c1: String, c2: String) => {
val lt = lookup.value
val c1v = lt.filter(_._1 == c1).map(_._2).map(_.asInstanceOf[Double])
val c2v = lt.filter(_._1 == c2).map(_._2).map(_.asInstanceOf[Double])
new Foo().magic(c1v, c2v)
}
val sq1 = udf(magic)
val dfks = dfc.withColumn("magic", sq1(col("c1"), col("c2")))
As you can guess I'm not pretty happy with the solution. For every pair I filter the lookup table twice, this isn't fast nor elegant. I'm using Spark 1.6.1.
An alternative would be to use RDD and join. Not sure what's better in term of performance though.
case class Measurement(measurement: String, value: Double)
case class Candidate(c1: String, c2: String)
val measurements = Seq(Measurement("x1", 5.1), Measurement("x2", 8.9),
Measurement("x1", 9.1), Measurement("x3", 4.4))
val candidates = Seq(Candidate("x1", "x2"), Candidate("x2", "x3"))
val rdm = sc.parallelize(measurements).map(r => (r.measurement, r.value)).groupByKey().cache()
val rdc = sc.parallelize(candidates).map(r => (r.c1, r.c2)).cache()
val firstColJoin = rdc.join(rdm).values
val secondColJoin = firstColJoin.join(rdm).values
secondColJoin.map { case (c1v, c2v) => new Foo().magic(c1v, c2v) }
Thank you for all comments. I read the comments, did some research and studied zero323 posts.
My current solution is using two joins and an UserDefinedAggregateFunction:
object GroupValues extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("x", DoubleType)
def bufferSchema = new StructType().add("buff", ArrayType(DoubleType))
def dataType = ArrayType(DoubleType)
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, ArrayBuffer.empty[Double])
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0))
buffer.update(0, buffer.getSeq[Double](0) :+ input.getDouble(0))
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1.update(0, buffer1.getSeq[Double](0) ++ buffer2.getSeq[Double](0))
}
def evaluate(buffer: Row) = buffer.getSeq[Double](0)
}
// join data for candidate one
val j1 = dfc.join(dfm, dfc("c1") === dfm("measurement"))
// aggregate all c1 values to an array
val j1v = j1.groupBy(col("c1"), col("c2")).agg(GroupValues(col("value"))
.alias("c1-values"))
// join data for candidate two
val j2 = j1v.join(dfm, j1v("c2") === dfm("measurement"))
// aggregate all c2 values to an array
val j2v = j2.groupBy(col("c1"), col("c2"), col("c1-values"))
.agg(GroupValues(col("value")).alias("c2-values"))
Next step would be to use collect_list instead of UserDefinedAggregateFunction.