Efficient Scala idiomatic way to pick top 85 percent of sorted values? - scala

Given a non-increasing list of numbers, I want to pick top 85% of values in the list. Here is how I am currently doing it.
scala> val a = Array(8.60, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
a: Array[Double] = Array(8.6, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
scala> val threshold = a.sum * 0.85
threshold: Double = 26.2565
scala> val successiveSums = a.tail.foldLeft(Array[Double](a.head)){ case (x,y) => x ++ Array(y + x.last) }
successiveSums: Array[Double] = Array(8.6, 15.45, 20.36, 23.81, 26.549999999999997, 28.609999999999996, 30.139999999999997, 30.49, 30.77, 30.89)
scala> successiveSums.takeWhile( x => x <= threshold )
res40: Array[Double] = Array(8.6, 15.45, 20.36, 23.81)
scala> val size = successiveSums.takeWhile( x => x <= threshold ).size
size: Int = 4
scala> a.take(size)
res41: Array[Double] = Array(8.6, 6.85, 4.91, 3.45)
I want improve its
performance
code-size
Any suggestions ?

On code size, consider this oneliner,
a.take( a.scanLeft(0.0)(_+_).takeWhile( _ <= a.sum * 0.85 ).size - 1 )
Here scanLeft accumulates additions.
On performance, tagging intermediate values may help not to recompute same operations, namely
val threshold = a.sum * 0.85
val size = a.scanLeft(0.0)(_+_).takeWhile( _ <= threshold ).size - 1
a.take( size )

There is some space for improvement in elm's answer:
1) You don't need to compute sum 2 times.
2) You can avoid creation of additional collection with takeWhile method and use indexWhere instead.
val sums = a.scanLeft(0.0)(_ + _)
a.take(sums.indexWhere(_ > sums.last * 0.85) - 1)

There's no library method that will do exactly what you want. Generally if you want something that performs well, you'd use a tail-recursive method both to find the sum and to find the point where the 85th percentile of the total sum is crossed. Something like
def threshold(
xs: Array[Double], thresh: Double,
i: Int = 0, sum: Double = 0
) {
val next = sum + x(i)
if (next > thresh) xs.take(i)
else threshold(xs, thresh, i+1, next)
}

Well in this case I would make a little use of mutable state. See the code below:
val a = Array(8.60, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
def f(a: Array[Double]) = {
val toGet = a.sum * 0.85
var sum = 0.0
a.takeWhile(x => {sum += x; sum <= toGet })
}
println(f(a).deep) //Array(8.6, 6.85, 4.91, 3.45)
In my opinion it's acceptable because function f does not have any side effects

Related

efficient computation of haversine distance between elements of collections

I have two collections. Each collection is comprised of a collection containing a latitude, longitude, and epoch.
val arr1= Seq(Seq(34.464, -115.341,1486220267.0), Seq(34.473,
-115.452,1486227821.0), Seq(35.572, -116.945,1486217300.0),
Seq(37.843, -115.874,1486348520.0),Seq(35.874, -115.014,1486349803.0),
Seq(34.345, -116,924, 1486342752.0) )
val arr2= Seq(Seq(35.573, -116.945,1486217300.0 ),Seq(34.853,
-114.983,1486347321.0 ) )
I want to determine how many times the two arrays are within .5 miles and have the same epoch. I have two functions
def haversineDistance_single(pointA: (Double, Double), pointB: (Double, Double)): Double = {
val deltaLat = math.toRadians(pointB._1 - pointA._1)
val deltaLong = math.toRadians(pointB._2 - pointA._2)
val a = math.pow(math.sin(deltaLat / 2), 2) + math.cos(math.toRadians(pointA._1)) * math.cos(math.toRadians(pointB._1)) * math.pow(math.sin(deltaLong / 2), 2)
val greatCircleDistance = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
3958.761 * greatCircleDistance
}
def location_time(col_2:Seq[Seq[Double]], col_1:Seq[Seq[Double]]): Int={
val arr=col_1.map(x=> col_2.filter(y=> (haversineDistance_single((y(0), y(1)), (x(0),x(1)))<=.5) &
(math.abs(y(2)-x(2))<=0)).flatten).filter(x=> x.length>0)
arr.length
}
location_time(arr1,arr2) =1
My actual collections are very large, is there a more efficient way than my location_time function to compute this.
I would consider revising location_time from:
def location_time(col_mobile: Seq[Seq[Double]], col_laptop: Seq[Seq[Double]]): Int = {
val arr = col_laptop.map( x => col_mobile.filter( y =>
(haversineDistance_single((y(0), y(1)), (x(0), x(1))) <= .5) & (math.abs(y(2) - x(2)) <= 0)
).flatten
).filter(x => x.length > 0)
arr.length
}
to:
def location_time(col_mobile: Seq[Seq[Double]], col_laptop: Seq[Seq[Double]]): Int = {
val arr = col_laptop.flatMap( x => col_mobile.filter( y =>
((math.abs(y(2) - x(2)) <= 0 && haversineDistance_single((y(0), y(1)), (x(0), x(1))) <= .5))
)
)
arr.length
}
Changes made:
Revised col_mobile.filter(y => ...) from:
filter(_ => costlyCond1 & lessCostlyCond2)
to:
filter(_ => lessCostlyCond2 && costlyCond1)
Assuming haversineDistance_single is more costly to run than math.abs, replacing & with && (see difference between & versus &&) and testing math.abs first might help the filtering performance.
Simplified map/filter/flatten/filter using flatMap, replacing:
col_laptop.map(x => col_mobile.filter(y => ...).flatten).filter(_.length > 0)
with:
col_laptop.flatMap( x => col_mobile.filter( y => ... ))
In case you have access to, say, an Apache Spark cluster, consider converting your collections (if they're really large) to RDDs to compute using transformations similar to the above.

Efficient way to select a subset of elements from a Seq in Scala

I have a sequence
val input = Seq(1,3,4,5,9,11...)
I want to randomly select a subset of it. What is the fastest way.
I currently implement it like this:
//ratio is the percentage of the subgroup from the whole group
def randomSelect(ratio:Double): Boolean = {
val rr=scala.util.Random
if (rr.nextFloat() < ratio) true else false
}
val ratio = 0.3
val result = input.map(x=>(x, randomSelect(ratio))).filter(x._2).map(x=>x._1)
So I first attach a true/false label for each element, and filter out those false elements, and get back the subset of the sequence.
Is there any faster/ advantage way?
So there are basically two approaches to this:
select n elements at random
include or exclude each element with probability p
Your solution is the latter and can be simplified to:
l.filter(_ => r.nextFloat < p)
(I'm calling the list, l, the instance of Random r and your ratio p from here on out.)
If you wanted to sample exactly n elements you could do:
r.shuffle(l).take(n)
I compared these selecting 200 elements from a 1000 element list:
scala> val first = time{
| l.map(x => (x, r.nextFloat < p)).filter(_._2).map(_._1)
| }
Elapsed time: 3249507ns
scala> val second = time {
| r.shuffle(l).take(200)
| }
Elapsed time: 10640432ns
scala> val third = time{
| l.filter(_ => r.nextFloat < p)}
Elapsed time: 1689009ns
Dropping your extra two mapss appears to speed things up by about a third (which makes complete sense). The shuffle-and-take method is significantly slower, but does guarantee you a fixed number of elements.
I borrowed the timing function from here if you want to do a more rigorous investigation (i.e. average over many trials, rather than 1).
If your list isn't big, a simple filter as suggested by others should suffice:
list.filter(_ => Random.nextDouble < p)
In case you have a big list, the per-element call of Random could become the bottleneck. One approach to minimize the calls is to generate random gaps (0, 1, 2, ...) by which the data sampling will hop over the elements. Below is a simple implementation in Scala:
import scala.util.Random
import scala.math._
def gapSampling(list: List[Double], p: Double): List[Double] = {
def randomGap(p: Double): Double = {
val epsilon: Double = 1e-10
val u = max(Random.nextDouble, epsilon)
floor( log(u) / log(1 - p) )
}
#scala.annotation.tailrec
def samplingFcn(acc: List[Double], list: List[Double], p: Double): List[Double] = list match {
case Nil => acc
case _ =>
val gap = randomGap(p).toInt
val l = list.drop(gap + 1)
val accNew = l.headOption match {
case Some(e) => e :: acc
case None => acc
}
samplingFcn(accNew, l, p)
}
samplingFcn(List[Double](), list, p).reverse
}
val list = (1 to 100).toList.map(_.toDouble)
gapSampling(list, 0.3)
// res1: List[Double] = List(
// 2.0, 5.0, 7.0, 14.0, 15.0, 18.0, 20.0, 25.0, 26.0, 28.0, 33.0,
// 35.0, 42.0, 43.0, 47.0, 48.0, 50.0, 55.0, 56.0, 59.0, 62.0,
// 69.0, 72.0, 75.0, 76.0, 79.0, 82.0, 93.0, 96.0, 97.0, 98.0
// )
More details about such gap sampling can be found here.

scala range split missing the last one

I use scala Range.by to split an range to get an array, but it miss the last one for some special bucket num, for example 100. I am puzzled, and demo as following:
object SplitDemo extends App {
val min = 0.0
val max = 7672.142857142857
val bucketNum = 100
def splitsBucket1(min: Double, max: Double, num: Int) = (min to max by ((max - min) / num)).toArray
def splitsBucket2(min: Double, max: Double, num: Int): Array[Double] = {
val rst = Array.fill[Double](num + 1)(0)
rst(0) = min
rst(num) = max
val step = (max-min)/num
for(i <- 1 until num) rst(i) = rst(i-1)+step
rst
}
val split1 = splitsBucket1(min, max, bucketNum)
println(s"Split1 size = ${split1.size}, %s".format(split1.takeRight(4).mkString(",")))
val split2 = splitsBucket2(min, max, bucketNum)
println(s"Split2 size = ${split2.size}, %s".format(split2.takeRight(4).mkString(",")))
}
the output is following
Split1 size = 100,7365.257142857143,7441.978571428572,7518.700000000001,7595.421428571429
Split2 size = 101,7441.978571428588,7518.700000000017,7595.421428571446,7672.142857142857
When num = 100, split1 misses the last one, but split2 not(which is my expection).When num is other num, e.t. 130, split1 and split2 get the sample result.
What's the reason to casuse the difference?
It's the usual floating point inaccuracy.
Look, how the max comes out differently after dividing and multiplicating it back:
scala> 7672.142857142857 / 100 * 100
res1: Double = 7672.142857142858
And this number is larger than max, so it doesn't fit into the range:
scala> max / bucketNum * bucketNum > max
res2: Boolean = true
It's still more correct than adding the step 100 hundred times in splitsBucket2:
scala> var result = 0.0
result: Double = 0.0
scala> for (_ <- 0 until 100) result += (max - min) / bucketNum
scala> result
res4: Double = 7672.142857142875
This is larger than both max and max / bucketNum * bucketNum. You avoid this in splitBuckets2 by explicitly assigning rst(num) = max though.
You can try the following split implementation:
def splitsBucket3(min: Double, max: Double, num: Int): Array[Double] = {
val step = (max - min) / num
Array.tabulate(num + 1)(min + step * _)
}
It is guaranteed to have the correct number of elements, and has less numeric precision problems than splitsBucket2.

when testing for convergence using successive approximation technique, why does this code divide by the guess twice?

While working through the coursera class on scala I ran into the code below (from another question asked here by Sudipta Deb.)
package src.com.sudipta.week2.coursera
import scala.math.abs
import scala.annotation.tailrec
object FixedPoint {
println("Welcome to the Scala worksheet") //> Welcome to the Scala worksheet
val tolerance = 0.0001 //> tolerance : Double = 1.0E-4
def isCloseEnough(x: Double, y: Double): Boolean = {
abs((x - y) / x) / x < tolerance
} //> isCloseEnough: (x: Double, y: Double)Boolean
def fixedPoint(f: Double => Double)(firstGuess: Double): Double = {
#tailrec
def iterate(guess: Double): Double = {
val next = f(guess)
if (isCloseEnough(guess, next)) next
else iterate(next)
}
iterate(firstGuess)
} //> fixedPoint: (f: Double => Double)(firstGuess: Double)Double
def myFixedPoint = fixedPoint(x => 1 + x / 2)(1)//> myFixedPoint: => Double
myFixedPoint //> res0: Double = 1.999755859375
def squareRoot(x: Double) = fixedPoint(y => (y + x / y) / 2)(1)
//> squareRoot: (x: Double)Double
squareRoot(2) //> res1: Double = 1.4142135623746899
def calculateAverate(f: Double => Double)(x: Double) = (x + f(x)) / 2
//> calculateAverate: (f: Double => Double)(x: Double)Double
def myNewSquareRoot(x: Double): Double = fixedPoint(calculateAverate(y => x / y))(1)
//> myNewSquareRoot: (x: Double)Double
myNewSquareRoot(2) //> res2: Double = 1.4142135623746899
}
My puzzlement concerns the isCloseEnough function.
I understand that for guesses which are large numbers, the difference between a guess and
the large value that the function returns could potentially be very big all the time, so we may never converge.
Conversely, if the guess is small, and if what f(x) produces is small then we will likely converge too quickly.
So dividing through by the guess like this:
def isCloseEnough(x: Double, y: Double): Boolean = {
abs((x - y) / x) / x < tolerance
}
makes perfect sense. (here is 'x' is the guess, and y is f_of_x.)
My question is why do why does the solution given divide by the guess TWICE ?
Wouldn't that undo all the benefits of dividing through by the guess the first time ?
As an example... let's say that my current guess and the value actually returned by the
function given my current x is as shown below:
import math.abs
var guess=.0000008f
var f_of_x=.00000079999f
And lets' say my tolerance is
var tolerance=.0001
These numbers look pretty close, and indeed, if i divide through by x ONCE, i see that the result
is less than my tolerance.
( abs(guess - f_of_x) / guess)
res3: Float = 1.2505552E-5
However, if i divide through by x TWICE the result is much greater than my tolerance, which would suggest
we need to keep iterating.. which seems wrong since guess and observed f(x) are so close.
scala> ( abs(guess - f_of_x) / guess) / guess
res11: Float = 15.632331
Thanks in advance for any help you can provide.
You are completely right, it does not make sense. Further, the second division is outside of the absolute value rendering the inequality true for any negative x.
Perhaps someone got confused with testing for quadratic convergence.

Scala: apply Map to a list of tuples

very simple question: I want to do something like this:
var arr1: Array[Double] = ...
var arr2: Array[Double] = ...
var arr3: Array[(Double,Double)] = arr1.zip(arr2)
arr3.foreach(x => {if (x._1 > treshold) {x._2 = x._2 * factor}})
I tried a lot differnt syntax versions, but I failed with all of them. How could I solve this? It can not be very difficult ...
Thanks!
Multiple approaches to solve this, consider for instance the use of collect which delivers an immutable collection arr4, as follows,
val arr4 = arr3.collect {
case (x, y) if x > threshold => (x ,y * factor)
case v => v
}
With a for comprehension like this,
for ((x, y) <- arr3)
yield (x, if (x > threshold) y * factor else y)
I think you want to do something like
scala> val arr1 = Array(1.1, 1.2)
arr1: Array[Double] = Array(1.1, 1.2)
scala> val arr2 = Array(1.1, 1.2)
arr2: Array[Double] = Array(1.1, 1.2)
scala> val arr3 = arr1.zip(arr2)
arr3: Array[(Double, Double)] = Array((1.1,1.1), (1.2,1.2))
scala> arr3.filter(_._1> 1.1).map(_._2*2)
res0: Array[Double] = Array(2.4)
I think there are two problems:
You're using foreach, which returns Unit, where you want to use map, which returns an Array[B].
You're trying to update an immutable value, when you want to return a new, updated value. This is the difference between _._2 = _._2 * factor and _._2 * factor.
To filter the values not meeting the threshold:
arr1.zip(arr2).filter(_._1 > threshold).map(_._2 * factor)
To keep all values, but only multiply the ones meeting the threshold:
arr1.zip(arr2).map {
case (x, y) if x > threshold => y * factor
case (_, y) => y
}
You can do it with this,
arr3.map(x => if (x._1 > threshold) (x._1, x._2 * factor) else x)
How about this?
arr3.map { case(x1, x2) => // extract first and second value
if (x1 > treshold) (x1, x2 * factor) // if first value is greater than threshold, 'change' x2
else (x1, x2) // otherwise leave it as it is
}.toMap
Scala is generally functional, which means you do not change values, but create new values, for example you do not write x._2 = …, since tuple is immutable (you can't change it), but create a new tuple.
This will do what you need.
arr3.map(x => if(x._1 > treshold) (x._1, x._2 * factor) else x)
The key here is that you can return tuple from the map lambda expression by putting two variable into (..).
Edit: You want to change every element of an array without creating a new array. Then you need to do the next.
arr3.indices.foreach(x => if(arr3(x)._1 > treshold) (arr3(x)._1, arr3(x)._2 * factor) else x)