efficient computation of haversine distance between elements of collections - scala

I have two collections. Each collection is comprised of a collection containing a latitude, longitude, and epoch.
val arr1= Seq(Seq(34.464, -115.341,1486220267.0), Seq(34.473,
-115.452,1486227821.0), Seq(35.572, -116.945,1486217300.0),
Seq(37.843, -115.874,1486348520.0),Seq(35.874, -115.014,1486349803.0),
Seq(34.345, -116,924, 1486342752.0) )
val arr2= Seq(Seq(35.573, -116.945,1486217300.0 ),Seq(34.853,
-114.983,1486347321.0 ) )
I want to determine how many times the two arrays are within .5 miles and have the same epoch. I have two functions
def haversineDistance_single(pointA: (Double, Double), pointB: (Double, Double)): Double = {
val deltaLat = math.toRadians(pointB._1 - pointA._1)
val deltaLong = math.toRadians(pointB._2 - pointA._2)
val a = math.pow(math.sin(deltaLat / 2), 2) + math.cos(math.toRadians(pointA._1)) * math.cos(math.toRadians(pointB._1)) * math.pow(math.sin(deltaLong / 2), 2)
val greatCircleDistance = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
3958.761 * greatCircleDistance
}
def location_time(col_2:Seq[Seq[Double]], col_1:Seq[Seq[Double]]): Int={
val arr=col_1.map(x=> col_2.filter(y=> (haversineDistance_single((y(0), y(1)), (x(0),x(1)))<=.5) &
(math.abs(y(2)-x(2))<=0)).flatten).filter(x=> x.length>0)
arr.length
}
location_time(arr1,arr2) =1
My actual collections are very large, is there a more efficient way than my location_time function to compute this.

I would consider revising location_time from:
def location_time(col_mobile: Seq[Seq[Double]], col_laptop: Seq[Seq[Double]]): Int = {
val arr = col_laptop.map( x => col_mobile.filter( y =>
(haversineDistance_single((y(0), y(1)), (x(0), x(1))) <= .5) & (math.abs(y(2) - x(2)) <= 0)
).flatten
).filter(x => x.length > 0)
arr.length
}
to:
def location_time(col_mobile: Seq[Seq[Double]], col_laptop: Seq[Seq[Double]]): Int = {
val arr = col_laptop.flatMap( x => col_mobile.filter( y =>
((math.abs(y(2) - x(2)) <= 0 && haversineDistance_single((y(0), y(1)), (x(0), x(1))) <= .5))
)
)
arr.length
}
Changes made:
Revised col_mobile.filter(y => ...) from:
filter(_ => costlyCond1 & lessCostlyCond2)
to:
filter(_ => lessCostlyCond2 && costlyCond1)
Assuming haversineDistance_single is more costly to run than math.abs, replacing & with && (see difference between & versus &&) and testing math.abs first might help the filtering performance.
Simplified map/filter/flatten/filter using flatMap, replacing:
col_laptop.map(x => col_mobile.filter(y => ...).flatten).filter(_.length > 0)
with:
col_laptop.flatMap( x => col_mobile.filter( y => ... ))
In case you have access to, say, an Apache Spark cluster, consider converting your collections (if they're really large) to RDDs to compute using transformations similar to the above.

Related

SCALA: Generating a list of Tuple2 objects meeting some criteria

I want to generate a list of Tuple2 objects. Each tuple (a,b) in the list should meeting the conditions:a and b both are perfect squares,(b/30)<a<b
and a>N and b>N ( N can even be a BigInt)
I am trying to write a scala function to generate the List of Tuples meeting the above requirements?
This is my attempt..it works fine for Ints and Longs..But for BigInt there is sqrt problem I am facing..Here is my approach in coding as below:
scala> def genTups(N:Long) ={
| val x = for(s<- 1L to Math.sqrt(N).toLong) yield s*s;
| val y = x.combinations(2).map{ case Vector(a,b) => (a,b)}.toList
| y.filter(t=> (t._1*30/t._2)>=1)
| }
genTups: (N: Long)List[(Long, Long)]
scala> genTups(30)
res32: List[(Long, Long)] = List((1,4), (1,9), (1,16), (1,25), (4,9), (4,16), (4,25), (9,16), (9,25), (16,25))
Improved this using BigInt square-root algorithm as below:
def genTups(N1:BigInt,N2:BigInt) ={
def sqt(n:BigInt):BigInt = {
var a = BigInt(1)
var b = (n>>5)+BigInt(8)
while((b-a) >= 0) {
var mid:BigInt = (a+b)>>1
if(mid*mid-n> 0) b = mid-1
else a = mid+1
}; a-1 }
val x = for(s<- sqt(N1) to sqt(N2)) yield s*s;
val y = x.combinations(2).map{ case Vector(a,b) => (a,b)}.toList
y.filter(t=> (t._1*30/t._2)>=1)
}
I appreciate any help to improve in my algorithm .
You can avoid sqrt in you algorithm by changing the way you calculate x to this:
val x = (BigInt(1) to N).map(x => x*x).takeWhile(_ <= N)
The final function is then:
def genTups(N: BigInt) = {
val x = (BigInt(1) to N).map(x => x*x).takeWhile(_ <= N)
val y = x.combinations(2).map { case Vector(a, b) if (a < b) => (a, b) }.toList
y.filter(t => (t._1 * 30 / t._2) >= 1)
}
You can also re-write this as a single chain of operations like this:
def genTups(N: BigInt) =
(BigInt(1) to N)
.map(x => x * x)
.takeWhile(_ <= N)
.combinations(2)
.map { case Vector(a, b) if a < b => (a, b) }
.filter(t => (t._1 * 30 / t._2) >= 1)
.toList
In a quest for performance, I came up with this recursive version that appears to be significantly faster
def genTups(N1: BigInt, N2: BigInt) = {
def sqt(n: BigInt): BigInt = {
var a = BigInt(1)
var b = (n >> 5) + BigInt(8)
while ((b - a) >= 0) {
var mid: BigInt = (a + b) >> 1
if (mid * mid - n > 0) {
b = mid - 1
} else {
a = mid + 1
}
}
a - 1
}
#tailrec
def loop(a: BigInt, rem: List[BigInt], res: List[(BigInt, BigInt)]): List[(BigInt, BigInt)] =
rem match {
case Nil => res
case head :: tail =>
val a30 = a * 30
val thisRes = rem.takeWhile(_ <= a30).map(b => (a, b))
loop(head, tail, thisRes.reverse ::: res)
}
val squares = (sqt(N1) to sqt(N2)).map(s => s * s).toList
loop(squares.head, squares.tail, Nil).reverse
}
Each recursion of the loop adds all the matching pairs for a given value of a. The result is built in reverse because adding to the front of a long list is much faster than adding to the tail.
Firstly create a function to check if number if perfect square or not.
def squareRootOfPerfectSquare(a: Int): Option[Int] = {
val sqrt = math.sqrt(a)
if (sqrt % 1 == 0)
Some(sqrt.toInt)
else
None
}
Then, create another func that will calculate this list of tuples according to the conditions mentioned above.
def generateTuples(n1:Int,n2:Int)={
for{
b <- 1 to n2;
a <- 1 to n1 if(b>a && squareRootOfPerfectSquare(b).isDefined && squareRootOfPerfectSquare(a).isDefined)
} yield ( (a,b) )
}
Then on calling the function with parameters generateTuples(5,10)
you will get an output as
res0: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((1,4), (1,9), (4,9))
Hope that helps !!!

area under the curve programatically in scala

im trying to solve for the area under the curve of the example 1 of: http://tutorial.math.lamar.edu/Classes/CalcI/AreaProblem.aspx
f(x) = x^3 - 5x^2 + 6x + 5 and the x-axis n = 5
the answers says it is: 25.12
but i'm getting a slightly less: 23.78880035448074
what im i doing wrong??
here's my code:
import scala.math.BigDecimal.RoundingMode
def summation(low: Int, up: Int, coe: List[Int], ex: List[Int]) = {
def eva(coe: List[Int], ex: List[Int], x: Double) = {
(for (i <- 0 until coe.size) yield coe(i) * math.pow(x,ex(i))).sum
}
#annotation.tailrec
def build_points(del: Float, p: Int, xs : List[BigDecimal]): List[BigDecimal] = {
if(p <= 0 ) xs map { x => x.setScale(3, RoundingMode.HALF_EVEN)}
else build_points(del, p - 1, ((del * p):BigDecimal ):: xs)
}
val sub = 5
val diff = (up - low).toFloat
val deltaX = diff / sub
val points = build_points(deltaX, sub, List(0.0f)); println(points)
val middle_points =
(for (i <- 0 until points.size - 1) yield (points(i) + points(i + 1)) / 2)
(for (elem <- middle_points) yield deltaX * eva(coe,ex,elem.toDouble)).sum
}
val coe = List(1,-5,6,5)
val exp = List(3,2,1,0)
print(summation(0,4,coe,exp))
I'm guessing the problem is that the problem is build_points(deltaX, 5, List(0.0f)) returns a list with 6 elements instead of 5. The problem is that you are passing a list with one element in the beginning, where I'm guessing you wanted an empty list, like
build_points(deltaX, sub, Nil)

How to avoid for loop with Spark?

i'm new to spark and don't understand how mapreduce mechanism works with spark. I have one csv file with only doubles, what i want is to make an operation (compute euclidian distance) with the first vector with the rest of the rdd. Then iterate with the other vectors. It is exist a other way than this one ? Maybe use wisely the cartesian product...
val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),(2,Vectors.dense(3,4),...)))
val array_vects = rdd.collect
val size = rdd.count
val emptyArray = Array((0,Vectors.dense(0))).tail
var rdd_rez = sc.parallelize(emptyArray)
for( ind <- 0 to size -1 ) {
val vector = array_vects(ind)._2
val rest = rdd.filter(x => x._1 != ind)
val rdd_dist = rest.map( x => (x._1 , Vectors.sqdist(x._2,vector)))
rdd_rez = rdd_rez ++ rdd_dist
}
Thank you for your support.
The distances (between all pairs of vectors) can be calculated using rdd.cartesian:
val rdd = sc.parallelize(Array((1,Vectors.dense(1,2)),
(2,Vectors.dense(3,4)),...))
val product = rdd.cartesian(rdd)
val result = product.filter{ case ((a, b), (c, d)) => a != c }
.map { case ((a, b), (c, d)) =>
(a, Vectors.sqdist(b, d)) }
I don't think why you were trying to do something like that. you can simply do this as follows.
val initialArray = Array( ( 1,Vectors.dense(1,2) ), ( 2,Vectors.dense(3,4) ),... )
val firstVector = initialArray( 0 )
val initialRdd = sc.parallelize( initialArray )
val euclideanRdd = initialRdd.map( { case ( i, vec ) => ( i, euclidean( firstVector, vec ) ) } )
Where we define a function euclidean which take two dense vectors and returns euclidean distances.

Efficient Scala idiomatic way to pick top 85 percent of sorted values?

Given a non-increasing list of numbers, I want to pick top 85% of values in the list. Here is how I am currently doing it.
scala> val a = Array(8.60, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
a: Array[Double] = Array(8.6, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
scala> val threshold = a.sum * 0.85
threshold: Double = 26.2565
scala> val successiveSums = a.tail.foldLeft(Array[Double](a.head)){ case (x,y) => x ++ Array(y + x.last) }
successiveSums: Array[Double] = Array(8.6, 15.45, 20.36, 23.81, 26.549999999999997, 28.609999999999996, 30.139999999999997, 30.49, 30.77, 30.89)
scala> successiveSums.takeWhile( x => x <= threshold )
res40: Array[Double] = Array(8.6, 15.45, 20.36, 23.81)
scala> val size = successiveSums.takeWhile( x => x <= threshold ).size
size: Int = 4
scala> a.take(size)
res41: Array[Double] = Array(8.6, 6.85, 4.91, 3.45)
I want improve its
performance
code-size
Any suggestions ?
On code size, consider this oneliner,
a.take( a.scanLeft(0.0)(_+_).takeWhile( _ <= a.sum * 0.85 ).size - 1 )
Here scanLeft accumulates additions.
On performance, tagging intermediate values may help not to recompute same operations, namely
val threshold = a.sum * 0.85
val size = a.scanLeft(0.0)(_+_).takeWhile( _ <= threshold ).size - 1
a.take( size )
There is some space for improvement in elm's answer:
1) You don't need to compute sum 2 times.
2) You can avoid creation of additional collection with takeWhile method and use indexWhere instead.
val sums = a.scanLeft(0.0)(_ + _)
a.take(sums.indexWhere(_ > sums.last * 0.85) - 1)
There's no library method that will do exactly what you want. Generally if you want something that performs well, you'd use a tail-recursive method both to find the sum and to find the point where the 85th percentile of the total sum is crossed. Something like
def threshold(
xs: Array[Double], thresh: Double,
i: Int = 0, sum: Double = 0
) {
val next = sum + x(i)
if (next > thresh) xs.take(i)
else threshold(xs, thresh, i+1, next)
}
Well in this case I would make a little use of mutable state. See the code below:
val a = Array(8.60, 6.85, 4.91, 3.45, 2.74, 2.06, 1.53, 0.35, 0.28, 0.12)
def f(a: Array[Double]) = {
val toGet = a.sum * 0.85
var sum = 0.0
a.takeWhile(x => {sum += x; sum <= toGet })
}
println(f(a).deep) //Array(8.6, 6.85, 4.91, 3.45)
In my opinion it's acceptable because function f does not have any side effects

Scala: apply Map to a list of tuples

very simple question: I want to do something like this:
var arr1: Array[Double] = ...
var arr2: Array[Double] = ...
var arr3: Array[(Double,Double)] = arr1.zip(arr2)
arr3.foreach(x => {if (x._1 > treshold) {x._2 = x._2 * factor}})
I tried a lot differnt syntax versions, but I failed with all of them. How could I solve this? It can not be very difficult ...
Thanks!
Multiple approaches to solve this, consider for instance the use of collect which delivers an immutable collection arr4, as follows,
val arr4 = arr3.collect {
case (x, y) if x > threshold => (x ,y * factor)
case v => v
}
With a for comprehension like this,
for ((x, y) <- arr3)
yield (x, if (x > threshold) y * factor else y)
I think you want to do something like
scala> val arr1 = Array(1.1, 1.2)
arr1: Array[Double] = Array(1.1, 1.2)
scala> val arr2 = Array(1.1, 1.2)
arr2: Array[Double] = Array(1.1, 1.2)
scala> val arr3 = arr1.zip(arr2)
arr3: Array[(Double, Double)] = Array((1.1,1.1), (1.2,1.2))
scala> arr3.filter(_._1> 1.1).map(_._2*2)
res0: Array[Double] = Array(2.4)
I think there are two problems:
You're using foreach, which returns Unit, where you want to use map, which returns an Array[B].
You're trying to update an immutable value, when you want to return a new, updated value. This is the difference between _._2 = _._2 * factor and _._2 * factor.
To filter the values not meeting the threshold:
arr1.zip(arr2).filter(_._1 > threshold).map(_._2 * factor)
To keep all values, but only multiply the ones meeting the threshold:
arr1.zip(arr2).map {
case (x, y) if x > threshold => y * factor
case (_, y) => y
}
You can do it with this,
arr3.map(x => if (x._1 > threshold) (x._1, x._2 * factor) else x)
How about this?
arr3.map { case(x1, x2) => // extract first and second value
if (x1 > treshold) (x1, x2 * factor) // if first value is greater than threshold, 'change' x2
else (x1, x2) // otherwise leave it as it is
}.toMap
Scala is generally functional, which means you do not change values, but create new values, for example you do not write x._2 = …, since tuple is immutable (you can't change it), but create a new tuple.
This will do what you need.
arr3.map(x => if(x._1 > treshold) (x._1, x._2 * factor) else x)
The key here is that you can return tuple from the map lambda expression by putting two variable into (..).
Edit: You want to change every element of an array without creating a new array. Then you need to do the next.
arr3.indices.foreach(x => if(arr3(x)._1 > treshold) (arr3(x)._1, arr3(x)._2 * factor) else x)