Failure to add keys to map in parallel - scala

I have the following code:
var res: GenMap[Point, GenSeq[Point]] = points.par.groupBy(point => findClosest(point, means))
means.par.foreach(mean => if(!res.contains(mean)) {
println("Map doesn't contain mean: " + mean)
res += mean -> GenSeq.empty[Point]
println("Map contains?: " + res.contains(mean))
})
That uses this case class:
case class Point(val x: Double, val y: Double, val z: Double)
Basically, the code groups the Point elements in points around the Point elements in means. The algorithm itself is not very important though.
My problem is that I am getting the following output:
Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map contains?: true
Map contains?: true
Map contains?: false
Map contains?: true
Why would I ever get this?
Map contains?: false
I am checking if a key is in the res map. If it is not, then I'm adding it.
So how can it not be present in the map?
Is there an issue with parallelism?

Your code has a race condition in line
res += mean -> GenSeq.empty[Point]
more than one thread is reasigning res concurrently so some entries can be missed.
This code solves the problem:
val closest = points.par.groupBy(point => findClosest(point, means))
val res = means.foldLeft(closest) {
case (map, mean) =>
if(map.contains(mean))
map
else
map + (mean -> GenSeq.empty[Point])
}

Processing a point changes means and the result is sensitive to processing order, so the algorithm doesn't lend itself to parallel execution. If parallel execution is important enough to allow a change of algorithm, then it might be possible to find an algorithm that can be applied in parallel.
Using a known set of grouping points, such as grid square centres, means that the points can be allocated to their grouping points in parallel and grouped by their grouping points in parallel:
import scala.annotation.tailrec
import scala.collection.parallel.ParMap
import scala.collection.{GenMap, GenSeq, Map}
import scala.math._
import scala.util.Random
class ParallelPoint {
val rng = new Random(0)
val groups: Map[Point, Point] = (for {
i <- 0 to 100
j <- 0 to 100
k <- 0 to 100
}
yield {
val p = Point(10.0 * i, 10.0 * j, 10.0 * k)
p -> p
}
).toMap
val points: Array[Point] = (1 to 10000000).map(aaa => Point(rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0)).toArray
def findClosest(point: Point, groups: GenMap[Point, Point]): (Point, Point) = {
val x: Double = rint(point.x / 10.0) * 10.0
val y: Double = rint(point.y / 10.0) * 10.0
val z: Double = rint(point.z / 10.0) * 10.0
val mean: Point = groups(Point(x, y, z)) //.getOrElse(throw new Exception(s"$point out of range of mean ($x, $y, $z).") )
(mean, point)
}
#tailrec
private def total(points: GenSeq[Point]): Option[Point] = {
points.size match {
case 0 => None
case 1 => Some(points(0))
case _ => total((points(0) + points(1)) +: points.drop(2))
}
}
def mean(points: GenSeq[Point]): Option[Point] = {
total(points) match {
case None => None
case Some(p) => Some(p / points.size)
}
}
val startTime = System.currentTimeMillis()
println("starting test ...")
val res: ParMap[Point, GenSeq[Point]] = points.par.map(p => findClosest(p, groups)).groupBy(pp => pp._1).map(kv => kv._1 -> kv._2.map(v => v._2))
val groupTime = System.currentTimeMillis()
println(s"... grouped result after ${groupTime - startTime}ms ...")
points.par.foreach(p => if (! res(findClosest(p, groups)._1).exists(_ == p)) println(s"point $p not found"))
val checkTime = System.currentTimeMillis()
println(s"... checked grouped result after ${checkTime - startTime}ms ...")
val means: ParMap[Point, GenSeq[Point]] = res.map{ kv => mean(kv._2).get -> kv._2 }
val meansTime = System.currentTimeMillis()
println(s"... means calculated after ${meansTime - startTime}ms.")
}
object ParallelPoint {
def main(args: Array[String]): Unit = new ParallelPoint()
}
case class Point(x: Double, y: Double, z: Double) {
def +(that: Point): Point = {
Point(this.x + that.x, this.y + that.y, this.z + that.z)
}
def /(scale: Double): Point = Point(x/ scale, y / scale, z / scale)
}
The last step replaces the grouping point with the calculated mean of the grouped points as the map key. This processes 10 million points in about 30 seconds on my 2011 MBP.

Related

Bubble sort of random integers in scala

I'm new in Scala programming language so in this Bubble sort I need to generate 10 random integers instead of right it down like the code below
any suggestions?
object BubbleSort {
def bubbleSort(array: Array[Int]) = {
def bubbleSortRecursive(array: Array[Int], current: Int, to: Int): Array[Int] = {
println(array.mkString(",") + " current -> " + current + ", to -> " + to)
to match {
case 0 => array
case _ if(to == current) => bubbleSortRecursive(array, 0, to - 1)
case _ =>
if (array(current) > array(current + 1)) {
var temp = array(current + 1)
array(current + 1) = array(current)
array(current) = temp
}
bubbleSortRecursive(array, current + 1, to)
}
}
bubbleSortRecursive(array, 0, array.size - 1)
}
def main(args: Array[String]) {
val sortedArray = bubbleSort(Array(10,9,11,5,2))
println("Sorted Array -> " + sortedArray.mkString(","))
}
}
Try this:
import scala.util.Random
val sortedArray = (1 to 10).map(_ => Random.nextInt).toArray
You can use scala.util.Random for generation. nextInt method takes maxValue argument, so in the code sample, you'll generate list of 10 int values from 0 to 100.
val r = scala.util.Random
for (i <- 1 to 10) yield r.nextInt(100)
You can find more info here or here
You can use it this way.
val solv1 = Random.shuffle( (1 to 100).toList).take(10)
val solv2 = Array.fill(10)(Random.nextInt)

Is there any way I can rewrite this line of code in Scala?

I try to rewrite this line of Scala + Figaro using my function sum_ but I have some errors.
val sum = Container(vars:_*).reduce(_+_)
It uses the reduce() method to calculate the sum. I want to rewrite this line but I have errors because of the Chain return type [Double, Int]:
import com.cra.figaro.language._
import com.cra.figaro.library.atomic.continuous.Uniform
import com.cra.figaro.language.{Element, Chain, Apply}
import com.cra.figaro.library.collection.Container
object sum {
def sum_(arr: Int*) :Int={
var i=0
var sum: Int =0
while (i < arr.length) {
sum += arr(i)
i += 1
}
return sum
}
def fillarray(): Int = {
scala.util.Random.nextInt(10) match{
case 0 | 1 | 2 => 3
case 3 | 4 | 5 | 6 => 4
case _ => 5
}
}
def main(args: Array[String]) {
val par = Array.fill(18)(fillarray())
val skill = Uniform(0.0, 8.0/13.0)
val shots = Array.tabulate(18)((hole: Int) => Chain(skill, (s:Double) =>
Select(s/8.0 -> (par(hole)-2),
s/2.0 -> (par(hole)-1),
s -> par(hole),
(4.0/5.0) * (1.0 - (13.0 * s)/8.0)-> (par(hole)+1),
(1.0/5.0) * (1.0 - (13.0 * s)/8.0) -> (par(hole)+2))))
val vars = for { i <- 0 until 18} yield shots(i)
//this line I want to rewrite
val sum1 = Container(vars:_*).reduce(_+_)
//My idea was to implement in this way the line above
val sum2 = sum_(vars)
}
}
If you want use your function you can do so:
val sum2 = sum_(vars.map(chain => chain.generateValue()):_*)
or
val sum2 = sum_(vars.map(_.generateValue()):_*)
but I'd recommend to dive deeper into your library and functional paradigm.

Converting a For Loop to a Fold

I have to analyze an email corpus to see how many of individual sentences are dominated by leet speak (i.e. lol, brb etc.)
For each sentence I am doing the following:
val words = sentence.split(" ")
for (word <- words) {
if (validWords.contains(word)) {
score += 1
} else if (leetWords.contains(word)) {
score -= 1
}
}
Is there a better way to calculate the scores using Fold?
Not a great deal different, but another option.
val words = List("one", "two", "three")
val valid = List("one", "two")
val leet = List("three")
def check(valid: List[String], invalid: List[String])(words:List[String]): Int = words.foldLeft(0){
case (x, word) if valid.contains(word) => x + 1
case (x, word) if invalid.contains(word) => x - 1
case (x, _ ) => x
}
val checkValidOrLeet = check(valid, leet)(_)
val count = checkValidOrLeet(words)
If not limited to fold, using sum would be more concise.
sentence.split(" ")
.iterator
.map(word =>
if (validWords.contains(word)) 1
else if (leetWords.contains(word)) -1
else 0
).sum
Here's a way to do it with fold and partial application. Could still be more elegant, I'll continue to think on it.
val sentence = // ...your data....
val validWords = // ... your valid words...
val leetWords = // ... your leet words...
def checkWord(goodList: List[String], badList: List[String])(c: Int, w: String): Int = {
if (goodList.contains(w)) c + 1
else if (badList.contains(w)) c - 1
else c
}
val count = sentence.split(" ").foldLeft(0)(checkWord(validWords, leetWords))
print(count)

I have an array of array and passing it to a function in scala but getting errors

I need to implement a distance search code.My input is as follows in the CSV.
Proprty_ID, lat, lon
123, 33.84, -118.39
234, 35.89, -119.48
345, 35.34, -119.39
I have a haversine formula which takes 2 coordinates (lat1, lon1), (lat2, lon2) and return the distance. Let say:
val Distance: Double = haversine(x1:Double, x2:Double, y1:Double, y2:Double)
I need to find out the distance of each property with each other. so the output will look like this.
Property_ID1, Property_ID2, distance
123,123,0
123,234,0.1
123,345,0.6
234,234,0
234,123,0.1
234,345,0.7
345,345,0
345,123,0.6
345,234,0.7
How can I implement this in Scala?
import math._
object Haversine {
val R = 6372.8 //radius in km
def haversine(lat1:Double, lon1:Double, lat2:Double, lon2:Double)={
val dLat=(lat2 - lat1).toRadians
val dLon=(lon2 - lon1).toRadians
val a = pow(sin(dLat/2),2) + pow(sin(dLon/2),2) * cos(lat1.toRadians) * cos(lat2.toRadians)
val c = 2 * asin(sqrt(a))
R * c
}
def main(args: Array[String]): Unit = {
println(haversine(36.12, -86.67, 33.94, -118.40))
}
}
class SimpleCSVHeader(header:Array[String]) extends Serializable {
val index = header.zipWithIndex.toMap
def apply(array:Array[String], key:String):String = array(index(key))
}
val lat1=33.84
val lon1=-118.39
val csv = sc.textFile("file.csv")
val data = csv.map(line => line.split(",").map(elem => elem.trim))
val header = new SimpleCSVHeader(data.take(1)(0)) // we build our header with the first line
val rows = data.filter(line => header(line,"lat") != "lat") // filter the header out
// I will do the looping for all properties here but I am trying to get the map function right for one property at least
val distances = rows.map(x => haversine(x.take(1)(0).toDouble,x.take(1)(1).toDouble, lat1,lon1)`
Now this should give me the distances for all the properties from (lat1, lon1). I know it's not right but I am not able to think from here.
I'd try to break it down into steps. Given data like:
val rows = List(Array("123", "33.84", "-118.39"),
Array("234", "35.89", "-119.48"),
Array("345", "35.34", "-119.39"))
First convert the types:
val typed = rows.map{ case Array(id, lat, lon) => (id, lat.toDouble, lon.toDouble)}
Then generate the combinations:
val combos = for {
a <- typed
b <- typed
} yield (a,b)
Then generate an output line for each combination:
combos.map{ case ((id1, lat1, lon1), (id2, lat2, lon2))
=> id1 + "," + id2 + "," + haversine(lat1, lon1, lat2, lon2)} foreach println

How to get SSSP actual path by apache spark graphX?

I have ran the single source shortest path (SSSP) example on spark site as follows:
graphx-SSSP pregel example
Code(scala):
object Pregel_SSSP {
def main(args: Array[String]) {
val sc = new SparkContext("local", "Allen Pregel Test", System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))
// A graph with edge attributes containing distances
val graph: Graph[Int, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 5).mapEdges(e => e.attr.toDouble)
graph.edges.foreach(println)
val sourceId: VertexId = 0 // The ultimate source
// Initialize the graph such that all vertices except the root have distance infinity.
val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity, Int.MaxValue, EdgeDirection.Out)(
// Vertex Program
(id, dist, newDist) => math.min(dist, newDist),
// Send Message
triplet => {
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
//Merge Message
(a, b) => math.min(a, b))
println(sssp.vertices.collect.mkString("\n"))
}
}
sourceId: 0
Get the result:
(0,0.0)
(4,2.0)
(2,1.0)
(3,1.0)
(1,2.0)
But I need actual path like as follows:
=>
0 -> 0,0
0 -> 2,1
0 -> 3,1
0 -> 2 -> 4,2
0 -> 3 -> 1,2
How to get SSSP actual path by spark graphX?
anybody give me some hint?
Thanks for your help!
You have to modify algorithm in order to store not only shortest path length but also actual path.
So instead of storing Double as property of vertex you should store tuple: (Double, List[VertexId])
Maybe this code can be useful for you.
object Pregel_SSSP {
def main(args: Array[String]) {
val sc = new SparkContext("local", "Allen Pregel Test", System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))
// A graph with edge attributes containing distances
val graph: Graph[Int, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 5).mapEdges(e => e.attr.toDouble)
graph.edges.foreach(println)
val sourceId: VertexId = 0 // The ultimate source
// Initialize the graph such that all vertices except the root have distance infinity.
val initialGraph : Graph[(Double, List[VertexId]), Double] = graph.mapVertices((id, _) => if (id == sourceId) (0.0, List[VertexId](sourceId)) else (Double.PositiveInfinity, List[VertexId]()))
val sssp = initialGraph.pregel((Double.PositiveInfinity, List[VertexId]()), Int.MaxValue, EdgeDirection.Out)(
// Vertex Program
(id, dist, newDist) => if (dist._1 < newDist._1) dist else newDist,
// Send Message
triplet => {
if (triplet.srcAttr._1 < triplet.dstAttr._1 - triplet.attr ) {
Iterator((triplet.dstId, (triplet.srcAttr._1 + triplet.attr , triplet.srcAttr._2 :+ triplet.dstId)))
} else {
Iterator.empty
}
},
//Merge Message
(a, b) => if (a._1 < b._1) a else b)
println(sssp.vertices.collect.mkString("\n"))
}
}
May be it is outdated answer, but take a look at this solution Find all paths in graph using Apache Spark