How can i avoid for loop for KNN search? - scala

My goal is to have the k nearest neighbours of each data point. I would like to avoid the use of a for loop with lookup and use something else simultaneously on each rdd_distance point, but I can't figure out how to do this.
parsedData = RDD[Object]
//Object have an id and a vector as attribute
//sqdist1 output is a Double
var rdd_distance = parsedData.cartesian(parsedData)
.flatMap { case (x,y) =>
if(x.get_id != y.get_id)
Some((x.get_id,(y.get_id,sqdist1(x.get_vector,y.get_vector))))
else None
}
for(ind1 <- 1 to size) {
val ind2 = ind1.toString
val tab1 = rdd_distance.lookup(ind2)
val rdd_knn0 = sc.parallelize(tab1)
val tab_knn = rdd_knn0.takeOrdered(k)(Ordering[(Double)].on(x=>x._2))
}
Is that possible without use a for loop with lookup ?

This code solves your question (but inefficient when the number of parsedData is big).
rdd_distance.groupByKey().map {
case (x, iterable) =>
x -> iterable.toSeq.sortBy(_._2).take(k)
}
So this is more appropriate solution.
import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
rdd_distance.topByKey(k)(Ordering.by(-_._2)) // because smaller is better.
Note that this code is included Spark 1.4.0. If you use the earlier version, use this code instead https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala
The idea of topBykey is to use BoundedPriorityQueue with aggregateByKey which retains top k items.

Related

Attemping to parallelize a nested loop in Scala

I am comparing 2 dataframes in scala/spark using a nested loop and an external jar.
for (nrow <- dfm.rdd.collect) {
var mid = nrow.mkString(",").split(",")(0)
var mfname = nrow.mkString(",").split(",")(1)
var mlname = nrow.mkString(",").split(",")(2)
var mlssn = nrow.mkString(",").split(",")(3)
for (drow <- dfn.rdd.collect) {
var nid = drow.mkString(",").split(",")(0)
var nfname = drow.mkString(",").split(",")(1)
var nlname = drow.mkString(",").split(",")(2)
var nlssn = drow.mkString(",").split(",")(3)
val fNameArray = Array(mfname,nfname)
val lNameArray = Array (mlname,nlname)
val ssnArray = Array (mlssn,nlssn)
val fnamescore = Main.resultSet(fNameArray)
val lnamescore = Main.resultSet(lNameArray)
val ssnscore = Main.resultSet(ssnArray)
val overallscore = (fnamescore +lnamescore +ssnscore) /3
if(overallscore >= .95) {
println("MeditechID:".concat(mid)
.concat(" MeditechFname:").concat(mfname)
.concat(" MeditechLname:").concat(mlname)
.concat(" MeditechSSN:").concat(mlssn)
.concat(" NextGenID:").concat(nid)
.concat(" NextGenFname:").concat(nfname)
.concat(" NextGenLname:").concat(nlname)
.concat(" NextGenSSN:").concat(nlssn)
.concat(" FnameScore:").concat(fnamescore.toString)
.concat(" LNameScore:").concat(lnamescore.toString)
.concat(" SSNScore:").concat(ssnscore.toString)
.concat(" OverallScore:").concat(overallscore.toString))
}
}
}
What I'm hoping to do is add some parallelism to the outer loop so that I can create a threadpool of 5 and pull 5 records from the collection of the outerloop and compare them to the collection of the inner loop, rather than doing this serially. So the outcome would be I can specify the number of threads, have 5 records from the outerloop's collection processing at any given time against the collection in the inner loop. How would I go about doing this?
Let's start by analyzing what you are doing. You collect the data of dfm to the driver. Then, for each element you collect the data from dfn, transform it and compute a score for each pair of elements.
That's problematic in many ways. First even without considering parallel computing, the transformations on the elements of dfn are made as many times as dfm as elements. Also, you collect the data of dfn for every row of dfm. That's a lot of network communications (between the driver and the executors).
If you want to use spark to parallelize you computations, you need to use the API (RDD , SQL or Datasets). You seem to want to use RDDs to perform a cartesian product (this is O(N*M) so be careful, it may take a while).
Let's start by transforming the data before the Cartesian product to avoid performing them more than once per element. Also, for clarity, let's define a case class to contain your data and a function that transform your dataframes into RDDs of that case class.
case class X(id : String, fname : String, lname : String, lssn : String)
def toRDDofX(df : DataFrame) = {
df.rdd.map(row => {
// using pattern matching to convert the array to the case class X
row.mkString(",").split(",") match {
case Array(a, b, c, d) => X(a, b, c, d)
}
})
}
Then, I use filter to keep only the tuples whose score is more than .95 but you could use map, foreach... depending on what you intend to do.
val rddn = toRDDofX(dfn)
val rddm = toRDDofX(dfm)
rddn.cartesian(rddm).filter{ case (xn, xm) => {
val fNameArray = Array(xm.fname,xn.fname)
val lNameArray = Array(xm.lname,xn.lname)
val ssnArray = Array(xm.lssn,xn.lssn)
val fnamescore = Main.resultSet(fNameArray)
val lnamescore = Main.resultSet(lNameArray)
val ssnscore = Main.resultSet(ssnArray)
val overallscore = (fnamescore +lnamescore +ssnscore) /3
// and then, let's say we filter by score
overallscore > .95
}}
This is not a right way of iterating over spark dataframe. The major concern is the dfm.rdd.collect. If the dataframe is arbitrarily large, you would end up exception. This due to the fact that the collect function essentially brings all the data into the master node.
Alternate way would be use the foreach or map construct of the rdd.
dfm.rdd.foreach(x => {
// your logic
}
Now you are trying to iterate the second dataframe here. I am afraid that won't be possible. The elegant way is to join the dfm and dfn and iterate over the resulting dataset to compute your function.

How can I use the last result from a scala map as input to the next function?

I'm working through some project euler questions to practice my scala. For problem 7 I have to find the 10001st prime. I have a working solution, but dont feel its as functional as it could be.
def first_n_primes(n: Long) : List[Long] = {
var last_prime = 1L
(1L to n).map(x => {last_prime = get_next_prime(x, last_prime); last_prime}).toList
}
Specifically, I feel there might be a way to get rid of the var last_prime, but I dont know how to use the result of the nth map evaluation as the input to the n+1 evaluation. How can I do this more functionally?
You are looking for scanLeft:
(1l to n).scanLeft(1) { case (x, last) => get_next_prime(x, last) }
Or just (1l to n).scanLeft(1)(get_next_prime)
Note however that this is not a very good algorithm looking for the primes, because there is a lot of repetitive work that could be saved (to find the next prime, you need to re-discover all the previous ones).
This sort of task is better done in scala with recursive streams:
lazy val primes: Stream[Long] = 2 #:: Stream.iterate(3l)(_+1).filter { n =>
val stop = math.sqrt(n)
primes.takeWhile { _ <= stop }.forall { k => n % k != 0 }
}
primes.take(n).toList

Scala/functional way of doing things

I am using scala to write up a spark application that reads data from csv files using dataframes (none of these details matter really, my question can be answered by anyone who is good at functional programming)
I'm used to sequential programming and its taking a while to think of things in the functional way.
I basically want to read to columns (a,b) from a csv file and keep track of those rows where b < 0.
I implemented this but its pretty much how I would do it Java and I would like to utilize Scala's features instead:
val ValueDF = fileDataFrame.select("colA", "colB")
val ValueArr = ValueDF.collect()
for ( index <- 0 until (ValueArr.length)){
var row = ValueArr(index)
var A = row(0).toString()
var B = row(1).toString().toDouble
if (B < 0){
//write A and B somewhere
}
}
Converting the dataframe to an array defeats the purpose of distributed computation.
So how could I possibly get the same results but instead of forming an array and traversing through it, I would rather want to perform some transformations of the data frame itself (such as map/filter/flatmap etc).
I should get going soon hopefully, just need some examples to wrap my head around it.
You are doing basically a filtering operation (ignore if not (B < 0)) and mapping (from each row, get A and B / do something with A and B).
You could write it like this:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.filter(_(1).toString().toDouble < 0).map{row => (row(0).toString(), row(1).toString().toDouble)}
// do something with result
You also can do first the mapping and then the filtering:
val result = valueArr.map{row => (row(0).toString(), row(1).toString().toDouble)}.filter(_._2 < 0)
Scala also offers more convenient versions for this kind of operations (thanks Sascha Kolberg), called withFilter and collect. withFilter has the advantage over filter that it doesn't create a new collection, saving you one pass, see this answer for more details. With collect you also map and filter in one pass, passing a partial function which allows to do pattern matching, see e.g. this answer.
In your case collect would look like this:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.collect{
case row if row(1).toString().toDouble < 0) => (row(0).toString(), row(1).toString().toDouble)
}
// do something with result
(I think there's a more elegant way to express this but that's left as an exercise ;))
Also, there's a lightweight notation called "sequence comprehensions". With this you could write:
val result = for (row <- valueArr if row(1).toString().toDouble < 0) yield (row(0).toString(), row(1).toString().toDouble)
Or a more flexible variant:
val result = for (row <- valueArr) yield {
val double = row(1).toString().toDouble
if (double < 0) {
(row(0).toString(), double)
}
}
Alternatively, you can use foldLeft:
val valueDF = fileDataFrame.select("colA", "colB")
val valueArr = valueDF.collect()
val result = valueArr.foldLeft(Seq[(String, Double)]()) {(s, row) =>
val a = row(0).toString()
val b = row(1).toString().toDouble
if (b < 0){
s :+ (a, b) // append tuple with A and B to results sequence
} else {
s // let results sequence unmodified
}
}
// do something with result
All of these are considered functional... which one you prefer is for the most part a matter of taste. The first 2 examples (filter/map, map/filter) do have a performance disadvantage compared to the rest because they iterate through the sequence twice.
Note that in FP it's very important to minimize side effects / isolate them from the main logic. I/O ("write A and B somewhere") is a side effect. So you normally will write your functions such that they don't have side effects - just input -> output logic without affecting or retrieving data from the surroundings. Once you have a final result, you can do side effects. In this concrete case, once you have result (which is a sequence of A and B tuples), you can loop through it and print it. This way you can for example change easily the way to print (you may want to print to the console, send to a remote place, etc.) without touching the main logic.
Also you should prefer immutable values (val) wherever possible, which is safer. Even in your loop, row, A and B are not modified so there's no reason to use var.
(Btw, I corrected the values names to start with lower case, see conventions).

Iterate over middle third of map

I have a scala Map[String, String] and I am comparing a part of the key to its value only for the mid-third of the map. Since it is not easy to iterate using indices in a map, I came up with the following but it does not work.
var i = 0
var j = 0
val mapSize = sortedMap.size/3
for((key,value) <- sortedMap) {
j+=1
if((i < 3) && (key.split(' ').take(1).mkString==value)&&(j>mapSize)){
Accuracy += 1
i += 1
}
}
You could use the method slice(from: Int, until: Int) and then only iterate over the middle third of the sorted map. Something like
val mapSize = sortedMap.size
for ((key, value) <- sortedMap.slice(mapSize/3, 2*mapSize/3)) {
...
}
Note that this is only reliable if the underlying map is sorted (as seems to be the case in your example). You also might have to adapt the index calculation a little bit, depending on what exactly you consider the middle third for maps whose size is not divisible by 3.
You could convert the map to a stream, modify the stream to remove the first third and the last third and then iterate through the remaining middle third.
val middle = sortedMap.toStream.drop(sortedMap.size / 3).dropRight(sortedMap.size / 3)
middle.foreach(println _) // replace println with your key test
For the key test, you could use a pattern match on
case key if key.split(' ')(0) == value => ...do something...

Succinct way of reading data from file into an immutable 2 dimensional array in Scala

What I am looking for is a succinct way of ending up with an immutable two dimensional array X and one dimensional array Y without first scanning the file to find out the dimensions of the data.
The data, which consists of a header line followed by columnar double values, is in the following format
X0, X1, X2, ...., Y
0.1, 1.2, -0.2, ..., 1.1
0.2, 0.5, 0.4, ..., -0.3
-0.5, 0.3, 0.3, ..., 0.1
I have the following code (so far) for getting lines from a file and tokenizing each comma delimited line in order to get the samples. It currently doesn't fill in the X and Y arrays nor assign num and dimx
val X = new Array[Array[Double]](num,dimx)
val Y = new Array[Double](num)
def readDataFromFile(filename: String) {
var firstTime = true
val lines = fromFile(filename).getLines
lines.foreach(line => {
val tokens = line split(",")
if(firstTime) {
tokens.foreach(token => // get header titles and set dimx)
firstTime = false
} else {
println("data")
tokens.foreach(token => //blah, blah, blah...)
}
})
}
Obviously this is an issue because, while I can detect and use dimx on-the-fly, I don't know num a priori. Also, the repeated tokens.foreach is not very elegant. I could first scan the file and determine the dimensions, but this seems like a nasty way to go. Is there a better way? Thanks in advance
There isn't anything built in that's going to tell you the size of your data. Why not have the method return your arrays instead of you declaring them outside? That way you can also handle error conditions better.
case class Hxy(headers: Array[String], x: Array[Array[Double]], y: Array[Double]) {}
def readDataFromFile(name: String): Option[Hxy] = {
val lines = io.Source.fromFile(name).getLines
if (!lines.hasNext) None
else {
val header = lines.next.split(",").map(_.trim)
try {
val xy = lines.map(_.split(",").map(_.trim.toDouble)).toArray
if (xy.exists(_.length != header.length)) None
else Some( Hxy(header, xy.map(_.init), xy.map(_.last)) )
}
catch { case nfe: NumberFormatException => None }
}
}
Here, only if we have well-formed data do we get back the relevant arrays (helpfully packaged into a case class); otherwise, we get back None so we know that something went wrong.
(If you want to know why it didn't work, replace Option[Hxy] with something like Either[String,Hxy] and return Right(...) instead of Some(...) on success, Left(message) instead of None on failure.)
Edit: If you want the values (not just the array sizes) to be immutable, then you'd need to map everything to Vector somewhere along the way. I'd probably do it at the last step when you're placing the data into Hxy.
Array, as in Java is mutable. So you can't have immutable array. you need to choose between Array and immutablity. One way, how you can achieve your goal without foreaches and vars is similar to following:// simulate the lines for this example
val lines = List("X,Y,Z,","1,2,3","2,5.0,3.4")
val res = lines.map(_.split(",")).toArray
Use Array.newBuilder. I assume that the header has already been extracted.
val b = Array.newBuilder[Array[Double]]
lines.foreach { b += _.split(",").map(_.toDouble) }
val data = b.result
If you want to be immutable, take some immutable implementation of IndexedSeq (e.g. Vector) instead of Array; builders work on all collections.