MAX and MIN value of RDD[scala.collection.immutable.Map[String,Any] - scala

in the following code I calculate the euclidean distance for each document to the cluster centroid in a KMeans cluster.
I feel like the euclidean distance does not make much sense so I thought normalizing it to a scale from from 0 to 1 would be better.
Unfortunately, I didn't figure out how to sort the org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Any]] data type or how to get the maximum / minimum value.
In fact it is a RDD[Map[String,Double]] but I suppose it got converted to an RDD[Map[String,Any]] for some reason. Most approaches e.g takeOrdered results in:
error: No implicit Ordering defined for scala.collection.immutable.Map[String,Any]
How can I teach Scala how to sort the Any values of this Map?
Any hints are very much appreciated.
Thanks
val score = rdd.map({case(id,vector) => {distToCentroid(id, vector, model_1)}})
// Normalizing the data with normalizeResult function.
// Problem I need to find the max and minimum beforehand
def distToCentroid(id: String, datum: Vector, model: KMeansModel) = {
val cluster = model.predict(datum)
val centroid = model.clusterCenters(cluster)
val distance = math.sqrt(centroid.toArray.zip(datum.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
Map("id" -> id, "distance" -> distance)
}
def normalizeResult(max: Double, min: Double, x: Double) = {
(x-min) / (max-min)
}

If I understand you right, you need global min/max for values, stored inside of maps. If so, you can just flatten your RDD and map it to RDD[Double]:
val values = rdd.flatMap(_.values.map(_.toDouble)).cache()
val min = values.min()
val max = values.max()

The easiest way to do this would be to Map the outputs directly in to the correct formats in the first instance.
def distToCentroid(id: String, datum: Vector, model: KMeansModel) = {
val cluster = model.predict(datum)
val centroid = model.clusterCenters(cluster)
val distance = math.sqrt(centroid.toArray.zip(datum.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
//Updated Outputs
Map("id" -> id, "distance" -> distance.toDouble)
}
That should then allow you to either use the inbuilt min and max functions or use the function you have written.

Related

How to sum number of Ints and Number of Floats within a List - Scala

I need to calculate the number of integers and floats i have in a Map which is like Map[String, List[(Int, String, Float)]]
The data comes from reading a file - the data inside for example looks kinda like (however there is a few more Routes):
Cycle Route (City),1:City Centre :0.75f,2:Main Park :3.8f,3:Central Station:2.7f,4:Modern Art Museum,5:Garden Centre:2.4f,6:Music Centre:3.4f
The map is split so that the String is the name of the route and the List is the rest of the data.
I want it to calculate the number of 'checkpoints' per route and total distance of each route (which is the float) then print out e.g. Oor Wullie Route has 6 checkpoints and total distance of 18.45km
I am guessing I need to use a foldLeft however i am unsure how to do so?
Example of a simple fold i have done before but not sure how to apply one to above scenario?
val list1 = List.range(1,20)
def sum(ls:List[Int]):Int = {
ls.foldLeft(0) { _ + _}
}
You could do this with a fold, but IMO it is unnecessary.
You know the number of checkpoints by simply taking the size of the list (assuming each entry in the list is a checkpoint).
To compute the total distance, you could do:
def getDistance(list: List[(Int, String, Float)]): Float =
list
.iterator // mapping the iterator to avoid building an intermediate List instance
.map(_._3) // get the distance Float from the tuple
.sum // built-in method for collections of Numeric elements (e.g. Float)
And then get your printout like:
def summarize(routes: Map[String, List[(Int, String, Float)]]): Unit =
for { (name, stops) <- routes } {
val numStops = stops.size
val distance = getDistance(stops)
println(s"$name has $numStops stops and total distance of $distance km")
}
If you really wanted to compute both numStops and distance via foldLeft, Luis's comment on your question is the way to do it.
edit - per Luis's request, putting his comment in here and cleaning it up a bit:
stops.foldLeft(0 -> 0.0f) {
// note: "acc" is short for "accumulated"
case ((accCount, accDistance), (_, _, distance)) =>
(accCount + 1) -> (accDistance + distance)
}

How to efficiently center (mean-shift) a spark RowMatrix?

I'm wondering what would be an efficient way of centering the
data of a RowMatrix in spark efficiently (for large inputs), do libraries
or functions already exist to do this?
So far I'm thinking of just defining a function and then using map to subtract
the mean, but is this efficient?
I want to do that in order to afterward perform an SVD (to do PCA) on
the given matrix.
EDIT :
here I found something that does the mean shift by the previously mentioned method (using map) :
def subPairs = (vPair: (Double, Double)) => vPair._1 - vPair._2
def subMean = (v: Vector) => Vectors.dense(v.toArray.zip(mean.toArray).map(subPairs))
val stdData = rows.map(subMean)
source : https://github.com/apache/spark/pull/17907/commits/956ce87cd151a9b30d181618aad7ef2a7ee859dc
Thanks in advance
Extract rows:
val mat: RowMatrix = ???
val rows = mat.rows
Fit StadardScalerModel
import org.apache.spark.mllib.feature.StandardScaler
val scaler = new StandardScaler(withMean = true, withStd = false).fit(rows)
Scale
scaler.transform(rows)

How do I efficiently debug a long-running Spark application?

I have a Spark application which cleans and prepares a data set, and then applies a K-means clustering algorithm onto this set. Afterwards, some metrics of the resulting clusters are calculated.
Naturally, calculating the K-means vector clusters is a task that has a long execution time. When debugging the calculation for the cluster metrics I cannot iterate quickly on my code due to the clusters being calculated on each execution. How do I solve this?
Ideas I have are:
writing a unit test for the metric calculation methods. But it would be cumbersome to mock cluster data as
Serialise computed K-means vector cluster data to disk
Any help is appreciated
Code for reference:
def main(args: Array[String]): Unit = {
// -- start of long running execution
val lines = sc.textFile("src/main/resources/stackoverflow/stackoverflow.csv")
val raw = rawPostings(lines)
val grouped = groupedPostings(raw)
val scored = scoredPostings(grouped)
val vectors = vectorPostings(scored)
// assert(vectors.count() == 2121822, "Incorrect number of vectors: " + vectors.count())
val means = kmeans(sampleVectors(vectors), vectors, debug = true)
// -- end of long running execution
val results = clusterResults(means, vectors) // < -- this method operates on the result of previous ops
printResults(results)
}
Implementation of clusterResults:
def clusterResults(means: Array[(Int, Int)], vectors: RDD[(LangIndex, HighScore)]): Array[(String, Double, Int, Int)] = {
// -- Note that means is quite intensive to compute
val closest = vectors.map(p => (findClosest(p, means), p)) // -- (Int, (LangIndex, HighScore))
val closestGrouped = closest.groupByKey() // -- (Int, Iter((LangIndex, HighScore))
vectors.take(3).foreach(println)
val median = closestGrouped.mapValues { vs =>
// #todo: what does groupBy(identity) do?
// Predef.idintity is equivalent ot x => x
val langId: Int = vs.map(_._1).groupBy(identity).maxBy(_._2.size)._1 // most common language in the cluster
val langLabel: String = langs(langId / langSpread)
val langPercent: Double = vs.map(_._1).count(_.equals(langId)) / vs.size // percent of the questions in the most common language (= number of questions in most common lang divided by total questions)
val clusterSize: Int = vs.size
val medianScore: Int = vs.map(_._2).
(langLabel, langPercent, clusterSize, medianScore)
}
median.collect().map(_._2).sortBy(_._4)
}

Create a PriorityQueue that contains triples, and returns the minimum third element in Scala?

I have a Priority Queue in Scala that I define below. My goal is that when I call dequeue I get the triple that has the most minimum third element in that triple. I figured that using Ordering is the way to go, but I cannot seem to get it to work.
import scala.collection.mutable.PriorityQueue
def orderByWeight(lst : (Int, Int, Int)) = lst._3
val pq = new PriorityQueue[(Int, Int, Int)]()(Ordering.by(orderByWeight))
var x = ListBuffer((0,1,2), (0,2,3), (0,3,4))
x.map(i => pq.enqueue(i))
I am confused on what my orderByWeight function should be. For the code above, the desired output if I call pq.dequeue should be (0, 1, 2). Note x is ordered at random. Any ideas?
If you want all the 3-tuples dequeued in order of smalled 3rd element to largest, I think this is all you need.
val pq = PriorityQueue[(Int, Int, Int)]()(Ordering.by(-_._3))
If you need an ordered output in case of 3rd-element ties, you can expand it.
var x = ListBuffer((0,1,2), (0,2,3), (0,3,4), (1,0,2))
val pq = PriorityQueue(x:_*)(Ordering[(Int, Int)].on(x => (-x._3, -x._2)))

Scala: groupBy (identity) of List Elements

I develop an application that builds pairs of words in (tokenised) text and produces the number of times each pair occurs (even when same-word pairs occur multiple times, it's OK as it'll be evened out later in the algorithm).
When I use
elements groupBy()
I want to group by the elements' content itself, so I wrote the following:
def self(x: (String, String)) = x
/**
* Maps a collection of words to a map where key is a pair of words and the
* value is number of
* times this pair
* occurs in the passed array
*/
def producePairs(words: Array[String]): Map[(String,String), Double] = {
var table = List[(String, String)]()
words.foreach(w1 =>
words.foreach(w2 =>
table = table ::: List((w1, w2))))
val grouppedPairs = table.groupBy(self)
val size = int2double(grouppedPairs.size)
return grouppedPairs.mapValues(_.length / size)
}
Now, I fully realise that this self() trick is a dirty hack. So I thought a little a came out with a:
grouppedPairs = table groupBy (x => x)
This way it produced what I want. However, I still feel that I clearly miss something and there should be easier way of doing it. Any ideas at all, dear all?
Also, if you'd help me to improve the pairs extraction part, it'll also help a lot – it looks very imperative, C++ - ish right now. Many thanks in advance!
I'd suggest this:
def producePairs(words: Array[String]): Map[(String,String), Double] = {
val table = for(w1 <- words; w2 <- words) yield (w1,w2)
val grouppedPairs = table.groupBy(identity)
val size = grouppedPairs.size.toDouble
grouppedPairs.mapValues(_.length / size)
}
The for comprehension is much easier to read, and there is already a predifined function identity, with is a generalized version of your self.
you are creating a list of pairs of all words against all words by iterating over words twice, where i guess you just want the neighbouring pairs. the easiest is to use a sliding view instead.
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1)).toList
val grouped = pairs.groupBy(t => t)
grouped.mapValues(_.size)
}
another approach would be to fold the list of pairs by summing them up. not sure though that this is more efficient:
def producePairs(words: Array[String]): Map[(String, String), Int] = {
val pairs = words.sliding(2, 1).map(arr => arr(0) -> arr(1))
pairs.foldLeft(Map.empty[(String, String), Int]) { (m, p) =>
m + (p -> (m.getOrElse(p, 0) + 1))
}
}
i see you are return a relative number (Double). for simplicity i have just counted the occurances, so you need to do the final division. i think you want to divide by the number of total pairs (words.size - 1) and not by the number of unique pairs (grouped.size)..., so the relative frequencies sum up to 1.0
Alternative approach which is not of order O(num_words * num_words) but of order O(num_unique_words * num_unique_words) (or something like that):
def producePairs[T <% Traversable[String]](words: T): Map[(String,String), Double] = {
val counts = words.groupBy(identity).map{case (w, ws) => (w -> ws.size)}
val size = (counts.size * counts.size).toDouble
for(w1 <- counts; w2 <- counts) yield {
((w1._1, w2._1) -> ((w1._2 * w2._2) / size))
}
}