For loop from 1 to max value in file - scala

I am new to scala and spark and wanted to do a for loop from 1 to maxValue in the file.
def main(args: Array[ String ]): Unit = {
val conf = new SparkConf().setAppName("Shortest Path")
val sc = new SparkContext(conf)
val graph = sc.textFile(args(0)).map( line => { val a = line.split(",")
Graph(a(0).toInt,a(1).toInt,a(2).toInt) } )
val distance = new ListBuffer[Long]
distance += 0
var i = 0
for (i <- 1 to 100000){
distance += Int.MaxValue
}
Here Instead of 100000 I want to put the maximum value out of j and i tuple of the graph.

Do you mean like this:
val iMax = graph.map{ g => g.i }.max
val jMax = graph.map{ g => g.j }.max
val theMax = math.max(iMax,jMax)
Alternatively, you could determine iMax and jMax using one pass over the data with reduce or fold

Related

Multiplication of "double" values in scala

I want to multiply two sparse matrices in spark using scala. I am passing these matrices in form of arguments and storing result in another argument.
Matrices are text files where each matrix element is represented by as: row, column, element.
I am not able to multiply two Double values in Scala.
object MultiplySpark {
def main(args: Array[ String ]) {
val conf = new SparkConf().setAppName("Multiply")
conf.setMaster("local[2]")
val sc = new SparkContext(conf)
val M = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val N = sc.textFile(args(0)).flatMap(entry => {
val rec = entry.split(",")
val row = rec(0).toInt
val column = rec(1).toInt
val value = rec(2).toDouble
for {pointer <-1 until rec.length} yield ((row,column),value)
})
val Mmap = M.map( e => (e._2,e))
val Nmap = N.map( d => (d._2,d))
val MNjoin = Mmap.join(Nmap).map{ case (k,(e,d)) => e._2.toDouble+","+d._2.toDouble }
val result = MNjoin.reduceByKey( (a,b) => a*b)
.map(entry => {
((entry._1._1, entry._1._2), entry._2)
})
.reduceByKey((a, b) => a + b)
result.saveAsTextFile(args(2))
sc.stop()
How can I multiply double values in Scala?
Please note:
I tried a.toDouble * b.toDouble
Error is: Value * is not a member of Double Double
This reduceByKey would work if you had RDD[((Int, Int), Double)] (or RDD[(SomeType, Double)] more generally) and join gives you RDD[((Int, Int), (Double, Double))]. So you are trying to multiply pairs (Double, Double), not Doubles.

Join per line two different RDDs in just one - Scala

I'm programming a K-means algorithm in Spark-Scala.
My model predicts in which cluster is each point.
Data
-6.59 -44.68
-35.73 39.93
47.54 -52.04
23.78 46.82
....
Load the data
val data = sc.textFile("/home/borja/flink/kmeans/points")
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
Cluster the data into two classes using KMeans
val numClusters = 10
val numIterations = 100
val clusters = KMeans.train(parsedData, numClusters, numIterations)
Predict
val prediction = clusters.predict(parsedData)
However, I need to put the result and the points in the same file, in the next format:
[no title, numberOfCluster (1,2,3,..10), pointX, pointY]:
6 -6.59 -44.68
8 -35.73 39.93
10 47.54 -52.04
7 23.78 46.82
This is the entry of this executable in Python to print really nice the result.
But my best effort has got just this:
(you can check the first numbers are wrong: 68, 384, ...)
var i = 0
val c = sc.parallelize(data.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
i = 0
val c2 = sc.parallelize(prediction.collect().map(x => {
val tuple = (i, x)
i += 1
tuple
}))
val result = c.join(c2)
result.take(5)
Result:
res94: Array[(Int, (String, Int))] = Array((68,(17.79 13.69,0)), (384,(-33.47 -4.87,8)), (440,(-4.75 -42.21,1)), (4,(-33.31 -13.11,6)), (324,(-39.04 -16.68,6)))
Thanks for your help! :)
I don't have a spark cluster handy to test, but something like this should work:
val result = parsedData.map { v =>
val cluster = clusters.predict(v)
s"$cluster ${v(0)} ${v(1)}"
}
result.saveAsTextFile("/some/output/path")

Merging list of uneven length with default value for missing matches

Im trying to pair up two lists in Scala where non matching pairs should be replaced by a default value, this is what I have so far but thy all fall short in some way.
How do I create List((a,a),(b,empty),(c,c))???
case class Test(id: Option[Int] = None)
val empty = Test()
val a = Test(Some(1))
val b = Test(Some(2))
val c = Test(Some(3))
val cache = List(a,b,c)
val delta = List(a,c)
//Trial 1
val newCache1 = cache.zipAll(delta,empty,empty)
//Tial 2
val newCache2 = for {
c <- cache
d <- delta
if c.id == d.id
} yield (c,d)
//Tial 3
val newCache3 = for {
c <- cache
d <- delta
} yield if (c.id == d.id) (c,d) else (c,empty)
Turn your delta into a map, then join them up.
val deltaMap: Map[Int, Test] =
delta.flatMap(x => x.id.map(id => id -> x)).toMap
val newCache: Seq[(Test, Test)] = cache.map { c =>
c -> c.id.flatMap(deltaMap.get).getOrElse(empty)
}

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble

Creating an RDD to collect the results of an iterative calculation

I would like to create an RDD to collect the results of an iterative calculation .
How can I use a loop (or any alternative) to replace the following code:
import org.apache.spark.mllib.random.RandomRDDs._
val n = 10
val step1 = normalRDD(sc, n, seed = 1 )
val step2 = normalRDD(sc, n, seed = (step1.max).toLong )
val result1 = step1.zip(step2)
val step3 = normalRDD(sc, n, seed = (step2.max).toLong )
val result2 = result1.zip(step3)
...
val step50 = normalRDD(sc, n, seed = (step49.max).toLong )
val result49 = result48.zip(step50)
(creating the N step RDDs and zipping then together at the end would also be ok as long the 50 RDDs are created iteratively to respect the seed = (step(n-1).max) condition)
A recursive function would work:
/**
* The return type is an Option to handle the case of a user specifying
* a non positive number of steps.
*/
def createZippedNormal(sc : SparkContext,
numPartitions : Int,
numSteps : Int) : Option[RDD[Double]] = {
#scala.annotation.tailrec
def accum(sc : SparkContext,
numPartitions : Int,
numSteps : Int,
currRDD : RDD[Double],
seed : Long) : RDD[Double] = {
if(numSteps <= 0) currRDD
else {
val newRDD = normalRDD(sc, numPartitions, seed)
accum(sc, numPartitions, numSteps - 1, currRDD.zip(newRDD), newRDD.max)
}
}
if(numSteps <= 0) None
else Some(accum(sc, numPartitions, numSteps, sc.emptyRDD[Double], 1L))
}