How to combine the results of spark computations in the following case? - scala

The question is to calculate average of each of the columns corresponding to each class. Class number is given in the first column.
I am giving a part of test file for better clarity.
2 0.819039 -0.408442 0.120827
3 -0.063763 0.060122 0.250393
4 -0.304877 0.379067 0.092391
5 -0.168923 0.044400 0.074417
1 0.053700 -0.088746 0.228501
2 0.196758 0.035607 0.008134
3 0.006971 -0.096478 0.123718
4 0.084281 0.278343 -0.350414
So the task is to calculate
1: avg(), avg(), avg()
.
.
.
I am very new to Scala. After juggling a lot with the code I came up with the following code
val inputfile = sc.textFile ("testfile.txt")
val myArray = inputfile.map { line =>
(line.split(" ").toList)
}
var Avgmap:Map[String,List[Double]] = Map()
var countmap:Map[String,Int] = Map()
for( a <- myArray ){
//println( "Value of a: " + a + " " + a.size );
if(!countmap.contains(a(0))){
countmap += (a(0) -> 0)
Avgmap += (a(0) -> List.fill(a.size-1)(1.0))
}
var c = countmap(a(0)) + 1
val countmap2 = countmap + (a(0) -> c)
countmap = countmap2
var p = List[Double]()
for( i <- 1 to a.size - 1) {
var temp = (Avgmap(a(0))(i-1)*(countmap(a(0)) - 1) + a(i).toDouble)/countmap(a(0))
// println("i: "+i+" temp: "+temp)
var q = p :+ temp
p = q
}
val Avgmap2 = Avgmap + (a(0) -> p)
Avgmap = Avgmap2;
println("--------------------------------------------------")
println(countmap)
println(Avgmap)
}
When I execute this code I seem to be getting the results in two halves of the dataset. Please help me in combining them.
Edit: About the variables I am using. countmap keeps record of classnumber -> number of vectors encountered. Similarly Avgmap keeps record of average so far of each columns corresponding to the key.

at first, use DataFrame API. at secont, what you want is just one row
df.select(df.columns.map(c => mean(col(c))) :_*).show

Related

Scala for loop to calculate sum of powers

I am very new to Scala and am trying to create a loop that will calculate the sum of powers (1^1 + 2^2 + ... + 10^10) without using an exponent operator.
I discovered that 1^1 through 9^9 calculate correctly. But for some reason 10^10 evaluates to 1410065409 in my current code and messes up my final output of the sum. What is causing this mathematical error?
My current code is:
var i = 1
var ex = 1
var sum = 0
while (i <= 10)
{
for (j <- 1 to i)
{
ex = ex * i
}
sum += ex
ex = 1
i += 1
}
println(s"The sum is $sum")
Here's how it's done in Scala.
List.tabulate(10)(n => List.fill(n+1)(n.toLong+1).product).sum
//res0: Long = 10405071317
Another option you have, is to use Math.pow:
val result1 = 1.to(10).map(x => Math.pow(x, x)).sum
Please note that result1 is of type Double, and has the value 1.0405071317E10.
If you want to have it as long, you can do:
val result2 = 1.to(10).map(x => Math.pow(x, x).toLong).sum
Then result2 will have the value 10405071317.

scala absolute min distance between words in a text

I'm trying to find the min distance between multiple words in a given text.
Let's suppose I a string such as: "a b cat dog x y z n m p fox x dog b b cat"
Find the min distance of all matches of substrings: (fox, dog, cat)
there are multiple occurrences of each substring in this text:
one at the beginning:
cat - 4
dog - 8
fox - 24
dist = 24 - 4 = 20
And one at the end of string:
fox - 24
dog - 30
cat - 38
dist = 38 - 24 = 14
the min Dist = 14
This is the algorithm I came up with:
object MinKWindowSum {
def main(args: Array[String]): Unit = {
val document =
"""This Hello World is a huge text with thousands
Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features
Java Scala AXVX TXZX ASDQWE OWEQ World asb eere qwerer
asdasd Scala Java Hello docs World KLKM NWQEW ZXCASD OPOOIK Scala ASDSA
"""
println(getMinWindowSize(document, "Hello World Scala"))
}
def getMinWindowSize(str:String, s:String): Int = {
/* creates a list of tuples List[(String, Int)] which contains each keyword and its
respective index found in the text sorted in order by index.
*/
val keywords = s.split(" ").toSet
val idxs = keywords.map(k => (k -> ("(?i)\\Q" + k + "\\E").r.findAllMatchIn(str).map(_.start)))
.map{ case (keyword,itr) => itr.map((keyword, _))}
.flatMap(identity).toSeq
.sortBy(_._2)
// Calculates the min window on the next step.
var min = Int.MaxValue
var minI, minJ = -1
// current window indexes and words
var currIdxs = ListBuffer[Int]()
var currWords = ListBuffer[String]()
for(idx <- idxs ) {
// check if word exists in window already
val idxOfWord = currWords.indexOf(idx._1)
if (!currWords.isEmpty && idxOfWord != -1) {
currWords = currWords.drop(idxOfWord + 1)
currIdxs = currIdxs.drop(idxOfWord + 1)
}
currWords += idx._1
currIdxs += idx._2
// if all keys are present check if it is new min window
if (keywords.size == currWords.length) {
val currMin = Math.abs(currIdxs.last - currIdxs.head)
if (min > currMin) {
min = currMin
minI = currIdxs.head
minJ = currIdxs.last
}
}
}
println("min = " + min + " ,i = " + minI + " j = " + minJ)
min
}
}
In the example above we try to find the min distance between all matches of "Hello World Scala"
The shortest window between the indexes is found between indexes:
i = 235, j = 257 --> min = 22
Was wondering if there is a better way of doing this in an idiomatic way or in a better manner in terms of efficiency, scalability, readability and simplicity?
Here is a slightly "more functional" alternative:
val document =
"""This Hello World is a huge text with thousands Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features Java Scala AXVX TXZX ASDQWE OWEQ World
"""
val WORDS = Set("Hello", "World", "Scala")
var minDistance = document.trim
.split(" ")
.foldLeft(List[(String, Int)](), None: Option[Int], 0) {
case ((words, min, idx), word) if WORDS.contains(word) =>
val newWords = (word, idx) :: words.filter(_._1 != word)
if (newWords.map(_._1).toSet == WORDS) { // toSet on only 3 elmts
var idxes = newWords.map(_._2)
var dist = idxes.max - idxes.min
var newMin = min match {
case None => dist
case Some(min) if min < dist => min
case _ => dist
}
(newWords, Some(newMin), idx + word.length + 1)
}
else {
(newWords, min, idx + word.length + 1)
}
case ((words, min, idx), word) =>
(words, min, idx + word.length + 1)
}
._2
println(minDistance)
which produces:
Some(38)
My approach starts with a similar premise but uses a tail-recursive helper method to search the indexed words.
def getMinWindowSize(str :String, s :String) :Int = {
val keywords = s.split("\\s+").toSet
val re = "(?i)\\b(" + keywords.mkString("|") + ")\\b"
val idxs = re.r.findAllMatchIn(str).map(w => w.start -> w.toString).toList
def dist(input :List[(Int, String)], keys :Set[String]) :Option[Int] = input match {
case Nil => None
case (idx, word) :: rest =>
if (keys(word) && keys.size == 1) Some(idx)
else dist(rest, keys diff Set(word))
}
idxs.tails.collect{
case (idx, word)::rest => dist(rest, keywords diff Set(word)).map(_ - idx)
}.flatten.reduceOption(_ min _).getOrElse(-1)
}
No mutable variables or data structures. I also used Option to help return a more meaningful value if no minimum window is possible.
Usage:
getMinWindowSize(document, "Hello World Scala") //res0: Int = 22
getMinWindowSize(document, "Hello World Scal") //res1: Int = -1

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble

Counting by range

The following script can be used to "count by" keys
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(nbr => (nbr, 1))
val nbrCountsWithReduce = nbrPairsRDD
.reduceByKey(_ + _)
.collect()
nbrCountsWithReduce.foreach(println)
it returns:
(1,1)
(2,2)
(3,3)
(4,4)
How could it be modified to map by range rather than absolute values and give the following output if we had two ranges 1:2 and 3:4:
(1:2,3)
(3:4,7)
One option is to convert the list into double and use the histogram function:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(_.toDouble).histogram(2)
One easy way that I can think of is to map the keys to individual ranges, for eg :
val nbrRangePairs = sc.parallelize(nbr)
.map(nbr => (computeRange(nbr), 1))
.reduceByKey(_ + _)
.collect()
// function to compute Ranges
def computeRange(num : int) : String =
{
if(num < 3)
return "1:2"
else if(num < 5)
return "2:3"
else
return "invalid"
}
Here is the code snippet to compute aggregations by range:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrs = sc.parallelize(nbr)
var lb = 1
var incr = 1
var ub = lb + incr
val nbrsMap = nbrs.map(rec => {
if(rec > ub) {
lb = rec
ub = lb + incr
}
(lb.toString + ":" + ub.toString, 1)
})
nbrsMap.reduceByKey((acc, value) => acc + value).foreach(println)
(1:2,3)
(3:4,7)

Strange results when using Scala collections

I have some tests with results that I can't quite explain.
The first test does a filter, map and reduce on a list containing 4 elements:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{ i =>
counter.incrementAndGet()
true
}
val mapped = filtered.map{ i =>
counter.incrementAndGet()
i*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(11 == counter.get)
}
The counter is incremented 11 times as I expected: once for each element during filtering, once for each element during mapping and three times to add up the 4 elements.
Using wildcards the result changes:
{
val counter = new AtomicInteger(0)
val l = List(1, 2, 3, 4)
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
I can't work out how to use wildcards in the reduce (code doesnt compile), but now, the counter is only incremented 5 times!!
So, question #1: Why do wildcards change the number of times the counter is called and how does that even work?
Then my second, related question. My understanding of views was that they would lazily execute the functions passed to the monadic methods, but the following code doesn't show that.
{
val counter = new AtomicInteger(0)
val l = Seq(1, 2, 3, 4).view
val filtered = l.filter{
counter.incrementAndGet()
_ > 0
}
println("after filter: " + counter.get)
val mapped = filtered.map{
counter.incrementAndGet()
_*2
}
println("after map: " + counter.get)
val reduced = mapped.reduce{ (a, b) =>
counter.incrementAndGet()
a+b
}
println("after reduce: " + counter.get)
println("counted " + counter.get + " and result is " + reduced)
assert(20 == reduced)
assert(5 == counter.get)
}
The output is:
after filter: 1
after map: 2
after reduce: 5
counted 5 and result is 20
Question #2: How come the functions are being executed immediately?
I'm using Scala 2.10
You're probably thinking that
filter {
println
_ > 0
}
means
filter{ i =>
println
i > 0
}
but Scala has other ideas. The reason is that
{ println; _ > 0 }
is a statement that first prints something, and then returns the > 0 function. So it interprets what you're doing as a funny way to specify the function, equivalent to:
val p = { println; (i: Int) => i > 0 }
filter(p)
which in turn is equivalent to
println
val temp = (i: Int) => i > 0 // Temporary name, forget we did this!
val p = temp
filter(p)
which as you can imagine doesn't quite work out the way you want--you only print (or in your case do the increment) once at the beginning. Both your problems stem from this.
Make sure if you're using underscores to mean "fill in the parameter" that you only have a single expression! If you're using multiple statements, it's best to stick to explicitly named parameters.