scala absolute min distance between words in a text - scala

I'm trying to find the min distance between multiple words in a given text.
Let's suppose I a string such as: "a b cat dog x y z n m p fox x dog b b cat"
Find the min distance of all matches of substrings: (fox, dog, cat)
there are multiple occurrences of each substring in this text:
one at the beginning:
cat - 4
dog - 8
fox - 24
dist = 24 - 4 = 20
And one at the end of string:
fox - 24
dog - 30
cat - 38
dist = 38 - 24 = 14
the min Dist = 14
This is the algorithm I came up with:
object MinKWindowSum {
def main(args: Array[String]): Unit = {
val document =
"""This Hello World is a huge text with thousands
Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features
Java Scala AXVX TXZX ASDQWE OWEQ World asb eere qwerer
asdasd Scala Java Hello docs World KLKM NWQEW ZXCASD OPOOIK Scala ASDSA
"""
println(getMinWindowSize(document, "Hello World Scala"))
}
def getMinWindowSize(str:String, s:String): Int = {
/* creates a list of tuples List[(String, Int)] which contains each keyword and its
respective index found in the text sorted in order by index.
*/
val keywords = s.split(" ").toSet
val idxs = keywords.map(k => (k -> ("(?i)\\Q" + k + "\\E").r.findAllMatchIn(str).map(_.start)))
.map{ case (keyword,itr) => itr.map((keyword, _))}
.flatMap(identity).toSeq
.sortBy(_._2)
// Calculates the min window on the next step.
var min = Int.MaxValue
var minI, minJ = -1
// current window indexes and words
var currIdxs = ListBuffer[Int]()
var currWords = ListBuffer[String]()
for(idx <- idxs ) {
// check if word exists in window already
val idxOfWord = currWords.indexOf(idx._1)
if (!currWords.isEmpty && idxOfWord != -1) {
currWords = currWords.drop(idxOfWord + 1)
currIdxs = currIdxs.drop(idxOfWord + 1)
}
currWords += idx._1
currIdxs += idx._2
// if all keys are present check if it is new min window
if (keywords.size == currWords.length) {
val currMin = Math.abs(currIdxs.last - currIdxs.head)
if (min > currMin) {
min = currMin
minI = currIdxs.head
minJ = currIdxs.last
}
}
}
println("min = " + min + " ,i = " + minI + " j = " + minJ)
min
}
}
In the example above we try to find the min distance between all matches of "Hello World Scala"
The shortest window between the indexes is found between indexes:
i = 235, j = 257 --> min = 22
Was wondering if there is a better way of doing this in an idiomatic way or in a better manner in terms of efficiency, scalability, readability and simplicity?

Here is a slightly "more functional" alternative:
val document =
"""This Hello World is a huge text with thousands Java of Hello words and Scala other lines and World and many other Hello docs
Words of World in many langs Hello and features Java Scala AXVX TXZX ASDQWE OWEQ World
"""
val WORDS = Set("Hello", "World", "Scala")
var minDistance = document.trim
.split(" ")
.foldLeft(List[(String, Int)](), None: Option[Int], 0) {
case ((words, min, idx), word) if WORDS.contains(word) =>
val newWords = (word, idx) :: words.filter(_._1 != word)
if (newWords.map(_._1).toSet == WORDS) { // toSet on only 3 elmts
var idxes = newWords.map(_._2)
var dist = idxes.max - idxes.min
var newMin = min match {
case None => dist
case Some(min) if min < dist => min
case _ => dist
}
(newWords, Some(newMin), idx + word.length + 1)
}
else {
(newWords, min, idx + word.length + 1)
}
case ((words, min, idx), word) =>
(words, min, idx + word.length + 1)
}
._2
println(minDistance)
which produces:
Some(38)

My approach starts with a similar premise but uses a tail-recursive helper method to search the indexed words.
def getMinWindowSize(str :String, s :String) :Int = {
val keywords = s.split("\\s+").toSet
val re = "(?i)\\b(" + keywords.mkString("|") + ")\\b"
val idxs = re.r.findAllMatchIn(str).map(w => w.start -> w.toString).toList
def dist(input :List[(Int, String)], keys :Set[String]) :Option[Int] = input match {
case Nil => None
case (idx, word) :: rest =>
if (keys(word) && keys.size == 1) Some(idx)
else dist(rest, keys diff Set(word))
}
idxs.tails.collect{
case (idx, word)::rest => dist(rest, keywords diff Set(word)).map(_ - idx)
}.flatten.reduceOption(_ min _).getOrElse(-1)
}
No mutable variables or data structures. I also used Option to help return a more meaningful value if no minimum window is possible.
Usage:
getMinWindowSize(document, "Hello World Scala") //res0: Int = 22
getMinWindowSize(document, "Hello World Scal") //res1: Int = -1

Related

Longest alphabetical Order sub String from a String in Scala

Need to write the below logic in Scala code
I have a string let say 'abcsfdhdefghihqwtpqr'
need to print the longest string from the above that is in alphabetical order
like from above string the sub strings in alphabetical order is
abc,defghi,pqr and the logest is defghi so the result will be defghi
So how to write this above logic in scala ?
below is the code I have written
def main(args: Array[String]): Unit = {
val setofletters: String = "aaakkcccccczz"
/* 15 */
val output: Int = runLongestIndex(setofletters)
println("Longest run that first appeared in index:" + output)
}
def runLongestIndex(setofletters: String): Int = {
var ctr: Int = 1
var output: Int = 0
var j: Int = 0
for (i <- 0 until setofletters.length - 1) {
j = i
while (i < setofletters.length - 1 &&
setofletters.charAt(i) == setofletters.charAt(i + 1)) {
{ i += 1; i - 1 }
{ ctr += 1; ctr - 1 }
}
if (ctr > output) {
output = j
}
ctr = 1
}
output
}
}
but getting error += is not a member of int
Can help me to change the code and to resolve this error
Your code uses many mutable variables and doesn't look very Scala-like at all.
Here's a different approach.
val str = "abcsfdhdefghihqwtpqr"
List.unfold(str) { s =>
if (s.lengthIs > 1) {
val pairs = s.sliding(2)
val (a, b) = s.splitAt(pairs.indexWhere(p => p(0) > p(1))+1)
if (a.isEmpty) Option(b, "")
else Option(a, b)
}
else if (s.nonEmpty) Option(s, "")
else None
}.maxBy(_.length) //res0: String = defghi
Note: unfold() is newly available with Scala 2.13.
I don't have 2.11 installed, but this should work.
val str = "abcsfdhdefghihqwtpqr"
assert(str.nonEmpty)
(str.head+str).sliding(2).foldRight(""::Nil){
case (p, hd::tl) =>
if (p(0) > p(1)) "" :: p(1) + hd :: tl
else p(1) + hd :: tl
case _ => Nil //just to suppress the warning
}.maxBy(_.length) //res1: String = defghi
Explanation
sliding(2) - Pair up each letter with its neighbor: "ab","bc","cs", etc.
foldRight - Process each pair from right to left (end to start).
""::Nil - The accumulator will be a List[String] starting with a single element of an empty string. (Could also have been written as List("").)
case (p, hd::tl) - Put the current pair of letters to be processed into the variable p. Split the accumulator into its head and tail parts.
p(1) + hd :: tl - The 2nd letter of the pair is always added (pre-pended) to the current head of the accumulator. If the two letters are not in alphabetical order then a new, empty, head element is also added to the accumulator.
str.head+str - Because only the 2nd letter of each pair is being added to the accumulator, we have to make an adjustment so that the 1st letter of the original string is also included.
maxBy(_.length) - Pretty easy to understand. Comment this out to see the result of the foldRight operation.

scala: Loop through a file to read 20 bytes at a time and blank out bytes at 3rd position

I have a code snippet in java that loops through the file byte by byte and blanks out byte at 3rd position on every 20 bytes. This is done using for each loop.
logic:
for(byte b: raw){
if (pos is 3) b = 32;
if (i > 20) i = 0;
i++
}
Since I am learning scala, I would like to know if there is a better way of looping byte by byte in scala.
I have read into byte array as below in scala:
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
Thanks.
Here is a diametrically opposite solution to that of Tzach Zohar:
def parallel(ba: Array[Byte], blockSize: Int = 2048): Unit = {
val n = ba.size
val numJobs = (n + blockSize - 1) / blockSize
(0 until numJobs).par.foreach { i =>
val startIdx = i * blockSize
val endIdx = n min ((i + 1) * blockSize)
var j = startIdx + ((3 - startIdx) % 20 + 20) % 20
while (j < endIdx) {
ba(j) = 32
j += 20
}
}
}
You see a lot of mutable variables, scary imperative while-loops, and some strange tricks with modular arithmetic. That's actually not idiomatic Scala at all. But the interesting thing about this solution is that it processes blocks of the byte array in parallel. I've compared the time needed by this solution to your naive solution, using various block sizes:
Naive: 38.196
Parallel( 16): 11.676000
Parallel( 32): 7.260000
Parallel( 64): 4.311000
Parallel( 128): 2.757000
Parallel( 256): 2.473000
Parallel( 512): 2.462000
Parallel(1024): 2.435000
Parallel(2048): 2.444000
Parallel(4096): 2.416000
Parallel(8192): 2.420000
At least in this not very thorough microbenchmark (1000 repetitions on 10MB array), the more-or-less efficiently implemented parallel version outperformed the for-loop in your question by factor 15x.
The question is now: What do you mean by "better"?
My proposal was slightly faster than your naive approach
#TzachZohar's functional solution could generalize better should the
code be moved on a cluster like Apache Spark.
I would usually prefer something closer to #TzachZohar's solution, because it's easier to read.
So, it depends on what you are optimizing for: performance? generality? readability? maintainability? For each notion of "better", you could get a different answer. I've tried to optimize for performance. #TzachZohar optimized for readability and maintainability. That lead to two rather different solutions.
Full code of the microbenchmark, just in case someone is interested:
val array = Array.ofDim[Byte](10000000)
def naive(ba: Array[Byte]): Unit = {
var pos = 0
for (i <- 0 until ba.size) {
if (pos == 3) ba(i) = 32
pos += 1
if (pos == 20) pos = 0
}
}
def parallel(ba: Array[Byte], blockSize: Int): Unit = {
val n = ba.size
val numJobs = (n + blockSize - 1) / blockSize
(0 until numJobs).par.foreach { i =>
val startIdx = i * blockSize
val endIdx = n min ((i + 1) * blockSize)
var j = startIdx + ((3 - startIdx) % 20 + 20) % 20
while (j < endIdx) {
ba(j) = 32
j += 20
}
}
}
def measureTime[U](repeats: Long)(block: => U): Double = {
val start = System.currentTimeMillis
var iteration = 0
while (iteration < repeats) {
iteration += 1
block
}
val end = System.currentTimeMillis
(end - start).toDouble / repeats
}
println("Basic sanity check (did I get the modulo arithmetic right?):")
{
val testArray = Array.ofDim[Byte](50)
naive(testArray)
println(testArray.mkString("[", ",", "]"))
}
{
for (blockSize <- List(3, 7, 13, 16, 17, 32)) {
val testArray = Array.ofDim[Byte](50)
parallel(testArray, blockSize)
println(testArray.mkString("[", ",", "]"))
}
}
val Reps = 1000
val naiveTime = measureTime(Reps)(naive(array))
println("Naive: " + naiveTime)
for (blockSize <- List(16,32,64,128,256,512,1024,2048,4096,8192)) {
val parallelTime = measureTime(Reps)(parallel(array, blockSize))
println("Parallel(%4d): %f".format(blockSize, parallelTime))
}
Here's one way to do this:
val updated = result.grouped(20).flatMap { arr => arr.update(3, 32); arr }

Combine multiple sequential entries in Scala/Spark

I have an array of numbers separated by comma as shown:
a:{108,109,110,112,114,115,116,118}
I need the output something like this:
a:{108-110, 112, 114-116, 118}
I am trying to group the continuous numbers with "-" in between.
For example, 108,109,110 are continuous numbers, so I get 108-110. 112 is separate entry; 114,115,116 again represents a sequence, so I get 114-116. 118 is separate and treated as such.
I am doing this in Spark. I wrote the following code:
import scala.collection.mutable.ArrayBuffer
def Sample(x:String):ArrayBuffer[String]={
val x1 = x.split(",")
var a:Int = 0
var present=""
var next:Int = 0
var yrTemp = ""
var yrAr= ArrayBuffer[String]()
var che:Int = 0
var storeV = ""
var p:Int = 0
var q:Int = 0
var count:Int = 1
while(a < x1.length)
{
yrTemp = x1(a)
if(x1.length == 1)
{
yrAr+=x1(a)
}
else
if(a < x1.length - 1)
{
present = x1(a)
if(che == 0)
{
storeV = present
}
p = x1(a).toInt
q = x1(a+1).toInt
if(p == q)
{
yrTemp = yrTemp
che = 1
}
else
if(p != q)
{
yrTemp = storeV + "-" + present
che = 0
yrAr+=yrTemp
}
}
else
if(a == x1.length-1)
{
present = x1(a)
yrTemp = present
che = 0
yrAr+=yrTemp
}
a = a+1
}
yrAr
}
val SampleUDF = udf(Sample(_:String))
I am getting the output as follows:
a:{108-108, 109-109, 110-110, 112, 114-114, 115-115, 116-116, 118}
I am not able to figure out where I am going wrong. Can you please help me in correcting this. TIA.
Here's another way:
def rangeToString(a: Int, b: Int) = if (a == b) s"$a" else s"$a-$b"
def reduce(xs: Seq[Int], min: Int, max: Int, ranges: Seq[String]): Seq[String] = xs match {
case y +: ys if (y - max <= 1) => reduce(ys, min, y, ranges)
case y +: ys => reduce(ys, y, y, ranges :+ rangeToString(min, max))
case Seq() => ranges :+ rangeToString(min, max)
}
def output(xs: Array[Int]) = reduce(xs, xs.head, xs.head, Vector())//.toArray
Which you can test:
println(output(Array(108,109,110,112,114,115,116,118)))
// Vector(108-110, 112, 114-116, 118)
Basically this is a tail recursive function - i.e. you take your "variables" as the input, then it calls itself with updated "variables" on each loop. So here xs is your array, min and max are integers used to keep track of the lowest and highest numbers so far, and ranges is the output sequence of Strings that gets added to when required.
The first pattern (y being the first element, and ys being the rest of the sequence - because that's how the +: extractor works) is matched if there's at least one element (ys can be an empty list) and it follows on from the previous maximum.
The second is if it doesn't follow on, and needs to reset the minimum and add the completed range to the output.
The third case is where we've got to the end of the input and just output the result, rather than calling the loop again.
Internet karma points to anyone who can work out how to eliminate the duplication of ranges :+ rangeToString(min, max)!
here is a solution :
def combineConsecutive(s: String): Seq[String] = {
val ints: List[Int] = s.split(',').map(_.toInt).toList.reverse
ints
.drop(1)
.foldLeft(List(List(ints.head)))((acc, e) => if ((acc.head.head - e) <= 1)
(e :: acc.head) :: acc.tail
else
List(e) :: acc)
.map(group => if (group.size > 1) group.min + "-" + group.max else group.head.toString)
}
val in = "108,109,110,112,114,115,116,118"
val result = combineConsecutive(in)
println(result) // List(108-110, 112, 114-116, 118)
}
This solution partly uses code from this question: Grouping list items by comparing them with their neighbors

How to combine the results of spark computations in the following case?

The question is to calculate average of each of the columns corresponding to each class. Class number is given in the first column.
I am giving a part of test file for better clarity.
2 0.819039 -0.408442 0.120827
3 -0.063763 0.060122 0.250393
4 -0.304877 0.379067 0.092391
5 -0.168923 0.044400 0.074417
1 0.053700 -0.088746 0.228501
2 0.196758 0.035607 0.008134
3 0.006971 -0.096478 0.123718
4 0.084281 0.278343 -0.350414
So the task is to calculate
1: avg(), avg(), avg()
.
.
.
I am very new to Scala. After juggling a lot with the code I came up with the following code
val inputfile = sc.textFile ("testfile.txt")
val myArray = inputfile.map { line =>
(line.split(" ").toList)
}
var Avgmap:Map[String,List[Double]] = Map()
var countmap:Map[String,Int] = Map()
for( a <- myArray ){
//println( "Value of a: " + a + " " + a.size );
if(!countmap.contains(a(0))){
countmap += (a(0) -> 0)
Avgmap += (a(0) -> List.fill(a.size-1)(1.0))
}
var c = countmap(a(0)) + 1
val countmap2 = countmap + (a(0) -> c)
countmap = countmap2
var p = List[Double]()
for( i <- 1 to a.size - 1) {
var temp = (Avgmap(a(0))(i-1)*(countmap(a(0)) - 1) + a(i).toDouble)/countmap(a(0))
// println("i: "+i+" temp: "+temp)
var q = p :+ temp
p = q
}
val Avgmap2 = Avgmap + (a(0) -> p)
Avgmap = Avgmap2;
println("--------------------------------------------------")
println(countmap)
println(Avgmap)
}
When I execute this code I seem to be getting the results in two halves of the dataset. Please help me in combining them.
Edit: About the variables I am using. countmap keeps record of classnumber -> number of vectors encountered. Similarly Avgmap keeps record of average so far of each columns corresponding to the key.
at first, use DataFrame API. at secont, what you want is just one row
df.select(df.columns.map(c => mean(col(c))) :_*).show

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble