How to get multiple adjacent data in a RDD with Scala Spark - scala

I have a RDD, the rdd's value is 0 or 1, and a limit is 4. When I map the RDD, if rdd's value is 1 then the values from the current position to the (current position+limit) are all 1 else there are 0 0 .
example.
input : 1,0,0,0,0,0,1,0,0
expected output : 1,1,1,1,0,0,1,1,1
This is what I have tried so far :
val rdd = sc.parallelize(Array(1, 0, 0, 0, 0, 0, 1, 0, 0))
val limit = 4
val resultlimit = rdd.mapPartitions(parIter => {
var result = new ArrayBuffer[Int]()
var resultIter = new ArrayBuffer[Int]()
while (parIter.hasNext) {
val iter = parIter.next()
resultIter.append(iter)
}
var i = 0
while (i < resultIter.length) {
result.append(resultIter(i))
if (resultIter(i) == 1) {
var j = 1
while (j + i < resultIter.length && j < limit) {
result.append(1)
j += 1
}
i += j
} else {
i += 1
}
}
result.toIterator
})
resultlimit.foreach(println)
The result of resultlimit is RDD:[1,1,1,1,0,0,1,1,1]
My quick and dirty approach is to first create an Array but that is so ugly and inefficient.
Is there any cleaner solution?

Plain and simple. Import RDDFunctions
import org.apache.spark.mllib.rdd.RDDFunctions._
Define a limit:
val limit: Int = 4
Perpend limit - 1 zeros to the first partition:
val extended = rdd.mapPartitionsWithIndex {
case (0, iter) => Seq.fill(limit - 1)(0).toIterator ++ iter
case (_, iter) => iter
}
Slide over the RDD:
val result = extended.sliding(limit).map {
slice => if (slice.exists(_ != 0)) 1 else 0
}
Check the result:
val expected = Seq(1,1,1,1,0,0,1,1,1)
require(expected == result.collect.toSeq)
On a side note, your current approach doesn't correct for partition boundaries, therefore result will vary depending on the source.

Following is an improved approach to your requirement. Three while loops are reduced to one for loop and two ArrayBuffers are reduced to one ArrayBuffer. So processing time and memory usage both are reduced.
val resultlimit= rdd.mapPartitions(parIter => {
var result = new ArrayBuffer[Int]()
var limit = 0
for (value <- parIter) {
if (value == 1) limit = 4
if (limit > 0) {
result.append(1)
limit -= 1
}
else {
result.append(value)
}
}
result.toIterator
})
Edited
Above solution is when you don't have a partition defined in the original rdd. But when a partition is defined as
val rdd = sc.parallelize(Array(1,1,0,0,0,0,1,0,0), 4)
We need to collect the rdds as above solution will get executed on each partitions.
So the following solution should work
var result = new ArrayBuffer[Int]()
var limit = 0
for (value <- rdd.collect()) {
if (value == 1) limit = 4
if (limit > 0) {
result.append(1)
limit -= 1
}
else {
result.append(value)
}
}
result.foreach(println)

Related

Scala: Problem with foldLeft with negative numbers in list

I am writing a Scala function that returns the sum of even elements in a list, minus sum of odd elements in a list. I cannot use mutables, recursion or for/while loops for my solution. The code below passes 2/3 tests, but I can't seem to figure out why it can't compute the last test correctly.
def sumOfEvenMinusOdd(l: List[Int]) : Int = {
if (l.length == 0) return 0
val evens = l.filter(_%2==0)
val odds = l.filter(_%2==1)
val evenSum = evens.foldLeft(0)(_+_)
val oddSum = odds.foldLeft(0)(_+_)
evenSum-oddSum
}
//BEGIN TESTS
val i1 = sumOfEvenMinusOdd(List(1,3,5,4,5,2,1,0)) //answer: -9
val i2 = sumOfEvenMinusOdd(List(2,4,5,6,7,8,10)) //answer: 18
val i3 = sumOfEvenMinusOdd(List(109, 19, 12, 1, -5, -120, -15, 30,-33,-13, 12, 19, 3, 18, 1, -1)) //answer -133
My code is outputting this:
defined function sumOfEvenMinusOdd
i1: Int = -9
i2: Int = 18
i3: Int = -200
I am extremely confused why these negative numbers are tripping up the rest of my code. I saw a post explaining the order of operations with foldLeft foldRight, but even changing to foldRight still yields i3: Int = -200. Is there a detail I'm missing? Any guidance / help would be greatly appreciated.
The problem isn't foldLeft or foldRight, the problem is the way you filter out odd values:
val odds = l.filter(_ % 2 == 1)
Should be:
val odds = l.filter(_ % 2 != 0)
The predicate _ % 2 == 1 will only yield true for positive elements. For example, the expression -15 % 2 is equal to -1, and not 1.
As as side note, we can also make this a bit more efficient:
def sumOfEvenMinusOdd(l: List[Int]): Int = {
val (evenSum, oddSum) = l.foldLeft((0, 0)) {
case ((even, odd), element) =>
if (element % 2 == 0) (even + element, odd) else (even, odd + element)
}
evenSum - oddSum
}
Or even better by accumulating the difference only:
def sumOfEvenMinusOdd(l: List[Int]): Int = {
l.foldLeft(0) {
case (diff, element) =>
diff + element * (if (element % 2 == 0) 1 else -1)
}
}
The problem is on the filter condition that you apply on list to find odd numbers.
the odd condition that you doesn't work for negative odd number because mod 2 return -1 for this kind of number.
number % 2 == 0 if number is even
number % 2 != 0 if number is odd
so if you change the filter conditions all works as expected.
Another suggestion:
Why you want use foldleft function for a simple sum operation when you can use directly the sum functions?
test("Test sum Of even minus odd") {
def sumOfEvenMinusOdd(l: List[Int]) : Int = {
val evensSum = l.filter(_%2 == 0).sum
val oddsSum = l.filter(_%2 != 0).sum
evensSum-oddsSum
}
assert(sumOfEvenMinusOdd(List.empty[Int]) == 0)
assert(sumOfEvenMinusOdd(List(1,3,5,4,5,2,1,0)) == -9) //answer: -9
assert(sumOfEvenMinusOdd(List(2,4,5,6,7,8,10)) == 18) //answer: 18
assert(sumOfEvenMinusOdd(List(109, 19, 12, 1, -5, -120, -15, 30,-33,-13, 12, 19, 3, 18, 1, -1)) == -133)
}
With this solution your function is more clear and you can remove the if on the funciton

Combine multiple sequential entries in Scala/Spark

I have an array of numbers separated by comma as shown:
a:{108,109,110,112,114,115,116,118}
I need the output something like this:
a:{108-110, 112, 114-116, 118}
I am trying to group the continuous numbers with "-" in between.
For example, 108,109,110 are continuous numbers, so I get 108-110. 112 is separate entry; 114,115,116 again represents a sequence, so I get 114-116. 118 is separate and treated as such.
I am doing this in Spark. I wrote the following code:
import scala.collection.mutable.ArrayBuffer
def Sample(x:String):ArrayBuffer[String]={
val x1 = x.split(",")
var a:Int = 0
var present=""
var next:Int = 0
var yrTemp = ""
var yrAr= ArrayBuffer[String]()
var che:Int = 0
var storeV = ""
var p:Int = 0
var q:Int = 0
var count:Int = 1
while(a < x1.length)
{
yrTemp = x1(a)
if(x1.length == 1)
{
yrAr+=x1(a)
}
else
if(a < x1.length - 1)
{
present = x1(a)
if(che == 0)
{
storeV = present
}
p = x1(a).toInt
q = x1(a+1).toInt
if(p == q)
{
yrTemp = yrTemp
che = 1
}
else
if(p != q)
{
yrTemp = storeV + "-" + present
che = 0
yrAr+=yrTemp
}
}
else
if(a == x1.length-1)
{
present = x1(a)
yrTemp = present
che = 0
yrAr+=yrTemp
}
a = a+1
}
yrAr
}
val SampleUDF = udf(Sample(_:String))
I am getting the output as follows:
a:{108-108, 109-109, 110-110, 112, 114-114, 115-115, 116-116, 118}
I am not able to figure out where I am going wrong. Can you please help me in correcting this. TIA.
Here's another way:
def rangeToString(a: Int, b: Int) = if (a == b) s"$a" else s"$a-$b"
def reduce(xs: Seq[Int], min: Int, max: Int, ranges: Seq[String]): Seq[String] = xs match {
case y +: ys if (y - max <= 1) => reduce(ys, min, y, ranges)
case y +: ys => reduce(ys, y, y, ranges :+ rangeToString(min, max))
case Seq() => ranges :+ rangeToString(min, max)
}
def output(xs: Array[Int]) = reduce(xs, xs.head, xs.head, Vector())//.toArray
Which you can test:
println(output(Array(108,109,110,112,114,115,116,118)))
// Vector(108-110, 112, 114-116, 118)
Basically this is a tail recursive function - i.e. you take your "variables" as the input, then it calls itself with updated "variables" on each loop. So here xs is your array, min and max are integers used to keep track of the lowest and highest numbers so far, and ranges is the output sequence of Strings that gets added to when required.
The first pattern (y being the first element, and ys being the rest of the sequence - because that's how the +: extractor works) is matched if there's at least one element (ys can be an empty list) and it follows on from the previous maximum.
The second is if it doesn't follow on, and needs to reset the minimum and add the completed range to the output.
The third case is where we've got to the end of the input and just output the result, rather than calling the loop again.
Internet karma points to anyone who can work out how to eliminate the duplication of ranges :+ rangeToString(min, max)!
here is a solution :
def combineConsecutive(s: String): Seq[String] = {
val ints: List[Int] = s.split(',').map(_.toInt).toList.reverse
ints
.drop(1)
.foldLeft(List(List(ints.head)))((acc, e) => if ((acc.head.head - e) <= 1)
(e :: acc.head) :: acc.tail
else
List(e) :: acc)
.map(group => if (group.size > 1) group.min + "-" + group.max else group.head.toString)
}
val in = "108,109,110,112,114,115,116,118"
val result = combineConsecutive(in)
println(result) // List(108-110, 112, 114-116, 118)
}
This solution partly uses code from this question: Grouping list items by comparing them with their neighbors

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble

How to Make Prime Generator Code More Efficient

As a beginner in Scala, I got a problem in SPOJ:
Peter wants to generate some prime numbers for his cryptosystem. Help him! Your task is to generate all prime numbers between two given numbers!
Input
The input begins with the number t of test cases in a single line (t<=10). In each of the next t lines there are two numbers m and n (1 <= m <= n <= 1000000000, n-m<=100000) separated by a space.
Output
For every test case print all prime numbers p such that m <= p <= n, one number per line, test cases separated by an empty line.
Example
Input:
2
1 10
3 5
Output:
2
3
5
7
3
5
`
Warning: large Input/Output data, be careful with certain languages (though most should be OK if the algorithm is well designed)
Run environment and requirement:
Added by: Adam Dzedzej
Date: 2004-05-01
Time limit: 6s
Source limit: 50000B
Memory limit: 1536MB
Cluster: Cube (Intel Pentium G860 3GHz)
Languages: All except: NODEJS PERL 6 SCM chicken
My code is here:
import math.sqrt
object Pro_2 {
def main(args: Array[String]) {
// judge whether a Long is a prime or not
def isPrime(num: Long): Boolean = {
num match {
case num if (num < 2) => false
case _ => {
2 to (sqrt(num) toInt) forall (num % _ != 0)
}
}
}
// if a Long is a prime print it in console
def printIfIsPrime(num: Long) = {
if (isPrime(num))
println(num)
}
// get range
def getInput(times: Int) = {
for (input <- 0 until times) yield {
val range = readLine().split(" ")
(range(0) toLong, range(1) toLong)
}
}
val ranges = getInput(readInt())
for (time <- 0 until ranges.length) {
(ranges(time)._1 to ranges(time)._2).foreach(printIfIsPrime(_))
if (time != ranges.length - 1)
println()
}
}
}
When I run my code in SPOJ, I got a result: time limit exceeded
I need make my code more efficient, could you please help me?
Any help would be greatly appreciated.
isPrime could be written in a lower level style. Also you can increase the speed of the println method.
def main(args: Array[String]) {
val out = new java.io.PrintWriter(System.out, false)
def isPrime(n: Long): Boolean = {
if(n <= 3) return n > 1
else if(n%2 == 0 || n%3 == 0) return false
else {
var i = 5
while(i*i <= n) {
if(n%i == 0 || n%(i+2) == 0) return false
i += 6
}
return true
}
}
def printIfIsPrime(num: Long) = {
if (isPrime(num)) {
out.println(num)
}
}
...
val ranges = getInput(readInt())
for (time <- 0 until ranges.length) {
(ranges(time)._1 to ranges(time)._2).foreach(printIfIsPrime(_))
if (time != ranges.length - 1)
out.println()
}
out.flush()
}

Binary search not working

Below is a binary search algorithm but its not finding the value :
I don't think this algorithm is correct?
'theArray' is initialised to an array of 0's with item at position 7 equal to 4.
object various {
//O(log N)
def binarySerachForValue(value : Int) = {
var arraySize = 100
var theArray = new Array[Int](arraySize)
theArray(7) = 4
var timesThrough = 0
var lowIndex = 0
var highIndex = arraySize - 1
while(lowIndex <= highIndex){
var middleIndex = (highIndex + lowIndex) / 2
if(theArray(middleIndex) < value)
lowIndex = middleIndex + 1
else if(theArray(middleIndex) > value)
highIndex = middleIndex - 1
else {
println("Found match in index " + middleIndex)
lowIndex = highIndex + 1
}
timesThrough = timesThrough + 1
}
timesThrough
} //> binarySerachForValue: (value: Int)Int
binarySerachForValue(4) //> res0: Int = 7
}
Assuming your array is already properly sorted, you could write your search function a little more functionally using tail optimized recursion as follows:
def binarySearchForValue(value : Int, theArray:Array[Int]) = {
#tailrec
def doSearch(arr:Array[Int], index:Int = 0):Int = {
val middleIndex = arr.size / 2
val splits = arr.splitAt(middleIndex)
val totalIndex = middleIndex + index
arr(middleIndex) match{
case i if i == value => totalIndex
case i if i < value => doSearch(splits._2, totalIndex)
case _ => doSearch(splits._1 dropRight(1), totalIndex)
}
}
doSearch(theArray)
}
Note that this could also be accomplished slightly differently as follows:
def binarySearchForValue(value : Int, theArray:Array[Int]) = {
#tailrec
def doSearch(low:Int, high:Int):Int = {
val mid = (low + high) / 2
if(mid >= theArray.size) -1
else {
val currval = theArray(mid)
if (currval == value) mid
else if (currval < value) doSearch(mid+1, high)
else doSearch(low, mid - 1)
}
}
doSearch(0, theArray.size)
}
It looks like a proper implementation of the Binary Search Algorithm, but you are providing an array of 0's, with just one number at the index of 7. Binary Search usually takes an array of sorted values (although you can implement sorting as the first step).
Here is an example of why you need a sorted array first:
Searchfor(4)
theArray = [0,4,0,0,0]
First iteration, look at theArray(2), which equals 0. 0 < 4, so use the upperhalf(i.e. lower index = middleindex + 1
newArray = [0,0]
Then we iterate again and eventually exit the loop because we never found it. With a sorted list, your technique would work well.
With finding a single value in an array of 0's, your best bet is to just iterate through the array until you find it. Best of Luck.
loop should be like this:
while(lowIndex <= highIndex){
//note the lowIndex + other
var middleIndex = lowIndex + ((highIndex + lowIndex) / 2)
if(theArray(middleIndex) < value)
lowIndex = middleIndex + 1
else if(theArray(middleIndex) > value)
highIndex = middleIndex - 1
else return middleIndex
timesThrough = timesThrough + 1
}
// if loop finished and not returned middleIndex in last else, return -1 (not found)
return -1