how to add adjacent values in Array[Double] - scala

val values = Array[Double].sliding(2).map(x => x.reduce(_ + _) / 2)
This works successfully. But if that array contains 10000 or more values, it take times to get the values. Is there a faster method to find the adjacent values?

I think this should be faster:
val values = (for(i <- 0 until array.length - 1) yield ((array(i) + array(i + 1)) / 2)).toArray

Going low-level:
var i = 0
val valuesLength = array.length - 1
val values = new Array[Double](valuesLength)
while (i < valuesLength) {
values(i) = (array(i) + array(i + 1)) / 2
i += 1
}
Of course, you should only do this if this is actually a bottleneck in your program.

Related

Scala for loop to calculate sum of powers

I am very new to Scala and am trying to create a loop that will calculate the sum of powers (1^1 + 2^2 + ... + 10^10) without using an exponent operator.
I discovered that 1^1 through 9^9 calculate correctly. But for some reason 10^10 evaluates to 1410065409 in my current code and messes up my final output of the sum. What is causing this mathematical error?
My current code is:
var i = 1
var ex = 1
var sum = 0
while (i <= 10)
{
for (j <- 1 to i)
{
ex = ex * i
}
sum += ex
ex = 1
i += 1
}
println(s"The sum is $sum")
Here's how it's done in Scala.
List.tabulate(10)(n => List.fill(n+1)(n.toLong+1).product).sum
//res0: Long = 10405071317
Another option you have, is to use Math.pow:
val result1 = 1.to(10).map(x => Math.pow(x, x)).sum
Please note that result1 is of type Double, and has the value 1.0405071317E10.
If you want to have it as long, you can do:
val result2 = 1.to(10).map(x => Math.pow(x, x).toLong).sum
Then result2 will have the value 10405071317.

scala: Loop through a file to read 20 bytes at a time and blank out bytes at 3rd position

I have a code snippet in java that loops through the file byte by byte and blanks out byte at 3rd position on every 20 bytes. This is done using for each loop.
logic:
for(byte b: raw){
if (pos is 3) b = 32;
if (i > 20) i = 0;
i++
}
Since I am learning scala, I would like to know if there is a better way of looping byte by byte in scala.
I have read into byte array as below in scala:
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
Thanks.
Here is a diametrically opposite solution to that of Tzach Zohar:
def parallel(ba: Array[Byte], blockSize: Int = 2048): Unit = {
val n = ba.size
val numJobs = (n + blockSize - 1) / blockSize
(0 until numJobs).par.foreach { i =>
val startIdx = i * blockSize
val endIdx = n min ((i + 1) * blockSize)
var j = startIdx + ((3 - startIdx) % 20 + 20) % 20
while (j < endIdx) {
ba(j) = 32
j += 20
}
}
}
You see a lot of mutable variables, scary imperative while-loops, and some strange tricks with modular arithmetic. That's actually not idiomatic Scala at all. But the interesting thing about this solution is that it processes blocks of the byte array in parallel. I've compared the time needed by this solution to your naive solution, using various block sizes:
Naive: 38.196
Parallel( 16): 11.676000
Parallel( 32): 7.260000
Parallel( 64): 4.311000
Parallel( 128): 2.757000
Parallel( 256): 2.473000
Parallel( 512): 2.462000
Parallel(1024): 2.435000
Parallel(2048): 2.444000
Parallel(4096): 2.416000
Parallel(8192): 2.420000
At least in this not very thorough microbenchmark (1000 repetitions on 10MB array), the more-or-less efficiently implemented parallel version outperformed the for-loop in your question by factor 15x.
The question is now: What do you mean by "better"?
My proposal was slightly faster than your naive approach
#TzachZohar's functional solution could generalize better should the
code be moved on a cluster like Apache Spark.
I would usually prefer something closer to #TzachZohar's solution, because it's easier to read.
So, it depends on what you are optimizing for: performance? generality? readability? maintainability? For each notion of "better", you could get a different answer. I've tried to optimize for performance. #TzachZohar optimized for readability and maintainability. That lead to two rather different solutions.
Full code of the microbenchmark, just in case someone is interested:
val array = Array.ofDim[Byte](10000000)
def naive(ba: Array[Byte]): Unit = {
var pos = 0
for (i <- 0 until ba.size) {
if (pos == 3) ba(i) = 32
pos += 1
if (pos == 20) pos = 0
}
}
def parallel(ba: Array[Byte], blockSize: Int): Unit = {
val n = ba.size
val numJobs = (n + blockSize - 1) / blockSize
(0 until numJobs).par.foreach { i =>
val startIdx = i * blockSize
val endIdx = n min ((i + 1) * blockSize)
var j = startIdx + ((3 - startIdx) % 20 + 20) % 20
while (j < endIdx) {
ba(j) = 32
j += 20
}
}
}
def measureTime[U](repeats: Long)(block: => U): Double = {
val start = System.currentTimeMillis
var iteration = 0
while (iteration < repeats) {
iteration += 1
block
}
val end = System.currentTimeMillis
(end - start).toDouble / repeats
}
println("Basic sanity check (did I get the modulo arithmetic right?):")
{
val testArray = Array.ofDim[Byte](50)
naive(testArray)
println(testArray.mkString("[", ",", "]"))
}
{
for (blockSize <- List(3, 7, 13, 16, 17, 32)) {
val testArray = Array.ofDim[Byte](50)
parallel(testArray, blockSize)
println(testArray.mkString("[", ",", "]"))
}
}
val Reps = 1000
val naiveTime = measureTime(Reps)(naive(array))
println("Naive: " + naiveTime)
for (blockSize <- List(16,32,64,128,256,512,1024,2048,4096,8192)) {
val parallelTime = measureTime(Reps)(parallel(array, blockSize))
println("Parallel(%4d): %f".format(blockSize, parallelTime))
}
Here's one way to do this:
val updated = result.grouped(20).flatMap { arr => arr.update(3, 32); arr }

Calculating the sum of integers from x to y with a while loop

I'm trying to write a code in Scala to calculate the sum of elements from x to y using a while loop.
I initialize x and y to for instance :
val x = 1
val y = 10
then I write a while loop to increment x :
while (x<y) x = x + 1
But println(x) gives the result 10 so I'm assuming the code basically does 1 + 1 + ... + 1 10 times, but that's not what I want.
One option would be to find the sum via a range, converted to a list:
val x = 1
val y = 10
val sum = (x to y).toList.sum
println("sum = " + sum)
Output:
sum = 55
Demo here:
Rextester
Here's how you would do it using a (yak!) while loop with vars:
var x = 1 // Note that is a "var" not a "val"
val y = 10
var sum = 0 // Must be a "var"
while(x <= y) { // Note less than or equal to
sum += x
x += 1
}
println(s"Sum is $sum") // Sum = 55 (= 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10)
Here's another, more functional, approach using a recursive function. Note the complete lack of var types.
val y = 10
#scala.annotation.tailrec // Signifies add must be tail recursive
def add(x: Int, sum: Int): Int = {
// If x exceeds y, then return the current sum value.
if(x > y) sum
// Otherwise, perform another iteration adding 1 to x and x to sum.
else add(x + 1, sum + x)
}
// Start things off and get the result (55).
val result = add(1, 0)
println(s"Sum is $result") // Sum is 55
Here's a common functional approach that can be used with collections. Firstly, (x to y) becomes a Range of values between 1 and 10 inclusive. We then use the foldLeft higher-order function to sum the members:
val x = 1
val y = 10
val result = (x to y).foldLeft(0)(_ + _)
println(s"Sum is $result") // Sum is 55
The (0) is the initial sum value, and the (_ + _) adds the current sum to the current value. (This is Scala shorthand for ((sum: Int, i: Int) => sum + i)).
Finally, here's a simplified version of the elegant functional version that #TimBiegeleisen posted above. However, since a Range already implements a .sum member, there is no need to convert to a List first:
val x = 1
val y = 10
val result = (x to y).sum
println(s"Sum is $result") // Sum is 55
(sum can be thought of as being equivalent to the foldLeft example above, and is typically implemented in similar fashion.)
BTW, if you just want to sum values from 1 to 10, the following code does this very succinctly:
(1 to 10).sum
Although you can use Scala to write imperative code (which uses vars, while loops, etc. and which inherently leads to shared mutable state), I strongly recommend that you consider functional alternatives. Functional programming avoids the side-effects and complexities of shared mutable state and often results in simpler, more elegant code. Note that all but the first examples are all functional.
var x = 1
var y = 10
var temp = 0
while (x < y) {
temp = temp+x
x = x + 1
}
println(temp)
This gives required result

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble

Counting by range

The following script can be used to "count by" keys
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(nbr => (nbr, 1))
val nbrCountsWithReduce = nbrPairsRDD
.reduceByKey(_ + _)
.collect()
nbrCountsWithReduce.foreach(println)
it returns:
(1,1)
(2,2)
(3,3)
(4,4)
How could it be modified to map by range rather than absolute values and give the following output if we had two ranges 1:2 and 3:4:
(1:2,3)
(3:4,7)
One option is to convert the list into double and use the histogram function:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(_.toDouble).histogram(2)
One easy way that I can think of is to map the keys to individual ranges, for eg :
val nbrRangePairs = sc.parallelize(nbr)
.map(nbr => (computeRange(nbr), 1))
.reduceByKey(_ + _)
.collect()
// function to compute Ranges
def computeRange(num : int) : String =
{
if(num < 3)
return "1:2"
else if(num < 5)
return "2:3"
else
return "invalid"
}
Here is the code snippet to compute aggregations by range:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrs = sc.parallelize(nbr)
var lb = 1
var incr = 1
var ub = lb + incr
val nbrsMap = nbrs.map(rec => {
if(rec > ub) {
lb = rec
ub = lb + incr
}
(lb.toString + ":" + ub.toString, 1)
})
nbrsMap.reduceByKey((acc, value) => acc + value).foreach(println)
(1:2,3)
(3:4,7)