Scala for loop to calculate sum of powers - scala

I am very new to Scala and am trying to create a loop that will calculate the sum of powers (1^1 + 2^2 + ... + 10^10) without using an exponent operator.
I discovered that 1^1 through 9^9 calculate correctly. But for some reason 10^10 evaluates to 1410065409 in my current code and messes up my final output of the sum. What is causing this mathematical error?
My current code is:
var i = 1
var ex = 1
var sum = 0
while (i <= 10)
{
for (j <- 1 to i)
{
ex = ex * i
}
sum += ex
ex = 1
i += 1
}
println(s"The sum is $sum")

Here's how it's done in Scala.
List.tabulate(10)(n => List.fill(n+1)(n.toLong+1).product).sum
//res0: Long = 10405071317

Another option you have, is to use Math.pow:
val result1 = 1.to(10).map(x => Math.pow(x, x)).sum
Please note that result1 is of type Double, and has the value 1.0405071317E10.
If you want to have it as long, you can do:
val result2 = 1.to(10).map(x => Math.pow(x, x).toLong).sum
Then result2 will have the value 10405071317.

Related

Getting specific incremental key values into an array

I have a list which I zipped with indices:
val fun_i_map_e = (list.indices zip list).toMap
Now, I want to get each key's value incremented by num:Int :
for (k<-0 until list.length by num)
for ((k,v) <- fun_i_map_e) {
bufferArray += v}
The idea here is something like this in Java:
for (k = 0; k <= list.length; k+= num){
//increment key k each time and store value into dynamic array }
However, I'm getting very random and complete trash output. I would appreciate if someone can help as I'm new in Scala.
You are almost there. All you need is to shape your for function with yield as given below
val bufferArray = for (k <- 0 until list.length by num) yield fun_i_map_e(k)
I hope the answer is helpful
val list = List[Int](5, 6, 7, 8)
val map = list.indices.zip(list).toMap
val num: Int = 15
val incremantedKeys = map.keys.map { k => k + num }
println("Original keys:")
println(map.keys)
println
println(s"Keys incremented by $num:")
println(incremantedKeys)

Calculating the sum of integers from x to y with a while loop

I'm trying to write a code in Scala to calculate the sum of elements from x to y using a while loop.
I initialize x and y to for instance :
val x = 1
val y = 10
then I write a while loop to increment x :
while (x<y) x = x + 1
But println(x) gives the result 10 so I'm assuming the code basically does 1 + 1 + ... + 1 10 times, but that's not what I want.
One option would be to find the sum via a range, converted to a list:
val x = 1
val y = 10
val sum = (x to y).toList.sum
println("sum = " + sum)
Output:
sum = 55
Demo here:
Rextester
Here's how you would do it using a (yak!) while loop with vars:
var x = 1 // Note that is a "var" not a "val"
val y = 10
var sum = 0 // Must be a "var"
while(x <= y) { // Note less than or equal to
sum += x
x += 1
}
println(s"Sum is $sum") // Sum = 55 (= 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10)
Here's another, more functional, approach using a recursive function. Note the complete lack of var types.
val y = 10
#scala.annotation.tailrec // Signifies add must be tail recursive
def add(x: Int, sum: Int): Int = {
// If x exceeds y, then return the current sum value.
if(x > y) sum
// Otherwise, perform another iteration adding 1 to x and x to sum.
else add(x + 1, sum + x)
}
// Start things off and get the result (55).
val result = add(1, 0)
println(s"Sum is $result") // Sum is 55
Here's a common functional approach that can be used with collections. Firstly, (x to y) becomes a Range of values between 1 and 10 inclusive. We then use the foldLeft higher-order function to sum the members:
val x = 1
val y = 10
val result = (x to y).foldLeft(0)(_ + _)
println(s"Sum is $result") // Sum is 55
The (0) is the initial sum value, and the (_ + _) adds the current sum to the current value. (This is Scala shorthand for ((sum: Int, i: Int) => sum + i)).
Finally, here's a simplified version of the elegant functional version that #TimBiegeleisen posted above. However, since a Range already implements a .sum member, there is no need to convert to a List first:
val x = 1
val y = 10
val result = (x to y).sum
println(s"Sum is $result") // Sum is 55
(sum can be thought of as being equivalent to the foldLeft example above, and is typically implemented in similar fashion.)
BTW, if you just want to sum values from 1 to 10, the following code does this very succinctly:
(1 to 10).sum
Although you can use Scala to write imperative code (which uses vars, while loops, etc. and which inherently leads to shared mutable state), I strongly recommend that you consider functional alternatives. Functional programming avoids the side-effects and complexities of shared mutable state and often results in simpler, more elegant code. Note that all but the first examples are all functional.
var x = 1
var y = 10
var temp = 0
while (x < y) {
temp = temp+x
x = x + 1
}
println(temp)
This gives required result

How to combine the results of spark computations in the following case?

The question is to calculate average of each of the columns corresponding to each class. Class number is given in the first column.
I am giving a part of test file for better clarity.
2 0.819039 -0.408442 0.120827
3 -0.063763 0.060122 0.250393
4 -0.304877 0.379067 0.092391
5 -0.168923 0.044400 0.074417
1 0.053700 -0.088746 0.228501
2 0.196758 0.035607 0.008134
3 0.006971 -0.096478 0.123718
4 0.084281 0.278343 -0.350414
So the task is to calculate
1: avg(), avg(), avg()
.
.
.
I am very new to Scala. After juggling a lot with the code I came up with the following code
val inputfile = sc.textFile ("testfile.txt")
val myArray = inputfile.map { line =>
(line.split(" ").toList)
}
var Avgmap:Map[String,List[Double]] = Map()
var countmap:Map[String,Int] = Map()
for( a <- myArray ){
//println( "Value of a: " + a + " " + a.size );
if(!countmap.contains(a(0))){
countmap += (a(0) -> 0)
Avgmap += (a(0) -> List.fill(a.size-1)(1.0))
}
var c = countmap(a(0)) + 1
val countmap2 = countmap + (a(0) -> c)
countmap = countmap2
var p = List[Double]()
for( i <- 1 to a.size - 1) {
var temp = (Avgmap(a(0))(i-1)*(countmap(a(0)) - 1) + a(i).toDouble)/countmap(a(0))
// println("i: "+i+" temp: "+temp)
var q = p :+ temp
p = q
}
val Avgmap2 = Avgmap + (a(0) -> p)
Avgmap = Avgmap2;
println("--------------------------------------------------")
println(countmap)
println(Avgmap)
}
When I execute this code I seem to be getting the results in two halves of the dataset. Please help me in combining them.
Edit: About the variables I am using. countmap keeps record of classnumber -> number of vectors encountered. Similarly Avgmap keeps record of average so far of each columns corresponding to the key.
at first, use DataFrame API. at secont, what you want is just one row
df.select(df.columns.map(c => mean(col(c))) :_*).show

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble

Scala - What type are the numbers in the List using x.toString.toList?

I have written a function in Scala that should calculate the sum of the squares of the digits of a number. Eg: 44 -> 32 (4^2 + 4^2 = 16 + 16 = 32)
Here it is:
def digitSum(x:BigInt) : BigInt = {
var sum = 0
val leng = x.toString.toList.length
var y = x.toString.toList
for (i<-0 until leng ) {
sum += y(i).toInt * y(i).toInt
}
return sum
}
However when I call the function let's say with digitSum(44) instead of 32 I get 5408.
Why is this happening? Does it have to do with the fact that in the list there are Strings? If so why does the .toInt method do not work?
Thanks!
The answer to your questions has been already covered here Scala int value of String characters, have a good read through and you will have more information than required ;)
Also looking at your code, it can benefit more from Scala expressiveness and functional features. The same function can be written in the following manner:
def digitSum(x: BigInt) = x.toString
.map(_.asDigit)
.map(a => a * a)
.sum
In the future try to avoid using mutable variables and standard looping techniques if you could.
When you do toString you're mapping the String to Chars not Ints and then to Ints later. This is what it looks like in the repl:
scala> "1".toList.map(_.toInt)
res0: List[Int] = List(49)
What you want is probably something like this:
def digitSum(x:BigInt) : BigInt = {
var sum = 0
val leng = x.toString.toList.length
var y = x.toString.toList
for (i<-0 until leng ) {
sum += (y(i).toInt - 48) * (y(i).toInt - 48) //Subtract out char base
}
sum
}