Spark iterative Kmeans not getting expected results? - scala

I am writing a naive implementation of Kmeans in Spark for my homework:
import breeze.linalg.{ Vector, DenseVector, squaredDistance }
import scala.math
def parse(line: String): Vector[Double] = {
DenseVector(line.split(' ').map(_.toDouble))
}
def closest_assign(p: Vector[Double], centres: Array[Vector[Double]]): Int = {
var bestIndex = 1
var closest = Double.PositiveInfinity
for (i <- 0 until centres.length) {
val tempDist = squaredDistance(p, centres(i))
if (tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
bestIndex
}
val fileroot:String="/FileStore/tables/"
val file=sc.textFile(fileroot+"data.txt")
.map(parse _)
.cache()
val c1=sc.textFile(fileroot+"c1.txt")
.map(parse _)
.collect()
val c2=sc.textFile(fileroot+"c2.txt")
.map(parse _)
.collect()
val K=10
val MAX_ITER=20
var kPoints=c2
for(i<-0 until MAX_ITER){
val closest = file.map(p => (closest_assign(p, kPoints), (p, 1)))
val pointStats = closest.reduceByKey { case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2) }
val newPoints = pointStats.map { pair =>
(pair._1, pair._2._1 * (1.0 / pair._2._2))
}.collectAsMap()
for (newP <- newPoints) {
kPoints(newP._1) = newP._2
}
val tempDist = closest
.map { x => squaredDistance(x._2._1, newPoints(x._1)) }
.fold(0) { _ + _ }
println(i+" time finished iteration (cost = " + tempDist + ")")
}
In theory tempDist should become smaller and smaller as the program runs but in reality it goes the other way around. Also I found c1 and c2 changes value after the for(i<-0 until MAX_ITER) loop. But c1 and c2 should be val values! Is the way I load c1 and c2 wrong? c1 and c2 are two different initial clusters for the data.

Related

SCALA: Generating a list of Tuple2 objects meeting some criteria

I want to generate a list of Tuple2 objects. Each tuple (a,b) in the list should meeting the conditions:a and b both are perfect squares,(b/30)<a<b
and a>N and b>N ( N can even be a BigInt)
I am trying to write a scala function to generate the List of Tuples meeting the above requirements?
This is my attempt..it works fine for Ints and Longs..But for BigInt there is sqrt problem I am facing..Here is my approach in coding as below:
scala> def genTups(N:Long) ={
| val x = for(s<- 1L to Math.sqrt(N).toLong) yield s*s;
| val y = x.combinations(2).map{ case Vector(a,b) => (a,b)}.toList
| y.filter(t=> (t._1*30/t._2)>=1)
| }
genTups: (N: Long)List[(Long, Long)]
scala> genTups(30)
res32: List[(Long, Long)] = List((1,4), (1,9), (1,16), (1,25), (4,9), (4,16), (4,25), (9,16), (9,25), (16,25))
Improved this using BigInt square-root algorithm as below:
def genTups(N1:BigInt,N2:BigInt) ={
def sqt(n:BigInt):BigInt = {
var a = BigInt(1)
var b = (n>>5)+BigInt(8)
while((b-a) >= 0) {
var mid:BigInt = (a+b)>>1
if(mid*mid-n> 0) b = mid-1
else a = mid+1
}; a-1 }
val x = for(s<- sqt(N1) to sqt(N2)) yield s*s;
val y = x.combinations(2).map{ case Vector(a,b) => (a,b)}.toList
y.filter(t=> (t._1*30/t._2)>=1)
}
I appreciate any help to improve in my algorithm .
You can avoid sqrt in you algorithm by changing the way you calculate x to this:
val x = (BigInt(1) to N).map(x => x*x).takeWhile(_ <= N)
The final function is then:
def genTups(N: BigInt) = {
val x = (BigInt(1) to N).map(x => x*x).takeWhile(_ <= N)
val y = x.combinations(2).map { case Vector(a, b) if (a < b) => (a, b) }.toList
y.filter(t => (t._1 * 30 / t._2) >= 1)
}
You can also re-write this as a single chain of operations like this:
def genTups(N: BigInt) =
(BigInt(1) to N)
.map(x => x * x)
.takeWhile(_ <= N)
.combinations(2)
.map { case Vector(a, b) if a < b => (a, b) }
.filter(t => (t._1 * 30 / t._2) >= 1)
.toList
In a quest for performance, I came up with this recursive version that appears to be significantly faster
def genTups(N1: BigInt, N2: BigInt) = {
def sqt(n: BigInt): BigInt = {
var a = BigInt(1)
var b = (n >> 5) + BigInt(8)
while ((b - a) >= 0) {
var mid: BigInt = (a + b) >> 1
if (mid * mid - n > 0) {
b = mid - 1
} else {
a = mid + 1
}
}
a - 1
}
#tailrec
def loop(a: BigInt, rem: List[BigInt], res: List[(BigInt, BigInt)]): List[(BigInt, BigInt)] =
rem match {
case Nil => res
case head :: tail =>
val a30 = a * 30
val thisRes = rem.takeWhile(_ <= a30).map(b => (a, b))
loop(head, tail, thisRes.reverse ::: res)
}
val squares = (sqt(N1) to sqt(N2)).map(s => s * s).toList
loop(squares.head, squares.tail, Nil).reverse
}
Each recursion of the loop adds all the matching pairs for a given value of a. The result is built in reverse because adding to the front of a long list is much faster than adding to the tail.
Firstly create a function to check if number if perfect square or not.
def squareRootOfPerfectSquare(a: Int): Option[Int] = {
val sqrt = math.sqrt(a)
if (sqrt % 1 == 0)
Some(sqrt.toInt)
else
None
}
Then, create another func that will calculate this list of tuples according to the conditions mentioned above.
def generateTuples(n1:Int,n2:Int)={
for{
b <- 1 to n2;
a <- 1 to n1 if(b>a && squareRootOfPerfectSquare(b).isDefined && squareRootOfPerfectSquare(a).isDefined)
} yield ( (a,b) )
}
Then on calling the function with parameters generateTuples(5,10)
you will get an output as
res0: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((1,4), (1,9), (4,9))
Hope that helps !!!

type mismatch in scala when using reduce

Can anybody help me understand what's wrong with the code below?
case class Point(x: Double, y: Double)
def centroid(points: IndexedSeq[Point]): Point = {
val x = points.reduce(_.x + _.x)
val y = points.reduce(_.y + _.y)
val len = points.length
Point(x/len, y/len)
}
I get the error when I run it:
Error:(10, 30) type mismatch;
found : Double
required: A$A145.this.Point
val x = points.reduce(_.x + _.x)
^
reduce, in this case, takes a function of type (Point, Point) => Point and returns a Point.
One way to calculate the centroid:
case class Point(x: Double, y: Double)
def centroid(points: IndexedSeq[Point]): Point = {
val x = points.map(_.x).sum
val y = points.map(_.y).sum
val len = points.length
Point(x/len, y/len)
}
If you want to use reduce you need to reduce both x and y in a single pass like this
def centroid(points: IndexedSeq[Point]): Point = {
val p = points.reduce( (s, p) => Point(s.x + p.x, s.y + p.y) )
val len = points.length
Point(p.x/len, p.y/len)
}
If you want to compute x and y independently then use foldLeft rather than reduce like this
def centroid(points: IndexedSeq[Point]): Point = {
val x = points.foldLeft(0.0)(_ + _.x)
val y = points.foldLeft(0.0)(_ + _.y)
val len = points.length
Point(x/len, y/len)
}
This is perhaps clearer but does process the points twice so it may be marginally less efficient.

Scala Mismatch MonteCarlo

I try to implement a version of the Monte Carlo algorithm in Scala but i have a little problem.
In my first loop, i have a mismatch with Unit and Int, but I didn't know how to slove this.
Thank for your help !
import scala.math._
import scala.util.Random
import scala.collection.mutable.ListBuffer
object Main extends App{
def MonteCarlo(list: ListBuffer[Int]): List[Int] = {
for (i <- list) {
var c = 0.00
val X = new Random
val Y = new Random
for (j <- 0 until i) {
val x = X.nextDouble // in [0,1]
val y = Y.nextDouble // in [0,1]
if (x * x + y * y < 1) {
c = c + 1
}
}
c = c * 4
var p = c / i
var error = abs(Pi-p)
print("Approximative value of pi : $p \tError: $error")
}
}
var liste = ListBuffer (200, 2000, 4000)
MonteCarlo(liste)
}
A guy working usually with Python.
for loop does not return anything, so that's why your method returns Unit but expects List[Int] as return type is List[Int].
Second, you have not used scala interpolation correctly. It won't print the value of error. You forgot to use 's' before the string.
Third thing, if want to return list, you first need a list where you will accumulate all values of every iteration.
So i am assuming that you are trying to return error for all iterations. So i have created an errorList, which will store all values of error. If you want to return something else you can modify your code accordingly.
def MonteCarlo(list: ListBuffer[Int]) = {
val errorList = new ListBuffer[Double]()
for (i <- list) {
var c = 0.00
val X = new Random
val Y = new Random
for (j <- 0 until i) {
val x = X.nextDouble // in [0,1]
val y = Y.nextDouble // in [0,1]
if (x * x + y * y < 1) {
c = c + 1
}
}
c = c * 4
var p = c / i
var error = abs(Pi-p)
errorList += error
println(s"Approximative value of pi : $p \tError: $error")
}
errorList
}
scala> MonteCarlo(liste)
Approximative value of pi : 3.26 Error: 0.11840734641020667
Approximative value of pi : 3.12 Error: 0.02159265358979301
Approximative value of pi : 3.142 Error: 4.073464102067881E-4
res9: scala.collection.mutable.ListBuffer[Double] = ListBuffer(0.11840734641020667, 0.02159265358979301, 4.073464102067881E-4)

area under the curve programatically in scala

im trying to solve for the area under the curve of the example 1 of: http://tutorial.math.lamar.edu/Classes/CalcI/AreaProblem.aspx
f(x) = x^3 - 5x^2 + 6x + 5 and the x-axis n = 5
the answers says it is: 25.12
but i'm getting a slightly less: 23.78880035448074
what im i doing wrong??
here's my code:
import scala.math.BigDecimal.RoundingMode
def summation(low: Int, up: Int, coe: List[Int], ex: List[Int]) = {
def eva(coe: List[Int], ex: List[Int], x: Double) = {
(for (i <- 0 until coe.size) yield coe(i) * math.pow(x,ex(i))).sum
}
#annotation.tailrec
def build_points(del: Float, p: Int, xs : List[BigDecimal]): List[BigDecimal] = {
if(p <= 0 ) xs map { x => x.setScale(3, RoundingMode.HALF_EVEN)}
else build_points(del, p - 1, ((del * p):BigDecimal ):: xs)
}
val sub = 5
val diff = (up - low).toFloat
val deltaX = diff / sub
val points = build_points(deltaX, sub, List(0.0f)); println(points)
val middle_points =
(for (i <- 0 until points.size - 1) yield (points(i) + points(i + 1)) / 2)
(for (elem <- middle_points) yield deltaX * eva(coe,ex,elem.toDouble)).sum
}
val coe = List(1,-5,6,5)
val exp = List(3,2,1,0)
print(summation(0,4,coe,exp))
I'm guessing the problem is that the problem is build_points(deltaX, 5, List(0.0f)) returns a list with 6 elements instead of 5. The problem is that you are passing a list with one element in the beginning, where I'm guessing you wanted an empty list, like
build_points(deltaX, sub, Nil)

Counting by range

The following script can be used to "count by" keys
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(nbr => (nbr, 1))
val nbrCountsWithReduce = nbrPairsRDD
.reduceByKey(_ + _)
.collect()
nbrCountsWithReduce.foreach(println)
it returns:
(1,1)
(2,2)
(3,3)
(4,4)
How could it be modified to map by range rather than absolute values and give the following output if we had two ranges 1:2 and 3:4:
(1:2,3)
(3:4,7)
One option is to convert the list into double and use the histogram function:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrPairsRDD = sc.parallelize(nbr).map(_.toDouble).histogram(2)
One easy way that I can think of is to map the keys to individual ranges, for eg :
val nbrRangePairs = sc.parallelize(nbr)
.map(nbr => (computeRange(nbr), 1))
.reduceByKey(_ + _)
.collect()
// function to compute Ranges
def computeRange(num : int) : String =
{
if(num < 3)
return "1:2"
else if(num < 5)
return "2:3"
else
return "invalid"
}
Here is the code snippet to compute aggregations by range:
val nbr = List(1,2,2,3,3,3,4,4,4,4)
val nbrs = sc.parallelize(nbr)
var lb = 1
var incr = 1
var ub = lb + incr
val nbrsMap = nbrs.map(rec => {
if(rec > ub) {
lb = rec
ub = lb + incr
}
(lb.toString + ":" + ub.toString, 1)
})
nbrsMap.reduceByKey((acc, value) => acc + value).foreach(println)
(1:2,3)
(3:4,7)