Optimizing Flink transformation - scala

I have the following method that computes the probability of a value in a DataSet:
/**
* Compute the probabilities of each value on the given [[DataSet]]
*
* #param x single colum [[DataSet]]
* #return Sequence of probabilites for each value
*/
private[this] def probs(x: DataSet[Double]): Seq[Double] = {
val counts = x.groupBy(_.doubleValue)
.reduceGroup(_.size.toDouble)
.name("X Probs")
.collect
val total = counts.sum
counts.map(_ / total)
}
The problem is that when I submit my flink job, that uses this method, its causing flink to kill the job due to a task TimeOut. I am executing this method for each attribute on a DataSet with only 40.000 instances and 9 attributes.
Is there a way I could do this code more efficient?
After a few tries, I made it work with mapPartition, this method is part of a class InformationTheory, which does some computations to calculate Entropy, mutual information etc. So, for example, SymmetricalUncertainty is computed as this:
/**
* Computes 'symmetrical uncertainty' (SU) - a symmetric mutual information measure.
*
* It is defined as SU(X, y) = 2 * (IG(X|Y) / (H(X) + H(Y)))
*
* #param xy [[DataSet]] with two features
* #return SU value
*/
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.mapPartitionWith {
case in ⇒
val x = in map (_._2)
val y = in map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
Some(2 * mu / (Hx + Hy))
}
su.collect.head.head
}
With this, I can compute efficiently entropy, mutual information etc. The catch is, it only works with a level of parallelism of 1, the problem resides in mapPartition.
Is there a way I could do something similar to what I am doing here with SymmetricalUncertainty, but with whatever level of parallelism?

I finally did it, don't know if its the best solution, but its working with n levels of parallelism:
def symmetricalUncertainty(xy: DataSet[(Double, Double)]): Double = {
val su = xy.reduceGroup { in ⇒
val invec = in.toVector
val x = invec map (_._2)
val y = invec map (_._1)
val mu = mutualInformation(x, y)
val Hx = entropy(x)
val Hy = entropy(y)
2 * mu / (Hx + Hy)
}
su.collect.head
}
You can check the entire code at InformationTheory.scala, and its tests InformationTheorySpec.scala

Related

How to program a circle fit in scala

I want to fit a circle to given 2D points in Scala.
Apache commons math has an example for this in java, which I am trying to translate to scala (without success, because my knowledge of Java is almost non existent).
I took the example code from "http://commons.apache.org/proper/commons-math/userguide/leastsquares.html", (see end of page) which I tried to translate into scala:
import org.apache.commons.math3.linear._
import org.apache.commons.math3.fitting._
import org.apache.commons.math3.fitting.leastsquares._
import org.apache.commons.math3.fitting.leastsquares.LeastSquaresOptimizer._
import org.apache.commons.math3._
import org.apache.commons.math3.geometry.euclidean.twod.Vector2D
import org.apache.commons.math3.util.Pair
import org.apache.commons.math3.fitting.leastsquares.LeastSquaresOptimizer.Optimum
def circleFitting: Unit = {
val radius: Double = 70.0
val observedPoints = Array(new Vector2D(30.0D, 68.0D), new Vector2D(50.0D, -6.0D), new Vector2D(110.0D, -20.0D), new Vector2D(35.0D, 15.0D), new Vector2D(45.0D, 97.0D))
// the model function components are the distances to current estimated center,
// they should be as close as possible to the specified radius
val distancesToCurrentCenter = new MultivariateJacobianFunction() {
//def value(point: RealVector): (RealVector, RealMatrix) = {
def value(point: RealVector): Pair[RealVector, RealMatrix] = {
val center = new Vector2D(point.getEntry(0), point.getEntry(1))
val value: RealVector = new ArrayRealVector(observedPoints.length)
val jacobian: RealMatrix = new Array2DRowRealMatrix(observedPoints.length, 2)
for (i <- 0 to observedPoints.length) {
var o = observedPoints(i)
var modelI: Double = Vector2D.distance(o, center)
value.setEntry(i, modelI)
// derivative with respect to p0 = x center
jacobian.setEntry(i, 0, (center.getX() - o.getX()) / modelI)
// derivative with respect to p1 = y center
jacobian.setEntry(i, 1, (center.getX() - o.getX()) / modelI)
}
new Pair(value, jacobian)
}
}
// the target is to have all points at the specified radius from the center
val prescribedDistances = Array.fill[Double](observedPoints.length)(radius)
// least squares problem to solve : modeled radius should be close to target radius
val problem:LeastSquaresProblem = new LeastSquaresBuilder().start(Array(100.0D, 50.0D)).model(distancesToCurrentCenter).target(prescribedDistances).maxEvaluations(1000).maxIterations(1000).build()
val optimum:Optimum = new LevenbergMarquardtOptimizer().optimize(problem) //LeastSquaresOptimizer.Optimum
val fittedCenter: Vector2D = new Vector2D(optimum.getPoint().getEntry(0), optimum.getPoint().getEntry(1))
println("circle fitting wurde aufgerufen!")
println("CIRCLEFITTING: fitted center: " + fittedCenter.getX() + " " + fittedCenter.getY())
println("CIRCLEFITTING: RMS: " + optimum.getRMS())
println("CIRCLEFITTING: evaluations: " + optimum.getEvaluations())
println("CIRCLEFITTING: iterations: " + optimum.getIterations())
}
This gives no compile errors, but crashes with:
Exception in thread "main" java.lang.NullPointerException
at org.apache.commons.math3.linear.EigenDecomposition.<init>(EigenDecomposition.java:119)
at org.apache.commons.math3.fitting.leastsquares.LeastSquaresFactory.squareRoot(LeastSquaresFactory.java:245)
at org.apache.commons.math3.fitting.leastsquares.LeastSquaresFactory.weightMatrix(LeastSquaresFactory.java:155)
at org.apache.commons.math3.fitting.leastsquares.LeastSquaresFactory.create(LeastSquaresFactory.java:95)
at org.apache.commons.math3.fitting.leastsquares.LeastSquaresBuilder.build(LeastSquaresBuilder.java:59)
at twoDhotScan.FittingFunctions$.circleFitting(FittingFunctions.scala:49)
at twoDhotScan.Main$.delayedEndpoint$twoDhotScan$Main$1(hotScan.scala:14)
at twoDhotScan.Main$delayedInit$body.apply(hotScan.scala:11)
at scala.Function0.apply$mcV$sp(Function0.scala:34)
at scala.Function0.apply$mcV$sp$(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App.$anonfun$main$1$adapted(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:389)
at scala.App.main(App.scala:76)
at scala.App.main$(App.scala:74)
at twoDhotScan.Main$.main(hotScan.scala:11)
at twoDhotScan.Main.main(hotScan.scala)
I guess the problem is somewhere in the definition of the function distancesToCurrentCenter. I don't even know if this MultivariateJacobianFunction is supposed to be a real function or an object or what ever.
After some long fiddeling with the code, I got it running
The NullPointerException was gone after I updated apache-commons-math3 from version 3.3 to version 3.6.1 in my build.sbt file. Don't know if I forgot a paramater of if it was a bug. There were also 2 bugs in the example on the apache-commons-math website: They had two times a .getX operator where should have been an .getY.
So here is a running example for a circle fit with known radius:
import org.apache.commons.math3.analysis.{ MultivariateVectorFunction, MultivariateMatrixFunction }
import org.apache.commons.math3.fitting.leastsquares.LeastSquaresOptimizer.Optimum
import org.apache.commons.math3.fitting.leastsquares.{ MultivariateJacobianFunction, LeastSquaresProblem, LeastSquaresBuilder, LevenbergMarquardtOptimizer }
import org.apache.commons.math3.geometry.euclidean.twod.Vector2D
import org.apache.commons.math3.linear.{ Array2DRowRealMatrix, RealMatrix, RealVector, ArrayRealVector }
object Main extends App {
val radius: Double = 20.0
val pointsList: List[(Double, Double)] = List(
(18.36921795, 10.71416674),
(0.21196357, -22.46528791),
(-4.153845171, -14.75588526),
(3.784114125, -25.55910336),
(31.32998899, 2.546924253),
(34.61542186, -12.90323269),
(19.30193011, -28.53185596),
(16.05620863, 10.97209111),
(31.67011956, -20.05020878),
(19.91175561, -28.38748712))
/*******************************************************************************
***** Random values on a circle with centerX=15, centerY=-9 and radius 20 *****
*******************************************************************************/
val observedPoints: Array[Vector2D] = (pointsList map { case (x, y) => new Vector2D(x, y) }).toArray
val vectorFunktion: MultivariateVectorFunction = new MultivariateVectorFunction {
def value(variables: Array[Double]): Array[Double] = {
val center = new Vector2D(variables(0), variables(1))
observedPoints map { p: Vector2D => Vector2D.distance(p, center) }
}
}
val matrixFunction = new MultivariateMatrixFunction {
def value(variables: Array[Double]): Array[Array[Double]] = {
val center = new Vector2D(variables(0), variables(1))
(observedPoints map { p: Vector2D => Array((center.getX - p.getX) / Vector2D.distance(p, center), (center.getY - p.getY) / Vector2D.distance(p, center)) })
}
}
// the target is to have all points at the specified radius from the center
val prescribedDistances = Array.fill[Double](observedPoints.length)(radius)
// least squares problem to solve : modeled radius should be close to target radius
val problem = new LeastSquaresBuilder().start(Array(100.0D, 50.0D)).model(vectorFunktion, matrixFunction).target(prescribedDistances).maxEvaluations(25).maxIterations(25).build
val optimum: Optimum = new LevenbergMarquardtOptimizer().optimize(problem)
val fittedCenter: Vector2D = new Vector2D(optimum.getPoint.getEntry(0), optimum.getPoint.getEntry(1))
println("Ergebnisse des LeastSquareBuilder:")
println("CIRCLEFITTING: fitted center: " + fittedCenter.getX + " " + fittedCenter.getY)
println("CIRCLEFITTING: RMS: " + optimum.getRMS)
println("CIRCLEFITTING: evaluations: " + optimum.getEvaluations)
println("CIRCLEFITTING: iterations: " + optimum.getIterations + "\n")
}
Tested on Scala version 2.12.6, compiled with sbt version 1.2.8
Does anabody know how to do this without a fixed radius?
After some reasearch on circle fitting I've found a wonderful algorith in the paper: "Error alalysis for circle fitting algorithms" by H. Al-Sharadqah and N. Chernov (available here: http://people.cas.uab.edu/~mosya/cl/ )
I implemented it in scala:
import org.apache.commons.math3.linear.{ Array2DRowRealMatrix, RealMatrix, RealVector, LUDecomposition, EigenDecomposition }
object circleFitFunction {
def circleFit(dataXY: List[(Double, Double)]) = {
def square(x: Double): Double = x * x
def multiply(pair: (Double, Double)): Double = pair._1 * pair._2
val n: Int = dataXY.length
val (xi, yi) = dataXY.unzip
//val S: Double = math.sqrt(((xi map square) ++ yi map square).sum / n)
val zi: List[Double] = dataXY map { case (x, y) => x * x + y * y }
val x: Double = xi.sum / n
val y: Double = yi.sum / n
val z: Double = ((xi map square) ++ (yi map square)).sum / n
val zz: Double = (zi map square).sum / n
val xx: Double = (xi map square).sum / n
val yy: Double = (yi map square).sum / n
val xy: Double = ((xi zip yi) map multiply).sum / n
val zx: Double = ((zi zip xi) map multiply).sum / n
val zy: Double = ((zi zip yi) map multiply).sum / n
val N: RealMatrix = new Array2DRowRealMatrix(Array(
Array(8 * z, 4 * x, 4 * y, 2),
Array(4 * x, 1, 0, 0),
Array(4 * y, 0, 1, 0),
Array(2.0D, 0, 0, 0)))
val M: RealMatrix = new Array2DRowRealMatrix(Array(
Array(zz, zx, zy, z),
Array(zx, xx, xy, x),
Array(zy, xy, yy, y),
Array(z, x, y, 1.0D)))
val Ninverse = new LUDecomposition(N).getSolver().getInverse()
val eigenValueProblem = new EigenDecomposition(Ninverse.multiply(M))
// Get all eigenvalues
// As we need only the smallest positive eigenvalue, all negative eigenvalues are replaced by Double.MaxValue
val eigenvalues: Array[Double] = eigenValueProblem.getRealEigenvalues() map (lambda => if (lambda < 0) Double.MaxValue else lambda)
// Now get the index of the smallest positive eigenvalue, to get the associated eigenvector
val i: Int = eigenvalues.zipWithIndex.min._2
val eigenvector: RealVector = eigenValueProblem.getEigenvector(3)
val A = eigenvector.getEntry(0)
val B = eigenvector.getEntry(1)
val C = eigenvector.getEntry(2)
val D = eigenvector.getEntry(3)
val centerX: Double = -B / (2 * A)
val centerY: Double = -C / (2 * A)
val Radius: Double = math.sqrt((B * B + C * C - 4 * A * D) / (4 * A * A))
val RMS: Double = (dataXY map { case (x, y) => (Radius - math.sqrt((x - centerX) * (x - centerX) + (y - centerY) * (y - centerY))) } map square).sum / n
(centerX, centerY, Radius, RMS)
}
}
I kept all the Names form the paper (see Chaper 4 and 8 and look for the Hyperfit-Algorithm) and I tried to limit the Matrix operations.
It's still not what I need, cause this sort of algorithm (algebraic fit) has known issues with fitting partially circles (arcs) and maybe big circles.
With my data, I had once the situation that it spit out completly wrong results, and I found out that I had an Eigenvalue of -0.1...
The Eigenvector of this Value produced the right result, but it was sorted out because of the negative Eigenvalue. So this one is not always stable (as so many other circle fitting algorithms)
But what a nice Algorithm!!!
Looks a bit like dark magic to me.
If someone needs not to much precision and a lot of speed (and has data from a full circle not to big) this would be my choice.
Next thing I will try is to implement a Levenberg Marquardt Algorithm form the same page I mentioned above.

How can I divide rdd to specific number of rdds

I have the below code which generates RDD from a text file:
val data = sparkContext.textfile(path)
val k = 3
How can I divide data into k unique RDD?
You can use RDD.randomSplitwhich will divide existing RDD based on weights passed in the parameters and return Array of RDDs.
The internal working will be like below...
/**
* Randomly splits this RDD with the provided weights.
*
* #param weights weights for splits, will be normalized if they don't sum to 1
* #param seed random seed
*
* #return split RDDs in an array
*/
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]] = {
require(weights.forall(_ >= 0),
s"Weights must be nonnegative, but got ${weights.mkString("[", ",", "]")}")
require(weights.sum > 0,
s"Sum of weights must be positive, but got ${weights.mkString("[", ",", "]")}")
withScope {
val sum = weights.sum
val normalizedCumWeights = weights.map(_ / sum).scanLeft(0.0d)(_ + _)
normalizedCumWeights.sliding(2).map { x =>
randomSampleWithRange(x(0), x(1), seed)
}.toArray
}
NOTE : weights weights for splits, will be normalized if they don't sum to 1
Based on the above behavior I created a sample snippet like below which was working :
def getDoubleWeights(numparts:Int) : Array[Double] = {
Array.fill[Double](numparts)(1.0d)
}
caller would be like....
val rddWithNumParts : Array[RDD] = yourRDD.randomSplit(getDoubleWeights(yourRDD.partitions.length))
This will uniformly divide in to number of RDD
NOTE : Same is applicable for below DataFrame.randomSplit as well
You can also convert that in to Dataframe by giving schema to RDD and use like below example.. sqlContext.createDataFrame(rddOfRow, Schema)
later you can call this method.
DataFrame[] randomSplit(double[] weights) Randomly splits this
DataFrame with the provided weights.
other thought I had is dividing based on number of Partitions...
i.e RDD.mapPartitionWithIndex(....)
for each partition you have an Iterator (can be converted in to RDD). you can have some thing like number of partitions = number of RDDs

Linear operations with slices in breeze

Is it somehow possible to slice updates on Matrices in breeze? I could not find implicit value for parameter op.
Breeze 0.11.2.
val idxs = Seq(0,1)
val x = DenseMatrix.rand(3,3)
val y = DenseMatrix.rand(3,3)
x(idxs,idxs)+= y(idxs, idxs) // cant find implicit conversion for += here.
Analog code with DenseVectors works properly.
val xv = DenseVector.rand(3)
val yv = DenseVector.rand(3)
x(idxs) += y(idxs)
There is ugly work-around updating rows in iterative manner.
val idxs = IndexedSeq(0, 1)
val x:DenseMatrix[Double] = DenseMatrix.zeros(3, 3)
val y = DenseMatrix.rand(3, 3)
for(r<-idxs) {
val slx = x(::, r)
val sly = y(::, r)
slx(idxs) += sly(idxs)
}
It's an oversight. Please open an issue on github.

Large matrix operations: Multiplication in Scala/Apache Spark

I need to multiply two large matrices, X and Y. Typically X has ~500K rows and ~18K columns and Y has ~18K rows and ~18K columns. The matrix X is expected to be sparse and the matrix Y is expected to be sparse/dense. What is the ideal way of performing this multiplication in Scala/Apache Spark?
I got some code for you. It represents a matrix as an array of column vectors (which means each entry in the array is a column, not a row). It takes about 0.7s to multiply two 1000*1000 matrices. 11 minutes for two 10,000 * 10,000 matrices. 1.5 hours for 20,000 * 20,000 and 30 hours for (500k*18k) times (18k*18k). But if you run it in parallel (by using the code that's commented out) it should run about 2 to 3 times faster (on a 4 core cpu). But remember that the number of columns in the first matrix always has to be the same as the number of rows in the second.
class Matrix(val columnVectors: Array[Array[Double]]) {
val columns = columnVectors.size
val rows = columnVectors.head.size
def *(v: Array[Double]): Array[Double] = {
val newValues = Array.ofDim[Double](rows)
var col = 0
while(col < columns) {
val n = v(col)
val column = columnVectors(col)
var row = 0
while(row < newValues.size) {
newValues(row) += column(row) * n
row += 1
}
col += 1
}
newValues
}
def *(other: Matrix): Matrix = {
//do the calculation on only one cpu
new Matrix(other.columnVectors.map(col => this * col))
//do the calculation in parallel on all available cpus
//new Matrix(other.columnVectors.par.map(col => this * col).toArray)
}
override def toString = {
columnVectors.transpose.map(_.mkString(", ")).mkString("\n")
}
}
edit:
ok, here is a better version. I now store the row vectors in the matrix instead of the column vectors. That makes it easier to optimize the multiplication for the case where the first matrix is sparse.
Also I added a lazy version of the matrix multiplication using iterators. Since the first matrix is 500k * 18k = 9 billion numbers, such a lazy version will allow you to do that multiplication without requiring much ram. You just have to create an Iterator that can read the rows lazily e.g. from a data bank and then write the rows from the resulting iterator back.
import scala.collection.Iterator
import scala.util.{Random => rand}
def time[T](descr: String)(f: => T): T = {
val start = System.nanoTime
val r = f
val end = System.nanoTime
val time = (end - start)/1e6
println(descr + ": time = " + time + "ms")
r
}
object Matrix {
def mulLazy(m1: Iterator[Array[Double]], m2: Matrix): Iterator[Array[Double]] = {
m1.grouped(8).map { group =>
group.par.map(m2.mulRow).toIterator
}.flatten
}
}
class Matrix(val rowVectors: Array[Array[Double]]) {
val columns = rowVectors.head.size
val rows = rowVectors.size
private def mulRow(otherRow: Array[Double]): Array[Double] = {
val rowVectors = this.rowVectors
val result = Array.ofDim[Double](columns)
var i = 0
while(i < otherRow.size) {
val value = otherRow(i)
if(value != 0) { //optimization for sparse matrix
val row = rowVectors(i)
var col = 0
while(col < result.size) {
result(col) += value * row(col)
col += 1
}
}
i += 1
}
result
}
def *(other: Matrix): Matrix = {
new Matrix(rowVectors.par.map(other.mulRow).toArray)
}
def equals(other: Matrix): Boolean = {
java.util.Arrays.deepEquals(this.rowVectors.asInstanceOf[Array[Object]], other.rowVectors.asInstanceOf[Array[Object]])
}
override def equals(other: Any): Boolean = {
if(other.isInstanceOf[Matrix]) equals(other.asInstanceOf[Matrix]) else false
}
override def toString = {
rowVectors.map(_.mkString(", ")).mkString("\n")
}
}
def randMatrix(rows: Int, columns: Int): Matrix = {
new Matrix((1 to rows).map(_ => Array.fill(columns)(rand.nextDouble * 100)).toArray)
}
def sparseRandMatrix(rows: Int, columns: Int, ratio: Double): Matrix = {
new Matrix((1 to rows).map(_ => Array.fill(columns)(if(rand.nextDouble > ratio) 0 else rand.nextDouble * 100)).toArray)
}
val N = 2000
val m1 = sparseRandMatrix(N, N, 0.1) // only 10% of the numbers will be different from 0
val m2 = randMatrix(N, N)
val m3 = m1.rowVectors.toIterator
val m12 = time("m1 * m2")(m1 * m2)
val m32 = time("m3 * m2")(Matrix.mulLazy(m3, m2)) //doesn't take much time because the matrix multiplication is lazy
println(m32)
println("m12 == m32 = " + (new Matrix(m32.toArray) == m12))

SparkPi running slow with more than 1 slice

Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below:
val spark = new SparkContext("spark://telecom:7077", "SparkPi",
System.getenv("SPARK_HOME"), List("target/scala-2.10/sparkpii_2.10-1.0.jar"))
val slices = 1
val n = 10000000 * slices
val count = spark.parallelize(1 to n, slices).map {
i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
Update: Problem was with random function, since it was a synchronized method, it couldn't scale to multiple cores.
The random function used in sparkpi example is a synchronized method and can't scale to multiple cores. It's an easy enough example to deploy on your cluster but don't use it to check Spark's performance and scalability.
As Ahsan mentioned in his answer, the problem was with 'scala.math.random'.
I have replaced it with 'org.apache.spark.util.random.XORShiftRandom', and now using multiple processors makes the Pi calculations to run much faster.
Below is my code, which is a modified version of SparkPi example from Spark distribution:
// scalastyle:off println
package org.apache.spark.examples
import org.apache.spark.util.random.XORShiftRandom
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi").setMaster(args(0))
val spark = new SparkContext(conf)
val slices = if (args.length > 1) args(1).toInt else 2
val n = math.min(100000000L * slices, Int.MaxValue).toInt // avoid overflow
val rand = new XORShiftRandom()
val count = spark.parallelize(1 until n, slices).map { i =>
val x = rand.nextDouble * 2 - 1
val y = rand.nextDouble * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// scalastyle:on println
When I run the program above using one core with parameters 'local[1] 16' it takes about 60 seconds on my laptop. Same program using 8 cores ('local[*] 16') it takes 17 seconds.