Apache Spark - Two-sample Kolmogorov-Smirnov Test - scala

I have two sets of data (let's call them d1, d2) in Spark. I would like to perform a Two-sample Kolmogorov-Smirnov test, to test wether their underlying poplation distribution function is different. Can MLLib's Statistics.kolmogorovSmirnovTest do this?
The documentation provides this example:
import org.apache.spark.mllib.stat.Statistics
val data: RDD[Double] = ... // an RDD of sample data
// perform a KS test using a cumulative distribution function of our making
val myCDF: Double => Double = ...
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
I tried computing the empirical cumulative distribution function of d2 (collecting it as Map) and comparing it with d1.
Statistics.kolmogorovSmirnovTest(d1, ecdf_map)
The test runs, but the results are wrong.
Am I doing something wrong? Is it possible to do this? Any ideas?
Thank you for the help!

In Spark Mllib KolmogorovSmirnovTest is one-sampled and two-sided. So if you want specificly two-sampled variant it's not possible within this library. However, you can still compare datasets by calculating empirical cumulative distribution function (I found a library to do that so I'll update this answer if the results will be any good) or using deviations from normal distribution. In this example I'll go with the later.
Comparing datasets by KST statistics against normal distribution
For the purposes of this testing I generated 3 distributions: 2 triangular that look similar and an exponential one to show big difference in stats.
Note:
I couldn't find any scientific papers describing this method as viable for distribution comparison so the idea is mostly empirical.
For every distribution you most definetely could find a mirrored one with the same global maximum distance between its CDF and normal distribution.
Next step was to get KS results against normal distribution with given mean and standart deviation. I visualized them to get a better picture:
As you can see, results (KS statistics and p-value) for triangual distributions are close to each other while exponential one is way off. As I stated in the note you could easily fool this method by mirroring dataset but for a real world data it could be ok.
import org.apache.spark._
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.stat.Statistics
import org.apache.commons.math3.distribution.{ ExponentialDistribution, TriangularDistribution }
import breeze.plot._
import breeze.linalg._
import breeze.numerics._
object Main {
def main( args: Array[ String ] ): Unit = {
val conf =
new SparkConf()
.setAppName( "SO Spark" )
.setMaster( "local[*]" )
.set( "spark.driver.host", "localhost" )
val sc = new SparkContext( conf )
// Create similar distributions
val triDist1 = new TriangularDistribution( -3, 5, 7 )
val triDist2 = new TriangularDistribution( -3, 7, 7 )
// Exponential distribution to show big difference
val expDist1 = new ExponentialDistribution( 0.6 )
// Sample data from the distributions and parallelize it
val n = 100000
val sampledTriDist1 = sc.parallelize( triDist1.sample( n ) )
val sampledTriDist2 = sc.parallelize( triDist2.sample( n ) )
val sampledExpDist1 = sc.parallelize( expDist1.sample( n ) )
// KS tests
val resultTriDist1 = Statistics
.kolmogorovSmirnovTest( sampledTriDist1,
"norm",
sampledTriDist1.mean,
sampledTriDist1.stdev )
val resultTriDist2 = Statistics
.kolmogorovSmirnovTest( sampledTriDist2,
"norm",
sampledTriDist2.mean,
sampledTriDist2.stdev )
val resultExpDist1 = Statistics
.kolmogorovSmirnovTest( sampledExpDist1,
"norm",
sampledExpDist1.mean,
sampledExpDist1.stdev )
// Results
val statsTriDist1 =
"Tri1: ( " +
resultTriDist1.statistic +
", " +
resultTriDist1.pValue +
" )"
val statsTriDist2 =
"Tri2: ( " +
resultTriDist2.statistic +
", " +
resultTriDist2.pValue +
" )"
val statsExpDist1 =
"Exp1: ( " +
resultExpDist1.statistic +
", " +
resultExpDist1.pValue +
" )"
println( statsTriDist1 )
println( statsTriDist2 )
println( statsExpDist1 )
// Visualize
val graphCanvas = Figure()
val mainPlot =
graphCanvas
.subplot( 0 )
mainPlot.legend = true
val x = linspace( 1, n, n )
mainPlot += plot( x,
sampledTriDist1.sortBy( x => x ).take( n ),
name = statsTriDist1 )
mainPlot += plot( x,
sampledTriDist2.sortBy( x => x ).take( n ),
name = statsTriDist2 )
mainPlot += plot( x,
sampledExpDist1.sortBy( x => x ).take( n ),
name = statsExpDist1 )
mainPlot.xlabel = "x"
mainPlot.ylabel = "sorted sample"
mainPlot.title = "KS results for 2 Triangular and 1 Exponential Distributions"
graphCanvas.saveas( "ks-sample.png", 300 )
sc.stop()
}
}

Related

Mean Squared Error (MSE) returns a huge number

I'm new in Scala and Spark in general. I'm using this code for Regression (based on this link Spark official site):
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("Year100")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize )
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
The dataset that I'm using can be seen here: Pastebin link.
So my question is: why MSE equals as 889717.74 (which is a huge number)?
Edit: As the commentators suggested, I tried these:
1) I changed the step to default and the MSE now returns as NaN
2) If I try this constructor:
LinearRegressionWithSGD.train(parsedData, numIterations,stepSize,intercept=True) the spark-shell returns an error (error: not found:value True)
You've passed a tiny step size and capped the number of iterations at 100. The maximum value by which your parameters can change is 0.00000001 * 100 = 0.000001. Try using the default step size, I imagine that will fix it.

How to compute cumulative sum using Spark

I have an rdd of (String,Int) which is sorted by key
val data = Array(("c1",6), ("c2",3),("c3",4))
val rdd = sc.parallelize(data).sortByKey
Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous keys.
Eg: c1 = 0 , c2 = c1's value , c3 = (c1 value +c2 value) , c4 = (c1+..+c3 value)
expected output:
(c1,0), (c2,6), (c3,9)...
Is it possible to achieve this ?
I tried it with map but the sum is not preserved inside the map.
var sum = 0 ;
val t = keycount.map{ x => { val temp = sum; sum = sum + x._2 ; (x._1,temp); }}
Compute partial results for each partition:
val partials = rdd.mapPartitionsWithIndex((i, iter) => {
val (keys, values) = iter.toSeq.unzip
val sums = values.scanLeft(0)(_ + _)
Iterator((keys.zip(sums.tail), sums.last))
})
Collect partials sums
val partialSums = partials.values.collect
Compute cumulative sum over partitions and broadcast it:
val sumMap = sc.broadcast(
(0 until rdd.partitions.size)
.zip(partialSums.scanLeft(0)(_ + _))
.toMap
)
Compute final results:
val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
val offset = sumMap.value(i)
if (iter.isEmpty) Iterator()
else iter.next.map{case (k, v) => (k, v + offset)}.toIterator
})
Spark has buit-in supports for hive ANALYTICS/WINDOWING functions and the cumulative sum could be achieved easily using ANALYTICS functions.
Hive wiki ANALYTICS/WINDOWING functions.
Example:
Assuming you have sqlContext object-
val datardd = sqlContext.sparkContext.parallelize(Seq(("a",1),("b",2), ("c",3),("d",4),("d",5),("d",6)))
import sqlContext.implicits._
//Register as test table
datardd.toDF("id","val").createOrReplaceTempView("test")
//Calculate Cumulative sum
sqlContext.sql("select id,val, " +
"SUM(val) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from test").show()
This approach cause to below warning. In case executor runs outOfMemory, tune job’s memory parameters accordingly to work with huge dataset.
WARN WindowExec: No Partition Defined for Window operation! Moving
all data to a single partition, this can cause serious performance
degradation
I hope this helps.
Here is a solution in PySpark. Internally it's essentially the same as #zero323's Scala solution, but it provides a general-purpose function with a Spark-like API.
import numpy as np
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
"""
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
partition_cumsums = list(np.cumsum(partition_sums))
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
# test for correctness by summing numbers, with and without Spark
rdd = sc.range(10000,numSlices=10).sortBy(lambda x: x)
cumsums, values = zip(*cumsum(rdd,lambda x: x).collect())
assert all(cumsums == np.cumsum(values))
I came across a similar problem and implemented #Paul 's solution. I wanted to do cumsum on a integer frequency table sorted by key(the integer), and there was a minor problem with np.cumsum(partition_sums), error being unsupported operand type(s) for +=: 'int' and 'NoneType'.
Because if the range is big enough, the probability of each partition having something is thus big enough(no None values). However, if the range is much smaller than count, and number of partitions remains the same, some of the partitions would be empty. Here comes the modified solution:
def cumsum(rdd, get_summand):
"""Given an ordered rdd of items, computes cumulative sum of
get_summand(row), where row is an item in the RDD.
"""
def cumsum_in_partition(iter_rows):
total = 0
for row in iter_rows:
total += get_summand(row)
yield (total, row)
rdd = rdd.mapPartitions(cumsum_in_partition)
def last_partition_value(iter_rows):
final = None
for cumsum, row in iter_rows:
final = cumsum
return (final,)
partition_sums = rdd.mapPartitions(last_partition_value).collect()
# partition_cumsums = list(np.cumsum(partition_sums))
#----from here are the changed lines
partition_sums = [x if x is not None else 0 for x in partition_sums]
temp = np.cumsum(partition_sums)
partition_cumsums = list(temp)
#----
partition_cumsums = [0] + partition_cumsums
partition_cumsums = sc.broadcast(partition_cumsums)
def add_sums_of_previous_partitions(idx, iter_rows):
return ((cumsum + partition_cumsums.value[idx], row)
for cumsum, row in iter_rows)
rdd = rdd.mapPartitionsWithIndex(add_sums_of_previous_partitions)
return rdd
#test on random integer frequency
x = np.random.randint(10, size=1000)
D = sqlCtx.createDataFrame(pd.DataFrame(x.tolist(),columns=['D']))
c = D.groupBy('D').count().orderBy('D')
c_rdd = c.rdd.map(lambda x:x['count'])
cumsums, values = zip(*cumsum(c_rdd,lambda x: x).collect())
you can want to try out with windows over using rowsBetween. hope still helpful.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val data = Array(("c1",6), ("c2",3),("c3",4))
val df = sc.parallelize(data).sortByKey().toDF("c", "v")
val w = Window.orderBy("c")
val r = df.select( $"c", sum($"v").over(w.rowsBetween(-2, -1)).alias("cs"))
display(r)

Spark Logistic regression and metrics

I want to run logistic regression 100 times with random splitting into test and training. I want to then save the performance metrics for individual runs and then later use them for gaining insight into the performance.
for (index <- 1 to 100) {
val splits = training_data.randomSplit(Array(0.90, 0.10), seed = index)
val training = splits(0).cache()
val test = splits(1)
logrmodel = train_LogisticRegression_model(training)
performLogisticRegressionRuns(logrmodel, test, index)
}
spark.stop()
}
def performLogisticRegressionRuns(model: LogisticRegressionModel, test: RDD[LabeledPoint], iterationcount: Int) {
private val sb = StringBuilder.newBuilder
// Compute raw scores on the test set. Once I cle
model.clearThreshold()
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val bcmetrics = new BinaryClassificationMetrics(predictionAndLabels)
// I am showing two sample metrics, but I am collecting more including recall, area under roc, f1 score etc....
val precision = bcmetrics.precisionByThreshold()
precision.foreach { case (t, p) =>
// If threshold is 0.5 as what we want, then get the precision and append it to the string. Idea is if score is <0.5 class 0, else class 1.
if (t == 0.5) {
println(s"Threshold is: $t, Precision is: $p")
sb ++= p.toString() + "\t"
}
}
val auROC = bcmetrics.areaUnderROC
sb ++= iteration + auPRC.toString() + "\t"
I want to save the performance results of each iteration in separate file. I tried this, but it does not work, any help with this will be great
val data = spark.parallelize(sb)
val filename = "logreg-metrics" + iterationcount.toString() + ".txt"
data.saveAsTextFile(filename)
}
I was able to resolve this, I did the following. I converted the String to a list.
val data = spark.parallelize(List(sb))
val filename = "logreg-metrics" + iterationcount.toString() + ".txt"
data.saveAsTextFile(filename)

Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples

Software Version: Apache Spark v1.3
Context: I've been trying to "translate" Octave/MATLAB code to Scala on Apache Spark. More precisely, I work on ex1data1.txt and ex1data2.txt from coursera practical part ex1. I've made such translation into Julia lang (it went smoothly) and now I've been struggling with Spark...without success.
Problem: Performance of my implementation on Spark is very poor. I cannot even say it works correctly. That's why for ex1data1.txt I added polynomial feature, and I also worked with: theta0 using setIntercept(true) and with extra non-normalized column of 1.0 values(in this case I set Intercept to false). I receive only silly results.
So, then I 've decided to start working with ex1data2.txt. Below you can find the code and the expected result. Of course Spark result is wrong.
Did you have similar experience? I will be grateful for your help.
The Scala code for the exd1data2.txt:
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}
object MLibOnEx1data2 extends App {
val conf = new SparkConf()
conf.set("spark.app.name", "coursera ex1data2.txt test")
val sc = new SparkContext(conf)
val input = sc.textFile("hdfs:///ex1data2.txt")
val trainData = input.map { line =>
val parts = line.split(',')
val y = parts(2).toDouble
val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
println(s"x = $features y = $y")
LabeledPoint(y, features)
}.cache()
// Building the model
val numIterations = 1500
val alpha = 0.01
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(trainData.map(x => x.features))
val scaledTrainData = trainData.map{ td =>
val normFeatures = scaler.transform(td.features)
println(s"normalized features = $normFeatures")
LabeledPoint(td.label, normFeatures)
}.cache()
val tsize = scaledTrainData.count()
println(s"Training set size is $tsize")
val alg = new LinearRegressionWithSGD().setIntercept(true)
alg.optimizer
.setNumIterations(numIterations)
.setStepSize(alpha)
.setUpdater(new SquaredL2Updater)
.setRegParam(0.0) //regularization - off
val model = alg.run(scaledTrainData)
println(s"Theta is $model.weights")
val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))
println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338
// Evaluate model on training examples and compute training error
val valuesAndPreds = scaledTrainData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = ((valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
println("Training Mean Squared Error = " + MSE)
// Save and load model
val trySaveAndLoad = util.Try(model.save(sc, "myModelPath"))
.flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
.getOrElse(-1)
println(s"trySaveAndLoad result is $trySaveAndLoad")
}
STDOUT result is:
Training set size is 47
Theta is (weights=[52090.291641275864,19342.034885388926],
intercept=181295.93717032953).weights
Estimate the price of a 1650 sq-ft, 3 br house = 153983.5541846754
dollars
Training Mean Squared Error = 1.5876093757127676E10
trySaveAndLoad result is -1
Well, after some digging I believe there is nothing here. First I saved content of the valuesAndPreds to text file:
valuesAndPreds.map{
case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'
Rest of the code is written in R.
First lets create a model using closed form solution:
# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model
model <- lm(V3 ~ ., df)
For reference:
> summary(model)
Call:
lm(formula = V3 ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-130582 -43636 -10829 43698 198147
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340413 9637 35.323 < 2e-16 ***
V1 110631 11758 9.409 4.22e-12 ***
V2 -6650 11758 -0.566 0.575
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
Now prediction:
closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
plot(
closedFormPrediction, df$V3,
ylab="Actual", xlab="Predicted",
main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))
'
Now we can compare above to SGD results:
sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)
plot(
sgd$V2, sgd$V1, ylab="Actual",
xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))
Finally lets compare both:
plot(
sgd$V2, closedFormPrediction,
xlab="SGD", ylab="Closed form", main="SGD vs Closed form")
So, result are clearly not perfect but nothing seems to be completely off here.

SparkPi running slow with more than 1 slice

Relatively new on spark and have tried running SparkPi example on a standalone 12 core three machine cluster. What I'm failing to understand is, that running this example with a single slice gives better performance as compared to using 12 slices. Same was the case when I was using parallelize function. The time is scaling almost linearly with adding each slice. Please let me know if I'm doing anything wrong. The code snippet is given below:
val spark = new SparkContext("spark://telecom:7077", "SparkPi",
System.getenv("SPARK_HOME"), List("target/scala-2.10/sparkpii_2.10-1.0.jar"))
val slices = 1
val n = 10000000 * slices
val count = spark.parallelize(1 to n, slices).map {
i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x * x + y * y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
Update: Problem was with random function, since it was a synchronized method, it couldn't scale to multiple cores.
The random function used in sparkpi example is a synchronized method and can't scale to multiple cores. It's an easy enough example to deploy on your cluster but don't use it to check Spark's performance and scalability.
As Ahsan mentioned in his answer, the problem was with 'scala.math.random'.
I have replaced it with 'org.apache.spark.util.random.XORShiftRandom', and now using multiple processors makes the Pi calculations to run much faster.
Below is my code, which is a modified version of SparkPi example from Spark distribution:
// scalastyle:off println
package org.apache.spark.examples
import org.apache.spark.util.random.XORShiftRandom
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi").setMaster(args(0))
val spark = new SparkContext(conf)
val slices = if (args.length > 1) args(1).toInt else 2
val n = math.min(100000000L * slices, Int.MaxValue).toInt // avoid overflow
val rand = new XORShiftRandom()
val count = spark.parallelize(1 until n, slices).map { i =>
val x = rand.nextDouble * 2 - 1
val y = rand.nextDouble * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// scalastyle:on println
When I run the program above using one core with parameters 'local[1] 16' it takes about 60 seconds on my laptop. Same program using 8 cores ('local[*] 16') it takes 17 seconds.