Spark mllib.stat.Statistics - kolmogorovSmirnovTest CDF - scala

I am looking through the example HypothesisTestingKolmogorovSmirnovTestExample.scala for spark and can't seem to figure out the CDF aspect.
Their example:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25)) // an RDD of sample data
val myCDF = Map(0.1 -> 0.2, 0.15 -> 0.6, 0.2 -> 0.05, 0.3 -> 0.05, 0.25 -> 0.1)
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)
This returns:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
This makes sense - what doesn't is when I try to have it not reject the Null:
val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25)) // an RDD of sample data
val myCDF = Map(0.1 -> 0.1, 0.15 -> 0.15, 0.2 -> 0.2, 0.3 -> 0.3, 0.25 -> 0.25) //CDF matching the data distribution
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)
This ALSO returns:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
What gives? The CDF and the data are the exact same distribution, are they not? Why would the Null be rejected? What am I assuming/doing wrong?

What is the scenario, you can use the KS test:
KS Test is one of the Goodness-fit-Test to be executed after the fit distribution for the data.
this test will tell you whether the identified distribution for the data is correct or not. we need to validate this with the p-value.
if the p value > 0.05 then the distribution you set for the data is fine. the p value is < 0.05 then you need fit data with the different distribution.
Rejecting Null means, p value is < 0.05: Data Not fit for the given distribution

Related

Kolmogorov-Smirnov Test Statistic

Can someone explain why, if I calculate manually the KS test statistic, the result is different from when I use scipy.stats.kstest?
>>> sample = np.array([1000,2000,2500,3000,5000])
>>> ecdf = np.array([0.2, 0.4, 0.6, 0.8, 1. ])
>>> cdf = stats.weibull_min(0.3, 100, 4000).cdf(sample)
>>> abs(ecdf - cdf).max()
0.3454961536273503
>>> stats.kstest(rvs=sample, cdf=stats.weibull_min(0.3, 100, 4000).cdf)
KstestResult(statistic=0.4722995454382698, pvalue=0.1534647709785294)
OK, I realized the mistake I made, so I will answer my onwn question. The KS-Statistic can't be calculated as abs(ecdf - cdf).max(), bacause of the right-continuity / left-discontinuity of the ECDF. The correct approach is:
>>> sample = np.array([1000, 2000, 2500, 3000, 5000])
>>> ecdf = np.array([0, 0.2, 0.4, 0.6, 0.8, 1. ])
>>> cdf = stats.weibull_min(0.3, 100, 4000).cdf(sample)
>>> max([(ecdf[1:] - cdf).max(), (cdf - ecdf[:-1]).max()])
0.4722995454382698

getting 'StructField' object has no attribute '_get_object_id' on BinaryClassificationMetrics

I was trying to get the binary classification report on pyspark and I ran into this error
StructField' object has no attribute '_get_object_id'
Here is my code
%%spark
from pyspark.mllib.evaluation import BinaryClassificationMetrics
#from pyspark.mllib.evaluation import BinaryClassificationMetrics
predictionAndLabels = test_pred.rdd.map(lambda Row : (float(Row['label']) , Row['prediction']))
metrics = BinaryClassificationMetrics(predictionAndLabels)
Also , Based on the documentation a link! , apparently it does not support f1 measure and recall etc . Any idea why or how we can extract them without low level coding ?
I don't think you have to go that deep. Taking their example of the data from the binary from the documentation you linked and assuming your threshold is p=0.5 cutoff you can just do something like
# f1 = 2 · Precision · Recall/Precision + Recall
# precision = tp / tp+fp
# recall = tp / tp+fn
from pyspark.sql.functions import col
scoreAndLabels = sc.parallelize([(0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2)
df = scoreAndLabels.toDF()
threshold = 0.5
tp = df.where((col('_1')>=threshold) &(col('_2')==1.0)).count()
fp = df.where((col('_1')<threshold) &(col('_2')==1.0)).count()
fn = df.where((col('_1')>=threshold) &(col('_2')==0.0)).count()
precision = tp / (tp+fp)
recall = tp / (tp+fn)
f1 = 2 * (precision * recall) / (precision + recall)
returns f1 = 0.75.

Element by Element Matrix Multiplication in Scala

I have an input mllib matrix like,
matrix1: org.apache.spark.mllib.linalg.Matrix =
1.0 0.0 2.0 1.0
0.0 3.0 1.0 1.0
2.0 1.0 0.0 0.0
The dimensions of matrix1 is 3*4.
I need to do an element by element matrix multiplication with another matrix so that dimensions of two matrices will be the same in all cases. Let us assume I have another matrix named matrix2 like
matrix2: org.apache.spark.mllib.linalg.Matrix =
3.0 0.0 2.0 1.0
1.0 9.0 5.0 1.0
2.0 5.0 0.0 0.0
with dimensions 3*4
My resultant matrix should be,
result: org.apache.spark.mllib.linalg.Matrix =
3.0 0.0 4.0 1.0
0.0 27.0 5.0 1.0
4.0 5.0 0.0 0.0
How can I achieve this in Scala ? (Note: inbuilt function multiply of spark mllib works as per exact matrix multiplication.)
Below is one way of doing it. Here we iterate both the matrix column wise and find their element multiplication. This solution assumes that both the matrix are of same dimensions.
First let's create test matrix as given in question.
//creating example matrix as per the question
val m1: Matrix = new DenseMatrix(3, 4, Array(1.0, 0.0, 2.0, 0.0, 3.0, 1.0, 2.0, 1.0, 0.0, 1.0, 1.0, 0.0))
val m2: Matrix = new DenseMatrix(3, 4, Array(3.0, 1.0, 2.0, 0.0, 9.0, 5.0, 2.0, 5.0, 0.0, 1.0, 1.0, 0.0))
Now let's define a function which takes two Matrix and returns their element multiplication.
//define a function to calculate element wise multiplication
def elemWiseMultiply(m1: Matrix, m2: Matrix): Matrix = {
val arr = new ArrayBuffer[Array[Double]]()
val m1Itr = m1.colIter //operate on each columns
val m2Itr = m2.colIter
while (m1Itr.hasNext)
//zip both the columns and then multiple element by element
arr += m1Itr.next.toArray.zip(m2Itr.next.toArray).map { case (a, b) => a * b }
//return the resultant matrix
new DenseMatrix(m1.numRows, m1.numCols, arr.flatten.toArray)
}
You can then call this function for your element multiplication.
//call the function to m1 and m2
elemWiseMultiply(m1, m2)
//output
//3.0 0.0 4.0 1.0
//0.0 27.0 5.0 1.0
//4.0 5.0 0.0 0.0

How to calculate mean of function in a gaussian fit?

I'm using the curve fitting app in MATLAB. If I understand correctly the "b1" component in the left box is the mean of function i.e. the x point where y=50% and my x data is [-0.8 -0.7 -0.5 0 0.3 0.5 0.7], so why is this number in this example so big (631)?
General model Gauss1:
f(x) = a1*exp(-((x-b1)/c1)^2)
Coefficients (with 95% confidence bounds):
a1 = 3.862e+258 (-Inf, Inf)
b1 = 631.2 (-1.117e+06, 1.119e+06)
c1 = 25.83 (-2.287e+04, 2.292e+04)
Your data looks like cdf and not pdf. You can use this code for your solution
xi=[-0.8,-0.7,-0.5, 0.0, 0.3, 0.5, 0.7];
yi= [0.2, 0.0, 0.2, 0.2, 0.5, 1.0, 1.0];
fun=#(v) normcdf(xi,v(1),v(2))-yi;
[v]=lsqnonlin(fun,[1,1]); %[1,2]
mu=v(1); sigma=v(2);
x=linspace(-1.5,1.5,100);
y=normcdf(x,mu,sigma);
figure(1);clf;plot(xi,yi,'x',x,y);
annotation('textbox',[0.2,0.7,0.1,0.1], 'String',sprintf('mu=%f\nsigma=%f',mu,sigma),'FitBoxToText','on','FontSize',16);
you will get: mu=0.24537, sigma=0.213
And if you still want to fit to pdf, just change the function 'normcdf' in 'fun' (and 'y') to 'normpdf'.

Array in scala produced by x to y by z resulting in long decimals

I am trying to generate a Numeric Range of Double by val arrayOfDoubles = (0.0 to 1.0 by 0.1).toArray, but the resultant is not what I expected it to be. the result is something like Array(0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7, 0.7999999999999999, 0.8999999999999999, 0.9999999999999999). Why is it like this. I could use this code to get what I expect it to be:
val roundedArray = for (x <- arrayOfDoubles) yield BigDecimal(x).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
Which results in Array[Double] = Array(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)But this looks really cumbersome and expensive for BigDecimal converts the double to String and then parse it.
Is there a way I could get the NumericRange already rounded to one decimal place?
Thanks.
0 to 10 map (_ / 10.0)
should do the trick
Floating point arithmetic is not exact, i.e. some decimal numbers cannot be represented exactly, and are represented by the closest available numbers. See also Is floating point math broken?