Kolmogorov-Smirnov Test Statistic - scipy

Can someone explain why, if I calculate manually the KS test statistic, the result is different from when I use scipy.stats.kstest?
>>> sample = np.array([1000,2000,2500,3000,5000])
>>> ecdf = np.array([0.2, 0.4, 0.6, 0.8, 1. ])
>>> cdf = stats.weibull_min(0.3, 100, 4000).cdf(sample)
>>> abs(ecdf - cdf).max()
0.3454961536273503
>>> stats.kstest(rvs=sample, cdf=stats.weibull_min(0.3, 100, 4000).cdf)
KstestResult(statistic=0.4722995454382698, pvalue=0.1534647709785294)

OK, I realized the mistake I made, so I will answer my onwn question. The KS-Statistic can't be calculated as abs(ecdf - cdf).max(), bacause of the right-continuity / left-discontinuity of the ECDF. The correct approach is:
>>> sample = np.array([1000, 2000, 2500, 3000, 5000])
>>> ecdf = np.array([0, 0.2, 0.4, 0.6, 0.8, 1. ])
>>> cdf = stats.weibull_min(0.3, 100, 4000).cdf(sample)
>>> max([(ecdf[1:] - cdf).max(), (cdf - ecdf[:-1]).max()])
0.4722995454382698

Related

getting 'StructField' object has no attribute '_get_object_id' on BinaryClassificationMetrics

I was trying to get the binary classification report on pyspark and I ran into this error
StructField' object has no attribute '_get_object_id'
Here is my code
%%spark
from pyspark.mllib.evaluation import BinaryClassificationMetrics
#from pyspark.mllib.evaluation import BinaryClassificationMetrics
predictionAndLabels = test_pred.rdd.map(lambda Row : (float(Row['label']) , Row['prediction']))
metrics = BinaryClassificationMetrics(predictionAndLabels)
Also , Based on the documentation a link! , apparently it does not support f1 measure and recall etc . Any idea why or how we can extract them without low level coding ?
I don't think you have to go that deep. Taking their example of the data from the binary from the documentation you linked and assuming your threshold is p=0.5 cutoff you can just do something like
# f1 = 2 · Precision · Recall/Precision + Recall
# precision = tp / tp+fp
# recall = tp / tp+fn
from pyspark.sql.functions import col
scoreAndLabels = sc.parallelize([(0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2)
df = scoreAndLabels.toDF()
threshold = 0.5
tp = df.where((col('_1')>=threshold) &(col('_2')==1.0)).count()
fp = df.where((col('_1')<threshold) &(col('_2')==1.0)).count()
fn = df.where((col('_1')>=threshold) &(col('_2')==0.0)).count()
precision = tp / (tp+fp)
recall = tp / (tp+fn)
f1 = 2 * (precision * recall) / (precision + recall)
returns f1 = 0.75.

How to calculate mean of function in a gaussian fit?

I'm using the curve fitting app in MATLAB. If I understand correctly the "b1" component in the left box is the mean of function i.e. the x point where y=50% and my x data is [-0.8 -0.7 -0.5 0 0.3 0.5 0.7], so why is this number in this example so big (631)?
General model Gauss1:
f(x) = a1*exp(-((x-b1)/c1)^2)
Coefficients (with 95% confidence bounds):
a1 = 3.862e+258 (-Inf, Inf)
b1 = 631.2 (-1.117e+06, 1.119e+06)
c1 = 25.83 (-2.287e+04, 2.292e+04)
Your data looks like cdf and not pdf. You can use this code for your solution
xi=[-0.8,-0.7,-0.5, 0.0, 0.3, 0.5, 0.7];
yi= [0.2, 0.0, 0.2, 0.2, 0.5, 1.0, 1.0];
fun=#(v) normcdf(xi,v(1),v(2))-yi;
[v]=lsqnonlin(fun,[1,1]); %[1,2]
mu=v(1); sigma=v(2);
x=linspace(-1.5,1.5,100);
y=normcdf(x,mu,sigma);
figure(1);clf;plot(xi,yi,'x',x,y);
annotation('textbox',[0.2,0.7,0.1,0.1], 'String',sprintf('mu=%f\nsigma=%f',mu,sigma),'FitBoxToText','on','FontSize',16);
you will get: mu=0.24537, sigma=0.213
And if you still want to fit to pdf, just change the function 'normcdf' in 'fun' (and 'y') to 'normpdf'.

Array in scala produced by x to y by z resulting in long decimals

I am trying to generate a Numeric Range of Double by val arrayOfDoubles = (0.0 to 1.0 by 0.1).toArray, but the resultant is not what I expected it to be. the result is something like Array(0.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7, 0.7999999999999999, 0.8999999999999999, 0.9999999999999999). Why is it like this. I could use this code to get what I expect it to be:
val roundedArray = for (x <- arrayOfDoubles) yield BigDecimal(x).setScale(1, BigDecimal.RoundingMode.HALF_UP).toDouble
Which results in Array[Double] = Array(0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0)But this looks really cumbersome and expensive for BigDecimal converts the double to String and then parse it.
Is there a way I could get the NumericRange already rounded to one decimal place?
Thanks.
0 to 10 map (_ / 10.0)
should do the trick
Floating point arithmetic is not exact, i.e. some decimal numbers cannot be represented exactly, and are represented by the closest available numbers. See also Is floating point math broken?

Spark mllib.stat.Statistics - kolmogorovSmirnovTest CDF

I am looking through the example HypothesisTestingKolmogorovSmirnovTestExample.scala for spark and can't seem to figure out the CDF aspect.
Their example:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25)) // an RDD of sample data
val myCDF = Map(0.1 -> 0.2, 0.15 -> 0.6, 0.2 -> 0.05, 0.3 -> 0.05, 0.25 -> 0.1)
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)
This returns:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
This makes sense - what doesn't is when I try to have it not reject the Null:
val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25)) // an RDD of sample data
val myCDF = Map(0.1 -> 0.1, 0.15 -> 0.15, 0.2 -> 0.2, 0.3 -> 0.3, 0.25 -> 0.25) //CDF matching the data distribution
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)
This ALSO returns:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
What gives? The CDF and the data are the exact same distribution, are they not? Why would the Null be rejected? What am I assuming/doing wrong?
What is the scenario, you can use the KS test:
KS Test is one of the Goodness-fit-Test to be executed after the fit distribution for the data.
this test will tell you whether the identified distribution for the data is correct or not. we need to validate this with the p-value.
if the p value > 0.05 then the distribution you set for the data is fine. the p value is < 0.05 then you need fit data with the different distribution.
Rejecting Null means, p value is < 0.05: Data Not fit for the given distribution

how to sort multidimensional matrices along multiple columns

I have a tricky matrix manipulation issue that I could really use some help with.
I need to reorganize a series of 2d matrices so that they align most effectively across subjects. Each matrix has ~50 rows (which are the observations) and 13 columns (which designate the 'weight' of each observation on a series of 13 outcome measures). Based on the manner in which the data are created, there is no inherent meaning in the order of the rows, however I need to reorganize each matrix such that the rows contain meaning between subjects.
Specifically, I want to be able to reorder the matrices such that the specific pattern of weightings in a given row aligns with a similar pattern in the same row across a group of 20 subjects. To make matters worse, some subjects have missing rows, although all have between 45 and 50 rows.
As an example:
subject 1:
[ 0.1, 0.1, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.5, 0.5, 0.6, 0.6, 0.7;
0.9, 0.8, 0.8, 0.7, 0.7, 0.6, 0.6, 0.5, 0.5, 0.4, 0.4, 0.3, 0.3]
subject 2:
[ 0.8, 0.7, 0.7, 0.6, 0.6, 0.5, 0.5, 0.4, 0.4, 0.3, 0.3, 0.2, 0.2;
0.0, 0.0, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.5, 0.6, 0.7]
problem: row 1 in subject 1 aligns best with row 2 in subject 2 (and v.v.) and I would like to reorganize them as such [note: the real life problem is much more convoluted than this].
I apologize ahead of time for how idiosyncratic this issue is, but I really appreciate any help that anyone can give.
Mac