How to fit data to normal distribution using scala breeze - scala

I am trying to fit data to normal distribution using scala breeze , python scipy alternative way is :
from scipy.stats import norm
mu,std = norm.fit(time1)
I am looking for alternative way to do same in scala using breeze

Looking at the source code for norm.fit, it looks like if you use the function with only the data passed in (ie no other parameters), then this function just returns the mean and standard deviation:. We can accomplish the same in Breeze like so:
scala> val data = DenseVector(1d,2d,3d,4d)
data: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0, 4.0)
scala> val mu = mean(data)
mu: Double = 2.5
scala> val samp_var = variance(data)
samp_var: Double = 1.6666666666666667
scala> val n = data.length.toDouble
n: Double = 4.0
scala> val pop_var = samp_var * (n-1)/(n)
pop_var: Double = 1.25
scala> val pop_std = math.sqrt(pop_var)
pop_std: Double = 1.118033988749895
We need to modify the sample variance to get the population variance. This is the same as the scipy result:
In [1]: from scipy.stats import norm
In [2]: mu, std = norm.fit([1,2,3,4])
In [3]: mu
Out[3]: 2.5
In [4]: std
Out[4]: 1.1180339887498949

Related

Interpolation with radial basis function in julia

I have found few radial basis functions like BasisExpansionFunction, Surrogates.jl, ScatteredInterpolation in Julia.
However, I am unable to replicate the results from python's scipy.interpolate.rbf() function.
Python Example
from scipy.interpolate import Rbf
import numpy as np
xs = np.arange(10)
ys = xs**2 + np.sin(xs) + 1
interp_func = Rbf(xs, ys) # By default RbF uses Multiquadratic function
newarr = interp_func(np.arange(2.1, 3, 0.1))
print(newarr)
What is correct approach to replicate the above example in Julia?
The first tutorial in Surrogates.jl shows how to make and interpolate a radial basis function.
using Surrogates
using LinearAlgebra
f = x -> x[1]*x[2]
lb = [1.0,2.0]
ub = [10.0,8.5]
x = sample(50,lb,ub,SobolSample())
y = f.(x)
my_radial_basis = RadialBasis(x,y,lb,ub)
#I want an approximation at (1.0,1.4)
approx = my_radial_basis((1.0,1.4))

spark ml LinearRegression prediction is a constant for all observations

I'm trying to build a simple linear regression model in spark using scala. To test the method I'm trying to perform a single valriable regression using a test data set.
my data set is as follows.
x - integers from 1 to 100
y - random values generated from excel using the formula =RANDBETWEEN(-10,10)*RAND() + x_i
I've run a regression for this data set using python sklearn library and it gives me the best fit line (with r2 = 0.98) for the data as expected.
However, if I run a regression using spark my prediction has a constant value for all the x values in the dataset with an r2 value of 2e-16.
Why doesn't this code give me the best fit line as the prediction? What am I missing?
Here's the code I'm using
Python Code that works
x = np.array(df['x'])
y = np.array(df['x'])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
clf = LinearRegression(normilize=True)
clf.fit(x,y)
y_predictions = clf.predict(x)
print(r2_score(y, y_predictions))
Here's a plot from the python regression.
Scala code that gives a constant prediction
val labelCol = "y"
val assembler = new VectorAssembler()
.setInputCols(Array("x"))
.setOutputCol("features")
val df2 = assembler.transform(df)
val labelIndexer = new StringIndexer().setInputCol(labelCol).setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
val regressor = new LinearRegression()
.setMaxIter(10)
.setRegParam(1.0)
.setElasticNetParam(1.0)
val model = regressor.fit(df3)
val predictions = model.transform(df3)
val modelSummary = model.summary
println(s"r2 = ${modelSummary.r2}")
The issue was using the stringIndexer which should not be used on numeric columns. In my case, instead of using the stringIndxer, I should've just renamed the y column to label. This fixes the problem.

How to convert from org.apache.spark.mllib.linalg.SparseVector to org.apache.spark.ml.linalg.SparseVector?

How to convert from
org.apache.spark.mllib.linalg.SparseVector to org.apache.spark.ml.linalg.SparseVector?
I am converting the code from from mllib to the ml api.
import org.apache.spark.mllib.linalg.{DenseVector, Vector}
import org.apache.spark.ml.linalg.{DenseVector => NewDenseVector, Vector => NewVector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.{LabeledPoint => NewLabeledPoint}
val labelPointData = limitedTable.rdd.map { row =>
new NewLabeledPoint(convertToDouble(row.head), row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector])
}
statement row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector]
is not working because of the following exception:
org.apache.spark.mllib.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector
How to overcome that?
I have found code converting from the mllib to ml but not viceversa.
It is possible to convert in both directions. First, let's create an mllib SparseVector:
import org.apache.spark.mllib.linalg.Vectors
val mllibVec: org.apache.spark.mllib.linalg.Vector = Vectors.sparse(3, Array(1,2,3), Array(1,2,3))
To convert to ML SparseVector, simply use asML:
val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
To convert it back again, the easiest way is to use Vectors.fromML():
val mllibVec2: org.apache.spark.mllib.linalg.Vector = Vectors.fromML(mlVec)
In addition, in your code, instead of row(1).asInstanceOf[SparseVector] you could try row.getAs[SparseVector](1). Try reading the vector as a mllib vector, then convert it with asML and pass into the ML-based LabeledPoint, i.e.:
val labelPointData = limitedTable.rdd.map { row =>
NewLabeledPoint(convertToDouble(row.head), row.getAs[org.apache.spark.mllb.linalg.SparseVector](1).asML)
}
In pyspark, you can convert different vectors to other vectors in this way:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
print('v1: %s' % v1)
print('v2: %s' % v2)
print(v1 == v2)
print(type(v1), type(v2))
# Convert vector to numpy array
arr1 = v1.toArray()
print('arr1: %s type: %s' % (arr1, type(arr1)))
# convert mllib vectors to ml vectors
v3 = ml_vectors.dense(arr1)
print('v3: %s' % v3)
print(type(v3))
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert ml sparse vector to dense vector
v5 = ml_vectors.dense(v4.toArray())
print('v5: %s' % v5)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)
# Convert ml sparse vector to mllib sparse vector
arr3 = v4.toArray()
d = {i:arr3[i] for i in np.nonzero(arr3)[0]}
v7 = mllib_vectors.sparse(len(arr3), d)
print('v7: %s' % v7)
The output is:
v1: [1.0,1.0,0.0,0.0,0.0]
v2: [1.0,1.0,0.0,0.0,0.0]
False
<class 'pyspark.mllib.linalg.DenseVector'> <class 'pyspark.ml.linalg.DenseVector'>
arr1: [1. 1. 0. 0. 0.] type: <class 'numpy.ndarray'>
v3: [1.0,1.0,0.0,0.0,0.0]
<class 'pyspark.ml.linalg.DenseVector'>
arr2 [1. 1. 0. 0. 0.]
d {0: 1.0, 1: 1.0}
v4: (5,[0,1],[1.0,1.0])
v5: [1.0,1.0,0.0,0.0,0.0]
v6: (5,[0,1],[1.0,1.0])
v7: (5,[0,1],[1.0,1.0])

Spark ML Linear Regression - What Hyper-parameters to Tune

I'm using the LinearRegression model in the Spark ML for predicting price. It is a single variate regression (x=time, y=price).
Assume my data is clean, what are the usual steps to take to improve this model?
So far, I tried tuning regularization parameter using cross-validation, and got rmse=15 given stdev=30.
Are there any other significant hyper-parameters I should care about? It seems Spark ML is not well documented for hyper-parameter tuning...
Update
I was able to tune up parameters using ParamGrid and Cross-Validation. However, is there any way to see how the fitted line looks like after correctly training a linear regression model? How can I know if the line is quadric or cubic etc? It would be great if there is a way to visualize the fitted line with all training data points.
The link you provided points to the main hyperparameters:
.setRegParam(0.3) // lambda for regularization
.setElasticNetParam(0.8) // coefficient for L1 vs L2
You can perform a GridSearch to optimize their usage .. say for
lambda in 0 to 0.8
elasticNet in 0 to 1.0
This can be done by providing ParamMap to CrossValidator
val estimatorParamMaps: Param[Array[ParamMap]]
param for estimator param maps
To answer your follow-up question, LinearRegression will also be a linear fit. You can plot it by predicting on a dataset of points across your range for your y-values with a line plot. Then, you can plot your training data on top of it.
val session = SparkSession.builder().master("local").appName("PredictiveAnalysis").getOrCreate();
def main(args: Array[String]): Unit = {
val data = session.sparkContext.textFile("C:\\Users\\Test\\new_workspace\\PredictionAlgo\\src\\main\\resources\\data.txt");
val parsedData = data.map { line =>
val x : Array[String] = line.replace(",", " ").split(" ")
val y = x.map{ (a => a.toDouble)}
val d = y.size - 1
val c = Vectors.dense(y(0),y(d))
LabeledPoint(y(0), c)
}.cache();
val numIterations = 100;
val stepSize = 0.00000001;
val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize);
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
valuesAndPreds.foreach((result) => println(s"predicted label: ${result._1}, actual label: ${result._2}"))
val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()
println("training Mean Squared Error = " + MSE)
}
}

Spark 1.5 Elementwise Product

Spark 1.5 recently came out and has element wise multiplication in python (http://spark.apache.org/docs/latest/mllib-feature-extraction.html).
I have no problem applying the weighting/transforming vector (v2 in my code below) on a Vector to produce a vector. However when I try to apply it on RDD[Vector], I get:
TypeError: Cannot convert type < type 'numpy.float64'> into Vector.
Here's my code:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.feature import ElementwiseProduct
v1 = sc.parallelize(Vectors.dense([2.0, 2.0, 2.0]))
v2 = Vectors.dense([0.0, 1.0, 2.0])
transformer = ElementwiseProduct(v2)
transformedData = transformer.transform(v1)
print transformedData.collect()
How do I produce an RDD[Vector] that is the Hadamard product of v1 and v2?
Turned out I need to turn v1 into a row Matrix
mat = RowMatrix(v1)
so for instance:
from pyspark.mllib.linalg.distributed import RowMatrix
v1 = sc.parallelize([[2.0, 2.0, 2.0], [3.0, 3.0, 3.0]])
mat = RowMatrix(v1)
v2 = Vectors.dense([0.0, 1.0, 2.0])
transformer = ElementwiseProduct(v2)
transformedData = transformer.transform(mat.rows)
print transformedData.collect()
will print:
[DenseVector([0.0, 2.0, 4.0]), DenseVector([0.0, 3.0, 6.0])]
What I really need though is a function that will allow v2 to also have multiple vectors, instead of a single vector matrix, but so far that doesn't seem to exist.