Spark 1.5 recently came out and has element wise multiplication in python (http://spark.apache.org/docs/latest/mllib-feature-extraction.html).
I have no problem applying the weighting/transforming vector (v2 in my code below) on a Vector to produce a vector. However when I try to apply it on RDD[Vector], I get:
TypeError: Cannot convert type < type 'numpy.float64'> into Vector.
Here's my code:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.feature import ElementwiseProduct
v1 = sc.parallelize(Vectors.dense([2.0, 2.0, 2.0]))
v2 = Vectors.dense([0.0, 1.0, 2.0])
transformer = ElementwiseProduct(v2)
transformedData = transformer.transform(v1)
print transformedData.collect()
How do I produce an RDD[Vector] that is the Hadamard product of v1 and v2?
Turned out I need to turn v1 into a row Matrix
mat = RowMatrix(v1)
so for instance:
from pyspark.mllib.linalg.distributed import RowMatrix
v1 = sc.parallelize([[2.0, 2.0, 2.0], [3.0, 3.0, 3.0]])
mat = RowMatrix(v1)
v2 = Vectors.dense([0.0, 1.0, 2.0])
transformer = ElementwiseProduct(v2)
transformedData = transformer.transform(mat.rows)
print transformedData.collect()
will print:
[DenseVector([0.0, 2.0, 4.0]), DenseVector([0.0, 3.0, 6.0])]
What I really need though is a function that will allow v2 to also have multiple vectors, instead of a single vector matrix, but so far that doesn't seem to exist.
Related
I have found few radial basis functions like BasisExpansionFunction, Surrogates.jl, ScatteredInterpolation in Julia.
However, I am unable to replicate the results from python's scipy.interpolate.rbf() function.
Python Example
from scipy.interpolate import Rbf
import numpy as np
xs = np.arange(10)
ys = xs**2 + np.sin(xs) + 1
interp_func = Rbf(xs, ys) # By default RbF uses Multiquadratic function
newarr = interp_func(np.arange(2.1, 3, 0.1))
print(newarr)
What is correct approach to replicate the above example in Julia?
The first tutorial in Surrogates.jl shows how to make and interpolate a radial basis function.
using Surrogates
using LinearAlgebra
f = x -> x[1]*x[2]
lb = [1.0,2.0]
ub = [10.0,8.5]
x = sample(50,lb,ub,SobolSample())
y = f.(x)
my_radial_basis = RadialBasis(x,y,lb,ub)
#I want an approximation at (1.0,1.4)
approx = my_radial_basis((1.0,1.4))
I'd like to build a GP with marginalized hyperparameters.
I have seen that this is possible with the HMC sampler provided in gpflow from this notebook
However, when I tried to run the following code as a first step of this (NOTE this is on gpflow 0.5, an older version), the returned samples are negative, even though the lengthscale and variance need to be positive (negative values would be meaningless).
import numpy as np
from matplotlib import pyplot as plt
import gpflow
from gpflow import hmc
X = np.linspace(-3, 3, 20)
Y = np.random.exponential(np.sin(X) ** 2)
Y = (Y - np.mean(Y)) / np.std(Y)
k = gpflow.kernels.Matern32(1, lengthscales=.2, ARD=False)
m = gpflow.gpr.GPR(X[:, None], Y[:, None], k)
m.kern.lengthscales.prior = gpflow.priors.Gamma(1., 1.)
m.kern.variance.prior = gpflow.priors.Gamma(1., 1.)
# dont want likelihood be a hyperparam now so fixed
m.likelihood.variance = 1e-6
m.likelihood.variance.fixed = True
m.optimize(maxiter=1000)
samples = m.sample(500)
print(samples)
Output:
[[-0.43764571 -0.22753325]
[-0.50418501 -0.11070128]
[-0.5932655 0.00821438]
[-0.70217714 0.05077999]
[-0.77745654 0.09362291]
[-0.79404456 0.13649446]
[-0.83989415 0.27118385]
[-0.90355789 0.29589641]
...
I don't know too much in detail about HMC sampling but I would expect that the sampled posterior hyperparameters are positive, I've checked the code and it seems maybe related to the Log1pe transform, though I failed to figure it out myself.
Any hint on this?
It would be helpful if you specified which GPflow version you are using - especially given that from the output you posted it looks like you are using a really old version of GPflow (pre-1.0), and this is actually something that got improved since. What is happening here (in old GPflow) is that the sample() method returns a single array S x P, where S is the number of samples, and P is the number of free parameters [e.g. for a M x M matrix parameter with lower-triangular transform (such as the Cholesky of the covariance of the approximate posterior, q_sqrt), only M * (M - 1)/2 parameters are actually stored and optimised!]. These are the values in the unconstrained space, i.e. they can take any value whatsoever. Transforms (see gpflow.transforms module) provide the mapping between this value (between plus/minus infinity) and the constrained value (e.g. gpflow.transforms.positive for lengthscales and variances). In old GPflow, the model provides a get_samples_df() method that takes the S x P array returned by sample() and returns a pandas DataFrame with columns for all the trainable parameters which would be what you want. Or, ideally, you would just use a recent version of GPflow, in which the HMC sampler directly returns the DataFrame!
As far as I know, pyspark offers PCA API like:
from pyspark.ml.feature import PCA
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(data_frame)
However in reality, I find explained variances ratio is more widely used. For example, in sklearn:
from sklearn.decomposition import PCA
pca_fitter = PCA(n_components=0.85)
Does anyone know how to implement explained variance ratio in pyspark? Thanks!
From Spark 2.0 onwards, PCAModel includes an explainedVariance method; from the docs:
explainedVariance
Returns a vector of proportions of variance explained by each principal component.
New in version 2.0.0.
Here is an example with k=2 principal components and toy data, adapted from the documentation:
spark.version
# u'2.2.0'
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import PCA
data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
... (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
... (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
df = spark.createDataFrame(data,["features"])
pca = PCA(k=2, inputCol="features", outputCol="pca_features")
model = pca.fit(df)
model.explainedVariance
# DenseVector([0.7944, 0.2056])
i.e. from our k=2 principal components, the first one explains 79.44% of the variance, while the second one explains the remaining 20.56%.
How to convert from
org.apache.spark.mllib.linalg.SparseVector to org.apache.spark.ml.linalg.SparseVector?
I am converting the code from from mllib to the ml api.
import org.apache.spark.mllib.linalg.{DenseVector, Vector}
import org.apache.spark.ml.linalg.{DenseVector => NewDenseVector, Vector => NewVector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.{LabeledPoint => NewLabeledPoint}
val labelPointData = limitedTable.rdd.map { row =>
new NewLabeledPoint(convertToDouble(row.head), row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector])
}
statement row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector]
is not working because of the following exception:
org.apache.spark.mllib.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector
How to overcome that?
I have found code converting from the mllib to ml but not viceversa.
It is possible to convert in both directions. First, let's create an mllib SparseVector:
import org.apache.spark.mllib.linalg.Vectors
val mllibVec: org.apache.spark.mllib.linalg.Vector = Vectors.sparse(3, Array(1,2,3), Array(1,2,3))
To convert to ML SparseVector, simply use asML:
val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
To convert it back again, the easiest way is to use Vectors.fromML():
val mllibVec2: org.apache.spark.mllib.linalg.Vector = Vectors.fromML(mlVec)
In addition, in your code, instead of row(1).asInstanceOf[SparseVector] you could try row.getAs[SparseVector](1). Try reading the vector as a mllib vector, then convert it with asML and pass into the ML-based LabeledPoint, i.e.:
val labelPointData = limitedTable.rdd.map { row =>
NewLabeledPoint(convertToDouble(row.head), row.getAs[org.apache.spark.mllb.linalg.SparseVector](1).asML)
}
In pyspark, you can convert different vectors to other vectors in this way:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
print('v1: %s' % v1)
print('v2: %s' % v2)
print(v1 == v2)
print(type(v1), type(v2))
# Convert vector to numpy array
arr1 = v1.toArray()
print('arr1: %s type: %s' % (arr1, type(arr1)))
# convert mllib vectors to ml vectors
v3 = ml_vectors.dense(arr1)
print('v3: %s' % v3)
print(type(v3))
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert ml sparse vector to dense vector
v5 = ml_vectors.dense(v4.toArray())
print('v5: %s' % v5)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)
# Convert ml sparse vector to mllib sparse vector
arr3 = v4.toArray()
d = {i:arr3[i] for i in np.nonzero(arr3)[0]}
v7 = mllib_vectors.sparse(len(arr3), d)
print('v7: %s' % v7)
The output is:
v1: [1.0,1.0,0.0,0.0,0.0]
v2: [1.0,1.0,0.0,0.0,0.0]
False
<class 'pyspark.mllib.linalg.DenseVector'> <class 'pyspark.ml.linalg.DenseVector'>
arr1: [1. 1. 0. 0. 0.] type: <class 'numpy.ndarray'>
v3: [1.0,1.0,0.0,0.0,0.0]
<class 'pyspark.ml.linalg.DenseVector'>
arr2 [1. 1. 0. 0. 0.]
d {0: 1.0, 1: 1.0}
v4: (5,[0,1],[1.0,1.0])
v5: [1.0,1.0,0.0,0.0,0.0]
v6: (5,[0,1],[1.0,1.0])
v7: (5,[0,1],[1.0,1.0])
I am trying to fit data to normal distribution using scala breeze , python scipy alternative way is :
from scipy.stats import norm
mu,std = norm.fit(time1)
I am looking for alternative way to do same in scala using breeze
Looking at the source code for norm.fit, it looks like if you use the function with only the data passed in (ie no other parameters), then this function just returns the mean and standard deviation:. We can accomplish the same in Breeze like so:
scala> val data = DenseVector(1d,2d,3d,4d)
data: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 2.0, 3.0, 4.0)
scala> val mu = mean(data)
mu: Double = 2.5
scala> val samp_var = variance(data)
samp_var: Double = 1.6666666666666667
scala> val n = data.length.toDouble
n: Double = 4.0
scala> val pop_var = samp_var * (n-1)/(n)
pop_var: Double = 1.25
scala> val pop_std = math.sqrt(pop_var)
pop_std: Double = 1.118033988749895
We need to modify the sample variance to get the population variance. This is the same as the scipy result:
In [1]: from scipy.stats import norm
In [2]: mu, std = norm.fit([1,2,3,4])
In [3]: mu
Out[3]: 2.5
In [4]: std
Out[4]: 1.1180339887498949