I am using scala spark. I have a dataframe that 2 column each containing a Vector with the same cardinality/size. I want to find the distance between each element of the 2 Vectors and put the results in a vector in another column of the dataframe.
Example: [1, 3, 5, -2] - [-2, 5, 0, 1] = [3, 2, 5, 3]
I found sqdist method that can get me the sum of the square distances between 2 Vectors but how do I get the individual distances of each elements in the vector.
I have an IndexedRowMatrix of doubles. I want to compute the sum of each row of the matrix and save the results to a Vector. After that I want to broadcast this vector.
I am creating an RDD of Doubles, which contains the sums, but I cannot turn it into a vector.
So, the question basically is how to create the Vector I want from the IndexedRowMatrix.
Collect to the driver and construct a vector:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val sc: SparkContext = ???
val rdd: RDD[Double] = ???
val vec: Vector = Vectors.dense(rdd.collect)
val broadcastVec = sc.broadcast(vec)
References:
https://spark.apache.org/docs/2.1.0/mllib-data-types.html#local-vector
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following:
Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector) if j != 0 ])
That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it?
import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.sql.functions import udf, col
If you have just one dense vector this will do it:
def dense_to_sparse(vector):
return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)
dense_to_sparse(densevector)
The trick here is that csc_matrix.shape[1] has to equal 1, so transpose the vector. Have a look at the source of _convert_to_vector: https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html
The more likely scenario is you have a DF with a column of densevectors:
to_sparse = udf(dense_to_sparse, VectorUDT())
DF.withColumn("sparse", to_sparse(col("densevector"))
I'm not sure whether you're using mllib or ml. Anyway, You can convert like this:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)
I started using Breeze since a few weeks and I am not able to do something that seems simple. I want to transform a Transpose into a DenseMatrix, for example:
val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0)) // DenseMatrix
val meanCols = mean(matrix(::, *)) // Transpose
val meanColsDM = meanCols.toDenseMatrix // Error: value toDenseMatrix is not a member of breeze.linalg.Transpose
I thought about creating a loop to transform the Transpose into an array to then create the DenseMatrix (1 row, 2 cols using the matrix from the example) but I wonder if there is a simpler way to obtain the same thing.
I need to do this to then concatene the mean of the columns with other matrices, I did not put the code in the example as it is not the source of the problem.
meanCols is a Transpose[DenseVector[Double]], which is just a wrapper for a DenseVector[Double]. If you want the result in a matrix with one row and two columns, you can transpose it again with .t to get a DenseVector[Double] and then convert that to a matrix with .toDenseVector:
scala> import breeze.linalg._, breeze.stats.mean
import breeze.linalg._
import breeze.stats.mean
scala> val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0))
matrix: breeze.linalg.DenseMatrix[Double] =
1.0 3.5
3.0 2.0
scala> val meanCols = mean(matrix(::, *))
meanCols: breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] = ...
scala> val meanColsDM = meanCols.t.toDenseMatrix
meanColsDM: breeze.linalg.DenseMatrix[Double] = 2.0 2.75
Say you have two Sparse Vectors. As an example:
val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to concatenate these two vectors so that the result is equivalent to:
val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]
Does Spark have any such convenience method to do this?
If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])
assembler = VectorAssembler(
inputCols=["userFeatures1", "userFeatures2"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)
You would get the following output for this:
+---------------------------------------------------------------------------+-----+
|features |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18], [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+
I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.
If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.
In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):
def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
val size = v1.size + v2.size
val maxIndex = v1.size
val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
val values = v1.values ++ v2.values
new SparseVector(size, indices, values)
}
If your vectors represent different columns of a dataframe, you can use VectorAssembler. Just need to set setInputcols (your 2 vectors) and Spark will make your wish come true ;)