How to map over a Spark Vector? - scala

I have a mllib.linalg.Vector in Scala containing Double values in range of (-1; 1). I would like to multiply all of the values by, let's say, 100.
For example I'd like to convert [0.5, 0.3, -0.1] to [50, 30, -10].
How can I do it?

import org.apache.spark.mllib.linalg.*
val vec = org.apache.spark.mllib.linalg.Vectors.dense(0.5, 0.3, -0.1)
val vec2 = Vectors.dense(vec.toArray.map(_*100))

Related

Efficient way to extract a row and do cosine similarity

In the code below, I get a dense Matrix V after doing SVD. What I want is
Given a set of values(say 3,7,9).
I want to extract the 3,7 and 9th row of Matrix V.
I want to calculate cosine similarity of these 3 rows with each row of Matrix V
I need to add the three cosine similarities obtained for of each row.
I finally need the index of row which have the maximum summation.
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val dataRDD = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(dataRDD)
// Compute the top 4 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(4, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
Please advise an efficient method to do the same. I have been thinking of converting Matrix V to Indexed Row Matrix, But when I do use row iterator on V, How do I keep track of index of rows? Is there a better way to do it?

convert Dense Vector to Sparse Vector in PySpark

Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following:
Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector) if j != 0 ])
That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it?
import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.sql.functions import udf, col
If you have just one dense vector this will do it:
def dense_to_sparse(vector):
return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)
dense_to_sparse(densevector)
The trick here is that csc_matrix.shape[1] has to equal 1, so transpose the vector. Have a look at the source of _convert_to_vector: https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html
The more likely scenario is you have a DF with a column of densevectors:
to_sparse = udf(dense_to_sparse, VectorUDT())
DF.withColumn("sparse", to_sparse(col("densevector"))
I'm not sure whether you're using mllib or ml. Anyway, You can convert like this:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)

Reducing Block Matrices to their Sum

Is there an efficient way to reduce a block matrix to a sum of all of its values? I'm looking to calculate the Euclidean distance between two block matrices (d2, as defined in the response here https://math.stackexchange.com/questions/507742/distance-similarity-between-two-matrices).
As a follow up, there doesn't appear to be a simple way to subtract two block matrices. Is there any way to multiply each by a constant?
Edit: Found a workaround for subtraction. V, W, and H are the three matrices. The negOneBlock is a matrix of size V which only contain negative ones.
V.add((W.multiply(H)).multiply(negOneBlock))
Applying a sum for each block and then reducing should quite efficient.
import org.apache.spark.mllib.linalg.distributed._
def sum(mat: BlockMatrix) = mat.blocks.map(_._2.toArray.sum).sum
where
_.blocks
creates a RDD[((Int, Int), Matrix)],
_._2
extracts Matrix, and
toArray.sum
aggregates all values in the block. For data like:
val mat: BlockMatrix = new CoordinateMatrix(sc.parallelize(Seq(
MatrixEntry(0, 10, 1.0), MatrixEntry(10, 1024, 2.0),
MatrixEntry(3000, 10, 3.0))
)).toBlockMatrix(128, 128)
sum(mat)
we get expected result which 6.0.

Creating a DenseMatrix from a Transpose

I started using Breeze since a few weeks and I am not able to do something that seems simple. I want to transform a Transpose into a DenseMatrix, for example:
val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0)) // DenseMatrix
val meanCols = mean(matrix(::, *)) // Transpose
val meanColsDM = meanCols.toDenseMatrix // Error: value toDenseMatrix is not a member of breeze.linalg.Transpose
I thought about creating a loop to transform the Transpose into an array to then create the DenseMatrix (1 row, 2 cols using the matrix from the example) but I wonder if there is a simpler way to obtain the same thing.
I need to do this to then concatene the mean of the columns with other matrices, I did not put the code in the example as it is not the source of the problem.
meanCols is a Transpose[DenseVector[Double]], which is just a wrapper for a DenseVector[Double]. If you want the result in a matrix with one row and two columns, you can transpose it again with .t to get a DenseVector[Double] and then convert that to a matrix with .toDenseVector:
scala> import breeze.linalg._, breeze.stats.mean
import breeze.linalg._
import breeze.stats.mean
scala> val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0))
matrix: breeze.linalg.DenseMatrix[Double] =
1.0 3.5
3.0 2.0
scala> val meanCols = mean(matrix(::, *))
meanCols: breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] = ...
scala> val meanColsDM = meanCols.t.toDenseMatrix
meanColsDM: breeze.linalg.DenseMatrix[Double] = 2.0 2.75

Concatenate Sparse Vectors in Spark?

Say you have two Sparse Vectors. As an example:
val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to concatenate these two vectors so that the result is equivalent to:
val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]
Does Spark have any such convenience method to do this?
If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])
assembler = VectorAssembler(
inputCols=["userFeatures1", "userFeatures2"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)
You would get the following output for this:
+---------------------------------------------------------------------------+-----+
|features |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18], [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+
I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.
If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.
In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):
def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
val size = v1.size + v2.size
val maxIndex = v1.size
val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
val values = v1.values ++ v2.values
new SparseVector(size, indices, values)
}
If your vectors represent different columns of a dataframe, you can use VectorAssembler. Just need to set setInputcols (your 2 vectors) and Spark will make your wish come true ;)