Concatenate Sparse Vectors in Spark?

Concatenate Sparse Vectors in Spark? - scala

Say you have two Sparse Vectors. As an example:
val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to concatenate these two vectors so that the result is equivalent to:
val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]
Does Spark have any such convenience method to do this?

If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])
assembler = VectorAssembler(
inputCols=["userFeatures1", "userFeatures2"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)
You would get the following output for this:
+---------------------------------------------------------------------------+-----+
|features |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18], [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+

I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.
If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.
In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):
def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
val size = v1.size + v2.size
val maxIndex = v1.size
val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
val values = v1.values ++ v2.values
new SparseVector(size, indices, values)
}

If your vectors represent different columns of a dataframe, you can use VectorAssembler. Just need to set setInputcols (your 2 vectors) and Spark will make your wish come true ;)

Related

Convert RDD of Matrix to RDD of Vector

I have a RDD[Matrix[Double]] and want to convert it to RDD[Vector] (Each row in the Matrix will be converted to a Vector).
I've seen related answer like Convert Matrix to RowMatrix in Apache Spark using Scala, but it's one Matrix to RDD of Vector. While my case is RDD of Matrix.

Use flatMap on code to convert Matrix to Seq[Vector]:
// from https://stackoverflow.com/a/28172826/1206998
def toSeqOfVector(m: Matrix): Seq[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
rows.map(row => new DenseVector(row.toArray))
}
val matrices: RDD[Matrix] = ??? // your input
val vectors: RDD[Vector] = matrices.flatMap(toSeqOfVector)
Note: I didn't test this code, but this is the principle

Efficient way to extract a row and do cosine similarity

In the code below, I get a dense Matrix V after doing SVD. What I want is
Given a set of values(say 3,7,9).
I want to extract the 3,7 and 9th row of Matrix V.
I want to calculate cosine similarity of these 3 rows with each row of Matrix V
I need to add the three cosine similarities obtained for of each row.
I finally need the index of row which have the maximum summation.
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val dataRDD = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(dataRDD)
// Compute the top 4 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(4, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
Please advise an efficient method to do the same. I have been thinking of converting Matrix V to Indexed Row Matrix, But when I do use row iterator on V, How do I keep track of index of rows? Is there a better way to do it?

Reducing Block Matrices to their Sum

Is there an efficient way to reduce a block matrix to a sum of all of its values? I'm looking to calculate the Euclidean distance between two block matrices (d2, as defined in the response here https://math.stackexchange.com/questions/507742/distance-similarity-between-two-matrices).
As a follow up, there doesn't appear to be a simple way to subtract two block matrices. Is there any way to multiply each by a constant?
Edit: Found a workaround for subtraction. V, W, and H are the three matrices. The negOneBlock is a matrix of size V which only contain negative ones.
V.add((W.multiply(H)).multiply(negOneBlock))

Applying a sum for each block and then reducing should quite efficient.
import org.apache.spark.mllib.linalg.distributed._
def sum(mat: BlockMatrix) = mat.blocks.map(_._2.toArray.sum).sum
where
_.blocks
creates a RDD[((Int, Int), Matrix)],
_._2
extracts Matrix, and
toArray.sum
aggregates all values in the block. For data like:
val mat: BlockMatrix = new CoordinateMatrix(sc.parallelize(Seq(
MatrixEntry(0, 10, 1.0), MatrixEntry(10, 1024, 2.0),
MatrixEntry(3000, 10, 3.0))
)).toBlockMatrix(128, 128)
sum(mat)
we get expected result which 6.0.

Creating a DenseMatrix from a Transpose

I started using Breeze since a few weeks and I am not able to do something that seems simple. I want to transform a Transpose into a DenseMatrix, for example:
val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0)) // DenseMatrix
val meanCols = mean(matrix(::, *)) // Transpose
val meanColsDM = meanCols.toDenseMatrix // Error: value toDenseMatrix is not a member of breeze.linalg.Transpose
I thought about creating a loop to transform the Transpose into an array to then create the DenseMatrix (1 row, 2 cols using the matrix from the example) but I wonder if there is a simpler way to obtain the same thing.
I need to do this to then concatene the mean of the columns with other matrices, I did not put the code in the example as it is not the source of the problem.

meanCols is a Transpose[DenseVector[Double]], which is just a wrapper for a DenseVector[Double]. If you want the result in a matrix with one row and two columns, you can transpose it again with .t to get a DenseVector[Double] and then convert that to a matrix with .toDenseVector:
scala> import breeze.linalg._, breeze.stats.mean
import breeze.linalg._
import breeze.stats.mean
scala> val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0))
matrix: breeze.linalg.DenseMatrix[Double] =
1.0 3.5
3.0 2.0
scala> val meanCols = mean(matrix(::, *))
meanCols: breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] = ...
scala> val meanColsDM = meanCols.t.toDenseMatrix
meanColsDM: breeze.linalg.DenseMatrix[Double] = 2.0 2.75

Feature selection and One-Hot-Encoding in Apache Spark

I'm working on a classification model and I have a problem to create the correct form of data for the model.
In my dataset there are 3 columns with sums. I discretized these columns with the given Bucketizer. The rest of the columns are categorical with Strings as values. I used the StringIndexer to transform these features. Afterwards I select the best columns via ChiSqSelector. So far so good.
But now I want to transform the categorical features in dummy variables. I don't know how to do that because I already have the data in the form of LabeledPoints. Is there a easy way or given solution to transform the values from a set of vectors to dummy variables? Or does anyone has a suggestion to solve this problem in another way?

#zero323 The input for ChiSqSelector has to be an RDD[LabeledPoint]. My data has 25 features.I select the 15 best features but for simplicity let's say I have the following LabeledPoints:
LabeledPoint(1, [1, 2, 3])
LabeledPoint(0, [2, 1, 3])
LabeledPoint(1, [1, 3, 1])
For example ChiSqSelector selects only the best (first) feature so my LabeledPoints are:
LabeledPoint(1, [1])
LabeledPoint(0, [2])
LabeledPoint(1, [1])
How can I encode the features from the feature vector to dummy variables now that my LabeledPoints are:
LabeledPoint(1, [1, 0])
LabeledPoint(0, [0, 1])
LabeledPoint(1, [1, 0])
Hope that helps. Or do you need some code?
Edit:
My idea right now is something like this:
Convert the label and features from each LabeledPoint to a Row and convert this RDD to DataFrame to use the OneHotEncoder:
val data = chiData.map{ r=>
val label = r.label
val feature1 = r.features.toArray(0)
val feature2 = r.features.toArray(1)
val feature3 = r.features.toArray(2)
....
Row(label, feature1, feature2, feature3, ...)
}
//Convert RDD to DataFrame
//Use OneHotEncoder
//Create LabeledPoints again for use in Algorithms
But I think this is not the smartest way.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Concatenate Sparse Vectors in Spark? - scala

If your vectors represent different columns of a dataframe, you can use VectorAssembler. Just need to set setInputcols (your 2 vectors) and Spark will make your wish come true ;)

Related

Convert RDD of Matrix to RDD of Vector

Efficient way to extract a row and do cosine similarity

Reducing Block Matrices to their Sum

Creating a DenseMatrix from a Transpose

Feature selection and One-Hot-Encoding in Apache Spark

Categories

Resources