I started using Breeze since a few weeks and I am not able to do something that seems simple. I want to transform a Transpose into a DenseMatrix, for example:
val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0)) // DenseMatrix
val meanCols = mean(matrix(::, *)) // Transpose
val meanColsDM = meanCols.toDenseMatrix // Error: value toDenseMatrix is not a member of breeze.linalg.Transpose
I thought about creating a loop to transform the Transpose into an array to then create the DenseMatrix (1 row, 2 cols using the matrix from the example) but I wonder if there is a simpler way to obtain the same thing.
I need to do this to then concatene the mean of the columns with other matrices, I did not put the code in the example as it is not the source of the problem.
meanCols is a Transpose[DenseVector[Double]], which is just a wrapper for a DenseVector[Double]. If you want the result in a matrix with one row and two columns, you can transpose it again with .t to get a DenseVector[Double] and then convert that to a matrix with .toDenseVector:
scala> import breeze.linalg._, breeze.stats.mean
import breeze.linalg._
import breeze.stats.mean
scala> val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0))
matrix: breeze.linalg.DenseMatrix[Double] =
1.0 3.5
3.0 2.0
scala> val meanCols = mean(matrix(::, *))
meanCols: breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] = ...
scala> val meanColsDM = meanCols.t.toDenseMatrix
meanColsDM: breeze.linalg.DenseMatrix[Double] = 2.0 2.75
Related
In the code below, I get a dense Matrix V after doing SVD. What I want is
Given a set of values(say 3,7,9).
I want to extract the 3,7 and 9th row of Matrix V.
I want to calculate cosine similarity of these 3 rows with each row of Matrix V
I need to add the three cosine similarities obtained for of each row.
I finally need the index of row which have the maximum summation.
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
val dataRDD = sc.parallelize(data)
val mat: RowMatrix = new RowMatrix(dataRDD)
// Compute the top 4 singular values and corresponding singular vectors.
val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(4, computeU = true)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
Please advise an efficient method to do the same. I have been thinking of converting Matrix V to Indexed Row Matrix, But when I do use row iterator on V, How do I keep track of index of rows? Is there a better way to do it?
I have an IndexedRowMatrix of doubles. I want to compute the sum of each row of the matrix and save the results to a Vector. After that I want to broadcast this vector.
I am creating an RDD of Doubles, which contains the sums, but I cannot turn it into a vector.
So, the question basically is how to create the Vector I want from the IndexedRowMatrix.
Collect to the driver and construct a vector:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val sc: SparkContext = ???
val rdd: RDD[Double] = ???
val vec: Vector = Vectors.dense(rdd.collect)
val broadcastVec = sc.broadcast(vec)
References:
https://spark.apache.org/docs/2.1.0/mllib-data-types.html#local-vector
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
Im struggeling to find a way to quickly convert a DenseMatrix to SparseMatrix.
I tried flattening the DenseMatrix to an array, converting it to a Sparse Matrix and then reshaping it but this is not possible since there is no reshape function..
val dm = DenseMatrix((1,2,3),(0,0,0),(0,0,0))
val sm =CSCMatrix(dm.toArray)
sm.reshape(3,3)
error: value reshape is not a member of breeze.linalg.CSCMatrix[Int]
How about something like this:
val dm = DenseMatrix((1,2,3),(0,0,0),(0,0,0))
val sm = CSCMatrix.tabulate(dm.rows, dm.cols)(dm(_, _))
With pandas/numpy, a 2x2 matrix multiplied with a 2x1 matrix will result in each column in 2x2 matrix by corresponding column value in 2x1 matrix.
Ex. The following with numpy
>>> data = np.array([[1, 2], [3, 4]])
>>> data
array([[1, 2],
[3, 4]])
>>> data * [2, 4]
array([[ 2, 8],
[ 6, 16]])
How can this operation be done with spark/breeze? I tried unsuccessfully with new DenseVector(2, 2, Array(1,2,3,4)) * DenseVector(2, 4).
Spark DataFrames are not designed to linear algebra operations. Theoretically you can combine all columns using VectorAssembler and perform multiplications using ElementwiseProduct:
import org.apache.spark.ml.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("x1", "x2"))
.setOutputCol("xs")
val product = new ElementwiseProduct()
.setScalingVec(Vectors.dense(Array(2.0, 4.0)))
.setInputCol("xs")
.setOutputCol("xs_transformed")
val df = sc.parallelize(Seq((1.0, 2.0), (3.0, 4.0))).toDF("x1", "x2")
product.transform(assembler.transform(df)).select("xs_transformed").show
// +--------------+
// |xs_transformed|
// +--------------+
// | [2.0,8.0]|
// | [6.0,16.0]|
// +--------------+
but it is useful only for basic transformations.
In Breeze, this is done with the special broadcasting value *.
scala> import breeze.linalg._
import breeze.linalg._
scala> val dm = DenseMatrix((1,2), (3,4))
dm: breeze.linalg.DenseMatrix[Int] =
1 2
3 4
scala> dm(*, ::) :* DenseVector(2,4)
res0: breeze.linalg.DenseMatrix[Int] =
2 8
6 16
dm(*, ::) says "apply the operation to every row". Scalar multiplication is :*, while matrix/shaped multiplication is *.
Say you have two Sparse Vectors. As an example:
val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to concatenate these two vectors so that the result is equivalent to:
val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]
Does Spark have any such convenience method to do this?
If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])
assembler = VectorAssembler(
inputCols=["userFeatures1", "userFeatures2"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)
You would get the following output for this:
+---------------------------------------------------------------------------+-----+
|features |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18], [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+
I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.
If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.
In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):
def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
val size = v1.size + v2.size
val maxIndex = v1.size
val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
val values = v1.values ++ v2.values
new SparseVector(size, indices, values)
}
If your vectors represent different columns of a dataframe, you can use VectorAssembler. Just need to set setInputcols (your 2 vectors) and Spark will make your wish come true ;)