Multiply elements in dataframe column by same value - scala

With pandas/numpy, a 2x2 matrix multiplied with a 2x1 matrix will result in each column in 2x2 matrix by corresponding column value in 2x1 matrix.
Ex. The following with numpy
>>> data = np.array([[1, 2], [3, 4]])
>>> data
array([[1, 2],
[3, 4]])
>>> data * [2, 4]
array([[ 2, 8],
[ 6, 16]])
How can this operation be done with spark/breeze? I tried unsuccessfully with new DenseVector(2, 2, Array(1,2,3,4)) * DenseVector(2, 4).

Spark DataFrames are not designed to linear algebra operations. Theoretically you can combine all columns using VectorAssembler and perform multiplications using ElementwiseProduct:
import org.apache.spark.ml.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("x1", "x2"))
.setOutputCol("xs")
val product = new ElementwiseProduct()
.setScalingVec(Vectors.dense(Array(2.0, 4.0)))
.setInputCol("xs")
.setOutputCol("xs_transformed")
val df = sc.parallelize(Seq((1.0, 2.0), (3.0, 4.0))).toDF("x1", "x2")
product.transform(assembler.transform(df)).select("xs_transformed").show
// +--------------+
// |xs_transformed|
// +--------------+
// | [2.0,8.0]|
// | [6.0,16.0]|
// +--------------+
but it is useful only for basic transformations.

In Breeze, this is done with the special broadcasting value *.
scala> import breeze.linalg._
import breeze.linalg._
scala> val dm = DenseMatrix((1,2), (3,4))
dm: breeze.linalg.DenseMatrix[Int] =
1 2
3 4
scala> dm(*, ::) :* DenseVector(2,4)
res0: breeze.linalg.DenseMatrix[Int] =
2 8
6 16
dm(*, ::) says "apply the operation to every row". Scalar multiplication is :*, while matrix/shaped multiplication is *.

Related

Spark method for subtracting 2 vectors

I am using scala spark. I have a dataframe that 2 column each containing a Vector with the same cardinality/size. I want to find the distance between each element of the 2 Vectors and put the results in a vector in another column of the dataframe.
Example: [1, 3, 5, -2] - [-2, 5, 0, 1] = [3, 2, 5, 3]
I found sqdist method that can get me the sum of the square distances between 2 Vectors but how do I get the individual distances of each elements in the vector.

How to convert RDD[Double] to Vector in Scala Spark

I have an IndexedRowMatrix of doubles. I want to compute the sum of each row of the matrix and save the results to a Vector. After that I want to broadcast this vector.
I am creating an RDD of Doubles, which contains the sums, but I cannot turn it into a vector.
So, the question basically is how to create the Vector I want from the IndexedRowMatrix.
Collect to the driver and construct a vector:
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val sc: SparkContext = ???
val rdd: RDD[Double] = ???
val vec: Vector = Vectors.dense(rdd.collect)
val broadcastVec = sc.broadcast(vec)
References:
https://spark.apache.org/docs/2.1.0/mllib-data-types.html#local-vector
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables

convert Dense Vector to Sparse Vector in PySpark

Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following:
Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector) if j != 0 ])
That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it?
import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.sql.functions import udf, col
If you have just one dense vector this will do it:
def dense_to_sparse(vector):
return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)
dense_to_sparse(densevector)
The trick here is that csc_matrix.shape[1] has to equal 1, so transpose the vector. Have a look at the source of _convert_to_vector: https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html
The more likely scenario is you have a DF with a column of densevectors:
to_sparse = udf(dense_to_sparse, VectorUDT())
DF.withColumn("sparse", to_sparse(col("densevector"))
I'm not sure whether you're using mllib or ml. Anyway, You can convert like this:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)

Creating a DenseMatrix from a Transpose

I started using Breeze since a few weeks and I am not able to do something that seems simple. I want to transform a Transpose into a DenseMatrix, for example:
val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0)) // DenseMatrix
val meanCols = mean(matrix(::, *)) // Transpose
val meanColsDM = meanCols.toDenseMatrix // Error: value toDenseMatrix is not a member of breeze.linalg.Transpose
I thought about creating a loop to transform the Transpose into an array to then create the DenseMatrix (1 row, 2 cols using the matrix from the example) but I wonder if there is a simpler way to obtain the same thing.
I need to do this to then concatene the mean of the columns with other matrices, I did not put the code in the example as it is not the source of the problem.
meanCols is a Transpose[DenseVector[Double]], which is just a wrapper for a DenseVector[Double]. If you want the result in a matrix with one row and two columns, you can transpose it again with .t to get a DenseVector[Double] and then convert that to a matrix with .toDenseVector:
scala> import breeze.linalg._, breeze.stats.mean
import breeze.linalg._
import breeze.stats.mean
scala> val matrix = DenseMatrix((1.0, 3.5), (3.0, 2.0))
matrix: breeze.linalg.DenseMatrix[Double] =
1.0 3.5
3.0 2.0
scala> val meanCols = mean(matrix(::, *))
meanCols: breeze.linalg.Transpose[breeze.linalg.DenseVector[Double]] = ...
scala> val meanColsDM = meanCols.t.toDenseMatrix
meanColsDM: breeze.linalg.DenseMatrix[Double] = 2.0 2.75

Concatenate Sparse Vectors in Spark?

Say you have two Sparse Vectors. As an example:
val vec1 = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val vec2 = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to concatenate these two vectors so that the result is equivalent to:
val vec3 = Vectors.sparse(4, List(0, 2), List(1, 1)) // [1, 0, 0, 1]
Does Spark have any such convenience method to do this?
If you have the data in a DataFrame, then VectorAssembler would be the right thing to use. For example:
from pyspark.ml.feature import VectorAssembler
dataset = spark.createDataFrame(
[(0, Vectors.sparse(10, {0: 0.6931, 5: 0.0, 7: 0.5754, 9: 0.2877}), Vectors.sparse(10, {3: 0.2877, 4: 0.6931, 5: 0.0, 6: 0.6931, 8: 0.6931}))],
["label", "userFeatures1", "userFeatures2"])
assembler = VectorAssembler(
inputCols=["userFeatures1", "userFeatures2"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features", "label").show(truncate=False)
You would get the following output for this:
+---------------------------------------------------------------------------+-----+
|features |label|
+---------------------------------------------------------------------------+-----+
|(20,[0,7,9,13,14,16,18], [0.6931,0.5754,0.2877,0.2877,0.6931,0.6931,0.6931])|0|
+---------------------------------------------------------------------------+-----+
I think you have a slight problem understanding SparseVectors. Therefore I will make a little explanation about them, the first argument is the number of features | columns | dimensions of the data, besides every entry of the List in the second argument represent the position of the feature, and the values in the the third List represent the value for that column, therefore SparseVectors are locality sensitive, and from my point of view your approach is incorrect.
If you pay more attention you are summing or combining two vectors that have the same dimensions, hence the real result would be different, the first argument tells us that the vector has only 2 dimensions, so [1,0] + [0,1] => [1,1] and the correct representation would be Vectors.sparse(2, [0,1], [1,1]), not four dimensions.
In the other hand if each vector has two different dimensions and you are trying to combine them and represent them in a higher dimensional space, let's say four then your operation might be valid, however this functionality isn't provided by the SparseVector class, and you would have to program a function to do that, something like (a bit imperative but I accept suggestions):
def combine(v1:SparseVector, v2:SparseVector):SparseVector = {
val size = v1.size + v2.size
val maxIndex = v1.size
val indices = v1.indices ++ v2.indices.map(e => e + maxIndex)
val values = v1.values ++ v2.values
new SparseVector(size, indices, values)
}
If your vectors represent different columns of a dataframe, you can use VectorAssembler. Just need to set setInputcols (your 2 vectors) and Spark will make your wish come true ;)