Adding Sparse Vectors 3.0.0 Apache Spark Scala - scala

I am trying to create a function as the following to add
two org.apache.spark.ml.linalg.Vector. or i.e two sparse vectors
This vector could look as the following
(28,[1,2,3,4,7,11,12,13,14,15,17,20,22,23,24,25],[0.13028398104008743,0.23648605632753023,0.7094581689825907,0.13028398104008743,0.23648605632753023,0.0,0.14218861229025295,0.3580566057240087,0.14218861229025295,0.13028398104008743,0.26056796208017485,0.0,0.14218861229025295,0.06514199052004371,0.13028398104008743,0.23648605632753023])
For e.g.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector): org.apache.spark.ml.linalg.Vector = {
}
Let's look at a use case
val x = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val y = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to output to be
Vectors.sparse(2, List(0,1), List(1,1))
Here's another case where they share the same indices
val x = Vectors.sparse(2, List(1), List(1))
val y = Vectors.sparse(2, List(1), List(1))
This output should be
Vectors.sparse(2, List(1), List(2))
I've realized doing this is harder than it seems. I looked into one possible solution of converting the vectors into breeze, adding them in breeze and then converting it back to a vector. e.g Addition of two RDD[mllib.linalg.Vector]'s. So I tried implementing this.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector) ={
val dense_x = x.toDense
val dense_y = y.toDense
val bv1 = new DenseVector(dense_x.toArray)
val bv2 = new DenseVector(dense_y.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout
}
however this gave me an error in the last line
val vectout = Vectors.dense((bv1 + bv2).toArray)
Cannot resolve the overloaded method 'dense'.
I'm wondering why is error is occurring and ways to fix it?

To answer my own question, I had to think about how sparse vectors are. For e.g. Sparse Vectors require 3 arguments. the number of dimensions, an array of indices, and finally an array of values. For e.g.
val indices: Array[Int] = Array(1,2)
val norms: Array[Double] = Array(0.5,0.3)
val num_int = 4
val vector: Vector = Vectors.sparse(num_int, indices, norms)
If I converted this SparseVector to an Array I would get the following.
code:
val choiced_array = vector.toArray
choiced_array.map(element => print(element + " "))
Output:
[0.0, 0.5,0.3,0.0].
This is considered a more dense representation of it. So once you convert the two vectors to array you can add them with the following code
val add: Array[Double] = (vector.toArray, vector_2.toArray).zipped.map(_ + _)
This gives you another array of them both added. Next to create your new sparse vector, you would want to create an indices array as shown in the construction
var i = -1;
val new_indices_pre = add.map( (element:Double) => {
i = i + 1
if(element > 0.0)
i
else{
-1
}
})
Then lets filter out all -1 indices indication that indicate zero for that indice.
new_indices_pre.filter(element => element != -1)
Remember to filter out none zero values from the array which has the addition of the two vectors.
val final_add = add.filter(element => element > 0.0)
Lastly, we can make the new sparse Vector
Vectors.sparse(num_int,new_indices,final_add)

Related

Avoid ListBuffer while preparing an element-wise multiplication of two SparseVectors

I'm trying to implement a element-wise multiplication of two ml.linalg.SparseVector instances (also called a Hadamard product).
A SparseVector represents a vector, but rather than having space taken up by all the "0" values, they are omitted. The vector is represented as two lists of Indices and Values.
For example: SparseVector(indices: [0, 100, 100000], values: [0.25, 1, 0.8]) concisely represents an array of 100,000 elements, where only 3 values are non-zero.
I now need an element-wise multiplication of two of these, and there seems to be no built-in. Conceptually, it should be simple - any indices they don't have in common are dropped, and for the indices in common, the numbers are multiplied together.
For example: SparseVector(indices: [0, 500, 100000], values: [10, 1, 10]) when multiplied with the above should return: SparseVector(indices: [0, 100000], values: [2.5, 8])
Sadly, I've found no built-in for this. I have an approach for doing this in a single pass, but it isn't very scala-y, it has to build up the list in a loop as it discovers which indices are in common, and then grab the corresponding values for each index (which have the same cardinal position, but in a second array).
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions.udf
import scala.collection.mutable.ListBuffer
// Return a new SparseVector whose values are the element-wise product (Hadamard product)
val multSparseVectors = udf((v1: SparseVector, v2: SparseVector) => {
// val commonIndexes = v1.indices.intersect(v2.indices); // Missing scale factors are assumed to have a value of 0, so only common elements remain
// TODO: No clear way to map common indices to the values that go with those indices. E.g. no "valueForIndex" method
// new SparseVector(v1.size, commonIndexes, commonIndexes.map(i => v1.valueForIndex(i) * v2.valueForIndex(i)).toArray);
val indices = ListBuffer[Int](); // TODO: Some way to do this without mutable lists?
val values = ListBuffer[Double]();
var v1Pos = 0; // Current index of SparseVector v1 (we will be making a single pass)
var v2pos = 0; // Current index of SparseVector v2 (we will be making a single pass)
while(v1Pos < v1.indices.length && v2pos < v2.indices.length) {
while(v1.indices(v1Pos) < v2.indices(v2pos))
v2pos += 1; // Advance our position in SparseVector 2 until we've matched or passed the current SparseVector 1 index
if(v2pos > v2.indices.length && v1.indices(v1Pos) == v2.indices(v2pos)) {
indices += v1.indices(v1Pos);
values += v1.values(v1Pos) * v2.values(v2pos);
}
v1Pos += 1;
}
new SparseVector(v1.size, indices.toArray, values.toArray);
})
spark.udf.register("multSparseVectors", multSparseVectors)
Can anyone think of a way that I can do this using a map or similar? My main goal is I want to avoid having to make multiple O(N) passes over the second vector to "lookup" the position of a value in the indices list so that I can grab the corresponding values entry, because this would take O(K + N*2) time when I know there's an O(K + N) solution possible.
I've come up with a solution by boiling this problem into a more general one:
Finding the indices at which two arrays intersect
Given an answer to the above question (where to the two arrays v1.indices and v2.indices intersect), we can trivially use those indices to extract back the new SparseVector indices, and the values from each vector to be multiplied together.
The solution is given below:
%scala
import scala.annotation.tailrec
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions.udf
// This fanciness from https://stackoverflow.com/a/71928709/529618 finds the indices at which two lists intersect
#tailrec
def indicesOfIntersection(left: List[Int], right: List[Int], lidx: Int = 0, ridx: Int = 0, result: List[(Int, Int)] = Nil): List[(Int, Int)] = (left, right) match {
case (Nil, _) | (_, Nil) => result.reverse
case (l::tail, r::_) if l < r => indicesOfIntersection(tail, right, lidx+1, ridx, result)
case (l::_, r::tail) if l > r => indicesOfIntersection(left, tail, lidx, ridx+1, result)
case (l::ltail, r::rtail) => indicesOfIntersection(ltail, rtail, lidx+1, ridx+1, (lidx, ridx) :: result)
}
// Return a new SparseVector whose values are the element-wise product (Hadamard product)
val multSparseVectors = udf((v1: SparseVector, v2: SparseVector) => {
val intersection = indicesOfIntersection(v1.indices.toList, v2.indices.toList);
new SparseVector(v1.size,
intersection.map{case (x1,_) => v1.indices(x1)}.toArray,
intersection.map{case (x1,x2) => v1.values(x1) * v2.values(x2)}.toArray);
})
spark.udf.register("multSparseVectors", multSparseVectors)

Functional way to build a matrix from the list in scala

I've asked this question on https://users.scala-lang.org/ but haven't got a concrete answer yet. I am given a vector v and I would like to construct a matrix m based on this vector according to the rules specified below. I would like to write the following code in a purely functional way, i.e. m = v.map(...) or similar. I can do it easily in procedural way like this
import scala.util.Random
val v = Vector.fill(50)(Random.nextInt(100))
println(v)
val m = Array.fill[Int](10, 10)(0)
def populateMatrix(x: Int): Unit = m(x/10)(x%10) += 1
v.map(x => populateMatrix(x))
m foreach { row => row foreach print; println }
In words, I am iterating through v, getting a pair of indices (i,j) from each v(k) and updating the matrix m at these positions, i.e., m(i)(j) += 1. But I am seeking a functional way. It is clear for me how to implement this in, e.g. Mathematica
v=RandomInteger[{99}, 300]
m=SparseArray[{Rule[{Quotient[#, 10] + 1, Mod[#, 10] + 1}, 1]}, {10, 10}] & /# v // Total // Normal
But how to do it in scala, which is functional language, too?
Your populate matrix approach can be "reversed" - map vector into index tuples, group them, count size of each group and turn it into map (index tuple -> size) which will be used to populate corresponding index in array with Array.tabulate:
val v = Vector.fill(50)(Random.nextInt(100))
val values = v.map(i => (i/10, i%10))
.groupBy(identity)
.view
.mapValues(_.size)
val result = Array.tabulate(10,10)( (i, j)=> values.getOrElse((i,j), 0))

In scala, how do I get access to specific index in tuple?

I am implementing function that gets random index and returns the element at random index of tuple.
I know that for tuple like, val a=(1,2,3) a._1=2
However, when I use random index val index=random_index(integer that is smaller than size of tuple), a._index doesnt work.
You can use productElement, note that it is zero based and has return type of Any:
val a=(1,2,3)
a.productElement(1) // returns 2nd element
If you know random_index only at runtime the best what you can have is (as #GuruStron answered)
val a = (1,2,3)
val i = 1
val x = a.productElement(i)
x: Any // 2
If you know random_index at compile time you can do
import shapeless.syntax.std.tuple._
val a = (1,2,3)
val x = a(1)
x: Int // 2 // not just Any
// a(4) // doesn't compile
val i = 1
// a(i) // doesn't compile
https://github.com/milessabin/shapeless/wiki/Feature-overview:-shapeless-2.0.0#hlist-style-operations-on-standard-scala-tuples
Although this a(1) seems to be pretty similar to standard a._1.

scala array.product function in 2d array

I have 2 d array:
val arr =Array(Array(2,1),Array(3,1),Array(4,1))
I should multiply all inner 1st elements and sum all inner 2nd elements to get as result:
Array(24,3)
I`m looking a way to use map there, something like :
arr.map(a=>Array(a(1stElemnt).product , a(2ndElemnt).sum ))
Any suggestion
Regards.
Following works but note that it is not safe, it throws exception if arr contains element/s that does not have exact 2 elements. You should add additional missing cases in pattern match as per your use case
val result = arr.fold(Array(1, 0)) {
case (Array(x1, x2), Array(y1, y2)) => Array(x1 * y1, x2 + y2)
}
Update
As #Luis suggested, if you make your original Array[Array] to Array[Tuple], another implementation could look like this
val arr = Array((2, 1), (3, 1), (4, 1))
val (prdArr, sumArr) = arr.unzip
val result = (prdArr.product, sumArr.sum)
println(result) // (24, 3)

How to declare a sparse Vector in Spark with Scala?

I'm trying to create a sparse Vector (the mllib.linalg.Vectors class, not the default one) but I can't understand how to use Seq. I have a small test file with three numbers/line, which I convert to an rdd, split the text in doubles and then group the lines by their first column.
Test file
1 2 4
1 3 5
1 4 8
2 7 5
2 8 4
2 9 10
Code
val data = sc.textFile("/home/savvas/DWDM/test.txt")
val data2 = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val grouped = data2.groupBy( _(0) )
This results in grouped having these values
(2.0,CompactBuffer([2.0,7.0,5.0], [2.0,8.0,4.0], [2.0,9.0,10.0]))
(1.0,CompactBuffer([1.0,2.0,4.0], [1.0,3.0,5.0], [1.0,4.0,8.0]))
But I can't seem to figure out the next step. I need to take each line of grouped and create a vector for it, so that each line of the new RDD has a vector with the third value of the CompactBuffer in the index specified by the second value. In short, what I mean is that I want my data in the example like this.
[0, 0, 0, 0, 0, 0, 5.0, 4.0, 10.0, 0]
[0, 4.0, 5.0, 8.0, 0, 0, 0, 0, 0, 0]
I know I need to use a sparse vector, and that there are three ways to construct it. I've tried using a Seq with a tuple2(index, value) , but I cannot understand how to create such a Seq.
One possible solution is something like below. First lets convert data to expected types:
import org.apache.spark.rdd.RDD
val pairs: RDD[(Double, (Int, Double))] = data.map(_.split(" ") match {
case Array(label, idx, value) => (label.toDouble, (idx.toInt, value.toDouble))
})
next find a maximum index (size of the vectors):
val nCols = pairs.map{case (_, (i, _)) => i}.max + 1
group and convert:
import org.apache.spark.mllib.linalg.SparseVector
def makeVector(xs: Iterable[(Int, Double)]) = {
val (indices, values) = xs.toArray.sortBy(_._1).unzip
new SparseVector(nCols, indices.toArray, values.toArray)
}
val transformed: RDD[(Double, SparseVector)] = pairs
.groupByKey
.mapValues(makeVector)
Another way you can handle this, assuming that the first elements can be safely converted to and from integer, is to use CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = data.map(_.split(" ") match {
case Array(label, idx, value) =>
MatrixEntry(label.toInt, idx.toInt, value.toDouble)
})
val transformed: RDD[(Double, SparseVector)] = new CoordinateMatrix(entries)
.toIndexedRowMatrix
.rows
.map(row => (row.index.toDouble, row.vector))