Spark dataframe to sparse vector with null - scala

I am facing a problem when I try to assemble a vector form a dataframe (Some columns contain null values) in scala. Unfortunately vectorAssembler cannot handle null values.
What I can do is to replace or fill dataframe's null values and then create a dense vector but that is not what I want.
So I thought about converting my dataframe rows to a sparse vector. But how can I achive this? I have not found an option for the vectorAssembler to make a sparse vector.
EDIT: Actually I do not need null in the sparse vector but it shouldn't be a value like 0 or any other as it would be the case for a dense vector.
Do you have any suggestions?

You could do it manually like this:
import org.apache.spark.SparkException
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ArrayBuilder
case class Row(a: Double, b: Option[Double], c: Double, d: Vector, e: Double)
val dataset = spark.createDataFrame(
Seq(new Row(0, None, 3.0, Vectors.dense(4.0, 5.0, 0.5), 7.0),
new Row(1, Some(2.0), 3.0, Vectors.dense(4.0, 5.0, 0.5), 7.0))
).toDF("id", "hour", "mobile", "userFeatures", "clicked")
val sparseVectorRDD = dataset.rdd.map { row =>
val indices = ArrayBuilder.make[Int]
val values = ArrayBuilder.make[Double]
var cur = 0
row.toSeq.foreach {
case v: Double =>
indices += cur
values += v
cur += 1
case vec: Vector =>
vec.foreachActive { case (i, v) =>
indices += cur + i
values += v
}
cur += vec.size
case null =>
cur += 1
case o =>
throw new SparkException(s"$o of type ${o.getClass.getName} is not supported.")
}
Vectors.sparse(cur, indices.result(), values.result())
}
And then convert it back to a dataframe if needed. Since Row objects are not type checked, you have to handle it manually and cast to the appropriate type if needed.

Related

Avoid ListBuffer while preparing an element-wise multiplication of two SparseVectors

I'm trying to implement a element-wise multiplication of two ml.linalg.SparseVector instances (also called a Hadamard product).
A SparseVector represents a vector, but rather than having space taken up by all the "0" values, they are omitted. The vector is represented as two lists of Indices and Values.
For example: SparseVector(indices: [0, 100, 100000], values: [0.25, 1, 0.8]) concisely represents an array of 100,000 elements, where only 3 values are non-zero.
I now need an element-wise multiplication of two of these, and there seems to be no built-in. Conceptually, it should be simple - any indices they don't have in common are dropped, and for the indices in common, the numbers are multiplied together.
For example: SparseVector(indices: [0, 500, 100000], values: [10, 1, 10]) when multiplied with the above should return: SparseVector(indices: [0, 100000], values: [2.5, 8])
Sadly, I've found no built-in for this. I have an approach for doing this in a single pass, but it isn't very scala-y, it has to build up the list in a loop as it discovers which indices are in common, and then grab the corresponding values for each index (which have the same cardinal position, but in a second array).
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions.udf
import scala.collection.mutable.ListBuffer
// Return a new SparseVector whose values are the element-wise product (Hadamard product)
val multSparseVectors = udf((v1: SparseVector, v2: SparseVector) => {
// val commonIndexes = v1.indices.intersect(v2.indices); // Missing scale factors are assumed to have a value of 0, so only common elements remain
// TODO: No clear way to map common indices to the values that go with those indices. E.g. no "valueForIndex" method
// new SparseVector(v1.size, commonIndexes, commonIndexes.map(i => v1.valueForIndex(i) * v2.valueForIndex(i)).toArray);
val indices = ListBuffer[Int](); // TODO: Some way to do this without mutable lists?
val values = ListBuffer[Double]();
var v1Pos = 0; // Current index of SparseVector v1 (we will be making a single pass)
var v2pos = 0; // Current index of SparseVector v2 (we will be making a single pass)
while(v1Pos < v1.indices.length && v2pos < v2.indices.length) {
while(v1.indices(v1Pos) < v2.indices(v2pos))
v2pos += 1; // Advance our position in SparseVector 2 until we've matched or passed the current SparseVector 1 index
if(v2pos > v2.indices.length && v1.indices(v1Pos) == v2.indices(v2pos)) {
indices += v1.indices(v1Pos);
values += v1.values(v1Pos) * v2.values(v2pos);
}
v1Pos += 1;
}
new SparseVector(v1.size, indices.toArray, values.toArray);
})
spark.udf.register("multSparseVectors", multSparseVectors)
Can anyone think of a way that I can do this using a map or similar? My main goal is I want to avoid having to make multiple O(N) passes over the second vector to "lookup" the position of a value in the indices list so that I can grab the corresponding values entry, because this would take O(K + N*2) time when I know there's an O(K + N) solution possible.
I've come up with a solution by boiling this problem into a more general one:
Finding the indices at which two arrays intersect
Given an answer to the above question (where to the two arrays v1.indices and v2.indices intersect), we can trivially use those indices to extract back the new SparseVector indices, and the values from each vector to be multiplied together.
The solution is given below:
%scala
import scala.annotation.tailrec
import org.apache.spark.ml.linalg._
import org.apache.spark.sql.functions.udf
// This fanciness from https://stackoverflow.com/a/71928709/529618 finds the indices at which two lists intersect
#tailrec
def indicesOfIntersection(left: List[Int], right: List[Int], lidx: Int = 0, ridx: Int = 0, result: List[(Int, Int)] = Nil): List[(Int, Int)] = (left, right) match {
case (Nil, _) | (_, Nil) => result.reverse
case (l::tail, r::_) if l < r => indicesOfIntersection(tail, right, lidx+1, ridx, result)
case (l::_, r::tail) if l > r => indicesOfIntersection(left, tail, lidx, ridx+1, result)
case (l::ltail, r::rtail) => indicesOfIntersection(ltail, rtail, lidx+1, ridx+1, (lidx, ridx) :: result)
}
// Return a new SparseVector whose values are the element-wise product (Hadamard product)
val multSparseVectors = udf((v1: SparseVector, v2: SparseVector) => {
val intersection = indicesOfIntersection(v1.indices.toList, v2.indices.toList);
new SparseVector(v1.size,
intersection.map{case (x1,_) => v1.indices(x1)}.toArray,
intersection.map{case (x1,x2) => v1.values(x1) * v2.values(x2)}.toArray);
})
spark.udf.register("multSparseVectors", multSparseVectors)

dot product of dense vectors with nulls

I am using spark with scala and trying to do the following.
I have two dense vectors(created using Vectors.dense), and I need to find the dot product of these. How could I accomplish this?
Also, I am creating the vectors based on an input file which is comma seperated. However some values are missing. Is there an easy way to read these values as zero instead of null when I am creating the vectors?
For example:
input file: 3,1,,,2
created vector: 3,1,0,0,2
Spark vectors are just wrappers for arrays, internally they get converted to Breeze arrays for vector/matrix operations. You can do just that manually to get the dot product:
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
val dv1: Vector = Vectors.dense(1.0, 0.0, 3.0)
val bdv1 = new BDV(dv1.toArray)
val dv2: Vector = Vectors.dense(2.0, 0.0, 0.0)
val bdv2 = new BDV(dv2.toArray)
scala> bdv1 dot bdv2
res3: Double = 2.0
For your second question, you can do something like this:
val v: String = "3,1,,,2"
scala> v.split("\\,").map(r => if (r == "") 0 else r.toInt)
res4: Array[Int] = Array(3, 1, 0, 0, 2)

Create a SparseVector from the elements of RDD

Using Spark, I have a data structure of type val rdd = RDD[(x: Int, y:Int), cov:Double] in Scala, where each element of the RDD represents an element of a matrix with x representing the row, y representing the column and cov representing the value of the element:
I need to create SparseVectors from rows of this matrix. So I decided to first convert the rdd to RDD[x: Int, (y:Int, cov:Double)] and then use groupByKey to put all elements of a specific row together like this:
val rdd2 = rdd.map{case ((x,y),cov) => (x, (y, cov))}.groupByKey()
Now I need to create the SparseVectors:
val N = 7 //Vector Size
val spvec = {(x: Int,y: Iterable[(Int, Double)]) => new SparseVector(N.toLong, Array(y.map(el => el._1.toInt)), Array(y.map(el => el._2.toDouble)))}
val vecs = rdd2.map(spvec)
However, this is the error that pops up.
type mismatch; found :Iterable[Int] required:Int
type mismatch; found :Iterable[Double] required:Double
I am guessing that y.map(el => el._1.toInt) is returning an iterable which Array cannot be applied on. I would appreciate if someone could help with how to do this.
The simplest solution is to convert to RowMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val rdd: RDD[((Int, Int), Double)] = ???
val vs: RDD[org.apache.spark.mllib.linalg.SparseVector]= new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toRowMatrix.rows.map(_.toSparse)
If you want to preserve row indices you can use toIndexedRowMatrix instead:
import org.apache.spark.mllib.linalg.distributed.IndexedRow
new CoordinateMatrix(
rdd.map{
case ((x, y), cov) => MatrixEntry(x, y, cov)
}
).toIndexedRowMatrix.rows.map { case IndexedRow(i, vs) => (i, vs.toSparse) }

Converting a [(Int, Seq[Double])] RDD to LabeledPoint

I have an RDD of the following format and would like to convert it into a LabeledPoint RDD in order to process it in mllib :
Test: RDD[(Int, Seq[Double])] = Array((1,List(1.0,3.0,8.0),(2,List(3.0, 3.0,8.0),(1,List(2.0,3.0,7.0),(1,List(5.0,5.0,9.0))
I tried with map
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
Test.map(x=> LabeledPoint(x._1, Vectors.sparse(x._2)))
but I get this error
mllib.linalg.Vector cannot be applied to (Seq[scala.Double])
So presumably the Seq element needs to be converted first but I don't know into what.
There are a few problems here:
label should be Double not Int
SparseVector requires number of elements, indices and values
none of the vector constructors accepts list of Double
your data looks dense not sparse
One possible solution:
val rdd = sc.parallelize(Array(
(1, List(1.0,3.0,8.0)),
(2, List(3.0, 3.0,8.0)),
(1, List(2.0,3.0,7.0)),
(1, List(5.0,5.0,9.0))))
rdd.map { case (k, vs) =>
LabeledPoint(k.toDouble, Vectors.dense(vs.toArray))
}
and another:
rdd.collect { case (k, v::vs) =>
LabeledPoint(k.toDouble, Vectors.dense(v, vs: _*)) }
As you can notice in LabeledPoint's documentation its constructor receives a Double as a label and a Vector as features (DenseVector or SparseVector). However, if you take a look in both inherited classes' constructors they receive an Array, therefore you need to convert your Seq to Array.
import org.apache.spark.mllib.linalg.{Vector, Vectors, DenseVector}
import org.apache.spark.mllib.regression.LabeledPoint
val rdd = sc.parallelize(Array((1, Seq(1.0,3.0,8.0)),
(2, Seq(3.0, 3.0,8.0)),
(1, Seq(2.0,3.0, 7.0)),
(1, Seq(5.0, 5.0, 9.0))))
val x = rdd.map{
case (a: Int, b:Seq[Double]) => LabeledPoint(a, new DenseVector(b.toArray))
}
x.take(2).foreach(println)
//(1.0,[1.0,3.0,8.0])
//(2.0,[3.0,3.0,8.0])

How to declare a sparse Vector in Spark with Scala?

I'm trying to create a sparse Vector (the mllib.linalg.Vectors class, not the default one) but I can't understand how to use Seq. I have a small test file with three numbers/line, which I convert to an rdd, split the text in doubles and then group the lines by their first column.
Test file
1 2 4
1 3 5
1 4 8
2 7 5
2 8 4
2 9 10
Code
val data = sc.textFile("/home/savvas/DWDM/test.txt")
val data2 = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
val grouped = data2.groupBy( _(0) )
This results in grouped having these values
(2.0,CompactBuffer([2.0,7.0,5.0], [2.0,8.0,4.0], [2.0,9.0,10.0]))
(1.0,CompactBuffer([1.0,2.0,4.0], [1.0,3.0,5.0], [1.0,4.0,8.0]))
But I can't seem to figure out the next step. I need to take each line of grouped and create a vector for it, so that each line of the new RDD has a vector with the third value of the CompactBuffer in the index specified by the second value. In short, what I mean is that I want my data in the example like this.
[0, 0, 0, 0, 0, 0, 5.0, 4.0, 10.0, 0]
[0, 4.0, 5.0, 8.0, 0, 0, 0, 0, 0, 0]
I know I need to use a sparse vector, and that there are three ways to construct it. I've tried using a Seq with a tuple2(index, value) , but I cannot understand how to create such a Seq.
One possible solution is something like below. First lets convert data to expected types:
import org.apache.spark.rdd.RDD
val pairs: RDD[(Double, (Int, Double))] = data.map(_.split(" ") match {
case Array(label, idx, value) => (label.toDouble, (idx.toInt, value.toDouble))
})
next find a maximum index (size of the vectors):
val nCols = pairs.map{case (_, (i, _)) => i}.max + 1
group and convert:
import org.apache.spark.mllib.linalg.SparseVector
def makeVector(xs: Iterable[(Int, Double)]) = {
val (indices, values) = xs.toArray.sortBy(_._1).unzip
new SparseVector(nCols, indices.toArray, values.toArray)
}
val transformed: RDD[(Double, SparseVector)] = pairs
.groupByKey
.mapValues(makeVector)
Another way you can handle this, assuming that the first elements can be safely converted to and from integer, is to use CoordinateMatrix:
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
val entries: RDD[MatrixEntry] = data.map(_.split(" ") match {
case Array(label, idx, value) =>
MatrixEntry(label.toInt, idx.toInt, value.toDouble)
})
val transformed: RDD[(Double, SparseVector)] = new CoordinateMatrix(entries)
.toIndexedRowMatrix
.rows
.map(row => (row.index.toDouble, row.vector))