Inserting values in row (Spark - Scala) - scala

I want to create a row (for any given k) such -
for k =2, graph will be [Row(1,2), Row(3,4)]
for k =3, graph will be [Row(1,2,3), Row(4,5,6), Row(7,8,9)]
I am new to scala and dont know how exactly can I insert values in row like this.
import org.apache.spark.sql.Row
var graph = ArrayBuffer[Row]()
val k = 3
val k2 = k * k
for (a <- 1 to k2) {
graph += Row(a)
}

val k = 3
val range = ArrayBuffer.range(1, k*k + 1)
val rows = range.map(v => Row(v))
val grouped = rows.grouped(k).toBuffer

Related

Add a vector to every column of a matrix, using Scala Breeze

I have a matrix M of (L x N) rank and I want to add the same vector v of length L to every column of the matrix. Is there a way do this please, using Scala Breeze?
I tried:
val H = DenseMatrix.zeros(L,N)
for (j <- 0 to L) {
H (::,j) = M(::,j) + v
}
but this doesn't really fit Scala's immutability as H is then already defined and therefore gives a reassignment to val error. Any suggestions appreciated!
To add a vector to all columns of a matrix, you don't need to loop through columns; you can use the column broadcasting feature, for your example:
H(::,*) + v // assume v is breeze dense vector
Should work.
import breeze.linalg._
val L = 3
val N = 2
val v = DenseVector(1.0,2.0,3.0)
val H = DenseMatrix.zeros[Double](L, N)
val result = H(::,*) + v
//result: breeze.linalg.DenseMatrix[Double] = 1.0 1.0
// 2.0 2.0
// 3.0 3.0

How can I normalize a matrix in spark?

I need to divide each matrix element (i, j) by the sqrt of the product of the diagonal elements (i, i) and (j, j)
in other words for all i and j I need to perform:
mat(i, j) = mat(i, j)/sqrt(mat(i,i)*mat(j,j))
So the matrix:
4 0 12
0 1 1
12 0 9
turns into:
1 0 2
0 1 1
2 0 1
What I have so far is a list of row/column index pairs with a weight that I convert into a CoordinateMatrix (and later RowMatrix). I extract the diagonal by filtering elements where row == column.
What's the best way to implement this elementwise division?
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry, RowMatrix}
import scala.math.sqrt
val pairs = Array((0,0,4.0), (0,2,12.0), (1,1,1.0), (2,0,12.0), (2,2,9.0))
val pairs_rdd = sc.parallelize(pairs)
val diagonal = pairs_rdd.filter(r => r._1 == r._2).map(r => (r._2, sqrt(r._3)))
val matrixEntries = pairs_rdd.map(r => MatrixEntry(r._1, r._2, r._3))
val coordinateMatrix: CoordinateMatrix = new CoordinateMatrix(matrixEntries)
val rowMatrix: RowMatrix = coordinateMatrix.toRowMatrix()
It seems none of the MLLib matrix helper-classes can really assist here, so the only way out seems to be manual joining of your matrix with the diagonal you've created (once by i, once by j):
val diagonal: RDD[(Long, Double)] = pairs_rdd.filter(r => r._1 == r._2).map(r => (r._2, r._3))
val result = matrixEntries
.keyBy(_.i).join(diagonal).values // join by i coordinate
.keyBy(_._1.j).join(diagonal).values // join by j coordinate
.map { case ((e, di), dj) => MatrixEntry(e.i, e.j, e.value / sqrt(di * dj)) }

Conditional slicing in Scala Breeze

I try to slice a DenseVector based on a elementwise boolean condition on another DenseVector:
import breeze.linalg.DenseVector
val x = DenseVector(1.0,2.0,3.0)
val y = DenseVector(10.0,20,0,30.0)
// I want a new DenseVector containing all elements of y where x > 1.5
// i.e. I want DenseVector(20,0,30.0)
val newy = y(x:>1.5) // does not give a DenseVector but a SliceVector
With Python/Numpy, I would just write y[x>1.5]
Using Breeze you have to use for comprehensions for filtering DenseVectors
val y = DenseVector(10.0,20,0,30.0)
val newY = for {
v <- y
if v > 1.5
} yield v
// or to write it in one line
val newY = for (v <- y if v > 1.5) yield v
The SliceVector resulting from y(x:>1.5) is just a view on the original DenseVector. To create a new DenseVector, use
val newy = y(x:>1.5).toDenseVector

Is it possible to correctly calculate SVD on IndexedRowMatrix in Spark?

I've got a IndexedRowMatrix [m x n], which contains only X non-zero rows. I'm setting k = 3.
When I try to calculate SVD on this object with computeU set to true, dimensions of U matrix are [m x n], when the correct dimensions are [m x k].
Why does it happen?
I've already tried converting IndexedRowMatrix to RowMatrix and then calculating SVD. The result dimensions are [X x k], so it only calculates result for non-zero rows (matrix is dropping indices, as in documentation).
Is it possible to convert this matrix, but with keeping rows indices?
val csv = sc.textFile("hdfs://spark/nlp/merged_sparse.csv").cache() // original file
val data = csv.mapPartitions(lines => {
val parser = new CSVParser(' ')
lines.map(line => {
parser.parseLine(line)
})
}).map(line => {
MatrixEntry(line(0).toLong - 1, line(1).toLong - 1 , line(2).toInt)
}
)
val coordinateMatrix: CoordinateMatrix = new CoordinateMatrix(data)
val indexedRowMatrix: IndexedRowMatrix = coordinateMatrix.toIndexedRowMatrix()
val rowMatrix: RowMatrix = indexedRowMatrix.toRowMatrix()
val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMatrix.computeSVD(3, computeU = true, 1e-9)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val S: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val indexedSvd: SingularValueDecomposition[IndexedRowMatrix, Matrix] = indexedRowMatrix.computeSVD(3, computeU = true, 1e-9)
val indexedU: IndexedRowMatrix = indexedSvd.U // The U factor is a RowMatrix.
val indexedS: Vector = indexedSvd.s // The singular values are stored in a local dense vector.
val indexedV: Matrix = indexedSvd.V // The V factor is a local dense matrix.
It looks like this is a bug in Spark MLlib. If you you get the size of a row vector in your indexed matrix it will correctly return 3 columns:
indexedU.rows.first().vector.size
I looked at the source and it looks like they're incorrectly copying the current number of columns from the indexed matrix:
val U = if (computeU) {
val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
IndexedRow(i, v)
}
new IndexedRowMatrix(indexedRows, nRows, nCols) //nCols is incorrect here
} else {
null
}
Looks like a prime candidate for a bugfix/pull request.

How to access on Vector components when we don't know in advance its dimension

For a two dimension vector i use this snippet, but if i don't know in advance the number of dimension of vector, how can i adapt my code ?
var vect = Vectors.dense(0,0)
var rdd_vects = sc.parallelize(Array(vect,vect,...))
var sum = rdd_vects.reduce( case (x,y) => Vectors.dense(x(0)+y(0),x(1)+y(1)) )
Thank you for your advices
If all the vectors are of same dimension:
val sum = rdd_vects.reduce{ (x, y) =>
Vectors.dense((x.toArray, y.toArray).zipped.map(_+_))
}
I think i found my answer by myself. Vector can be created with Vector.dense(Array[Double]) thus i run a for loop inside my reducer.
var sum = rdd_vect.reduce( (x,y) => {
var tab_vect = Array(0.0).tail
var x_size = x.size
for( ind <- 0 to x_size-1) {
val component_x = x(ind)
val component_y = y(ind)
val component_f = component_x + component_y
tab_vect = tab_vect :+ component_f
}
Vectors.dense(tab_vect)
})