Inserting values in row (Spark - Scala)

Inserting values in row (Spark - Scala) - scala

I want to create a row (for any given k) such -
for k =2, graph will be [Row(1,2), Row(3,4)]
for k =3, graph will be [Row(1,2,3), Row(4,5,6), Row(7,8,9)]
I am new to scala and dont know how exactly can I insert values in row like this.
import org.apache.spark.sql.Row
var graph = ArrayBuffer[Row]()
val k = 3
val k2 = k * k
for (a <- 1 to k2) {
graph += Row(a)
}

val k = 3
val range = ArrayBuffer.range(1, k*k + 1)
val rows = range.map(v => Row(v))
val grouped = rows.grouped(k).toBuffer

Related

Add a vector to every column of a matrix, using Scala Breeze

I have a matrix M of (L x N) rank and I want to add the same vector v of length L to every column of the matrix. Is there a way do this please, using Scala Breeze?
I tried:
val H = DenseMatrix.zeros(L,N)
for (j <- 0 to L) {
H (::,j) = M(::,j) + v
}
but this doesn't really fit Scala's immutability as H is then already defined and therefore gives a reassignment to val error. Any suggestions appreciated!

To add a vector to all columns of a matrix, you don't need to loop through columns; you can use the column broadcasting feature, for your example:
H(::,*) + v // assume v is breeze dense vector
Should work.
import breeze.linalg._
val L = 3
val N = 2
val v = DenseVector(1.0,2.0,3.0)
val H = DenseMatrix.zeros[Double](L, N)
val result = H(::,*) + v
//result: breeze.linalg.DenseMatrix[Double] = 1.0 1.0
// 2.0 2.0
// 3.0 3.0

How can I normalize a matrix in spark?

I need to divide each matrix element (i, j) by the sqrt of the product of the diagonal elements (i, i) and (j, j)
in other words for all i and j I need to perform:
mat(i, j) = mat(i, j)/sqrt(mat(i,i)*mat(j,j))
So the matrix:
4 0 12
0 1 1
12 0 9
turns into:
1 0 2
0 1 1
2 0 1
What I have so far is a list of row/column index pairs with a weight that I convert into a CoordinateMatrix (and later RowMatrix). I extract the diagonal by filtering elements where row == column.
What's the best way to implement this elementwise division?
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry, RowMatrix}
import scala.math.sqrt
val pairs = Array((0,0,4.0), (0,2,12.0), (1,1,1.0), (2,0,12.0), (2,2,9.0))
val pairs_rdd = sc.parallelize(pairs)
val diagonal = pairs_rdd.filter(r => r._1 == r._2).map(r => (r._2, sqrt(r._3)))
val matrixEntries = pairs_rdd.map(r => MatrixEntry(r._1, r._2, r._3))
val coordinateMatrix: CoordinateMatrix = new CoordinateMatrix(matrixEntries)
val rowMatrix: RowMatrix = coordinateMatrix.toRowMatrix()

It seems none of the MLLib matrix helper-classes can really assist here, so the only way out seems to be manual joining of your matrix with the diagonal you've created (once by i, once by j):
val diagonal: RDD[(Long, Double)] = pairs_rdd.filter(r => r._1 == r._2).map(r => (r._2, r._3))
val result = matrixEntries
.keyBy(_.i).join(diagonal).values // join by i coordinate
.keyBy(_._1.j).join(diagonal).values // join by j coordinate
.map { case ((e, di), dj) => MatrixEntry(e.i, e.j, e.value / sqrt(di * dj)) }

Conditional slicing in Scala Breeze

I try to slice a DenseVector based on a elementwise boolean condition on another DenseVector:
import breeze.linalg.DenseVector
val x = DenseVector(1.0,2.0,3.0)
val y = DenseVector(10.0,20,0,30.0)
// I want a new DenseVector containing all elements of y where x > 1.5
// i.e. I want DenseVector(20,0,30.0)
val newy = y(x:>1.5) // does not give a DenseVector but a SliceVector
With Python/Numpy, I would just write y[x>1.5]

Using Breeze you have to use for comprehensions for filtering DenseVectors
val y = DenseVector(10.0,20,0,30.0)
val newY = for {
v <- y
if v > 1.5
} yield v
// or to write it in one line
val newY = for (v <- y if v > 1.5) yield v

The SliceVector resulting from y(x:>1.5) is just a view on the original DenseVector. To create a new DenseVector, use
val newy = y(x:>1.5).toDenseVector

Is it possible to correctly calculate SVD on IndexedRowMatrix in Spark?

I've got a IndexedRowMatrix [m x n], which contains only X non-zero rows. I'm setting k = 3.
When I try to calculate SVD on this object with computeU set to true, dimensions of U matrix are [m x n], when the correct dimensions are [m x k].
Why does it happen?
I've already tried converting IndexedRowMatrix to RowMatrix and then calculating SVD. The result dimensions are [X x k], so it only calculates result for non-zero rows (matrix is dropping indices, as in documentation).
Is it possible to convert this matrix, but with keeping rows indices?
val csv = sc.textFile("hdfs://spark/nlp/merged_sparse.csv").cache() // original file
val data = csv.mapPartitions(lines => {
val parser = new CSVParser(' ')
lines.map(line => {
parser.parseLine(line)
})
}).map(line => {
MatrixEntry(line(0).toLong - 1, line(1).toLong - 1 , line(2).toInt)
}
)
val coordinateMatrix: CoordinateMatrix = new CoordinateMatrix(data)
val indexedRowMatrix: IndexedRowMatrix = coordinateMatrix.toIndexedRowMatrix()
val rowMatrix: RowMatrix = indexedRowMatrix.toRowMatrix()
val svd: SingularValueDecomposition[RowMatrix, Matrix] = rowMatrix.computeSVD(3, computeU = true, 1e-9)
val U: RowMatrix = svd.U // The U factor is a RowMatrix.
val S: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
val indexedSvd: SingularValueDecomposition[IndexedRowMatrix, Matrix] = indexedRowMatrix.computeSVD(3, computeU = true, 1e-9)
val indexedU: IndexedRowMatrix = indexedSvd.U // The U factor is a RowMatrix.
val indexedS: Vector = indexedSvd.s // The singular values are stored in a local dense vector.
val indexedV: Matrix = indexedSvd.V // The V factor is a local dense matrix.

It looks like this is a bug in Spark MLlib. If you you get the size of a row vector in your indexed matrix it will correctly return 3 columns:
indexedU.rows.first().vector.size
I looked at the source and it looks like they're incorrectly copying the current number of columns from the indexed matrix:
val U = if (computeU) {
val indexedRows = indices.zip(svd.U.rows).map { case (i, v) =>
IndexedRow(i, v)
}
new IndexedRowMatrix(indexedRows, nRows, nCols) //nCols is incorrect here
} else {
null
}
Looks like a prime candidate for a bugfix/pull request.

How to access on Vector components when we don't know in advance its dimension

For a two dimension vector i use this snippet, but if i don't know in advance the number of dimension of vector, how can i adapt my code ?
var vect = Vectors.dense(0,0)
var rdd_vects = sc.parallelize(Array(vect,vect,...))
var sum = rdd_vects.reduce( case (x,y) => Vectors.dense(x(0)+y(0),x(1)+y(1)) )
Thank you for your advices

If all the vectors are of same dimension:
val sum = rdd_vects.reduce{ (x, y) =>
Vectors.dense((x.toArray, y.toArray).zipped.map(_+_))
}

I think i found my answer by myself. Vector can be created with Vector.dense(Array[Double]) thus i run a for loop inside my reducer.
var sum = rdd_vect.reduce( (x,y) => {
var tab_vect = Array(0.0).tail
var x_size = x.size
for( ind <- 0 to x_size-1) {
val component_x = x(ind)
val component_y = y(ind)
val component_f = component_x + component_y
tab_vect = tab_vect :+ component_f
}
Vectors.dense(tab_vect)
})

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Inserting values in row (Spark - Scala) - scala

val k = 3 val range = ArrayBuffer.range(1, k*k + 1) val rows = range.map(v => Row(v)) val grouped = rows.grouped(k).toBuffer

Related

Add a vector to every column of a matrix, using Scala Breeze

How can I normalize a matrix in spark?

Conditional slicing in Scala Breeze

Is it possible to correctly calculate SVD on IndexedRowMatrix in Spark?

How to access on Vector components when we don't know in advance its dimension

Categories

Resources