Large matrix operations: Multiplication in Scala/Apache Spark - scala

I need to multiply two large matrices, X and Y. Typically X has ~500K rows and ~18K columns and Y has ~18K rows and ~18K columns. The matrix X is expected to be sparse and the matrix Y is expected to be sparse/dense. What is the ideal way of performing this multiplication in Scala/Apache Spark?

I got some code for you. It represents a matrix as an array of column vectors (which means each entry in the array is a column, not a row). It takes about 0.7s to multiply two 1000*1000 matrices. 11 minutes for two 10,000 * 10,000 matrices. 1.5 hours for 20,000 * 20,000 and 30 hours for (500k*18k) times (18k*18k). But if you run it in parallel (by using the code that's commented out) it should run about 2 to 3 times faster (on a 4 core cpu). But remember that the number of columns in the first matrix always has to be the same as the number of rows in the second.
class Matrix(val columnVectors: Array[Array[Double]]) {
val columns = columnVectors.size
val rows = columnVectors.head.size
def *(v: Array[Double]): Array[Double] = {
val newValues = Array.ofDim[Double](rows)
var col = 0
while(col < columns) {
val n = v(col)
val column = columnVectors(col)
var row = 0
while(row < newValues.size) {
newValues(row) += column(row) * n
row += 1
}
col += 1
}
newValues
}
def *(other: Matrix): Matrix = {
//do the calculation on only one cpu
new Matrix(other.columnVectors.map(col => this * col))
//do the calculation in parallel on all available cpus
//new Matrix(other.columnVectors.par.map(col => this * col).toArray)
}
override def toString = {
columnVectors.transpose.map(_.mkString(", ")).mkString("\n")
}
}
edit:
ok, here is a better version. I now store the row vectors in the matrix instead of the column vectors. That makes it easier to optimize the multiplication for the case where the first matrix is sparse.
Also I added a lazy version of the matrix multiplication using iterators. Since the first matrix is 500k * 18k = 9 billion numbers, such a lazy version will allow you to do that multiplication without requiring much ram. You just have to create an Iterator that can read the rows lazily e.g. from a data bank and then write the rows from the resulting iterator back.
import scala.collection.Iterator
import scala.util.{Random => rand}
def time[T](descr: String)(f: => T): T = {
val start = System.nanoTime
val r = f
val end = System.nanoTime
val time = (end - start)/1e6
println(descr + ": time = " + time + "ms")
r
}
object Matrix {
def mulLazy(m1: Iterator[Array[Double]], m2: Matrix): Iterator[Array[Double]] = {
m1.grouped(8).map { group =>
group.par.map(m2.mulRow).toIterator
}.flatten
}
}
class Matrix(val rowVectors: Array[Array[Double]]) {
val columns = rowVectors.head.size
val rows = rowVectors.size
private def mulRow(otherRow: Array[Double]): Array[Double] = {
val rowVectors = this.rowVectors
val result = Array.ofDim[Double](columns)
var i = 0
while(i < otherRow.size) {
val value = otherRow(i)
if(value != 0) { //optimization for sparse matrix
val row = rowVectors(i)
var col = 0
while(col < result.size) {
result(col) += value * row(col)
col += 1
}
}
i += 1
}
result
}
def *(other: Matrix): Matrix = {
new Matrix(rowVectors.par.map(other.mulRow).toArray)
}
def equals(other: Matrix): Boolean = {
java.util.Arrays.deepEquals(this.rowVectors.asInstanceOf[Array[Object]], other.rowVectors.asInstanceOf[Array[Object]])
}
override def equals(other: Any): Boolean = {
if(other.isInstanceOf[Matrix]) equals(other.asInstanceOf[Matrix]) else false
}
override def toString = {
rowVectors.map(_.mkString(", ")).mkString("\n")
}
}
def randMatrix(rows: Int, columns: Int): Matrix = {
new Matrix((1 to rows).map(_ => Array.fill(columns)(rand.nextDouble * 100)).toArray)
}
def sparseRandMatrix(rows: Int, columns: Int, ratio: Double): Matrix = {
new Matrix((1 to rows).map(_ => Array.fill(columns)(if(rand.nextDouble > ratio) 0 else rand.nextDouble * 100)).toArray)
}
val N = 2000
val m1 = sparseRandMatrix(N, N, 0.1) // only 10% of the numbers will be different from 0
val m2 = randMatrix(N, N)
val m3 = m1.rowVectors.toIterator
val m12 = time("m1 * m2")(m1 * m2)
val m32 = time("m3 * m2")(Matrix.mulLazy(m3, m2)) //doesn't take much time because the matrix multiplication is lazy
println(m32)
println("m12 == m32 = " + (new Matrix(m32.toArray) == m12))

Related

Scala code slows down as array size increases

I have written a Scala Spark application that is implementation of iterative algorithm. Element wise operation have to be performed in each iteration on arrays. Initially size of all collection was 1000. For 1000 size arrays, code works fine and faster. Now size of arrays is 1 million and this code (part of application) is taking too long time, slowing down whole application. This code is used inside mapPartitionWithIndex of Spark RDD. Here is code.
def ESt(arr: Array[Double], arr1: Array[Double], frq: Double): Array[Double] = {
val newArr = new Array[Double](arr.length)
var i = 0
while (i < arr.length) {
newArr(i) = (arr(i) - arr1(i)) * frq
i += 1
}
newArr
}
def ES(arr: Array[Double], arr1: Array[Double], arr2: Array[Double]): Array[Double] = {
val newArr =new Array[Double](arr.length)
var i = 0
while (i < arr.length) {
newArr(i) = arr(i) + arr1(i) + arr2(i)
i += 1
}
newArr
}
Here is usage of above two functions to create an Array.
val dim = 1000000 // one million
val number : Double = math.random
val xr1 : Array[Double] = Array of size dim
val xr2 : Array[Double] = Array of size dim
val xbest : Array[Double] = Array of size dim
val randomArray : Array[Double] = Array.fill(dim)(math.random)
var result : Array[Double] = ES(randomArray, ESt(xbest, randomArray, number), ESt(xr1, xr2, number))
I have lot of cores in cluster where I executes this code. How can I speed-up this computation using Scala functional or parallel programming power or using any other technique?

Scala Mismatch MonteCarlo

I try to implement a version of the Monte Carlo algorithm in Scala but i have a little problem.
In my first loop, i have a mismatch with Unit and Int, but I didn't know how to slove this.
Thank for your help !
import scala.math._
import scala.util.Random
import scala.collection.mutable.ListBuffer
object Main extends App{
def MonteCarlo(list: ListBuffer[Int]): List[Int] = {
for (i <- list) {
var c = 0.00
val X = new Random
val Y = new Random
for (j <- 0 until i) {
val x = X.nextDouble // in [0,1]
val y = Y.nextDouble // in [0,1]
if (x * x + y * y < 1) {
c = c + 1
}
}
c = c * 4
var p = c / i
var error = abs(Pi-p)
print("Approximative value of pi : $p \tError: $error")
}
}
var liste = ListBuffer (200, 2000, 4000)
MonteCarlo(liste)
}
A guy working usually with Python.
for loop does not return anything, so that's why your method returns Unit but expects List[Int] as return type is List[Int].
Second, you have not used scala interpolation correctly. It won't print the value of error. You forgot to use 's' before the string.
Third thing, if want to return list, you first need a list where you will accumulate all values of every iteration.
So i am assuming that you are trying to return error for all iterations. So i have created an errorList, which will store all values of error. If you want to return something else you can modify your code accordingly.
def MonteCarlo(list: ListBuffer[Int]) = {
val errorList = new ListBuffer[Double]()
for (i <- list) {
var c = 0.00
val X = new Random
val Y = new Random
for (j <- 0 until i) {
val x = X.nextDouble // in [0,1]
val y = Y.nextDouble // in [0,1]
if (x * x + y * y < 1) {
c = c + 1
}
}
c = c * 4
var p = c / i
var error = abs(Pi-p)
errorList += error
println(s"Approximative value of pi : $p \tError: $error")
}
errorList
}
scala> MonteCarlo(liste)
Approximative value of pi : 3.26 Error: 0.11840734641020667
Approximative value of pi : 3.12 Error: 0.02159265358979301
Approximative value of pi : 3.142 Error: 4.073464102067881E-4
res9: scala.collection.mutable.ListBuffer[Double] = ListBuffer(0.11840734641020667, 0.02159265358979301, 4.073464102067881E-4)

Spark Jaccard similarity computation by min hashing slow compared to trivial approach

Given 2 huge list of values, I am trying to compute jaccard similarity between them in Spark using Scala.
Assume colHashed1 contains the first list of values and colHashed2 contains the second list.
Approach 1(trivial approach):
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
Approach 2(using minHashing):
I have used the approach explained here.
import java.util.zip.CRC32
def getCRC32 (s : String) : Int =
{
val crc=new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
val maxShingleID = Math.pow(2,32)-1
def pickRandomCoeffs(kIn : Int) : Array[Int] =
{
var k = kIn
val randList = Array.fill(k){0}
while(k > 0)
{
// Get a random shingle ID.
var randIndex = (Math.random()*maxShingleID).toInt
// Ensure that each random number is unique.
while(randList.contains(randIndex))
{
randIndex = (Math.random()*maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes)
val coeffB = pickRandomCoeffs(numHashes)
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble
Approach 1 seems to outperform Approach 2 in terms of time always. When I analyzed the code, min() function call on the RDD in Approach 2 takes significant time and that function is called many times depending upon how many hash functions are used.
The intersection and union operations used in Approach 1 seems to work faster compared to the repeated min() function calls.
I don't understand why minHashing does not help here. I expected minHashing to work faster compared to trivial approach. Is there anything I am doing wrong here?
Sample data can be viewed here
JaccardSimilarity with MinHash is not giving consistent results:
import java.util.zip.CRC32
object Jaccard {
def getCRC32(s: String): Int = {
val crc = new CRC32
crc.update(s.getBytes)
return crc.getValue.toInt & 0xffffffff
}
def pickRandomCoeffs(kIn: Int, maxShingleID: Double): Array[Int] = {
var k = kIn
val randList = Array.ofDim[Int](k)
while (k > 0) {
// Get a random shingle ID.
var randIndex = (Math.random() * maxShingleID).toInt
// Ensure that each random number is unique.
while (randList.contains(randIndex)) {
randIndex = (Math.random() * maxShingleID).toInt
}
// Add the random number to the list.
k = k - 1
randList(k) = randIndex
}
return randList
}
def approach2(list1Values: List[String], list2Values: List[String]) = {
val maxShingleID = Math.pow(2, 32) - 1
val colHashed1 = list1Values.map(a => getCRC32(a))
val colHashed2 = list2Values.map(a => getCRC32(a))
val nextPrime = 4294967311L
val numHashes = 10
val coeffA = pickRandomCoeffs(numHashes, maxShingleID)
val coeffB = pickRandomCoeffs(numHashes, maxShingleID)
val signature1 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed1.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen.
}
val signature2 = for (i <- 0 until numHashes) yield {
val hashCodeRDD = colHashed2.map(ele => (coeffA(i) * ele + coeffB(i)) % nextPrime)
hashCodeRDD.min.toInt // Track the lowest hash code seen
}
val count = (0 until numHashes)
.map(k => if (signature1(k) == signature2(k)) 1 else 0)
.fold(0)(_ + _)
val jSimilarity = count / numHashes.toDouble
jSimilarity
}
// def approach1(list1Values: List[String], list2Values: List[String]) = {
// val colHashed1 = list1Values.toSet
// val colHashed2 = list2Values.toSet
//
// val jSimilarity = colHashed1.intersection(colHashed2).distinct.count / (colHashed1.union(colHashed2).distinct.count.toDouble)
// jSimilarity
// }
def approach1(list1Values: List[String], list2Values: List[String]) = {
val colHashed1 = list1Values.toSet
val colHashed2 = list2Values.toSet
val jSimilarity = (colHashed1 & colHashed2).size / (colHashed1 ++ colHashed2).size.toDouble
jSimilarity
}
def main(args: Array[String]) {
val list1Values = List("a", "b", "c")
val list2Values = List("a", "b", "d")
for (i <- 0 until 5) {
println(s"Iteration ${i}")
println(s" - Approach 1: ${approach1(list1Values, list2Values)}")
println(s" - Approach 2: ${approach2(list1Values, list2Values)}")
}
}
}
OUTPUT:
Iteration 0
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 1
- Approach 1: 0.5
- Approach 2: 0.5
Iteration 2
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 3
- Approach 1: 0.5
- Approach 2: 0.8
Iteration 4
- Approach 1: 0.5
- Approach 2: 0.4
Why are you using it?
It seems to me that the overhead cost for minHashing approach just outweighs its functionality in Spark. Especially as numHashes increases.
Here are some observations I've found in your code:
First, while (randList.contains(randIndex)) this part will surely slow down your process as numHashes (which is by the way equal to the size of randList) increases.
Second, You can save some time if you rewrite this code:
var signature1 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature1(i) = hashCodeRDD.min.toInt
}
var signature2 = Array.fill(numHashes){0}
for (i <- 0 to numHashes-1)
{
// Evaluate the hash function.
val hashCodeRDD = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
// Track the lowest hash code seen.
signature2(i) = hashCodeRDD.min.toInt
}
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
into
var count = 0
for (i <- 0 to numHashes - 1)
{
val hashCodeRDD1 = colHashed1.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val hashCodeRDD2 = colHashed2.map(ele => ((coeffA(i) * ele + coeffB(i)) % nextPrime))
val sig1 = hashCodeRDD1.min.toInt
val sig2 = hashCodeRDD2.min.toInt
if (sig1 == sig2) { count = count + 1 }
}
This method simplifies the three loops into one. However, I am not sure if that would give a huge boost in computational time.
One other suggestion I have, assuming that the first approach still turns out to be much faster is to use the property of sets to modify the first approach:
val colHashed1_dist = colHashed1.distinct
val colHashed2_dist = colHashed2.distinct
val intersect_cnt = colHashed1_dist.intersection(colHashed2_dist).distinct.count
val jSimilarity = intersect_cnt/(colHashed1_dist.count + colHashed2_dist.count - intersect_cnt).toDouble
with that, instead of getting the union, you can just reuse the value of the intersection.
Actually, in LSH apporach you would calculate minHash only once for each of your documents and then compare two minHases for each possible pair of documents. And in case of trivial approach you would perform full comparison of documents for each possible pair of documents. Which is roughly N^2/2 number of comparisons. Hence extra cost of calculating minHashes is negligible for large enough number of documents.
You should actually compare the performance of the trivial approach:
val jSimilarity = colHashed1.intersection(colHashed2).distinct.count/(colHashed1.union(colHashed2).distinct.count.toDouble)
and performance of the Jaccard distance calculation (last lines in your code):
var count = 0
// Count the number of positions in the minhash signature which are equal.
for(k <- 0 to numHashes-1)
{
if(signature1(k) == signature2(k))
count = count + 1
}
val jSimilarity = count/numHashes.toDouble

How to convert RowMatrix to BDM (Breeze Dense Matrix) and more questions

trying to convert a RowMatrix into BDM (Breeze Dense Matrix), not sure how to proceed
need to implement
def getDenseMatrix(A: RowMatrix): BDM[Double] = {
//write code here
}
additional questions:
how to convert a RowMatrix into a Matrix?
How to access a particular Row in the RowMatrix?
for(i <- 0 to (RowM.numCols().toInt-1)){
//How to access RowM.rows(i)
}
How to access a particular column in the RowMatrix?
for(i <- 0 to (RowM.numCols().toInt-1)){
//How to access RowM.rows.map(f=>f(i))
}
How to multiply 2 RowMatrices
note: RowMatrix has a API 'multiply' but it need the argument of type Matrix
say A and B are RowMatices, then AB = A.multiply(B), this will not work, as B
is a RowMatrix and not Matrix
And lastly how to convert a BDM to a RowMatrix?
Read the source code in rowMatrix, you want the source code, it is in private method. The code is below:
def toBreeze(mat:RowMatrix):BDM[Double] = {
val m = mat.numRows()
val n = mat.numCols()
val result = BDM.zeros[Double](m.toInt,n.toInt)
var i = 0
mat.rows.collect().foreach{Vector =>
Vector.foreachActive { case(index,value) =>
result(i,index) = value
}
i+=1
}
result
}
If you don't have foreachActive available use this:
def toBreeze(X: RowMatrix): BDM[Double]{
val m = X.numRows().toInt
val n = X.numCols().toInt
val mat = BDM.zeros[Double](m, n)
var i = 0
X.rows.collect().map{
case sp: SparseVector => (sp.indices, sp.values)
case dp: DenseVector => (Range(0,n).toArray, dp.values)
}.foreach {
case (indices, values) => indices.zip(values).foreach { case (j, v) =>mat(i, j) = v }
i += 1
}
mat
}
This is parallel up to a certain point and do the same.

Spark Logistic regression and metrics

I want to run logistic regression 100 times with random splitting into test and training. I want to then save the performance metrics for individual runs and then later use them for gaining insight into the performance.
for (index <- 1 to 100) {
val splits = training_data.randomSplit(Array(0.90, 0.10), seed = index)
val training = splits(0).cache()
val test = splits(1)
logrmodel = train_LogisticRegression_model(training)
performLogisticRegressionRuns(logrmodel, test, index)
}
spark.stop()
}
def performLogisticRegressionRuns(model: LogisticRegressionModel, test: RDD[LabeledPoint], iterationcount: Int) {
private val sb = StringBuilder.newBuilder
// Compute raw scores on the test set. Once I cle
model.clearThreshold()
val predictionAndLabels = test.map { case LabeledPoint(label, features) =>
val prediction = model.predict(features)
(prediction, label)
}
val bcmetrics = new BinaryClassificationMetrics(predictionAndLabels)
// I am showing two sample metrics, but I am collecting more including recall, area under roc, f1 score etc....
val precision = bcmetrics.precisionByThreshold()
precision.foreach { case (t, p) =>
// If threshold is 0.5 as what we want, then get the precision and append it to the string. Idea is if score is <0.5 class 0, else class 1.
if (t == 0.5) {
println(s"Threshold is: $t, Precision is: $p")
sb ++= p.toString() + "\t"
}
}
val auROC = bcmetrics.areaUnderROC
sb ++= iteration + auPRC.toString() + "\t"
I want to save the performance results of each iteration in separate file. I tried this, but it does not work, any help with this will be great
val data = spark.parallelize(sb)
val filename = "logreg-metrics" + iterationcount.toString() + ".txt"
data.saveAsTextFile(filename)
}
I was able to resolve this, I did the following. I converted the String to a list.
val data = spark.parallelize(List(sb))
val filename = "logreg-metrics" + iterationcount.toString() + ".txt"
data.saveAsTextFile(filename)