Efficient load CSV coordinate format (COO) input to local matrix spark - scala

I want to convert CSV coordinate format (COO) data into a local matrix. Currently I'm first converting them to CoordinateMatrix and then converting to LocalMatrix. But is there a better way to do this?
Example data:
0,5,5.486978435
0,3,0.438472867
0,0,6.128832321
0,7,5.295923198
0,1,7.738270234
Code:
var loadG = sqlContext.read.option("header", "false").csv("file.csv").rdd.map("mapfunctionCreatingMatrixEntryOutOfRow")
var G = new CoordinateMatrix(loadG)
var matrixG = G.toBlockMatrix().toLocalMatrix()

A LocalMatrix will be stored on a single machine and hence not make use of Spark's strengths. In other words, using Spark seems a bit wasteful, although still possible.
The easiest way to get the CSV file to a LocalMatrix is to first read the CSV with Scala, not Spark:
val entries = Source.fromFile("data.csv").getLines()
.map(_.split(","))
.map(a => (a(0).toInt, a(1).toInt, a(2).toDouble))
.toSeq
The SparseMatrix variant of the LocalMatrix has a method for reading COO formatted data. The number of rows and columns need to be specified to use this. Since the matrix is sparse this should in most cases be done by hand but it's possible to get the highest values in the data as follows:
val numRows = entries.map(_._1).max + 1
val numCols = entries.map(_._2).max + 1
Then create the matrix:
val matrixG = SparseMatrix.fromCOO(numRows, numCols, entries)
The matrix will be stored in CSC format on the machine. Printing the example input above will yield the following output:
1 x 8 CSCMatrix
(0,0) 6.128832321
(0,1) 7.738270234
(0,3) 0.438472867
(0,5) 5.486978435
(0,7) 5.295923198

Related

What's the simplest way to get a Spark DataFrame from arbitrary Array Data in Scala?

I've been breaking my head about this one for a couple of days now. It feels like it should be intuitively easy... Really hope someone can help!
I've built an org.nd4j.linalg.api.ndarray.INDArray of word occurrence from some semi-structured data like this:
import org.nd4j.linalg.factory.Nd4j
import org.nd4s.Implicits._
val docMap = collection.mutable.Map[Int,Map[Int,Int]] //of the form Map(phrase -> Map(phrasePosition -> word)
val words = ArrayBuffer("word_1","word_2","word_3",..."word_n")
val windows = ArrayBuffer("$phrase,$phrasePosition_1","$phrase,$phrasePosition_2",..."$phrase,$phrasePosition_n")
var matrix = Nd4j.create(windows.length*words.length).reshape(windows.length,words.length)
for (row <- matrix.shape(0)){
for(column <- matrix.shape(1){
//+1 to (row,column) if word occurs at phrase, phrasePosition indicated by window_n.
}
}
val finalmatrix = matrix.T.dot(matrix) // to get co-occurrence matrix
So far so good...
Downstream of this point I need to integrate the data into an existing pipeline in Spark, and use that implementation of pca etc, so I need to create a DataFrame, or at least an RDD. If I knew the number of words and/or windows in advance I could do something like:
case class Row(window : String, word_1 : Double, word_2 : Double, ...etc)
val dfSeq = ArrayBuffer[Row]()
for (row <- matrix.shape(0)){
dfSeq += Row(windows(row),matrix.get(NDArrayIndex.point(row), NDArrayIndex.all()))
}
sc.parallelize(dfSeq).toDF("window","word_1","word_2",...etc)
but the number of windows and words is determined at runtime. I'm looking for a WindowsxWords org.apache.spark.sql.DataFrame as output, input is a WindowsxWords org.nd4j.linalg.api.ndarray.INDArray
Thanks in advance for any help you can offer.
Ok, so after several days work it looks like the simple answer is: there isn't one. In fact, it looks like trying to use Nd4j in this context at all is a bad idea for several reasons:
It's (really) hard to get data out of the native INDArray format once you've put it in.
Even using something like guava, the .data() method brings everything on heap which will quickly become expensive.
You've got the added hassle of having to compile an assembly jar or use hdfs etc to handle the library itself.
I did also consider using Breeze which may actually provide a viable solution but carries some of the same problems and can't be used on distributed data structures.
Unfortunately, using native Spark / Scala datatypes, although easier once you know how, is - for someone like me coming from Python + numpy + pandas heaven at least - painfully convoluted and ugly.
Nevertheless, I did implement this solution successfully:
import org.apache.spark.mllib.linalg.{Vectors,Vector,Matrix,DenseMatrix,DenseVector}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
//first make a pseudo-matrix from Scala Array[Double]:
var rowSeq = Seq.fill(windows.length)(Array.fill(words.length)(0d))
//iterate through 'rows' and 'columns' to fill it:
for (row 0 until windows.length){
for (column 0 until words.length){
// rowSeq(row)(column) += 1 if word occurs at phrase, phrasePosition indicated by window_n.
}
}
//create Spark DenseMatrix
val rows : Array[Double] = rowSeq.transpose.flatten.toArray
val matrix = new DenseMatrix(windows.length,words.length,rows)
One of the main operations that I needed Nd4J for was matrix.T.dot(matrix) but it turns out that you can't multiply 2 matrices of Type org.apache.spark.mllib.linalg.DenseMatrix together, one of them (A) has to be a org.apache.spark.mllib.linalg.distributed.RowMatrix and - you guessed it - you can't call matrix.transpose() on a RowMatrix, only on a DenseMatrix! Since it's not really relevant to the question, I'll leave that part out, except to explain that what comes out of that step is a RowMatrix. Credit is also due here and here for the final part of the solution:
val rowMatrix : [RowMatrix] = transposeAndDotDenseMatrix(matrix)
// get DataFrame from RowMatrix via DenseMatrix
val newdense = new DenseMatrix(rowMatrix.numRows().toInt,rowMatrix.numCols().toInt,rowMatrix.rows.collect.flatMap(x => x.toArray)) // the call to collect() here is undesirable...
val matrixRows = newdense.rowIter.toSeq.map(_.toArray)
val df = spark.sparkContext.parallelize(matrixRows).toDF("Rows")
// then separate columns:
val df2 = (0 until words.length).foldLeft(df)((df, num) =>
df.withColumn(words(num), $"Rows".getItem(num)))
.drop("Rows")
Would love to hear improvements and suggestions on this, thanks.

How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model

I have a confusion regarding BinaryClassificationMetrics (Mllib) inputs. As per Apache Spark 1.6.0, we need to pass predictedandlabel of Type (RDD[(Double,Double)]) from transformed DataFrame that having predicted, probability(vector) & rawPrediction(vector).
I have created RDD[(Double,Double)] from Predicted and label columns. After performing BinaryClassificationMetrics evaluation on NavieBayesModel, I'm able to retrieve ROC, PR etc. But the values are limited, I can't able plot the curve using the value generated from this. Roc contains 4 values and PR contains 3 value.
Is it the right way of preparing PredictedandLabel or do I need to use rawPrediction column or Probability column instead of Predicted column?
Prepare like this:
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
val df = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val predictions = new NaiveBayes().fit(df).transform(df)
val preds = predictions.select("probability", "label").rdd.map(row =>
(row.getAs[Vector](0)(0), row.getAs[Double](1)))
And evaluate:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
new BinaryClassificationMetrics(preds, 10).roc
If predictions are only 0 or 1 number of buckets can be lower like in your case. Try more complex data like this:
val anotherPreds = df1.select(rand(), $"label").rdd.map(row => (row.getDouble(0), row.getDouble(1)))
new BinaryClassificationMetrics(anotherPreds, 10).roc

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix(
rdd.zipWithIndex.map{case (v, i) => IndexedRow(i, v)}
).toCoordinateMatrix.toBlockMatrix
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)

How to convert an RDD to Vector in Spark

I have an RDD of type RDD[(Int,Double)] in which the first element of the pair is the index and the second is the value and I'd like to convert this RDD to a Vector to use for classification. Could someone help me with that?
I have the following code but it's not working
def vectorize(x:RDD[(Int,Double)], size: Int):Vector = {
val vec = Vectors.sparse(size,x)
}
Since org.apache.spark.mllib.linalg.Vector is a local data structure you have to collect your data.
def vectorize(x:RDD[(Int,Double)], size: Int):Vector = {
Vectors.sparse(size, x.collect)
}
Since there is no data distribution you have to be sure output will fit in a driver memory.
In general this operation is not particularly useful. If your data can be easily handled using local data structures then it probably shouldn't be stored inside RDD in the first place.

Imputing the dataset with mean of class label causing crash in filter operation

I have a csv file that contains numeric values.
val row = withoutHeader.map{
line => {
val arr = line.split(',')
for (h <- 0 until arr.length){
if(arr(h).trim == ""){
val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
arr(h) = //imputing with the value above
}
}
arr.mkString(",")
}
}
This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.
avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.
dependent_col_index is the column containing the class labels.
The line with filter is crashing with the null pointer exception.
On removing this line the original array is the output (comma separated).
I am confused why the filter operation is causing a crash.
Please suggest on how to fix this issue.
Example
col1,dependent_col_index
4,1
8,0
,1
21,1
21,0
,1
25,1
,0
34,1
mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5
Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1
Thanks !!
You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.
The way to proceed is the following:
Transform the source RDD withoutHeader to the RDD of pairs <Class, Value> of the corrent type (Long in your case). Cache it
Calculate avgrdd on top of withoutHeader. This should be an RDD of pairs <Class, AvgValue>
Join withoutHeader RDD and avgrdd together - this way for each row you would have a structure <Class, <Value, AvgValue>>
Execute map on top of the result to replace missing Value with AvgValue
Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values