Spark Mllib .toBlockMatrix results in matrix of 0.0 - scala

I am trying to create a block matrix from a input data file. I have managed to get the data read from the data file and stored in IndexedRowMatrix and CoordinateMatrix format correct.
When I use .toBlockMatrix on the CoordinateMatrix the result is a block matrix containing only 0.0 with the same dimensions as the CoordinateMatrix.
I am using version 1.5.0-cdh5.5.0
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.distributed.BlockMatrix
val conf = new SparkConf().setMaster("local").setAppName("Transpose");
val sc = new SparkContext(conf)
val dataRDD = sc.textFile("/user/cloudera/data/data.txt").map(line => Vectors.dense(line.split(" ").map(_.toDouble))).zipWithIndex.map(_.swap)
//Format of dataRDD is RDD[(Long, Vector)]
val rows = dataRDD.map{case(k,v) => IndexedRow(k,v)}
//Format of rows is RDD[IndexedRow]
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
val coordMat: CoordinateMatrix = mat.toCoordinateMatrix()
val blockMat: BlockMatrix = coordMat.toBlockMatrix().cache()
The data file is just simply two columns by sixty rows of integers.
140 123
141 310
310 381
480 321
... ...
Update:
I've done some investigating and have discovered that the groupByKey function is not working correctly, which is what is preventing the BlockMatrix from being formed correctly. I still however do not know why groupByKey, join, and groupBy are not working and always returning an empty result.

I have solved the problem by removing the lines of code:
val conf = new SparkConf().setMaster("local").setAppName("Transpose")
val sc = new SparkContext(conf)
I found the answer in the below linked page in a comment by Farzad Nozarian,
Unable to count words using reduceByKey((v1,v2) => v1 + v2) scala function in spark
As a side-note this might help people who are getting empty results for .groupByKey, .reduceByKey, .join, etc.

Related

How to calculate rolling covariance matrix from a spark dataframe

I have a Spark 2.2.0 DataFrame of currency prices where I add the returns to.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.getOrCreate()
val prices = spark.read.json("prices.json")
// make a window function and convert prices to returns
val window = Window.partitionBy("currency").orderBy("time")
val lagPrice = lag(col("close"), 1).over(window)
val percentReturn = col("close") / col("lastClose") - 1d
val logReturn = log(col("close") / col("lastClose"))
val returns = prices.withColumn("lastClose", lagPrice)
.withColumn("return", percentReturn)
.withColumn("logReturn", logReturn)
Now I want to calculate a rolling Covarance Matrix (like a moving average) of all currencies using a window function. But I can not find any documentation or examples.

print CoordinateMatrix after using RowMatrix.columnSimilarities in Apache Spark

I am using spark mllib for one of my projects in which I need to calculate document similarities.
I first converted the documents to vectors using tf-idf transform of the mllib, then converted it into RowMatrix and used the columnSimilarities() method.
I referred to tf-idf documentation and used the DIMSUM implementation for cosine similarities.
in spark-shell this is the scala code is executed:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix
Now let's say my input file, test1 in this code block is a simple file with 5 short documents (less than 10 terms each), one on each row.
Since I am just testing this code, I would like to see the output of mat.columnSimilarities() which is in object sim.
I would like to see the similarity of 1st document vector with 2nd, 3rd and so on.
I referred to spark documentation for CoordinateMatrix which is the type of object returned by columnSimilarities method of RowMatrix class and referred by sim.
By going through more documentation, I figured I could convert the CoordinateMatrix to RowMatrix, then convert the rows of RowMatrix to arrays and then print like this println(sim.toRowMatrix().rows.toArray().mkString("\n")) .
But that gives some output which I couldn't understand.
Can anyone help? Any kind of resource links etc would help a lot!
Thanks!
You can try the following, no need to convert to row matrix format
val transformedRDD = sim.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
To retrieve the elements you can invoke the following action
transformedRDD.collect()

Cosine Similarity via DIMSUM in Spark

I have a very simple code to try Cosine Similarity:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows= Array(((1,2,3,4,5),(1,2,3,4,5),(1,2,4,5,8),(3,4,1,2,7),(7,7,7,7,7)))
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
I run this code on Amazon AWS which has Spark 1.5 however I got the following message for the last two lines:
"Erroe: value columnSimilarities is not a memeber of org.apache.spark.rdd.RDD[(int,int)]"
Could you please help to resolve this issue?
I found the answer. I need to convert the matrix to RDD. Here is the right code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
def matrixToRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val dm: Matrix = Matrices.dense(5, 5,Array(1,2,3,4,5,1,2,3,4,5,1,2,4,5,8,3,4,1,2,7,7,7,7,7,7))
val rows = matrixToRDD(dm)
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
println("Estimated pairwise similarities are: " + simsEstimate.entries.collect.mkString(", "))
Cheers

Spark: Input a vector

I'm get into spark and I have problems with Vectors
import org.apache.spark.mllib.linalg.{Vectors, Vector}
The input of my program is a text file with contains the output of a RDD(Vector):
dataset.txt:
[-0.5069793074881704,-2.368342680619545,-3.401324690974588]
[-0.7346396928543871,-2.3407983487917448,-2.793949129209909]
[-0.9174226561793709,-0.8027635530022152,-1.701699021443242]
[0.510736518683609,-2.7304268743276174,-2.418865539558031]
So, what a try to do is:
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
I have the error because it read [0.510736518683609 as a number.
Exist any form to load directly the vector stored in the text-file without doing the second line? How I can delete "[" in the map stage ?
I'm really new in spark, sorry if it's a very obvious question.
Given the input the simplest thing you can do is to use Vectors.parse:
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> Vectors.parse("[-0.50,-2.36,-3.40]")
res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]
It also works with sparse representation:
scala> Vectors.parse("(10,[1,5],[0.5,-1.0])")
res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])
Combining it with your data all you need is:
rdd.map(Vectors.parse)
If you expect malformed / empty lines you can wrap it using Try:
import scala.util.Try
rdd.map(line => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)
Here is one way to do it :
val rdd = sc.textFile("/workingdirectory/dataset")
val data = rdd.map {
s =>
val vect = s.replaceAll("\\[", "").replaceAll("\\]","").split(',').map(_.toDouble)
Vectors.dense(vect)
}
I've just broke the map into line for readability purpose.
Note: Remember, it's simple a string processing on each line.

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
Vectors.dense(values)
}.cache()
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
.setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
}
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
rows.flatMap(_.toArray)
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = rows.map(_.toArray).flatMap(x => x)