I am trying to create a block matrix from a input data file. I have managed to get the data read from the data file and stored in IndexedRowMatrix and CoordinateMatrix format correct.
When I use .toBlockMatrix on the CoordinateMatrix the result is a block matrix containing only 0.0 with the same dimensions as the CoordinateMatrix.
I am using version 1.5.0-cdh5.5.0
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.distributed.BlockMatrix
val conf = new SparkConf().setMaster("local").setAppName("Transpose");
val sc = new SparkContext(conf)
val dataRDD = sc.textFile("/user/cloudera/data/data.txt").map(line => Vectors.dense(line.split(" ").map(_.toDouble)))
//Format of dataRDD is RDD[(Long, Vector)]
val rows ={case(k,v) => IndexedRow(k,v)}
//Format of rows is RDD[IndexedRow]
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
val coordMat: CoordinateMatrix = mat.toCoordinateMatrix()
val blockMat: BlockMatrix = coordMat.toBlockMatrix().cache()
The data file is just simply two columns by sixty rows of integers.
140 123
141 310
310 381
480 321
... ...
I've done some investigating and have discovered that the groupByKey function is not working correctly, which is what is preventing the BlockMatrix from being formed correctly. I still however do not know why groupByKey, join, and groupBy are not working and always returning an empty result.

I have solved the problem by removing the lines of code:
val conf = new SparkConf().setMaster("local").setAppName("Transpose")
val sc = new SparkContext(conf)
I found the answer in the below linked page in a comment by Farzad Nozarian,
Unable to count words using reduceByKey((v1,v2) => v1 + v2) scala function in spark
As a side-note this might help people who are getting empty results for .groupByKey, .reduceByKey, .join, etc.


How to calculate rolling covariance matrix from a spark dataframe

I have a Spark 2.2.0 DataFrame of currency prices where I add the returns to.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val spark = SparkSession.builder.getOrCreate()
val prices ="prices.json")
// make a window function and convert prices to returns
val window = Window.partitionBy("currency").orderBy("time")
val lagPrice = lag(col("close"), 1).over(window)
val percentReturn = col("close") / col("lastClose") - 1d
val logReturn = log(col("close") / col("lastClose"))
val returns = prices.withColumn("lastClose", lagPrice)
.withColumn("return", percentReturn)
.withColumn("logReturn", logReturn)
Now I want to calculate a rolling Covarance Matrix (like a moving average) of all currencies using a window function. But I can not find any documentation or examples.

print CoordinateMatrix after using RowMatrix.columnSimilarities in Apache Spark

I am using spark mllib for one of my projects in which I need to calculate document similarities.
I first converted the documents to vectors using tf-idf transform of the mllib, then converted it into RowMatrix and used the columnSimilarities() method.
I referred to tf-idf documentation and used the DIMSUM implementation for cosine similarities.
in spark-shell this is the scala code is executed:
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val documents = sc.textFile("test1").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documents)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
// now use the RowMatrix to compute cosineSimilarities
// which implements DIMSUM algorithm
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities() // returns a CoordinateMatrix
Now let's say my input file, test1 in this code block is a simple file with 5 short documents (less than 10 terms each), one on each row.
Since I am just testing this code, I would like to see the output of mat.columnSimilarities() which is in object sim.
I would like to see the similarity of 1st document vector with 2nd, 3rd and so on.
I referred to spark documentation for CoordinateMatrix which is the type of object returned by columnSimilarities method of RowMatrix class and referred by sim.
By going through more documentation, I figured I could convert the CoordinateMatrix to RowMatrix, then convert the rows of RowMatrix to arrays and then print like this println(sim.toRowMatrix().rows.toArray().mkString("\n")) .
But that gives some output which I couldn't understand.
Can anyone help? Any kind of resource links etc would help a lot!
You can try the following, no need to convert to row matrix format
val transformedRDD ={case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
To retrieve the elements you can invoke the following action

Cosine Similarity via DIMSUM in Spark

I have a very simple code to try Cosine Similarity:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows= Array(((1,2,3,4,5),(1,2,3,4,5),(1,2,4,5,8),(3,4,1,2,7),(7,7,7,7,7)))
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
I run this code on Amazon AWS which has Spark 1.5 however I got the following message for the last two lines:
"Erroe: value columnSimilarities is not a memeber of org.apache.spark.rdd.RDD[(int,int)]"
Could you please help to resolve this issue?
I found the answer. I need to convert the matrix to RDD. Here is the right code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
def matrixToRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
val vectors = => new DenseVector(row.toArray))
val dm: Matrix = Matrices.dense(5, 5,Array(1,2,3,4,5,1,2,3,4,5,1,2,4,5,8,3,4,1,2,7,7,7,7,7,7))
val rows = matrixToRDD(dm)
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
println("Estimated pairwise similarities are: " + simsEstimate.entries.collect.mkString(", "))

Spark: Input a vector

I'm get into spark and I have problems with Vectors
import org.apache.spark.mllib.linalg.{Vectors, Vector}
The input of my program is a text file with contains the output of a RDD(Vector):
So, what a try to do is:
val rdd = sc.textFile("/workingdirectory/dataset")
val data = => Vectors.dense(s.split(',').map(_.toDouble)))
I have the error because it read [0.510736518683609 as a number.
Exist any form to load directly the vector stored in the text-file without doing the second line? How I can delete "[" in the map stage ?
I'm really new in spark, sorry if it's a very obvious question.
Given the input the simplest thing you can do is to use Vectors.parse:
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala> Vectors.parse("[-0.50,-2.36,-3.40]")
res14: org.apache.spark.mllib.linalg.Vector = [-0.5,-2.36,-3.4]
It also works with sparse representation:
scala> Vectors.parse("(10,[1,5],[0.5,-1.0])")
res15: org.apache.spark.mllib.linalg.Vector = (10,[1,5],[0.5,-1.0])
Combining it with your data all you need is:
If you expect malformed / empty lines you can wrap it using Try:
import scala.util.Try => Try(Vectors.parse(line))).filter(_.isSuccess).map(_.get)
Here is one way to do it :
val rdd = sc.textFile("/workingdirectory/dataset")
val data = {
s =>
val vect = s.replaceAll("\\[", "").replaceAll("\\]","").split(',').map(_.toDouble)
I've just broke the map into line for readability purpose.
Note: Remember, it's simple a string processing on each line.

Convert Rdd[Vector] to Rdd[Double]

How do I convert csv to Rdd[Double]? I have the error: cannot be applied to (org.apache.spark.rdd.RDD[Unit]) at this line:
val kd = new KernelDensity().setSample(rows)
My full code is here:
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
class KdeAnalysis {
val conf = new SparkConf().setAppName("sample").setMaster("local")
val sc = new SparkContext(conf)
val DATAFILE: String = "C:\\Users\\ajohn\\Desktop\\spark_R\\data\\mass_cytometry\\mass.csv"
val rows = sc.textFile(DATAFILE).map {
line => val values = line.split(',').map(_.toDouble)
// Construct the density estimator with the sample data and a standard deviation for the Gaussian
// kernels
val rdd : RDD[Double] = sc.parallelize(rows)
val kd = new KernelDensity().setSample(rdd)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
Since rows is a RDD[org.apache.spark.mllib.linalg.Vector] following line cannot work:
val rdd : RDD[Double] = sc.parallelize(rows)
parallelize expects Seq[T] and RDD is not a Seq.
Even if this part worked as you expect your input is simply wrong. A correct argument for KernelDensity.setSample is either RDD[Double] or JavaRDD[java.lang.Double]. It looks like it doesn't support a multivariate data at this moment.
Regarding a question from the tile you can flatMap
or even better when you create rows
val rows = sc.textFile(DATAFILE).flatMap(_.split(',').map(_.toDouble)).cache()
but I doubt it is really what you need.
Have prepared this code, please evaluate if it can help you out ->
val doubleRDD = => x)