spark(scala) three separated rdd[org.apache.spark.mllib.linalg.Vector] to a single rdd[Vector] - scala

i have three separated rdd[mllib....vectors] and i need to combine them as a one rdd[mllib vector].
val vvv =>(scaler.transform(Vectors.dense(x(0))),Vectors.dense((x(1)/bv_max_2).toArray),Vectors.dense((x(2)/bv_max_1).toArray)))
more info:
scaler => StandardScaler
bv_max_... is nothing but the DenseVector from breeze lib in case for normalizing (x/max(x))
now i need to make them all as one
i get ([1.],[2.],[3.]) and [[1.],[2.],[3.]]
but i need [1.,2.,3.] as one vector

finally i found ... i dont know if this is the best.
i had 3d data set and i needed to perform x/max(x) normalization on two dimensions and apply standardScaler to another dimension.
my problem was that in the end i had 3 separated Vectors like: eg
[ [1.0],[4,0],[5.0] ]
[ [2.0], [5.0], [6.0]]................but i needed [1.0,4.0,5.0] which can be passed to KMeans.
i changed the above code as :
val vvv =>scaler.transform(Vectors.dense(x.days_d)).toArray ++ (x.freq_d/bv_max_freq).toArray ++ (x.food_d/bv_max_food).toArray).map(x=>Vectors.dense(x(0),x(1),x(2)))


K-means clustering error: only 0's may be mixed with negative subscripts

I am trying to do kmeans clustering on IRIS data in R. I want to use KKZ option for the seed selection (starting points of clusters).
If i dont standardize the data i have no issues with the KKZ command:
res<- kkz(x=iris[,1:4], k=3)
seed <- res$v # this gives me the cluster seeds based on KKZ method
k1 <- kmeans(iris[,1:4], seed, iter.max=1000)
However, when i scale the data first, then kkz command gives me the error:
dat <- center_scale(iris[1:4], mean_center = TRUE, sd_scale = TRUE) # scale iris data
res2 <- kkz(x=dat, k=3)
**Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts**
I think this is an array indexing thing but not sure what it is and how to solve it.
For some reason, kkz cannot take in anything with a mixture of positive and negative values. I have a lot of problems running it, for example:
# not ok
Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts
You don't really need to center your values, so you can do:
dat <- center_scale(iris[1:4], mean_center = FALSE, sd_scale = TRUE)
res2 <- kkz(x=dat, k=3)
I would be quite cautious about using this package..until you figure out why it is so..

Efficient load CSV coordinate format (COO) input to local matrix spark

I want to convert CSV coordinate format (COO) data into a local matrix. Currently I'm first converting them to CoordinateMatrix and then converting to LocalMatrix. But is there a better way to do this?
Example data:
var loadG ="header", "false").csv("file.csv")"mapfunctionCreatingMatrixEntryOutOfRow")
var G = new CoordinateMatrix(loadG)
var matrixG = G.toBlockMatrix().toLocalMatrix()
A LocalMatrix will be stored on a single machine and hence not make use of Spark's strengths. In other words, using Spark seems a bit wasteful, although still possible.
The easiest way to get the CSV file to a LocalMatrix is to first read the CSV with Scala, not Spark:
val entries = Source.fromFile("data.csv").getLines()
.map(a => (a(0).toInt, a(1).toInt, a(2).toDouble))
The SparseMatrix variant of the LocalMatrix has a method for reading COO formatted data. The number of rows and columns need to be specified to use this. Since the matrix is sparse this should in most cases be done by hand but it's possible to get the highest values in the data as follows:
val numRows = + 1
val numCols = + 1
Then create the matrix:
val matrixG = SparseMatrix.fromCOO(numRows, numCols, entries)
The matrix will be stored in CSC format on the machine. Printing the example input above will yield the following output:
1 x 8 CSCMatrix
(0,0) 6.128832321
(0,1) 7.738270234
(0,3) 0.438472867
(0,5) 5.486978435
(0,7) 5.295923198

How can I zip two RDDs in PySpark?

I have been trying to merge the two Rdds below averagePoints1 and kpoints2 . It keeps throwing this error
ValueError: Can not deserialize RDD with different number of items in pair: (2, 1)
and I tried many things but I can't the two Rdds are identical, have the same number of partitions . my next to step is to apply euclidean distance function on the two lists to measure the difference so if any one knows how to solve this error or has a different approach I can follow I would really appreciate it.
averagePoints1 = x: x[1])
[[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
kpoints2 = sc.parallelize(kpoints,4)
In [17]:
[[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
a= [[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
b= [[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
c =
check this answer
Combine two RDDs in pyspark
newSample=newCenters.collect() #new centers as a list
samples=zip(newSample,sample) #sample=> old centers
samples1=sc.parallelize(samples) (x,y):distanceSquared(x[1],y))
for future searchers this is the solution I followed at the end

How to give predicted and label columns in BinaryClassificationMetrics evaluation for Naive Bayes model

I have a confusion regarding BinaryClassificationMetrics (Mllib) inputs. As per Apache Spark 1.6.0, we need to pass predictedandlabel of Type (RDD[(Double,Double)]) from transformed DataFrame that having predicted, probability(vector) & rawPrediction(vector).
I have created RDD[(Double,Double)] from Predicted and label columns. After performing BinaryClassificationMetrics evaluation on NavieBayesModel, I'm able to retrieve ROC, PR etc. But the values are limited, I can't able plot the curve using the value generated from this. Roc contains 4 values and PR contains 3 value.
Is it the right way of preparing PredictedandLabel or do I need to use rawPrediction column or Probability column instead of Predicted column?
Prepare like this:
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
val df ="libsvm").load("data/mllib/sample_libsvm_data.txt")
val predictions = new NaiveBayes().fit(df).transform(df)
val preds ="probability", "label") =>
(row.getAs[Vector](0)(0), row.getAs[Double](1)))
And evaluate:
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
new BinaryClassificationMetrics(preds, 10).roc
If predictions are only 0 or 1 number of buckets can be lower like in your case. Try more complex data like this:
val anotherPreds =, $"label") => (row.getDouble(0), row.getDouble(1)))
new BinaryClassificationMetrics(anotherPreds, 10).roc

How to use RowMatrix.columnSimilarities (similarity search)

TL;DR; I am trying to train off of an existing data set (Seq[Words] with corresponding categories), and use that trained dataset to filter another dataset using category similarity.
I am trying to train a corpus of data and then use it for text analysis*. I've tried using NaiveBayes, but that seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
So, I am now trying to use TFIDF and passing that output into a RowMatrix and computing the similarities. But, I'm not sure how to run my query (one word for now). Here's what I've tried:
val rddOfTfidfFromCorpus : RDD[Vector]
val query = "word"
val tf = new HashingTF().transform(List(query))
val tfIDF = new IDF().fit(sc.makeRDD(List(tf))).transform(tf)
val mergedVectors = rddOfTfidfFromCorpus.union(sc.makeRDD(List(tfIDF)))
val similarities = new RowMatrix(mergedVectors).columnSimilarities(1.0)
Here is where I'm stuck (if I've even done everything right until here). I tried filtering the similarities i and j down to the parts of my query's TFIDF and end up with an empty collection.
The gist is that I want to train on a corpus of data and find what category it falls in. The above code is at least trying to get it down to one category and checking if I can get a prediction from that at least....
*Note that this is a toy example, so I only need something that works well enough
*I am using Spark 1.4.0
Using columnSimilarities doesn't make sense here. Since each column in your matrix represents a set of terms you'll get a matrix of similarities between tokens not documents. You could transpose the matrix and then use columnSimilarities but as far as I understand what you want is a similarity between query and corpus. You can express that using matrix multiplication as follows:
For starters you'll need an IDFModel you've trained on a corpus. Lets assume it is called idf:
import org.apache.spark.mllib.feature.IDFModel
val idf: IDFModel = ??? // Trained using corpus data
and a small helper:
def toBlockMatrix(rdd: RDD[Vector]) = new IndexedRowMatrix({case (v, i) => IndexedRow(i, v)}
First lets convert query to an RDD and compute TF:
val query: Seq[String] = ???
val queryTf = new HashingTF().transform(query)
Next we can apply IDF model and convert result to matrix:
val queryTfidf = idf.transform(queryTf)
val queryMatrix = toBlockMatrix(queryTfidf)
We'll need a corpus matrix as well:
val corpusMatrix = toBlockMatrix(rddOfTfidfFromCorpus)
If you multiple both we get a matrix with number of rows equal to the number of docs in the query and number of columns equal to the number of documents in the corpus.
val dotProducts = queryMatrix.multiply(corpusMatrix.transpose)
To get a proper cosine similarity you have to divide by a product of magnitudes but if you can handle that.
There are two problems here. First of all it is rather expensive. Moreover I am not sure if it really useful. To reduce cost you can apply some dimensionality reduction algorithm first but lets leave it for now.
Judging from a following statement
NaiveBayes (...) seems to only work with the data you have, so it's predict algorithm will always return something, even if it doesn't match anything.
I guess you want some kind of unsupervised learning method. The simplest thing you can try is K-means:
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
val numClusters: Int = ???
val numIterations = 20
val model = KMeans.train(rddOfTfidfFromCorpus, numClusters, numIterations)
val predictions = model.predict(queryTfidf)