I'm wondering what would be an efficient way of centering the
data of a RowMatrix in spark efficiently (for large inputs), do libraries
or functions already exist to do this?
So far I'm thinking of just defining a function and then using map to subtract
the mean, but is this efficient?
I want to do that in order to afterward perform an SVD (to do PCA) on
the given matrix.
EDIT :
here I found something that does the mean shift by the previously mentioned method (using map) :
def subPairs = (vPair: (Double, Double)) => vPair._1 - vPair._2
def subMean = (v: Vector) => Vectors.dense(v.toArray.zip(mean.toArray).map(subPairs))
val stdData = rows.map(subMean)
source : https://github.com/apache/spark/pull/17907/commits/956ce87cd151a9b30d181618aad7ef2a7ee859dc
Thanks in advance
Extract rows:
val mat: RowMatrix = ???
val rows = mat.rows
Fit StadardScalerModel
import org.apache.spark.mllib.feature.StandardScaler
val scaler = new StandardScaler(withMean = true, withStd = false).fit(rows)
Scale
scaler.transform(rows)
Related
I have a DataFrame where each row has 3 columns:
ID:Long, ratings1:Seq[Double], ratings2:Seq[Double]
For each row I need to compute the correlation between those Vectors.
I came up with the following solution which seems to be inefficient (not working as Jarrod Roberson has mentioned) as I have to create RDDs for each Seq:
val similarities = ratingPairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
Similarity(row.getAs[Long]("ID"), corr)
})
Is there a way to compute such correlations properly?
Let's assume you have a correlation function for arrays:
def correlation(arr1: Array[Double], arr2: Array[Double]): Double
(for potential implementations of that function, which is completely independent of Spark, you can ask a separate question or search online, there are some close-enough resource, e.g. this implementation).
Now, all that's left to do is to wrap this function with a UDF and use it:
import org.apache.spark.sql.functions._
import spark.implicits._
val corrUdf = udf {
(arr1: Seq[Double], arr2: Seq[Double]) => correlation(arr1.toArray, arr2.toArray)
}
val result = df.select($"ID", corrUdf($"ratings1", $"ratings2") as "correlation")
scala spark : How to avoid RDD shuffling in join after Distributed Matrix operation
created a dense matrix as a input to calculate cosine distance between columns
val rowMarixIn = sc.textFile("input.csv").map{ line =>
val values = line.split(" ").map(_.toDouble)
Vectors.dense(values)
}
Extracted set of entries from co-ordinated matrix after the cosine calculations
val coMarix = new RowMatrix(rowMarixIn)
val similerRows = coMatrix.columnSimilarities()
//extract entires over a specific Threshold
val rowIndices = similerRows.entries.map {case MatrixEntry(row: Long, col: Long, sim: Double) =>
if (sim > someTreshold )){
col,sim
}`
We have a another RDD with rdd2(key,Val2)
just want to join the two rdd's, rowIndices(key,Val) , rdd2(key,Val2)
val joinedRDD = rowIndices.join(rdd2)
this will result in a shuffle ,
What are best practices to follow in order to avoid shuffle or any suggestion on a better approach is much appreciated
in the following code I calculate the euclidean distance for each document to the cluster centroid in a KMeans cluster.
I feel like the euclidean distance does not make much sense so I thought normalizing it to a scale from from 0 to 1 would be better.
Unfortunately, I didn't figure out how to sort the org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Any]] data type or how to get the maximum / minimum value.
In fact it is a RDD[Map[String,Double]] but I suppose it got converted to an RDD[Map[String,Any]] for some reason. Most approaches e.g takeOrdered results in:
error: No implicit Ordering defined for scala.collection.immutable.Map[String,Any]
How can I teach Scala how to sort the Any values of this Map?
Any hints are very much appreciated.
Thanks
val score = rdd.map({case(id,vector) => {distToCentroid(id, vector, model_1)}})
// Normalizing the data with normalizeResult function.
// Problem I need to find the max and minimum beforehand
def distToCentroid(id: String, datum: Vector, model: KMeansModel) = {
val cluster = model.predict(datum)
val centroid = model.clusterCenters(cluster)
val distance = math.sqrt(centroid.toArray.zip(datum.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
Map("id" -> id, "distance" -> distance)
}
def normalizeResult(max: Double, min: Double, x: Double) = {
(x-min) / (max-min)
}
If I understand you right, you need global min/max for values, stored inside of maps. If so, you can just flatten your RDD and map it to RDD[Double]:
val values = rdd.flatMap(_.values.map(_.toDouble)).cache()
val min = values.min()
val max = values.max()
The easiest way to do this would be to Map the outputs directly in to the correct formats in the first instance.
def distToCentroid(id: String, datum: Vector, model: KMeansModel) = {
val cluster = model.predict(datum)
val centroid = model.clusterCenters(cluster)
val distance = math.sqrt(centroid.toArray.zip(datum.toArray).map(p => p._1 - p._2).map(d => d * d).sum)
//Updated Outputs
Map("id" -> id, "distance" -> distance.toDouble)
}
That should then allow you to either use the inbuilt min and max functions or use the function you have written.
I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude).
My file contains 4 fields separated by comma (the last two are the longitude and latitude).
Here, it's an example of k-means clustering using Spark:
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class:
train(RDD<Vector> data, int k, int maxIterations)
This is my code:
val data = sc.textFile("/user/test/location/*")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))
But when I run it in spark-shell I get the following error:
error: overloaded method value dense with alternatives: (values:
Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue:
Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[(Double, Double)])
So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?
Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.
The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.
If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:
s.split(",").drop(2).take(2).map(_.toDouble)
or if you want all BUT the first to fields converted to Double (if there may be more than 2):
s.split(",").drop(2).map(_.toDouble)
There're two 'factory' methods for dense Vectors:
def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector
While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)
val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))
What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)
val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))
Integrating that in the original code should solve the issue:
val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))
(You can of course inline that code, I separated it here to focus on the issue at hand)
val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))
I have a CSV file with the following format :
product_id1,product_title1
product_id2,product_title2
product_id3,product_title3
product_id4,product_title4
product_id5,product_title5
[...]
The product_idX is a integer and the product_titleX is a String, example :
453478692, Apple iPhone 4 8Go
I'm trying to create the TF-IDF from my file so I can use it for a Naive Bayes Classifier in MLlib.
I am using Spark for Scala so far and using the tutorials I have found on the official page and the Berkley AmpCamp 3 and 4.
So I'm reading the file :
val file = sc.textFile("offers.csv")
Then I'm mapping it in tuples RDD[Array[String]]
val tuples = file.map(line => line.split(",")).cache
and after I'm transforming the tuples into pairs RDD[(Int, String)]
val pairs = tuples.(line => (line(0),line(1)))
But I'm stuck here and I don't know how to create the Vector from it to turn it into TFIDF.
Thanks
To do this myself (using pyspark), I first started by creating two data structures out of the corpus. The first is a key, value structure of
document_id, [token_ids]
The second is an inverted index like
token_id, [document_ids]
I'll call those corpus and inv_index respectively.
To get tf we need to count the number of occurrences of each token in each document. So
from collections import Counter
def wc_per_row(row):
cnt = Counter()
for word in row:
cnt[word] += 1
return cnt.items()
tf = corpus.map(lambda (x, y): (x, wc_per_row(y)))
The df is simply the length of each term's inverted index. From that we can calculate the idf.
df = inv_index.map(lambda (x, y): (x, len(y)))
num_documnents = tf.count()
# At this step you can also apply some filters to make sure to keep
# only terms within a 'good' range of df.
import math.log10
idf = df.map(lambda (k, v): (k, 1. + log10(num_documents/v))).collect()
Now we just have to do a join on the term_id:
def calc_tfidf(tf_tuples, idf_tuples):
return [(k1, v1 * v2) for (k1, v1) in tf_tuples for
(k2, v2) in idf_tuples if k1 == k2]
tfidf = tf.map(lambda (k, v): (k, calc_tfidf(v, idf)))
This isn't a particularly performant solution, though. Calling collect to bring idf into the driver program so that it's available for the join seems like the wrong thing to do.
And of course, it requires first tokenizing and creating a mapping from each uniq token in the vocabulary to some token_id.
If anyone can improve on this, I'm very interested.