How to save PCA object in spark scala? - scala

I'm doing PCA on my data and I read the guide from: https://spark.apache.org/docs/latest/mllib-dimensionality-reduction
The relevant code is following:
import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val data: RDD[LabeledPoint] = sc.parallelize(Seq(
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 1)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 1, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0)),
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0))))
// Compute the top 5 principal components.
val pca = new PCA(5).fit(data.map(_.features))
// Project vectors to the linear space spanned by the top 5 principal
// components, keeping the label
val projected = data.map(p => p.copy(features = pca.transform(p.features)))
This code perform PCA upon the data. However, I can't find example code or doc explaining how to save and load the fitted PCA object for future using. Could someone give me an example based on the above code?

It seems that the PCA mlib version does not support save the model to disk. You can save the pc matrix of the resulting PCAModel instead. However, use the spar ML version. It returns a Spark Estimator that can be serialized and included in a Spark ML Pipeline.

The example code based on #EmiCareOfCell44 answer, using PCA and PCAModel from org.apache.spark.ml.feature:
import org.apache.spark.ml.feature.{PCA, PCAModel}
import org.apache.spark.ml.linalg.Vectors
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3)
.fit(df)
val result = pca.transform(df).select("pcaFeatures")
result.show(false)
// save the model
val savePath = "xxxx"
pca.save(savePath)
// load the save model
val pca_loaded = PCAModel.load(savePath)

Related

Shuffling elements of an RDD[List[Double]] in Spark

In a program I am developing using Spark 2.3 in Scala, I have an RDD[List[Double]]. Every List[Double] have the same size. I can't figure out how to perform a transformation that given the RDD
[1.0, 1.5, 4.0, 3.0],
[2.3, 5.6, 3.4, 9.0],
[4.5, 2.0, 1.0, 5.7]
transform it in the RDD
[2.3, 2.0, 1.0, 3.0],
[1.0, 5.6, 4.0, 5.7],
[4.5, 1.5, 3.4, 9.0]
where every single element of the lists is swapped among them, maintaining the same position.
For example, the first element of the first list is moved to the first position of the second list, the second element of the first list is moved to the second position of the third list, and so on.
Thanks a lot.
One approach to shuffling column-wise would be to break down the dataset into individual single-column DataFrames each of which gets shuffled using orderBy(rand), and then piece them back together.
To join the shuffled DataFrames, RDD zipWithIndex is applied to each of them to create row-identifying ids. Note that monotonically_increasing_id won't cut it as it doesn't guarantee generating same list of ids necessary for the final join. Hence, this is rather expensive due to the required transformation between RDD and DataFrame.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val rdd0 = sc.parallelize(Seq(
List(1.0, 1.5, 4.0, 3.0),
List(2.3, 5.6, 3.4, 9.0),
List(4.5, 2.0, 1.0, 5.7)
))
//rdd0: org.apache.spark.rdd.RDD[List[Double]] = ...
val rdd = rdd0.map{ case x: Seq[Double] => (x(0), x(1), x(2), x(3)) }
val df = rdd.toDF("c1", "c2", "c3", "c4")
val shuffledDFs = df.columns.filter(_.startsWith("c")).map{ c =>
val subDF = df.select(c)
val subRDD = subDF.orderBy(rand).rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
spark.createDataFrame( subRDD,
StructType(subDF.schema.fields :+ StructField("idx", LongType, false))
)
}
shuffledDFs.reduce( _.join(_, Seq("idx")) ).
show
// +---+---+---+---+---+
// |idx| c1| c2| c3| c4|
// +---+---+---+---+---+
// | 0|2.3|2.0|4.0|9.0|
// | 1|1.0|5.6|3.4|3.0|
// | 2|4.5|1.5|1.0|5.7|
// +---+---+---+---+---+

How to print RowMatrix in Scala/Spark?

How can I view/print to screen a small RowMatrix in Scala?
val A = new RowMatrix(sparkContext.parallelize(Seq(
Vectors.dense(1, 2, 3),
Vectors.dense(4, 5, 6))))
I figured it's just
A.rows.collect
FYI: Beware of the matrix size.

implicit recomendation with ML spark and data frames

I am trying to use the new ML libraries with Spark and Dataframes for building a recommender with implicit ratings.
My code
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
sc = SparkContext()
sqlContext = SQLContext(sc)
# create the dataframe (user x item)
df = sqlContext.createDataFrame(
[(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)],
["user", "item"])
als = ALS() \
.setRank(10) \
.setImplicitPrefs(True)
model = als.fit(df)
print "Rank %i " % model.rank
model.userFactors.orderBy("id").collect()
test = sqlContext.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"])
predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
for p in predictions: print p
However, I run in this error
pyspark.sql.utils.AnalysisException: cannot resolve 'rating' given input columns user, item;
So, Not sure how to define the data frame
It appears you are trying to use (user, product) tuples, but you need (user, product, rating) triplets. Even for implicit ratings, you do need the ratings. You can use a constant like 1.0 if they are all the same.
I am confused because the MLLIB API has a separate API call for implicit
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
val alpha = 0.01
val lambda = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)

Convert local Vectors to RDD[Vector]

I'm new to Spark and Scala and I'm trying to read its documentation on MLlib.
The tutorial on http://spark.apache.org/docs/1.4.0/mllib-data-types.html,
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
does not show how to construct an RDD[Vector] (variable rows) from a list of local vectors.
So for example, I have executed (as part of my exploration) in spark-shell
val v0: Vector = Vectors.dense(1.0, 0.0, 3.0)
val v1: Vector = Vectors.sparse(3, Array(1), Array(2.5))
val v2: Vector = Vectors.sparse(3, Seq((0, 1.5),(1, 1.8)))
which if 'merged' will look like this matrix
1.0 0.0 3.0
0.0 2.5 0.0
1.5 1.8 0.0
So, how do I transform Vectors v0, v1, v2 to rows?
By using the property of Spark Context which parallelize the Sequence, we can achieve the thing you want, Since you have created vectors,now all you required to bring them in sequence and parallelize by the process given below.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v0 = Vectors.dense(1.0, 0.0, 3.0)
val v1 = Vectors.sparse(3, Array(1), Array(2.5))
val v2 = Vectors.sparse(3, Seq((0, 1.5), (1, 1.8)))
val rows = sc.parallelize(Seq(v0, v1, v2))
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()

simple matrix multiplication in Spark

I am struggling with some very basic spark code. I would like to define a matrix x with 2 columns. This is what I have tried:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0) // in this case I want s to be both column 1 and column 2 of x
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val mat = new RowMatrix(ss, 5, 2)
<console>:17: error: type mismatch;
found : Array[Double]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val mat = new RowMatrix(ss, 5, 2)
I do not understand how I can get the right transformation in order to pass the values to the distributed matrix ^
EDIT:
Maybe I have been able to solve:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0)
scala> val ss = s.to
toArray toDenseMatrix toDenseVector toScalaVector toString
toVector
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> val x = new breeze.linalg.Dense
DenseMatrix DenseVector
scala> val x = new breeze.linalg.DenseMatrix(5, 2, ss)
x: breeze.linalg.DenseMatrix[Double] =
-3.0 -3.0
-1.5 -1.5
0.0 0.0
1.5 1.5
3.0 3.0
scala> val xDist = sc.parallelize(x.toArray)
xDist: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:18
Something like this. This typechecks, but for some reason won't run in my Scala worksheet.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
// the values for the column in each row
val col = List(-3.0, -1.5, 0.0, 1.5, 3.0) ;
// make two rows of the column values, transpose it,
// make Vectors of the result
val t = List(col,col).transpose.map(r=>Vectors.dense(r.toArray))
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val rm = new RowMatrix(sc.makeRDD(t));