implicit recomendation with ML spark and data frames - pyspark

I am trying to use the new ML libraries with Spark and Dataframes for building a recommender with implicit ratings.
My code
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
sc = SparkContext()
sqlContext = SQLContext(sc)
# create the dataframe (user x item)
df = sqlContext.createDataFrame(
[(0, 0), (0, 1), (1, 1), (1, 2), (2, 1), (2, 2)],
["user", "item"])
als = ALS() \
.setRank(10) \
.setImplicitPrefs(True)
model = als.fit(df)
print "Rank %i " % model.rank
model.userFactors.orderBy("id").collect()
test = sqlContext.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", "item"])
predictions = sorted(model.transform(test).collect(), key=lambda r: r[0])
for p in predictions: print p
However, I run in this error
pyspark.sql.utils.AnalysisException: cannot resolve 'rating' given input columns user, item;
So, Not sure how to define the data frame

It appears you are trying to use (user, product) tuples, but you need (user, product, rating) triplets. Even for implicit ratings, you do need the ratings. You can use a constant like 1.0 if they are all the same.

I am confused because the MLLIB API has a separate API call for implicit
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
val alpha = 0.01
val lambda = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, lambda, alpha)

Related

How to save PCA object in spark scala?

I'm doing PCA on my data and I read the guide from: https://spark.apache.org/docs/latest/mllib-dimensionality-reduction
The relevant code is following:
import org.apache.spark.mllib.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
val data: RDD[LabeledPoint] = sc.parallelize(Seq(
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 1)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 1, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0)),
new LabeledPoint(0, Vectors.dense(1, 0, 0, 0, 0)),
new LabeledPoint(1, Vectors.dense(1, 1, 0, 0, 0))))
// Compute the top 5 principal components.
val pca = new PCA(5).fit(data.map(_.features))
// Project vectors to the linear space spanned by the top 5 principal
// components, keeping the label
val projected = data.map(p => p.copy(features = pca.transform(p.features)))
This code perform PCA upon the data. However, I can't find example code or doc explaining how to save and load the fitted PCA object for future using. Could someone give me an example based on the above code?
It seems that the PCA mlib version does not support save the model to disk. You can save the pc matrix of the resulting PCAModel instead. However, use the spar ML version. It returns a Spark Estimator that can be serialized and included in a Spark ML Pipeline.
The example code based on #EmiCareOfCell44 answer, using PCA and PCAModel from org.apache.spark.ml.feature:
import org.apache.spark.ml.feature.{PCA, PCAModel}
import org.apache.spark.ml.linalg.Vectors
val data = Array(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3)
.fit(df)
val result = pca.transform(df).select("pcaFeatures")
result.show(false)
// save the model
val savePath = "xxxx"
pca.save(savePath)
// load the save model
val pca_loaded = PCAModel.load(savePath)

AttributeError: 'HashingTF' object has no attribute '_java_obj'

when I use the pyspark.ml.Pipline to create pipline, it occurs the following problem:
File "/opt/module/spark-2.4.3-bin-hadoop2.7/Pipeline.py", line 18, in
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
File "/opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/init.py", line 110
, in wrapper
TypeError: init() got an unexpected keyword argument 'ipnutCol'
Exception ignored in:
Traceback (most recent call last):
File "/opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 4
0, in del
AttributeError: 'HashingTF' object has no attribute '_java_obj'
I guess the API have changed, but I am not certain.
# 构建一个机器学习流水线
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml import Pipeline
# 创建一个SparkSession对象
spark = SparkSession.builder.master("local").appName("WorldCount").getOrCreate()
# 1. prepare training documents from a list of (id, text, label) tuples
training = spark.createDataFrame([
(0, 'a b c d e spark', 1.0),
(1, 'b d', 0.0),
(2, 'spark f g h', 1.0),
(3, 'hadoop mapreduce', 0.0)
],['id','text','label'])
# 2. 定义pipline 中各个流水线阶段PipelineStage.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
# 3. 按照具体的处理逻辑有序地组织PipelineStages,并创建一个Pipeline.
pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])
# 4. 训练模型
model = pipeline.fit(training)
# 5. 构建测试数据
test = spark.createDataFrame([
(4, 'spark i j k'),
(5, 'i m n'),
(6, 'spark hadoop spark'),
(7, 'apache hadoop')
],['id', 'text'])
# 6. 调用之前训练好的PipelineModel的transform()方法,
# 让测试数据按照顺序通过拟合的流水线,生成预测结果
prediction = model.transform(test)
selected = prediction.select('id','text','probability','prediction')
for row in selected.collect():
rid, text, prob, prediction = row
print('({},{}) -> prob = {}, prediction={}'.format(rid, text, str(prob),prediction))
(4, spark i j k) --> prob=[0.155543713844,0.844456286156], prediction=1.000000 (5, l m n) --> prob=[0.830707735211,0.169292264789], prediction=0.000000 (6, spark hadoop spark) --> prob=[0.0696218406195,0.93037815938], prediction=1.000000 (7, apache hadoop) --> prob=[0.981518350351,0.018481649649], prediction=0.000000
Your spelling of input is wrong here (ipnutCol):
TypeError: init() got an unexpected keyword argument 'ipnutCol'

convert a 2d list to RDD[vector] or JavaRDD[vector] scala

I have a 2d list of integers and I would like to convert it to either RDD[vector] or JavaRDD[vector] in order to use the predict method of the SVM model in spark MLlib.
I have tried the following, in order to convert it to rdd. But it seems that this is not what I need.
val tuppleSlides = encoded.iterator.sliding(10).toList
val rdd = sc.parallelize(tuppleSlides)
Any ideas what is the command to convert it to the right type?
Thank you in advance.
If you want to use MLlib you will need an RDD[LabeledPoint]. Given your 2D list of data and some list of labels, you can create your RDD[LabeledPoint] like so:
scala> val labels = List(1.0, -1.0)
labels: List[Double] = List(1.0, -1.0)
scala> val myData = List(List(1d,2d), List(3d,4d))
myData: List[List[Double]] = List(List(1.0, 2.0), List(3.0, 4.0))
scala> import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vectors
scala> import org.apache.spark.ml.feature.LabeledPoint
import org.apache.spark.ml.feature.LabeledPoint
scala> val vectors = myData.map(x => Vectors.dense(x.toArray))
vectors: List[org.apache.spark.ml.linalg.Vector] = List([1.0,2.0], [3.0,4.0])
scala> val labPts = labels.zip(vectors).map{case (l, fV) => LabeledPoint(l, fV)}
labPts: List[org.apache.spark.ml.feature.LabeledPoint] = List((1.0,[1.0,2.0]), (-1.0,[3.0,4.0]))
scala> val myRDD = sc.parallelize(labPts)
myRDD: org.apache.spark.rdd.RDD[org.apache.spark.ml.feature.LabeledPoint] = ParallelCollectionRDD[0] at parallelize at <console>:34

convert bipartite graph to adjacency matrix spark scala

I'm trying to convert edge list which is in the following format
data = [('a', 'developer'),
('b', 'tester'),
('b', 'developer'),
('c','developer'),
('c', 'architect')]
where the adjacency matrix will be in the form of
developer tester architect
a 1 0 0
b 1 1 0
c 1 0 1
I want to store the matrix in the following format
1 0 0
1 1 0
1 0 1
I've tried it using GraphX
def pageHash(title:String ) = title.toLowerCase.replace(" ","").hashCode.toLong
val edges: RDD[Edge[String]] = sc.textFile("/user/query.csv").map { line =>
val row = line.split(",")
Edge(pageHash(row(0)), pageHash(row(1)), "1")
}
val graph: Graph[Int, String] = Graph.fromEdges(edges, defaultValue = 1)
I'm able to create the graph but not able to convert to adjacent matrix representation.
One possible way to approach is something this:
Convert RDD to DataFrame
val rdd = sc.parallelize(Seq(
("a", "developer"), ("b", "tester"), ("b", "developer"),
("c","developer"), ("c", "architect")))
val df = rdd.toDF("row", "col")
Index columns:
import org.apache.spark.ml.feature.StringIndexer
val indexers = Seq("row", "col").map(x =>
new StringIndexer().setInputCol(x).setOutputCol(s"${x}_idx").fit(df)
)
Transform data and create RDD[MatrixEntry]:
import org.apache.spark.functions.lit
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix}
val entries = indexers.foldLeft(df)((df, idx) => idx.transform(df))
.select($"row_idx", $"col_idx", lit(1.0))
.as[MatrixEntry] // Spark 1.6. For < 1.5 map manually
.rdd
Create matrix
new CoordinateMatrix(entries)
This matrix can be further converted to any other type of distributed matrix including RowMatrix and IndexedRowMatrix.

Convert local Vectors to RDD[Vector]

I'm new to Spark and Scala and I'm trying to read its documentation on MLlib.
The tutorial on http://spark.apache.org/docs/1.4.0/mllib-data-types.html,
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
does not show how to construct an RDD[Vector] (variable rows) from a list of local vectors.
So for example, I have executed (as part of my exploration) in spark-shell
val v0: Vector = Vectors.dense(1.0, 0.0, 3.0)
val v1: Vector = Vectors.sparse(3, Array(1), Array(2.5))
val v2: Vector = Vectors.sparse(3, Seq((0, 1.5),(1, 1.8)))
which if 'merged' will look like this matrix
1.0 0.0 3.0
0.0 2.5 0.0
1.5 1.8 0.0
So, how do I transform Vectors v0, v1, v2 to rows?
By using the property of Spark Context which parallelize the Sequence, we can achieve the thing you want, Since you have created vectors,now all you required to bring them in sequence and parallelize by the process given below.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v0 = Vectors.dense(1.0, 0.0, 3.0)
val v1 = Vectors.sparse(3, Array(1), Array(2.5))
val v2 = Vectors.sparse(3, Seq((0, 1.5), (1, 1.8)))
val rows = sc.parallelize(Seq(v0, v1, v2))
val mat: RowMatrix = new RowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()