Spark MLib ALS input and output domains - pyspark

I'm using Spark MLib ALS and trying to use the trainImplicit() interface to feed it the number of an item purchased by a user as the implicit preference. I don't know how to validate my model though. My input is in the domain [1, inf), but the output predictions seem to be floats in (0, 1).
The usual kind of code:
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
from pyspark.sql import HiveContext
from pyspark import SparkContext
sc = SparkContext(appName="Quantity Prediction Model")
hive = HiveContext(sc)
d = hive.sql("select o.user_id as user, l.product_id as product, sum(l.quantity) as qty from order_line l join order_order o ON l.order_id = o.id group by o.user_id, l.product_id")
d.write.save('user_product_qty')
ratings = d.rdd.map(tuple)
testdata = ratings.map(lambda t: (t[0], t[1]))
for rank in (4, 8, 12):
model = ALS.trainImplicit(ratings, rank, 10, alpha=0.01)
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
# Error is pretty bad because output raitings aren't in the same domain as quantity
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Rank: {} MSE: {}".format(rank, MSE))
Extra credit: When using train() what is the input/output domain? Are "ratings" some how expected to be on a five point scale? This isn't documented anywhere.

RMSE is not an ideal metric for implicit ALS (the original paper suggests a more elaborate evaluation technique). However, you can still apply RMSE to evaluate implicit ALS training results if you map your input ratings to (-1;1) as well.
See https://github.com/apache/spark/pull/597 for details.
Finally, some code to get you started (inpsired by MovieLens example for ALS from Spark):
// RMSE
logger.info(s"Calculating RMSE on ${testingSet.count()} ratings")
def groupRatings(rs: RDD[MLlibRating]): RDD[((Int, Int), Double)] =
rs.map { r => ((r.user, r.product), r.rating) }
// When using implicit ALS we should treat actual and predicted
// ratings as confidence levels. See also apache/spark#597.
//
// Predicted ratings are clamped to [0;1]
def clampPredicted(r: Double): Double =
math.max(math.min(r, 1.0), 0.0)
def clampActual(r: Double): Double = if (r > 0.0) 1.0 else 0.0
def sqr(x: Double): Double = x * x
val ratingSquaredErrors =
groupRatings(alsModel.predict(testingSet.map { r => (r.user, r.product) }))
.join(groupRatings(testingSet))
.map { case (_, (predictedRating, actualRating)) =>
sqr(clampPredicted(predictedRating) - clampActual(actualRating)) }
val rmse = sqrt(ratingSquaredErrors.mean())
logger.info(s"RMSE: ${rmse}")

Related

Increase of hash tables in MinHashLSH, decreases accuracy and f1

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and F1 score are decreasing. All that I have already read for this algorithm shows me that I have an issue.
I have tried a different number of hash tables and I have provided different numbers of Jaccard similarity thresholds but I have the same exact problem, the accuracy is decreasing rapidly. I have also tried different samplings of my dataset and nothing changed. My workflow goes on like this: I am concatenating all the text columns of my dataframe, which includes title, authors, journal and abstract and next I am tokenizing the concatenated column into words. Then I am using a CountVectorizer to transform this "bag of words" into vectors. Next, I am providing this column in MinHashLSH with some hash tables and finaly I am doing an approximateSimilarityJoin to find similar "papers" which are under my given threshold. My implementation is the following.
import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import UnsupervisedLinkPrediction.BroutForce.join
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf, when}
import org.apache.spark.sql.types._
object lsh {
def main(args: Array[String]): Unit = {
Logger.getLogger("org").setLevel(Level.ERROR) // show only errors
// val cores=args(0).toInt
// val partitions=args(1).toInt
// val hashTables=args(2).toInt
// val limit = args(3).toInt
// val threshold = args(4).toDouble
val cores="*"
val partitions=1
val hashTables=16
val limit = 1000
val jaccardDistance = 0.89
val master = "local["+cores+"]"
val ss = SparkSession.builder().master(master).appName("MinHashLSH").getOrCreate()
val sc = ss.sparkContext
val inputFile = "resources/data/node_information.csv"
println("reading from input file: " + inputFile)
println
val schemaStruct = StructType(
StructField("id", IntegerType) ::
StructField("pubYear", StringType) ::
StructField("title", StringType) ::
StructField("authors", StringType) ::
StructField("journal", StringType) ::
StructField("abstract", StringType) :: Nil
)
// Read the contents of the csv file in a dataframe. The csv file contains a header.
// var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()
papers.repartition(partitions)
println("papers.rdd.getNumPartitions"+papers.rdd.getNumPartitions)
import ss.implicits._
// Read the original graph edges, ground trouth
val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
val fields = line.split("\t")
(fields(0), fields(1))
}).toDF("nodeA_id", "nodeB_id").cache()
val originalGraphCount = originalGraphDF.count()
println("Ground truth count: " + originalGraphCount )
val nullAuthor = ""
val nullJournal = ""
val nullAbstract = ""
papers = papers.na.fill(nullAuthor, Seq("authors"))
papers = papers.na.fill(nullJournal, Seq("journal"))
papers = papers.na.fill(nullAbstract, Seq("abstract"))
papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
papers.show(false)
val filteredGt= originalGraphDF.as("g").join(papers.as("p"),(
$"g.nodeA_id" ===$"p.id") || ($"g.nodeB_id" ===$"p.id")
).select("g.nodeA_id","g.nodeB_id").distinct().cache()
filteredGt.show()
val filteredGtCount = filteredGt.count()
println("Filtered GroundTruth count: "+ filteredGtCount)
//TOKENIZE
val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")
println("Setting pipeline stages...")
val stages = Array(
tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract
// rTitle, rAuthors, rJournal, rAbstract
)
val pipeline = new Pipeline()
pipeline.setStages(stages)
println("Transforming dataframe\n")
val model = pipeline.fit(papers)
papers = model.transform(papers)
println(papers.count())
papers.show(false)
papers.printSchema()
val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))
val joinedDf = papers.withColumn(
"paper_data",
udf_join_cols(
papers("pubYear_words"),
papers("title_words"),
papers("authors_words"),
papers("journal_words"),
papers("abstract_words")
)
).select("id", "paper_data").cache()
joinedDf.show(5,false)
val vocabSize = 1000000
val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(joinedDf)
val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
val vectorizedDf = cvModel.transform(joinedDf).filter(isNoneZeroVector(col("features"))).select(col("id"), col("features"))
vectorizedDf.show()
val mh = new MinHashLSH().setNumHashTables(hashTables)
.setInputCol("features").setOutputCol("hashValues")
val mhModel = mh.fit(vectorizedDf)
mhModel.transform(vectorizedDf).show()
vectorizedDf.createOrReplaceTempView("vecDf")
println("MinHashLSH.getHashTables: "+mh.getNumHashTables)
val dfA = ss.sqlContext.sql("select id as nodeA_id, features from vecDf").cache()
dfA.show(false)
val dfB = ss.sqlContext.sql("select id as nodeB_id, features from vecDf").cache()
dfB.show(false)
val predictionsDF = mhModel.approxSimilarityJoin(dfA, dfB, jaccardDistance, "JaccardDistance").cache()
println("Predictions:")
val predictionsCount = predictionsDF.count()
predictionsDF.show()
println("Predictions count: "+predictionsCount)
predictionsDF.createOrReplaceTempView("predictions")
val pairs = ss.sqlContext.sql("select datasetA.nodeA_id, datasetB.nodeB_id, JaccardDistance from predictions").cache()
pairs.show(false)
val totalPredictions = pairs.count()
println("Properties:\n")
println("Threshold: "+threshold+"\n")
println("Hahs tables: "+hashTables+"\n")
println("Ground truth: "+filteredGtCount)
println("Total edges found: "+totalPredictions +" \n")
println("EVALUATION PROCESS STARTS\n")
println("Calculating true positives...\n")
val truePositives = filteredGt.as("g").join(pairs.as("p"),
($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
).cache().count()
println("True Positives: "+truePositives+"\n")
println("Calculating false positives...\n")
val falsePositives = predictionsCount - truePositives
println("False Positives: "+falsePositives+"\n")
println("Calculating true negatives...\n")
val pairsPerTwoCount = (limit *(limit - 1)) / 2
val trueNegatives = (pairsPerTwoCount - truePositives) - falsePositives
println("True Negatives: "+trueNegatives+"\n")
val falseNegatives = filteredGtCount - truePositives
println("False Negatives: "+falseNegatives)
val truePN = (truePositives+trueNegatives).toFloat
println("TP + TN sum: "+truePN+"\n")
val sum = (truePN + falseNegatives+ falsePositives).toFloat
println("TP +TN +FP+ FN sum: "+sum+"\n")
val accuracy = (truePN/sum).toFloat
println("Accuracy: "+accuracy+"\n")
val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat
val f1Score = 2*(recall*precision)/(recall+precision).toFloat
println("F1 score: "+f1Score+"\n")
ss.stop()
I forget to tell you that I am running this code in a cluster with 40 cores and 64g of RAM. Note that approximate similarity join (Spark's implementation) works with JACCARD DISTANCE and not with JACCARD INDEX. So I provide as a similarity threshold the JACCARD DISTANCE which for my case is jaccardDistance = 1 - threshold. (threshold = Jaccard Index ).
I was expecting to get higher accuracy and f1 score while I am increasing the hash tables. Do you have any idea about my issue?
Thank all of you in advance!
There are multiple visible problems here, and probably more hidden, so just to enumerate a few:
LSH is not really a classifier and attempt to evaluate it as one doesn't make much sense, even if you assume that text similarity is somehow a proxy for citation (which is big if).
If the problem was to be framed as classification problem it should be treated as multi-label classification (each paper can cite or be cited by multiple sources) not multi-class classification, hence simple accuracy is not meaningful.
Even if it was a classification and could be evaluated as such your calculations don't include actual negatives, which don't meet the threshold of the approxSimilarityJoin
Also setting threshold to 1 restricts joins to either exact matches or cases of hash collisions - hence preference towards LSH with higher collisions rates.
Additionally:
Text processing approach you took is rather pedestrian and prefers non-specific features (remember you don't optimize your actual goal, but text similarity).
Such approach, especially treating everything as equal, discards majority of useful information in the set primarily, but not limited to, temporal relationships..

Mean Squared Error (MSE) returns a huge number

I'm new in Scala and Spark in general. I'm using this code for Regression (based on this link Spark official site):
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionModel
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
import org.apache.spark.mllib.linalg.Vectors
// Load and parse the data
val data = sc.textFile("Year100")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
// Building the model
val numIterations = 100
val stepSize = 0.00000001
val model = LinearRegressionWithSGD.train(parsedData, numIterations,stepSize )
// Evaluate model on training examples and compute training error
val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
The dataset that I'm using can be seen here: Pastebin link.
So my question is: why MSE equals as 889717.74 (which is a huge number)?
Edit: As the commentators suggested, I tried these:
1) I changed the step to default and the MSE now returns as NaN
2) If I try this constructor:
LinearRegressionWithSGD.train(parsedData, numIterations,stepSize,intercept=True) the spark-shell returns an error (error: not found:value True)
You've passed a tiny step size and capped the number of iterations at 100. The maximum value by which your parameters can change is 0.00000001 * 100 = 0.000001. Try using the default step size, I imagine that will fix it.

Retrieving not only top one predictions from Multiclass Regression with Spark [duplicate]

I'm running a Bernoulli Naive Bayes using code:
val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code:
val numIterations = 100
val model = SVMWithSGD.train(training, numIterations)
model.clearThreshold()
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Unfortunately this code isn't working for NaiveBayes.
Concerning the probabilities for Bernouilli Naive Bayes, here is an example :
// Building dummy data
val data = sc.parallelize(List("0,1 0 0", "1,0 1 0", "1,0 0 1", "0,1 0 1","1,1 1 0"))
// Transforming dummy data into LabeledPoint
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}
// Prepare data for training
val splits = parsedData.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)
val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli")
// labels
val labels = model.labels
// Probabilities for all feature vectors
val features = parsedData.map(lp => lp.features)
model.predictProbabilities(features).take(10) foreach println
// For one specific vector, I'm taking the first vector in the parsedData
val testVector = parsedData.first.features
println(s"For vector ${testVector} => probability : ${model.predictProbabilities(testVector)}")
As for the AUC :
// Compute raw scores on the test set.
val labelAndPreds = test.map { point =>
val prediction = model.predict(point.features)
(prediction, point.label)
}
// Get evaluation metrics.
val metrics = new BinaryClassificationMetrics(labelAndPreds)
val auROC = metrics.areaUnderROC()
Concerning the inquiry from the chat :
val results = parsedData.map { lp =>
val probs: Vector = model.predictProbabilities(lp.features)
(for (i <- 0 to (probs.size - 1)) yield ((lp.label, labels(i), probs(i))))
}.flatMap(identity)
results.take(10).foreach(println)
// (0.0,0.0,0.59728640251696)
// (0.0,1.0,0.40271359748304003)
// (1.0,0.0,0.2546873180388961)
// (1.0,1.0,0.745312681961104)
// (1.0,0.0,0.47086939671877026)
// (1.0,1.0,0.5291306032812298)
// (0.0,0.0,0.6496075621805428)
// (0.0,1.0,0.3503924378194571)
// (1.0,0.0,0.4158585282373076)
// (1.0,1.0,0.5841414717626924)
and if you are only interested in the argmax classes :
val results = training.map { lp => val probs: Vector = model.predictProbabilities(lp.features)
val bestClass = probs.argmax
(labels(bestClass), probs(bestClass))
}
results.take(10) foreach println
// (0.0,0.59728640251696)
// (1.0,0.745312681961104)
// (1.0,0.5291306032812298)
// (0.0,0.6496075621805428)
// (1.0,0.5841414717626924)
Note: Works with Spark 1.5+
EDIT: (for Pyspark users)
It seems like some are having troubles getting probabilities using pyspark and mllib. Well that's normal, spark-mllib doesn't present that function for pyspark.
Thus you'll need to use the spark-ml DataFrame-based API :
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import NaiveBayes
df = spark.createDataFrame([
Row(label=0.0, features=Vectors.dense([0.0, 0.0])),
Row(label=0.0, features=Vectors.dense([0.0, 1.0])),
Row(label=1.0, features=Vectors.dense([1.0, 0.0]))])
nb = NaiveBayes(smoothing=1.0, modelType="bernoulli")
model = nb.fit(df)
model.transform(df).show(truncate=False)
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |features |label|rawPrediction |probability |prediction|
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
# |[0.0,0.0]|0.0 |[-1.4916548767777167,-2.420368128650429] |[0.7168141592920354,0.28318584070796465]|0.0 |
# |[0.0,1.0]|0.0 |[-1.4916548767777167,-3.1135153092103742]|[0.8350515463917526,0.16494845360824742]|0.0 |
# |[1.0,0.0]|1.0 |[-2.5902671654458262,-1.7272209480904837]|[0.29670329670329676,0.7032967032967034]|1.0 |
# +---------+-----+-----------------------------------------+----------------------------------------+----------+
You'll just need to select your prediction column and compute your AUC.
For more information about Naive Bayes in spark-ml, please refer to the official documentation here.

Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples

Software Version: Apache Spark v1.3
Context: I've been trying to "translate" Octave/MATLAB code to Scala on Apache Spark. More precisely, I work on ex1data1.txt and ex1data2.txt from coursera practical part ex1. I've made such translation into Julia lang (it went smoothly) and now I've been struggling with Spark...without success.
Problem: Performance of my implementation on Spark is very poor. I cannot even say it works correctly. That's why for ex1data1.txt I added polynomial feature, and I also worked with: theta0 using setIntercept(true) and with extra non-normalized column of 1.0 values(in this case I set Intercept to false). I receive only silly results.
So, then I 've decided to start working with ex1data2.txt. Below you can find the code and the expected result. Of course Spark result is wrong.
Did you have similar experience? I will be grateful for your help.
The Scala code for the exd1data2.txt:
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}
object MLibOnEx1data2 extends App {
val conf = new SparkConf()
conf.set("spark.app.name", "coursera ex1data2.txt test")
val sc = new SparkContext(conf)
val input = sc.textFile("hdfs:///ex1data2.txt")
val trainData = input.map { line =>
val parts = line.split(',')
val y = parts(2).toDouble
val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
println(s"x = $features y = $y")
LabeledPoint(y, features)
}.cache()
// Building the model
val numIterations = 1500
val alpha = 0.01
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(trainData.map(x => x.features))
val scaledTrainData = trainData.map{ td =>
val normFeatures = scaler.transform(td.features)
println(s"normalized features = $normFeatures")
LabeledPoint(td.label, normFeatures)
}.cache()
val tsize = scaledTrainData.count()
println(s"Training set size is $tsize")
val alg = new LinearRegressionWithSGD().setIntercept(true)
alg.optimizer
.setNumIterations(numIterations)
.setStepSize(alpha)
.setUpdater(new SquaredL2Updater)
.setRegParam(0.0) //regularization - off
val model = alg.run(scaledTrainData)
println(s"Theta is $model.weights")
val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))
println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338
// Evaluate model on training examples and compute training error
val valuesAndPreds = scaledTrainData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val MSE = ((valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
println("Training Mean Squared Error = " + MSE)
// Save and load model
val trySaveAndLoad = util.Try(model.save(sc, "myModelPath"))
.flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
.getOrElse(-1)
println(s"trySaveAndLoad result is $trySaveAndLoad")
}
STDOUT result is:
Training set size is 47
Theta is (weights=[52090.291641275864,19342.034885388926],
intercept=181295.93717032953).weights
Estimate the price of a 1650 sq-ft, 3 br house = 153983.5541846754
dollars
Training Mean Squared Error = 1.5876093757127676E10
trySaveAndLoad result is -1
Well, after some digging I believe there is nothing here. First I saved content of the valuesAndPreds to text file:
valuesAndPreds.map{
case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'
Rest of the code is written in R.
First lets create a model using closed form solution:
# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model
model <- lm(V3 ~ ., df)
For reference:
> summary(model)
Call:
lm(formula = V3 ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-130582 -43636 -10829 43698 198147
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 340413 9637 35.323 < 2e-16 ***
V1 110631 11758 9.409 4.22e-12 ***
V2 -6650 11758 -0.566 0.575
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared: 0.7329, Adjusted R-squared: 0.7208
F-statistic: 60.38 on 2 and 44 DF, p-value: 2.428e-13
Now prediction:
closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
plot(
closedFormPrediction, df$V3,
ylab="Actual", xlab="Predicted",
main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))
'
Now we can compare above to SGD results:
sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)
plot(
sgd$V2, sgd$V1, ylab="Actual",
xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))
Finally lets compare both:
plot(
sgd$V2, closedFormPrediction,
xlab="SGD", ylab="Closed form", main="SGD vs Closed form")
So, result are clearly not perfect but nothing seems to be completely off here.

How to set preferences for ALS implicit feedback in Collaborative Filtering?

I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields userId and productId. I have no product ratings, just info on what products users have bought, that's all. So to train ALS I use:
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
This API requires Rating object:
Rating(user: Int, product: Int, rating: Double)
On the other hand documentation on trainImplicit tells: Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, preference) pairs.
When I set rating / preferences to 1 as in:
val ratings = sc.textFile(new File(dir, file).toString).map { line =>
val fields = line.split(",")
// format: (randomNumber, Rating(userId, productId, rating))
(rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}
val training = ratings.filter(x => x._1 < 60)
.values
.repartition(numPartitions)
.cache()
val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
.values
.repartition(numPartitions)
.cache()
val test = ratings.filter(x => x._1 >= 80).values.cache()
And then train ALSL:
val model = ALS.trainImplicit(ratings, rank, numIter)
I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value:
val validationRmse = computeRmse(model, validation, numValidation)
/** Compute RMSE (Root Mean Squared Error). */
def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
.join(data.map(x => ((x.user, x.product), x.rating)))
.values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)
}
So my question is: to what value should I set rating in:
Rating(user: Int, product: Int, rating: Double)
for implicit training (in ALS.trainImplicit method) ?
Update
With:
val alpha = 40
val lambda = 0.01
I get:
Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.
Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1).
You can specify the alpha confidence level. Default is 1.0: but try lower.
val alpha = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, alpha)
Let us know how that goes.
'rating' in class Rating is a double value.
According to
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-implicit-error-pyspark-td16595.html,
'rating' can be value > 1.
And according to
https://docs.prediction.io/templates/recommendation/training-with-implicit-preference/,
'rating' can be the number of observations for given user+item
It's even more strange since you are training with the whole set instead of using the "training" subset
How well distributed is the original data? do you have a large number of items without a preference or items that are preferred a lot?
In "Collaborative Filtering for Implicit Feedback Datasets" the alpha used is 40, you might want to experiment with different values but