I have followed the guide given in the link
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
But this is outdated as it uses spark Mlib RDD approach. The New Spark 2.0 has DataFrame approach.
Now My problem is I have got the updated code
val ratings = spark.read.textFile("data/mllib/als/sample_movielens_ratings.txt")
.map(parseRating)
.toDF()
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
// Build the recommendation model using ALS on the training data
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("userId")
.setItemCol("movieId")
.setRatingCol("rating")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(test)
Now Here is the problem, In the old code the model that was obtained was a MatrixFactorizationModel, Now it has its own model(ALSModel)
In MatrixFactorizationModel you could directly do
val recommendations = bestModel.get
.predict(userID)
Which will give the list of products with highest probability of user liking them.
But Now there is no .predict method. Any Idea how to recommend a list of products given a user Id
Use transform method on model:
import spark.implicits._
val dataFrameToPredict = sparkContext.parallelize(Seq((111, 222)))
.toDF("userId", "productId")
val predictionsOfProducts = model.transform (dataFrameToPredict)
There's a jira ticket to implement recommend(User|Product) method, but it's not yet on default branch
Now you have DataFrame with score for user
You can simply use orderBy and limit to show N recommended products:
// where is for case when we have big DataFrame with many users
model.transform (dataFrameToPredict.where('userId === givenUserId))
.select ('productId, 'prediction)
.orderBy('prediction.desc)
.limit(N)
.map { case Row (productId: Int, prediction: Double) => (productId, prediction) }
.collect()
DataFrame dataFrameToPredict can be some large user-product DataFrame, for example all users x all products
The ALS Model in Spark contains the following helpful methods:
recommendForAllItems(int numUsers)
Returns top numUsers users recommended for each item, for all items.
recommendForAllUsers(int numItems)
Returns top numItems items recommended for each user, for all users.
recommendForItemSubset(Dataset<?> dataset, int numUsers)
Returns top numUsers users recommended for each item id in the input data set.
recommendForUserSubset(Dataset<?> dataset, int numItems)
Returns top numItems items recommended for each user id in the input data set.
e.g. Python
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import explode
alsEstimator = ALS()
(alsEstimator.setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop"))
alsModel = alsEstimator.fit(productRatings)
recommendForSubsetDF = alsModel.recommendForUserSubset(TargetUsers, 40)
recommendationsDF = (recommendForSubsetDF
.select("user_id", explode("recommendations")
.alias("recommendation"))
.select("user_id", "recommendation.*")
)
display(recommendationsDF)
e.g. Scala:
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.sql.functions.explode
val alsEstimator = new ALS().setRank(1)
.setUserCol("user_id")
.setItemCol("product_id")
.setRatingCol("rating")
.setMaxIter(20)
.setColdStartStrategy("drop")
val alsModel = alsEstimator.fit(productRatings)
val recommendForSubsetDF = alsModel.recommendForUserSubset(sampleTargetUsers, 40)
val recommendationsDF = recommendForSubsetDF
.select($"user_id", explode($"recommendations").alias("recommendation"))
.select($"user_id", $"recommendation.*")
display(recommendationsDF)
Here is what I did to get recommendations for a specific user with spark.ml:
import com.github.fommil.netlib.BLAS.{getInstance => blas}
userFactors.lookup(userId).headOption.fold(Map.empty[String, Float]) { user =>
val ratings = itemFactors.map { case (id, features) =>
val rating = blas.sdot(features.length, user, 1, features, 1)
(id, rating)
}
ratings.sortBy(_._2).take(numResults).toMap
}
Both userFactors and itemFactors in my case are RDD[(String, Array[Float])] but you should be able to do something similar with DataFrames.
Related
Give following code
case class Contact(name: String, phone: String)
case class Person(name: String, ts:Long, contacts: Seq[Contact])
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
import sqlContext.implicits._
val people = sqlContext.read.format("orc").load("people")
What is the best way to dedupe users by its timestamp
So the user with max ts will stay at collection?
In spark using RDD I would run something like this
rdd.reduceByKey(_ maxTS _)
and would add the maxTS method to Person or add implicits ...
def maxTS(that: Person):Person =
that.ts > ts match {
case true => that
case false => this
}
Is it possible to do the same at DataFrames? and will that be the similar performance?
We are using spark 1.6
You can use Window functions, I'm assuming that the key is name:
import org.apache.spark.sql.functions.{rowNumber, max, broadcast}
import org.apache.spark.sql.expressions.Window
val df = // convert to DataFrame
val win = Window.partitionBy('name).orderBy('ts.desc)
df.withColumn("personRank", rowNumber.over(win))
.where('personRank === 1).drop("personRank")
For each person it will create personRank - each person with given name will have unique number, person with the latest ts will have the lowest rank, equal to 1. The you drop temporary rank
You can do a groupBy and use your preferred aggregation method like sum, max etc.
df.groupBy($"name").agg(sum($"tx").alias("maxTS"))
I have the below data which needed to be sorted using spark(scala) in such a way that, I only need id of the person who visited "Walmart" but not "Bestbuy". store might be repetitive because a person can visit the store any number of times.
Input Data:
id, store
1, Walmart
1, Walmart
1, Bestbuy
2, Target
3, Walmart
4, Bestbuy
Output Expected:
3, Walmart
I have got the output using dataFrames and running SQL queries on spark context. But is there any way to do this using groupByKey/reduceByKey etc without dataFrames. Can someone help me with the code, After map-> groupByKey, a ShuffleRDD has been formed and I am facing difficulty in filtering the CompactBuffer!
The code with which I got it using sqlContext is below:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Person(id: Int, store: String)
val people = sc.textFile("examples/src/main/resources/people.txt")
.map(_.split(","))
.map(p => Person(p(1)trim.toInt, p(1)))
people.registerTempTable("people")
val result = sqlContext.sql("select id, store from people left semi join (select id from people where store in('Walmart','Bestbuy') group by id having count(distinct store)=1) sample on people.id=sample.id and people.url='Walmart'")
The code which I am trying now is this, but I am struck after the third step:
val data = sc.textFile("examples/src/main/resources/people.txt")
.map(x=> (x.split(",")(0),x.split(",")(1)))
.filter(!_.filter("id"))
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.map{case (x,y) =>
val url = y.flatMap(x=> x.split(",")).toList
if (!url.contains("Bestbuy") && url.contains("Walmart")){
x.map(x=> (x,y))}}
if I do dataFiltered.collect(), I am getting
Array[Any] = Array(Vector((3,Walmart)), (), ())
Please help me how to extract the output after this step
To filter an RDD, just use RDD.filter:
val dataGroup = data.groupByKey()
val dataFiltered = dataGroup.filter {
// keep only lists that contain Walmart but do not contain Bestbuy:
case (x, y) => val l = y.toList; l.contains("Walmart") && !l.contains("Bestbuy")
}
dataFiltered.foreach(println) // prints: (3,CompactBuffer(Walmart))
// if you want to flatten this back to tuples of (id, store):
val result = dataFiltered.flatMap { case (id, stores) => stores.map(store => (id, store)) }
result.foreach(println) // prints: (3, Walmart)
I also tried it another way and it worked out
val data = sc.textFile("examples/src/main/resources/people.txt")
.filter(!_.filter("id"))
.map(x=> (x.split(",")(0),x.split(",")(1)))
data.cache()
val dataWalmart = data.filter{case (x,y) => y.contains("Walmart")}.distinct()
val dataBestbuy = data.filter{case (x,y) => y.contains("Bestbuy")}.distinct()
val result = dataWalmart.subtractByKey(dataBestbuy)
data.uncache()
Not a duplicate of this because I'm asking about what the input is, not what function to call, see below
I followed this guide to create an LDA model in Spark 1.5. I saw in this question that to get the topic distribution of a new document I need to use the LocalLDAModel's topicDistributions function which takes an RDD[(Long, Vector)].
Should the new document vector be a term-count vector? That is the type of vector the LDA is trained with. My code compiles and runs but I'd like to know if this is the intended use of the topicDistributions function
import org.apache.spark.rdd._
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import scala.collection.mutable
val input = Seq("this is a document","this could be another document","these are training, not tests", "here is the final file (document)")
val corpus: RDD[Array[String]] = sc.parallelize(input.map{
doc => doc.split("\\s")
})
val termCounts: Array[(String, Long)] = corpus.flatMap(_.map(_ -> 1L)).reduceByKey(_ + _).collect().sortBy(-_._2)
val vocabArray: Array[String] = termCounts.takeRight(termCounts.size).map(_._1)
val vocab: Map[String, Int] = vocabArray.zipWithIndex.toMap
// Convert documents into term count vectors
val documents: RDD[(Long, Vector)] =
corpus.zipWithIndex.map { case (tokens, id) =>
val counts = new mutable.HashMap[Int, Double]()
tokens.foreach { term =>
if (vocab.contains(term)) {
val idx = vocab(term)
counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
}
}
(id, Vectors.sparse(vocab.size, counts.toSeq))
}
// Set LDA parameters
val numTopics = 10
val ldaModel: DistributedLDAModel = new LDA().setK(numTopics).setMaxIterations(20).run(documents).asInstanceOf[DistributedLDAModel]
//create test input, convert to term count, and get its topic distribution
val test_input = Seq("this is my test document")
val test_document:RDD[(Long,Vector)] = sc.parallelize(test_input.map(doc=>doc.split("\\s"))).zipWithIndex.map{ case (tokens, id) =>
val counts = new mutable.HashMap[Int, Double]()
tokens.foreach { term =>
if (vocab.contains(term)) {
val idx = vocab(term)
counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
}
}
(id, Vectors.sparse(vocab.size, counts.toSeq))
}
println("test_document: "+test_document.first._2.toArray.mkString(", "))
val localLDAModel: LocalLDAModel = ldaModel.toLocal
val topicDistributions = localLDAModel.topicDistributions(documents)
println("first topic distribution:"+topicDistributions.first._2.toArray.mkString(", "))
According to the Spark src, I notice the following comment regarding the documents parameter:
* #param documents:
* RDD of documents, which are term (word) count vectors paired with IDs.
* The term count vectors are "bags of words" with a fixed-size vocabulary
* (where the vocabulary size is the length of the vector).
* This must use the same vocabulary (ordering of term counts) as in training.
* Document IDs must be unique and >= 0.
So the answer is yes, the new document vector should be a term count vector. Further, the vector ordering should be the same as was used in training.
I am trying to implement ALS on Spark. I have used ml class instead of mllib because the CSV file contains String in one column. Rating class in mllib do not accept String as a parameter.
I want to use predict function from org.apache.spark.mllib.recommendation.MatrixFactorizationModel class but while running it is searching in org.apache.spark.rdd.RDD.
This is the code I am using.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.recommendation.ALS.Rating
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
object LoadCsv{
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Load CSV")
val sc = new SparkContext(conf)
println("READING FILE...............................");
val data = sc.textFile("file.csv")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating[String](user, item, rate.toFloat)
})
//val (userFactors, itemFactors) = ALS.train(ratings)
//Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations)
// Evaluate the model on rating data
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
// GETTING ERROR OVER HERE.
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
((user, product), rate)
}
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
println("Mean Squared Error = " + MSE)
// Save and load model
//model.save(sc, "/home/shishir/spark-Projects/op")
//val sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter")
// $example off$
}
}
On running the code, I am getting this error:
LoadCsv.scala:34: value predict is not a member of (org.apache.spark.rdd.RDD[(String, Array[Float])], org.apache.spark.rdd.RDD[(String, Array[Float])])
[error] model.predict(usersProducts).map { case Rating(user, product, rate) =>
Your imports are "incorrect", you are using this:
import org.apache.spark.ml.recommendation.ALS
import org.apache.spark.ml.recommendation.ALS.Rating
When you should be using this:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
You can use the other package, it's just that the result won't be a model but (as the error says) and RDD.
You can read up online why there are two ML packages (from what I remember the mllib package is the older one and contains some design flaws so they reimplement in ml so they can use pipelines).
It looks like you're mixing MLLib and ML style approach. If your data uses supported ID types (it doesn't look like it is the case here) you can use MLLib implementation:
import org.apache.spark.mllib.recommendation.{ALS => OldALS}
import org.apache.spark.mllib.recommendation.{
MatrixFactorizationModel => OldModel}
import org.apache.spark.mllib.recommendation.{Rating => OldRating}
val ratings: RDD[OldRating] = ???
val model: OldModel = OldALS()
.setAlpha(0.01)
.setIterations(numIterations)
.setRank(rank)
.run(ratings)
If your data uses non-standard ID and you want an access to user friendly API it is better to use DataFrames:
val ratings: RDD[org.apache.spark.ml.recommendation.ALS.Rating[String]] = ???
val df = ratings.toDF
val als: org.apache.spark.ml.recommendation.ALS = new ALS()
.setAlpha(0.01)
.setMaxIter(numIterations)
.setRank(rank)
val model: org.apache.spark.ml.recommendation.ALSModel = als.fit(df)
Finally you can use your current approach but you'll have to operate on user factors and item factors directly without helpers like predict.
I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.