Spark - Prediction.io - scala.MatchError: null - scala

I'm working on a template for prediction.io and I'm running into trouble with Spark.
I keep getting a scala.MatchError error: full gist here
scala.MatchError: null
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:831)
at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:66)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:86)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1$$anonfun$apply$1.apply(ALSAlgorithm.scala:79)
at scala.Option.map(Option.scala:145)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:79)
at org.template.prediction.ALSAlgorithm$$anonfun$predict$1.apply(ALSAlgorithm.scala:78)
The code github source here
val usersWithCounts =
ratingsRDD
.map(r => (r.user, (1, Seq[Rating](Rating(r.user, r.item, r.rating)))))
.reduceByKey((v1, v2) => (v1._1 + v2._1, v1._2.union(v2._2)))
.filter(_._2._1 >= evalK)
// create evalK folds of ratings
(0 until evalK).map { idx =>
// start by getting this fold's ratings for each user
val fold = usersWithCounts
.map { userKV =>
val userRatings = userKV._2._2.zipWithIndex
val trainingRatings = userRatings.filter(_._2 % evalK != idx).map(_._1)
val testingRatings = userRatings.filter(_._2 % evalK == idx).map(_._1)
(trainingRatings, testingRatings) // split the user's ratings into a training set and a testing set
}
.reduce((l, r) => (l._1.union(r._1), l._2.union(r._2))) // merge all the testing and training sets into a single testing and training set
val testingSet = fold._2.map {
r => (new Query(r.user, r.item), new ActualResult(r.rating))
}
(
new TrainingData(sc.parallelize(fold._1)),
new EmptyEvaluationInfo(),
sc.parallelize(testingSet)
)
}
In order to do evaluation I need to split the ratings into a training and a testing group. To make sure each user has been included as part of the training, I group all the user's ratings together and then do the split on each user and then join the splits together.
Maybe there's a better way to do this?

The error means that the userFeatures of the MLlib MatrixFactorizationModel doesn't contain the user id (say, if the user is not in training data). MLlib doesn't check for this after the lookup (.head is used):
https://github.com/apache/spark/blob/v1.2.0/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L66
To debug if it's the case, you can implement a modified version of model.predict() to check if userId/itemId exists in model instead of calling the default one:
val itemScore = model.predict(userInt, itemInt)
(https://github.com/nickpoorman/template-scala-parallel-prediction/blob/master/src/main/scala/ALSAlgorithm.scala#L80):
Change to use .headOption:
val itemScore = model.userFeatures.lookup(userInt).headOption.map { userFeature =>
model.productFeatures.lookup(itemInt).headOption.map { productFeature =>
val userVector = new DoubleMatrix(userFeature)
val productVector = new DoubleMatrix(productFeature)
userVector.dot(productVector)
}.getOrElse{
logger.info(s"No itemFeature for item ${query.item}.")
0.0 // return default score
}
}.getOrElse{
logger.info(s"No userFeature for user ${query.user}.")
0.0 // return default score
}

Related

Spark: show and collect-println giving different outputs

I am using Spark 2.2
I feel like I have something odd going on here. Basic premise is that
I have a set of KIE/Drools rules running through a Dataset of profile objects
I am then trying to show/collect-print the resulting output
I then cast the output as a tuple to flatmap it later
Code below
implicit val mapEncoder = Encoders.kryo[java.util.HashMap[String, Any]]
implicit val recommendationEncoder = Encoders.kryo[Recommendation]
val mapper = new ObjectMapper()
val kieOuts = uberDs.map(profile => {
val map = mapper.convertValue(profile, classOf[java.util.HashMap[String, Any]])
val profile = Profile(map)
// setup the kie session
val ks = KieServices.Factory.get
val kContainer = ks.getKieClasspathContainer
val kSession = kContainer.newKieSession() //TODO: stateful session, how to do stateless?
// insert profile object into kie session
val kCmds = ks.getCommands
val cmds = new java.util.ArrayList[Command[_]]()
cmds.add(kCmds.newInsert(profile))
cmds.add(kCmds.newFireAllRules("outFired"))
// fire kie rules
val results = kSession.execute(kCmds.newBatchExecution(cmds))
val fired = results.getValue("outFired").toString.toInt
// collect the inserted recommendation objects and create uid string
import scala.collection.JavaConversions._
var gresults = kSession.getObjects
gresults = gresults.drop(1) // drop the inserted profile object which also gets collected
val recommendations = scala.collection.mutable.ListBuffer[Recommendation]()
gresults.toList.foreach(reco => {
val recommendation = reco.asInstanceOf[Recommendation]
recommendations += recommendation
})
kSession.dispose
val uIds = StringBuilder.newBuilder
if(recommendations.size > 0) {
recommendations.foreach(recommendation => {
uIds.append(recommendation.getOfferId + "_" + recommendation.getScore)
uIds.append(";")
})
uIds.deleteCharAt(uIds.size - 1)
}
new ORecommendation(profile.getAttributes().get("cId").toString.toLong, fired, uIds.toString)
})
println("======================Output#1======================")
kieOuts.show(1000, false)
println("======================Output#2======================")
kieOuts.collect.foreach(println)
//separating cid and and each uid into individual rows
val kieOutsDs = kieOuts.as[(Long, Int, String)]
println("======================Output#3======================")
kieOutsDs.show(1000, false)
(I have sanitized/shortened the id's below, they are much bigger but with a similar format)
What I am seeing as outputs
Output#1 has a set of uIds(as String) come up
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cId |rulesFired | eligibleUIds |
|842 | 17|123-25_2.0;12345678-48_9.0;28a-ad_5.0;123-56_10.0;123-27_2.0;123-32_3.0;c6d-e5_5.0;123-26_2.0;123-51_10.0;8e8-c1_5.0;123-24_2.0;df8-ad_5.0;123-36_5.0;123-16_2.0;123-34_3.0|
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Output#2 has mostly a similar set of uIds show up(usually off by 1 element)
ORecommendation(842,17,123-36_5.0;123-24_2.0;8e8-c1_5.0;df8-ad_5.0;28a-ad_5.0;660-73_5.0;123-34_3.0;123-48_9.0;123-16_2.0;123-51_10.0;123-26_2.0;c6d-e5_5.0;123-25_2.0;123-56_10.0;123-32_3.0)
Output#3 is same as #Output1
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|842 | 17 |123-32_3.0;8e8-c1_5.0;123-51_10.0;123-48_9.0;28a-ad_5.0;c6d-e5_5.0;123-27_2.0;123-16_2.0;123-24_2.0;123-56_10.0;123-34_3.0;123-36_5.0;123-6_2.0;123-25_2.0;660-73_5.0|
Every time I run it the difference between Output#1 and Output#2 is 1 element but never the same element (In the above example, Output#1 has 123-27_2.0 but Output#2 has 660-73_5.0)
Should they not be the same? I am still new to Scala/Spark and feel like I am missing something very fundamental
I think I figured this out, adding cache to kieOuts atleast got me identical outputs between show and collect.
I will be looking at why KIE gives me different output for every run of the same input but that is a different issue

suddenly throwing This RDD lacks a SparkContext it was working before every code was in main method

It was a working piece of code but suddenly its not working after I tried creating Sparksession from different scala object
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
val primary_ke = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
b.foreach(println)
}
Error:
ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Not working even after I revoked it and I'm not using any objects.
Code Updated:
object try {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("50columns3nodes").getOrCreate()
var s = spark.read.csv("/home/hadoopuser/Desktop/input/source.csv").rdd.map(_.mkString(","))
var k = spark.read.csv("/home/hadoopuser/Desktop/input/destination.csv").rdd.map(_.mkString(","))
val source_primary_key = s.map(rec => (rec.split(",")(0), rec))
val destination_primary_key = k.map(rec => (rec.split(",")(0), rec))
val a = source_primary_key.cogroup(destination_primary_key).filter { x => ((x._2._1) != (x._2._2)) }
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
var extra_In_Dest = a.filter(x => x._2._1.isEmpty && !x._2._2.isEmpty).map(rec => (rec._2._2.mkString("")))
var extra_In_Src = a.filter(x => !x._2._1.isEmpty && x._2._2.isEmpty).map(rec => (rec._2._1.mkString("")))
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
}
}
Input data follows
s=1,david
2,ajay
3,jijo
4,abi
5,surendhar
k=1,david
2,ajay
3,jijoaa
4,abisdsdd
5,surendhar
val a contains {3,(jijo,jijoaa),5(abi,abisdsdd)}
If you read carefully the first message
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
It clearly states that actions and transformations cannot be performed inside a transformation.
primary_key_distinct is transformation done on b and b itself is a transformation done on a. And b.foreach(println) is an action done inside transformation of primary_key_distinct
So if you collect b or primary_key_distinct inside driver, then the code should run properly
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }.collect
or
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct.collect
or if you don't use action inside another transformation then the code should run properly too as
for (i <- 1 to 2) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
I hope the explanation is clear.

Task not serializable Flink

I am trying to do the pagerank Basic example in flink with little bit of modification(only in reading the input file, everything else is the same) i am getting the error as Task not serializable and below is the part of the output error
atorg.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179)
at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171)
Below is my code
object hpdb {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val maxIterations = 10000
val DAMPENING_FACTOR: Double = 0.85
val EPSILON: Double = 0.0001
val outpath = "/home/vinoth/bigdata/assign10/pagerank.csv"
val links = env.readCsvFile[Tuple2[Long,Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1,4)).as('sourceId,'targetId).toDataSet[Link]//source and target
val pages = env.readCsvFile[Tuple1[Long]]("/home/vinoth/bigdata/assign10/ppi.csv",
fieldDelimiter = "\t", includedFields = Array(1)).as('pageId).toDataSet[Id]//Pageid
val noOfPages = pages.count()
val pagesWithRanks = pages.map(p => Page(p.pageId, 1.0 / noOfPages))
val adjacencyLists = links
// initialize lists ._1 is the source id and ._2 is the traget id
.map(e => AdjacencyList(e.sourceId, Array(e.targetId)))
// concatenate lists
.groupBy("sourceId").reduce {
(l1, l2) => AdjacencyList(l1.sourceId, l1.targetIds ++ l2.targetIds)
}
// start iteration
val finalRanks = pagesWithRanks.iterateWithTermination(maxIterations) {
// **//the output shows error here**
currentRanks =>
val newRanks = currentRanks
// distribute ranks to target pages
.join(adjacencyLists).where("pageId").equalTo("sourceId") {
(page, adjacent, out: Collector[Page]) =>
for (targetId <- adjacent.targetIds) {
out.collect(Page(targetId, page.rank / adjacent.targetIds.length))
}
}
// collect ranks and sum them up
.groupBy("pageId").aggregate(SUM, "rank")
// apply dampening factor
//**//the output shows error here**
.map { p =>
Page(p.pageId, (p.rank * DAMPENING_FACTOR) + ((1 - DAMPENING_FACTOR) / pages.count()))
}
// terminate if no rank update was significant
val termination = currentRanks.join(newRanks).where("pageId").equalTo("pageId") {
(current, next, out: Collector[Int]) =>
// check for significant update
if (math.abs(current.rank - next.rank) > EPSILON) out.collect(1)
}
(newRanks, termination)
}
val result = finalRanks
// emit result
result.writeAsCsv(outpath, "\n", " ")
env.execute()
}
}
Any help in the right direction is highly appreciated? Thank you.
The problem is that you reference the DataSet pages from within a MapFunction. This is not possible, since a DataSet is only the logical representation of a data flow and cannot be accessed at runtime.
What you have to do to solve this problem is to assign the val pagesCount = pages.count value to a variable pagesCount and refer to this variable in your MapFunction.
What pages.count actually does, is to trigger the execution of the data flow graph, so that the number of elements in pages can be counted. The result is then returned to your program.

scala.MatchError: null on spark RDDs

I am relatively new to both spark and scala.
I was trying to implement collaborative filtering using scala on spark.
Below is the code
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile("/user/amohammed/CB/input-cb.txt")
val distinctUsers = data.map(x => x.split(",")(0)).distinct().map(x => x.toInt)
val distinctKeywords = data.map(x => x.split(",")(1)).distinct().map(x => x.toInt)
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt,item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, 1, 20, 0.01)
val keywords = distinctKeywords collect
distinctUsers.map(x => {(x, keywords.map(y => model.predict(x,y)))}).collect()
It throws a scala.MatchError: null
org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:571) at the last line
Thw code works fine if I collect the distinctUsers rdd into an array and execute the same code:
val users = distinctUsers collect
users.map(x => {(x, keywords.map(y => model.predict(x, y)))})
Where am I getting it wrong when dealing with RDDs?
Spark Version : 1.0.0
Scala Version : 2.10.4
Going one call further back in the stack trace, line 43 of the MatrixFactorizationModel source says:
val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
Note that the userFeatures field of model is itself another RDD; I believe it isn't getting serialized properly when the anonymous function block closes over model, and thus the lookup method on it is failing. I also tried placing both model and keywords into broadcast variables, but that didn't work either.
Instead of falling back to Scala collections and losing the benefits of Spark, it's probably better to stick with RDDs and take advantage of other ways of transforming them.
I'd start with this:
val ratings = data.map(_.split(',') match {
case Array(user, keyword, rate) => Rating(user.toInt, keyword.toInt, rate.toDouble)
})
// instead of parsing the original RDD's strings three separate times,
// you can map the "user" and "product" fields of the Rating case class
val distinctUsers = ratings.map(_.user).distinct()
val distinctKeywords = ratings.map(_.product).distinct()
val model = ALS.train(ratings, 1, 20, 0.01)
Then, instead of calculating each prediction one by one, we can obtain the Cartesian product of all possible user-keyword pairs as an RDD and use the other predict method in MatrixFactorizationModel, which takes an RDD of such pairs as its argument.
val userKeywords = distinctUsers.cartesian(distinctKeywords)
val predictions = model.predict(userKeywords).map { case Rating(user, keyword, rate) =>
(user, Map(keyword -> rate))
}.reduceByKey { _ ++ _ }
Now predictions has an immutable map for each user that can be queried for the predicted rating of a particular keyword. If you specifically want arrays as in your original example, you can do:
val keywords = distinctKeywords.collect() // add .sorted if you want them in order
val predictionArrays = predictions.mapValues(keywords.map(_))
Caveat: I tested this with Spark 1.0.1 as it's what I had installed, but it should work with 1.0.0 as well.

spark scala get uncommon map elements

I am trying to split my data set into train and test data sets. I first read the file into memory as shown here:
val ratings = sc.textFile(movieLensdataHome+"/ratings.csv").map { line=>
val fields = line.split(",")
Rating(fields(0).toInt,fields(1).toInt,fields(2).toDouble)
}
Then I select 80% of those for my training set:
val train = ratings.sample(false,.8,1)
Is there an easy way to get the test set in a distributed way,
I am trying this but fails:
val test = ratings.filter(!_.equals(train.map(_)))
val test = ratings.subtract(train)
Take a look here. http://markmail.org/message/qi6srcyka6lcxe7o
Here is the code
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val partitionSeeds = data.partitions.map(partition => rand.nextLong)
val temp = data.mapPartitionsWithIndex((index, iter) => {
val partitionRand = new java.util.Random(partitionSeeds(index))
iter.map(x => (x, partitionRand.nextDouble))
})
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}
Instead of using an exclusion method (like filter or subtract), I'd partition the set "by hand" for a more efficient execution:
val probabilisticSegment:(RDD[Double,Rating],Double=>Boolean) => RDD[Rating] =
(rdd,prob) => rdd.filter{case (k,v) => prob(k)}.map {case (k,v) => v}
val ranRating = rating.map( x=> (Random.nextDouble(), x)).cache
val train = probabilisticSegment(ranRating, _ < 0.8)
val test = probabilisticSegment(ranRating, _ >= 0.8)
cache saves the intermediate RDD sothat the next two operations can be performed from that point on without incurring in the execution of the complete lineage.
(*) Note the use of val to define a function instead of def. vals are serializer-friendly