How to get SSSP actual path by apache spark graphX? - scala

I have ran the single source shortest path (SSSP) example on spark site as follows:
graphx-SSSP pregel example
Code(scala):
object Pregel_SSSP {
def main(args: Array[String]) {
val sc = new SparkContext("local", "Allen Pregel Test", System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))
// A graph with edge attributes containing distances
val graph: Graph[Int, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 5).mapEdges(e => e.attr.toDouble)
graph.edges.foreach(println)
val sourceId: VertexId = 0 // The ultimate source
// Initialize the graph such that all vertices except the root have distance infinity.
val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
val sssp = initialGraph.pregel(Double.PositiveInfinity, Int.MaxValue, EdgeDirection.Out)(
// Vertex Program
(id, dist, newDist) => math.min(dist, newDist),
// Send Message
triplet => {
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
//Merge Message
(a, b) => math.min(a, b))
println(sssp.vertices.collect.mkString("\n"))
}
}
sourceId: 0
Get the result:
(0,0.0)
(4,2.0)
(2,1.0)
(3,1.0)
(1,2.0)
But I need actual path like as follows:
=>
0 -> 0,0
0 -> 2,1
0 -> 3,1
0 -> 2 -> 4,2
0 -> 3 -> 1,2
How to get SSSP actual path by spark graphX?
anybody give me some hint?
Thanks for your help!

You have to modify algorithm in order to store not only shortest path length but also actual path.
So instead of storing Double as property of vertex you should store tuple: (Double, List[VertexId])
Maybe this code can be useful for you.
object Pregel_SSSP {
def main(args: Array[String]) {
val sc = new SparkContext("local", "Allen Pregel Test", System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))
// A graph with edge attributes containing distances
val graph: Graph[Int, Double] =
GraphGenerators.logNormalGraph(sc, numVertices = 5).mapEdges(e => e.attr.toDouble)
graph.edges.foreach(println)
val sourceId: VertexId = 0 // The ultimate source
// Initialize the graph such that all vertices except the root have distance infinity.
val initialGraph : Graph[(Double, List[VertexId]), Double] = graph.mapVertices((id, _) => if (id == sourceId) (0.0, List[VertexId](sourceId)) else (Double.PositiveInfinity, List[VertexId]()))
val sssp = initialGraph.pregel((Double.PositiveInfinity, List[VertexId]()), Int.MaxValue, EdgeDirection.Out)(
// Vertex Program
(id, dist, newDist) => if (dist._1 < newDist._1) dist else newDist,
// Send Message
triplet => {
if (triplet.srcAttr._1 < triplet.dstAttr._1 - triplet.attr ) {
Iterator((triplet.dstId, (triplet.srcAttr._1 + triplet.attr , triplet.srcAttr._2 :+ triplet.dstId)))
} else {
Iterator.empty
}
},
//Merge Message
(a, b) => if (a._1 < b._1) a else b)
println(sssp.vertices.collect.mkString("\n"))
}
}

May be it is outdated answer, but take a look at this solution Find all paths in graph using Apache Spark

Related

Evaluating multiple filters in a single pass

I have below rdd created and I need to perform a series of filters on the same dataset to derive different counters and aggregates.
Is there a way I can apply these filters and compute aggregates in a single pass, avoiding spark to go over the same dataset multiple times?
val res = df.rdd.map(row => {
// ............... Generate data here for each row.......
})
res.persist(StorageLevel.MEMORY_AND_DISK)
val all = res.count()
val stats1 = res.filter(row => row.getInt(1) > 0)
val stats1Count = stats1.count()
val stats1Agg = stats1.map(r => r.getInt(1)).mean()
val stats2 = res.filter(row => row.getInt(2) > 0)
val stats2Count = stats2.count()
val stats2Agg = stats2.map(r => r.getInt(2)).mean()
You can use aggregate:
case class Stats(count: Int = 0, sum: Int = 0) {
def mean = sum/count
def +(s: Stats): Stats = Stats(count + s.count, sum + s.sum)
def <- (n: Int) = if(n > 0) copy(count + 1, sum + n) else this
}
val (stats1, stats2) = res.aggregate(Stats() -> Stats()) (
{ (s, row) => (s._1 <- row.getInt(1), s._2 <- row.getInt(2)) },
{ _ + _ }
)
val (stat1Count, stats1Agg, stats2Count, stats2Agg) = (stats1.count, stats1.mean, stats2.count, stats2.mean)

Generating recommendation model for a large dataset using spark

I'm trying out to generate a simple ALS model using the spark documentation here.
My first file(ratings.csv) has 20million UserID,MovID,Rat and can be downloaded here
So I have the testing data which is a subset of ratings.csv. That test dataset can be downloaded here:
The test file has just the UserID, Movie ID column.
So to create training data we will have to filter ratings.csv.
The following code is working fine for a smaller case of 100,000 UserID,MovID rating. I am not able to generate the model for the big case.
Please help with a pointer.
/**
* Created by echoesofconc on 3/8/17.
*/
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
import java.io._
import scala.collection.mutable.ListBuffer
object Prateek_Agrawal_task1 {
def dropheader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def create_training(ratings_split: RDD[Array[String]], ratings_testing: Array[Array[String]]) = {
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == false
})
}
def create_testing(ratings_split: RDD[Array[String]], ratings_testing: Array[Array[String]]) = {
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == true
})
}
def create_model(ratings_train:RDD[Array[String]],rank:Int,numIterations:Int ):org.apache.spark.mllib.recommendation.MatrixFactorizationModel={
val ratings = ratings_train.map(_ match { case Array(user,item,rate,temp) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, rank, numIterations, 0.01)
return model
}
def print_results(final_predictions_adjusted:RDD[((Int, Int), Double)])={
val rating_range=final_predictions_adjusted.map(x=>(x._2.toInt,1)).reduceByKey(_+_).sortByKey()
val rating_range_till_4=rating_range.map{x=>
var temp=x
if (x._1==5){temp=(4,x._2)}
temp
}.reduceByKey(_+_)
rating_range_till_4.sortByKey().foreach { x =>
if(x._1==0)
printf(">=0 and <1: " + x._2+"\n")
if(x._1==1)
printf(">=1 and <2: " + x._2+"\n")
if(x._1==2)
printf(">=2 and <3: " + x._2+"\n")
if(x._1==3)
printf(">=3 and <4: " + x._2+"\n")
if(x._1==4)
printf(">=4 " + x._2+"\n")
if(x._1==5)
printf("=5 " + x._2+"\n")
}
}
case class User_mov_rat(UserID: Int, MovieID:Int, Pred_rating: Double)
def print_outputfile(final_predictions_adjusted:RDD[((Int, Int), Double)])={
val writer = new FileWriter(new File("./output.txt" ))
writer.write("UserID,MovieID,Pred_rating\n")
final_predictions_adjusted.collect().foreach(x=>{writer.write(x._1._1+","+x._1._2+","+x._2+"\n")})
writer.close()
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Prateek_Agrawal_task1").setMaster("local[2]")
val sc = new SparkContext(conf)
val file = "/Users/echoesofconc/Documents/USC_courses/INF553/ml-20m/ratings.csv"
val test = "/Users/echoesofconc/Documents/USC_courses/INF553/Prateek_Agrawal_hw3/testing_20m.csv"
val data = sc.textFile(file, 2).cache()
val data_test = sc.textFile(test, 2).cache()
// Drop Header
val data_wo_header=dropheader(data).persist()
val data_test_wo_header=dropheader(data_test).persist()
// Create Training and testing data of the format (User ID, MovID, Rating, Time)
val ratings_split = data_wo_header.map(line => line.split(",")).persist()
data_wo_header.unpersist()
data.unpersist()
val ratings_testing = data_test_wo_header.map(line => line.split(",")).collect()
data_test_wo_header.unpersist()
data_test.unpersist()
val ratings_train = create_training(ratings_split, ratings_testing).persist()
val ratings_test=create_testing(ratings_split, ratings_testing)
ratings_split.unpersist()
ratings_test.unpersist()
// Create the model using rating_train the training data
val rank = 1
val numIterations = 10
val model=create_model(ratings_train,rank,numIterations)
ratings_train.unpersist()
// Average user,Rating from training this is for cases which are there in test but not rated by any user in training
val user_avgrat=ratings_test.map(_ match { case Array(user, mov, rate, temp) =>(user.toInt, (rate.toDouble,1.0))}).reduceByKey((x,y)=>(x._1 + y._1, x._2 + y._2)).mapValues{ case (sum, count) => (1.0 * sum) / count }
// Predict user_mov ratings
val user_mov = data_test_wo_header.map(_.split(',') match { case Array(user, mov) =>
(user.toInt,mov.toInt)
})
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
// Combine Predictions and unpredicted user,Movies due to them being individual. Going forward we need to improve the accuracy for these predictions
val user_mov_rat=user_mov.map(x=>(x,0.0))
val predictions_unpredicted_combined= predictions.union(user_mov_rat).reduceByKey(_+_).map(x=>(x._1._1,(x._1._2,x._2)))
// Combine average rating and predictions+unpredicted values
val avg_rating_predictions_unpredicted_combined=predictions_unpredicted_combined.join(user_avgrat)
// Generate final predictions RDD
val final_predictions=avg_rating_predictions_unpredicted_combined.map{x=>
var temp=((x._1,x._2._1._1),x._2._2)
if(x._2._1._2==0.0){temp=((x._1,x._2._1._1),x._2._2)}
if(x._2._1._2!=0.0){temp=((x._1,x._2._1._1),x._2._1._2)}
temp
}
// Adjust for ratings above 5.0 and below 0.0
val final_predictions_adjusted=final_predictions.map{x=>
var temp=x
if (x._2>5.0){temp=(x._1,5.0)}
if (x._2<0.0){temp=(x._1,0.0)}
temp
}
val ratesAndPreds = ratings_test.map(_ match { case Array(user, mov, rate, temp) => ((user.toInt,mov.toInt),rate.toDouble)}).join(final_predictions_adjusted)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
val RMSE=math.sqrt(MSE)
// Print output.txt
print_outputfile(final_predictions_adjusted)
// Print the predictionresults
print_results(final_predictions_adjusted.sortByKey())
print(RMSE+"\n")
}
}
In case someone thinks I should be doing a regex match I have tried that approach. BUt that dosen't seem to be a bottleneck.
I only need to complete the create model part on which I am stuck for the big dataset. Can somebody help.
EDIT:
Another approach I tried which is much faster by using broadcast variables. But it's been running for 12 hrs with no signs of progress. On spark UI somehow the whole of the RDD(ratings.csv ~500MB) is not cached. Only around 64MB with 2.5 Million lines is being processed initially. I am using --executor-memory -8g. I have modified the create_training create_testing functions:
/**
* Created by echoesofconc on 3/8/17.
*/
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
import java.io._
object Prateek_Agrawal_task2 {
def dropheader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def create_training(data_wo_header: RDD[String], data_test_wo_header: RDD[String],sc:SparkContext): RDD[String] = {
val rdd2array = sc.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.indexOf(y.toString())==0).length == 0
}
return training_set
}
def create_test(data_wo_header: RDD[String], data_test_wo_header: RDD[String],sc:SparkContext): RDD[String] = {
val rdd2array = sc.broadcast(data_test_wo_header.collect())
val training_set = data_wo_header.filter{
case(x) => rdd2array.value.filter(y => x.indexOf(y.toString())==0).length != 0
}
return training_set
}
def create_model(ratings_train:RDD[String],rank:Int,numIterations:Int ):org.apache.spark.mllib.recommendation.MatrixFactorizationModel={
val ratings = ratings_train.map(_.split(',') match { case Array(user, item, rate, timestamp) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, rank, numIterations, 0.01)
return model
}
def print_results(final_predictions_adjusted:RDD[((Int, Int), Double)])={
val rating_range=final_predictions_adjusted.map(x=>(x._2.toInt,1)).reduceByKey(_+_).sortByKey()
val rating_range_till_4=rating_range.map{x=>
var temp=x
if (x._1==5){temp=(4,x._2)}
temp
}.reduceByKey(_+_)
rating_range_till_4.sortByKey().foreach { x =>
if(x._1==0)
printf(">=0 and <1: " + x._2+"\n")
if(x._1==1)
printf(">=1 and <2: " + x._2+"\n")
if(x._1==2)
printf(">=2 and <3: " + x._2+"\n")
if(x._1==3)
printf(">=3 and <4: " + x._2+"\n")
if(x._1==4)
printf(">=4 " + x._2+"\n")
if(x._1==5)
printf("=5 " + x._2+"\n")
}
}
case class User_mov_rat(UserID: Int, MovieID:Int, Pred_rating: Double)
def print_outputfile(final_predictions_adjusted:RDD[((Int, Int), Double)])={
val writer = new FileWriter(new File("./output.txt" ))
writer.write("UserID,MovieID,Pred_rating\n")
final_predictions_adjusted.collect().foreach(x=>{writer.write(x._1._1+","+x._1._2+","+x._2+"\n")})
writer.close()
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Prateek_Agrawal_task1").setMaster("local[2]")
val sc = new SparkContext(conf)
val file = "/Users/echoesofconc/Documents/USC_courses/INF553/ml-latest-small/ratings.csv"
val test = "/Users/echoesofconc/Documents/USC_courses/INF553/Prateek_Agrawal_hw3/testing_small.csv"
// val file = "/Users/echoesofconc/Documents/USC_courses/INF553/ml-20m/ratings.csv"
// val test = "/Users/echoesofconc/Documents/USC_courses/INF553/Prateek_Agrawal_hw3/testing_20m.csv"
val data = sc.textFile(file, 2).persist()
val data_test = sc.textFile(test, 2).persist()
// Drop Header
val data_wo_header=dropheader(data)
val data_test_wo_header=dropheader(data_test)
// Create Traing and testing data of the format (User ID, MovID, Rating, Time)
val ratings_train=create_training(data_wo_header,data_test_wo_header,sc).persist()
val ratings_test=create_test(data_wo_header,data_test_wo_header,sc)
// val ratings_test=create_test(data_wo_header,data_test_wo_header,sc)
// data_test_wo_header.unpersist()
// data_test.unpersist()
//// data.unpersist()
//// data_test.unpersist()
// Create the model using rating_train the training data
val rank = 1
val numIterations = 10
val model=create_model(ratings_train,rank,numIterations)
// ratings_train.unpersist()
// model.save(sc, "target/tmp/myCollaborativeFilter")
// val Model = MatrixFactorizationModel.load(sc, "/Users/echoesofconc/myCollaborativeFilter")
// Average user,Rating from training
val user_avgrat=ratings_test.map(_.split(",") match { case Array(user, mov, rate, temp) =>(user.toInt, (rate.toDouble,1.0))}).reduceByKey((x,y)=>(x._1 + y._1, x._2 + y._2)).mapValues{ case (sum, count) => (1.0 * sum) / count }
//data
// Predict user_mov ratings
val user_mov = data_test_wo_header.map(_.split(',') match { case Array(user, mov) =>
(user.toInt,mov.toInt)
})
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
// Combine Predictions and unpredicted user,Movies due to them being individual. Going forward we need to improve the accuracy for these predictions
val user_mov_rat=user_mov.map(x=>(x,0.0))
val predictions_unpredicted_combined= predictions.union(user_mov_rat).reduceByKey(_+_).map(x=>(x._1._1,(x._1._2,x._2)))
// Combine average rating and predictions+unpredicted values
val avg_rating_predictions_unpredicted_combined=predictions_unpredicted_combined.join(user_avgrat)
// Generate final predictions RDD
val final_predictions=avg_rating_predictions_unpredicted_combined.map{x=>
var temp=((x._1,x._2._1._1),x._2._2)
if(x._2._1._2==0.0){temp=((x._1,x._2._1._1),x._2._2)}
if(x._2._1._2!=0.0){temp=((x._1,x._2._1._1),x._2._1._2)}
temp
}
// Adjust for ratings above 5.0 and below 0.0
val final_predictions_adjusted=final_predictions.map{x=>
var temp=x
if (x._2>5.0){temp=(x._1,5.0)}
if (x._2<0.0){temp=(x._1,0.0)}
temp
}
val ratesAndPreds = ratings_test.map(_.split(",") match { case Array(user, mov, rate, temp) => ((user.toInt,mov.toInt),rate.toDouble)}).join(final_predictions_adjusted)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
val RMSE=math.sqrt(MSE)
// Print output.txt
print_outputfile(final_predictions_adjusted)
// Print the predictionresults
print_results(final_predictions_adjusted.sortByKey())
print(RMSE+"\n")
}
}
This worked out to be fine. It's using join to create testng training data
/**
* Created by echoesofconc on 3/8/17.
*/
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import java.io._
object Prateek_Agrawal_task1 {
def dropheader(data: RDD[String]): RDD[String] = {
data.mapPartitionsWithIndex((idx, lines) => {
if (idx == 0) {
lines.drop(1)
}
lines
})
}
def create_training(ratings_split: RDD[Array[String]], ratings_testing: Array[Array[String]]) = {
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == false
})
}
def create_testing(ratings_split: RDD[Array[String]], ratings_testing: Array[Array[String]]) = {
ratings_split.filter(x => {
ratings_testing.exists(y =>
(x(0) == y(0) && x(1) == y(1))
) == true
})
}
def create_model(ratings_train:RDD[((String, String), (String, String))],rank:Int,numIterations:Int ):org.apache.spark.mllib.recommendation.MatrixFactorizationModel={
val ratings = ratings_train.map(_ match { case ((user,item),(rate,temp)) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val model = ALS.train(ratings, rank, numIterations, 0.01)
return model
}
def print_results(final_predictions_adjusted:RDD[((Int, Int), Double)])={
val rating_range=final_predictions_adjusted.map(x=>(x._2.toInt,1)).reduceByKey(_+_).sortByKey()
val rating_range_till_4=rating_range.map{x=>
var temp=x
if (x._1==5){temp=(4,x._2)}
temp
}.reduceByKey(_+_)
rating_range_till_4.sortByKey().foreach { x =>
if(x._1==0)
printf(">=0 and <1: " + x._2+"\n")
if(x._1==1)
printf(">=1 and <2: " + x._2+"\n")
if(x._1==2)
printf(">=2 and <3: " + x._2+"\n")
if(x._1==3)
printf(">=3 and <4: " + x._2+"\n")
if(x._1==4)
printf(">=4 " + x._2+"\n")
if(x._1==5)
printf("=5 " + x._2+"\n")
}
}
case class User_mov_rat(UserID: Int, MovieID:Int, Pred_rating: Double)
def print_outputfile(final_predictions_adjusted:RDD[((Int, Int), Double)])={
val writer = new FileWriter(new File("./output.txt" ))
writer.write("UserID,MovieID,Pred_rating\n")
final_predictions_adjusted.collect().foreach(x=>{writer.write(x._1._1+","+x._1._2+","+x._2+"\n")})
writer.close()
}
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Prateek_Agrawal_task1").setMaster("local[2]")
val sc = new SparkContext(conf)
// val file = "/Users/echoesofconc/Documents/USC_courses/INF553/ml-latest-small/ratings.csv"
// val test = "/Users/echoesofconc/Documents/USC_courses/INF553/Prateek_Agrawal_hw3/testing_small.csv"
val file = "/Users/echoesofconc/Documents/USC_courses/INF553/ml-20m/ratings.csv"
val test = "/Users/echoesofconc/Documents/USC_courses/INF553/Prateek_Agrawal_hw3/testing_20m.csv"
val data = sc.textFile(file, 2).cache()
val data_test = sc.textFile(test, 2).cache()
// Drop Header
// val data_wo_header=dropheader(data).persist()
// val data_test_wo_header=dropheader(data_test).persist()
// Create Traing and testing data of the format (User ID, MovID, Rating, Time)
val data_wo_header=dropheader(data).map(_.split(",")).map(x=>((x(0),x(1)),(x(2),x(3))))
val data_test_wo_header=dropheader(data_test).map(_.split(",")).map(x=>((x(0),x(1)),1))
val ratings_train=data_wo_header.subtractByKey(data_test_wo_header)
val ratings_test=data_wo_header.subtractByKey(ratings_train)
data_test_wo_header.unpersist()
data_wo_header.unpersist()
data.unpersist()
data_test.unpersist()
// val ratings_split = data_wo_header.map(line => line.split(",")).persist()
// data_wo_header.unpersist()
// data.unpersist()
// val ratings_testing = data_test_wo_header.map(line => line.split(",")).collect()
// data_test_wo_header.unpersist()
// data_test.unpersist()
//
// val ratings_train = create_training(ratings_split, ratings_testing).persist()
// val ratings_test=create_testing(ratings_split, ratings_testing)
// ratings_split.unpersist()
// ratings_test.unpersist()
// Create the model using rating_train the training data
val rank = 1
val numIterations = 10
// val model=create_model(ratings_train,rank,numIterations)
//
// model.save(sc, "/Users/echoesofconc/Documents/USC_courses/INF553/Prateek_Agrawal_hw3/myCollaborativeFilter")
val model = MatrixFactorizationModel.load(sc, "/Users/echoesofconc/Documents/USC_courses/INF553/Prateek_Agrawal_hw3/myCollaborativeFilter")
// Average user,Rating from training
val user_avgrat=ratings_train.map(_ match { case ((user, mov), (rate, temp)) =>(user.toInt, (rate.toDouble,1.0))}).reduceByKey((x,y)=>(x._1 + y._1, x._2 + y._2)).mapValues{ case (sum, count) => (1.0 * sum) / count }
ratings_train.unpersist()
// Predict user_mov ratings
val user_mov = data_test_wo_header.map(_ match { case ((user, mov),temp) =>
(user.toInt,mov.toInt)
})
val predictions =
model.predict(user_mov).map { case Rating(user, mov, rate) =>
((user, mov), rate)
}
// Combine Predictions and unpredicted user,Movies due to them being individual. Going forward we need to improve the accuracy for these predictions
val user_mov_rat=user_mov.map(x=>(x,0.0))
val predictions_unpredicted_combined= predictions.union(user_mov_rat).reduceByKey(_+_).map(x=>(x._1._1,(x._1._2,x._2)))
// Combine average rating and predictions+unpredicted values
val avg_rating_predictions_unpredicted_combined=predictions_unpredicted_combined.join(user_avgrat)
// Generate final predictions RDD
val final_predictions=avg_rating_predictions_unpredicted_combined.map{x=>
var temp=((x._1,x._2._1._1),x._2._2)
if(x._2._1._2==0.0){temp=((x._1,x._2._1._1),x._2._2)}
if(x._2._1._2!=0.0){temp=((x._1,x._2._1._1),x._2._1._2)}
temp
}
// Adjust for ratings above 5.0 and below 0.0
val final_predictions_adjusted=final_predictions.map{x=>
var temp=x
if (x._2>5.0){temp=(x._1,5.0)}
if (x._2<0.0){temp=(x._1,0.0)}
temp
}
// final_predictions_adjusted.count()
val ratesAndPreds_map = ratings_test.map(_ match { case ((user, mov), (rate, temp)) => ((user.toInt,mov.toInt),rate.toDouble)})
val ratesAndPreds=ratesAndPreds_map.join(final_predictions_adjusted)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
val RMSE=math.sqrt(MSE)
// Print output.txt
print_outputfile(final_predictions_adjusted)
// Print the predictionresults
print_results(final_predictions_adjusted.sortByKey())
print(RMSE+"\n")
}
}

Failure to add keys to map in parallel

I have the following code:
var res: GenMap[Point, GenSeq[Point]] = points.par.groupBy(point => findClosest(point, means))
means.par.foreach(mean => if(!res.contains(mean)) {
println("Map doesn't contain mean: " + mean)
res += mean -> GenSeq.empty[Point]
println("Map contains?: " + res.contains(mean))
})
That uses this case class:
case class Point(val x: Double, val y: Double, val z: Double)
Basically, the code groups the Point elements in points around the Point elements in means. The algorithm itself is not very important though.
My problem is that I am getting the following output:
Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.44, 0.59, 0.73)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map doesn't contain mean: (0.1, 0.11, 0.11)
Map contains?: true
Map contains?: true
Map contains?: false
Map contains?: true
Why would I ever get this?
Map contains?: false
I am checking if a key is in the res map. If it is not, then I'm adding it.
So how can it not be present in the map?
Is there an issue with parallelism?
Your code has a race condition in line
res += mean -> GenSeq.empty[Point]
more than one thread is reasigning res concurrently so some entries can be missed.
This code solves the problem:
val closest = points.par.groupBy(point => findClosest(point, means))
val res = means.foldLeft(closest) {
case (map, mean) =>
if(map.contains(mean))
map
else
map + (mean -> GenSeq.empty[Point])
}
Processing a point changes means and the result is sensitive to processing order, so the algorithm doesn't lend itself to parallel execution. If parallel execution is important enough to allow a change of algorithm, then it might be possible to find an algorithm that can be applied in parallel.
Using a known set of grouping points, such as grid square centres, means that the points can be allocated to their grouping points in parallel and grouped by their grouping points in parallel:
import scala.annotation.tailrec
import scala.collection.parallel.ParMap
import scala.collection.{GenMap, GenSeq, Map}
import scala.math._
import scala.util.Random
class ParallelPoint {
val rng = new Random(0)
val groups: Map[Point, Point] = (for {
i <- 0 to 100
j <- 0 to 100
k <- 0 to 100
}
yield {
val p = Point(10.0 * i, 10.0 * j, 10.0 * k)
p -> p
}
).toMap
val points: Array[Point] = (1 to 10000000).map(aaa => Point(rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0, rng.nextDouble() * 1000.0)).toArray
def findClosest(point: Point, groups: GenMap[Point, Point]): (Point, Point) = {
val x: Double = rint(point.x / 10.0) * 10.0
val y: Double = rint(point.y / 10.0) * 10.0
val z: Double = rint(point.z / 10.0) * 10.0
val mean: Point = groups(Point(x, y, z)) //.getOrElse(throw new Exception(s"$point out of range of mean ($x, $y, $z).") )
(mean, point)
}
#tailrec
private def total(points: GenSeq[Point]): Option[Point] = {
points.size match {
case 0 => None
case 1 => Some(points(0))
case _ => total((points(0) + points(1)) +: points.drop(2))
}
}
def mean(points: GenSeq[Point]): Option[Point] = {
total(points) match {
case None => None
case Some(p) => Some(p / points.size)
}
}
val startTime = System.currentTimeMillis()
println("starting test ...")
val res: ParMap[Point, GenSeq[Point]] = points.par.map(p => findClosest(p, groups)).groupBy(pp => pp._1).map(kv => kv._1 -> kv._2.map(v => v._2))
val groupTime = System.currentTimeMillis()
println(s"... grouped result after ${groupTime - startTime}ms ...")
points.par.foreach(p => if (! res(findClosest(p, groups)._1).exists(_ == p)) println(s"point $p not found"))
val checkTime = System.currentTimeMillis()
println(s"... checked grouped result after ${checkTime - startTime}ms ...")
val means: ParMap[Point, GenSeq[Point]] = res.map{ kv => mean(kv._2).get -> kv._2 }
val meansTime = System.currentTimeMillis()
println(s"... means calculated after ${meansTime - startTime}ms.")
}
object ParallelPoint {
def main(args: Array[String]): Unit = new ParallelPoint()
}
case class Point(x: Double, y: Double, z: Double) {
def +(that: Point): Point = {
Point(this.x + that.x, this.y + that.y, this.z + that.z)
}
def /(scale: Double): Point = Point(x/ scale, y / scale, z / scale)
}
The last step replaces the grouping point with the calculated mean of the grouped points as the map key. This processes 10 million points in about 30 seconds on my 2011 MBP.

Converting a For Loop to a Fold

I have to analyze an email corpus to see how many of individual sentences are dominated by leet speak (i.e. lol, brb etc.)
For each sentence I am doing the following:
val words = sentence.split(" ")
for (word <- words) {
if (validWords.contains(word)) {
score += 1
} else if (leetWords.contains(word)) {
score -= 1
}
}
Is there a better way to calculate the scores using Fold?
Not a great deal different, but another option.
val words = List("one", "two", "three")
val valid = List("one", "two")
val leet = List("three")
def check(valid: List[String], invalid: List[String])(words:List[String]): Int = words.foldLeft(0){
case (x, word) if valid.contains(word) => x + 1
case (x, word) if invalid.contains(word) => x - 1
case (x, _ ) => x
}
val checkValidOrLeet = check(valid, leet)(_)
val count = checkValidOrLeet(words)
If not limited to fold, using sum would be more concise.
sentence.split(" ")
.iterator
.map(word =>
if (validWords.contains(word)) 1
else if (leetWords.contains(word)) -1
else 0
).sum
Here's a way to do it with fold and partial application. Could still be more elegant, I'll continue to think on it.
val sentence = // ...your data....
val validWords = // ... your valid words...
val leetWords = // ... your leet words...
def checkWord(goodList: List[String], badList: List[String])(c: Int, w: String): Int = {
if (goodList.contains(w)) c + 1
else if (badList.contains(w)) c - 1
else c
}
val count = sentence.split(" ").foldLeft(0)(checkWord(validWords, leetWords))
print(count)

use spark run KMeans cluster , program block?

when I use apache spark scala API run the KMeans cluster. my program as follow:
object KMeans {
def closestPoint(p: Vector, centers: Array[Vector]) = {
var index = 0
var bestIndex = 0
var closest = Double.PositiveInfinity
for(i <- 0 until centers.length) {
var tempDist = p.squaredDist(centers(i))
if(tempDist < closest) {
closest = tempDist
bestIndex = i
}
}
bestIndex
}
def parseVector(line: String): Vector = {
new Vector(line.split("\\s+").map(s => s.toDouble))
}
def main(args: Array[String]): Unit = {}
System.setProperty("hadoop.home.dir", "F:/OpenSoft/hadoop-2.2.0")
val sc = new SparkContext("local", "kmeans cluster",
"G:/spark-0.9.0-incubating-bin-hadoop2",
SparkContext.jarOfClass(this.getClass()))
val lines = sc.textFile("G:/testData/synthetic_control.data.txt") // RDD[String]
val count = lines.count
val data = lines.map(parseVector _) // RDD[Vector]
data.foreach(println)
val K = 6
val convergeDist = 0.1
val kPoint = data.takeSample(withReplacement = false, K, 42) // Array[Vector]
kPoint.foreach(println)
var tempDist = 1.0
while(tempDist > convergeDist) {
val closest = data.map(p => (closestPoint(p, kPoint), (p, 1)))
val pointStat = closest.reduceByKey{case ((x1, y1), (x2, y2)) =>
(x1+x2, y1+y2)}
val newKPoint = pointStat.map{pair => (
pair._1,pair._2._1/pair._2._2)}.collectAsMap()
tempDist = 0.0
for(i <- 0 until K) {
tempDist += kPoint(i).squaredDist(newKPoint(i))
}
for(newP <- newKPoint) {
kPoint(newP._1) = newP._2
}
println("Finish iteration (delta=" + tempDist + ")")
}
println("Finish centers: ")
kPoint.foreach(println)
System.exit(0)
}
when I apply run as local mode , the log info as follow:
..................
14/03/31 11:29:15 INFO HadoopRDD: Input split: hdfs://hadoop-01:9000/data/synthetic_control.data:0+288374
program begin block , no running continue........
Can anyone can help me???