I am training an org.apache.spark.mllib.recommendation.ALS model on an quite big RDD rdd. I'd like to select a decent regularization hyperparameter so that my model doesn't over- (or under-) fit. To do so, I split rdd (using randomSplit) into a train set and a test set and perform a cross-validation with a defined set of hyperparameters on these.
As I'm using the train and test RDDs several times in the cross-validation it seems natural to cache() the data at some point for faster computation. However, my Spark knowledge is quite limited and I'm wondering which of these two options is better (and why):
Cache the initial RDD rdd before splitting it, that is:
val train_proportion = 0.75
val seed = 42
rdd.cache()
val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
val train_set = split(0)
val test_set = split(1)
Cache the train and test RDDs after splitting the initial RDD:
val train_proportion = 0.75
val seed = 42
val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
val train_set = split(0).cache()
val test_set = split(1).cache()
My speculation is that option 1 is better because the randomSplit would also benefit from the fact that rdd is cached, but I'm not sure whether it would negatively impact the (multiple) future accesses to train_set and test_set with respect to option 2.
This answer seems to confirm my intuition, but it received no feedback, so I'd like to be sure by asking here.
What do you think? And more importantly: Why?
Please note that I have run the experiment on a Spark cluster, but it is often busy these days so my conclusions may be wrong. I also checked the Spark documentation and found no answer to my question.
If the calculation on the RDD are made before the split, than it is better to cache it before, as (in my experience) all the transformation will be run only once and triggered by the cache() action.
I suppose split() cache() cache() are 3 actions vs cache() split() 2. EDIT: cache is not an action.
And indeed I find a confirmation in other similar questions around the web
Edit: to clarify my first sentence: the DAG will perform all the transformation on the RDD and then cache it, so all the things done to it afterwards will need no more computation, although the splitted parts will be calculated again.
In conclusion, should you operate heavier transformations on the splitted part than the original RDD itself, you would want to cache them instead. (I hope someone will back me up here)
Related
i work on graphs in GraphX. by using the below code i have made a variable to store neighbors of nodes in RDD:
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either)
i used broadcast variable to broadcast neighbors to all slaves by using below code:
val broadcastVar = all_neighbors.collect().toMap
val nvalues = sc.broadcast(broadcastVar)
i want to compute intersection between two nodes neighbors. for example intersection between node 1 and node 2 neighbors.
At first i use this code for computing intersection that uses the broadcast variable nvalues:
val common_neighbors=nvalues.value(1).intersect(nvalues.value(2))
and once i used the below code for computing intersection of two nodes:
val common_neighbors2=(all_neighbors.filter(x=>x._1==1)).intersection(all_neighbors.filter(x=>x._1==2))
my question is this: which one of the above methods is efficient and more distributed and parallel? using the broadcast variable nvalue for computing intersection or using filtering RDD method?
I think it depends on the situation.
In the case where your nvalues size is less and can fit into each executor and driver node, the approach with broadcasting will be optimal as data is cached in executors and this data is not recomputed over and over again. Also, it will save spark a huge communication and compute burden. In such cases, the other approach is not optimal as it might happen that all_neighbours rdd is calculated every time and this will decrease the performance as there will be a lot of recomputations and will increase computation cost.
In the case where your nvalues cannot fit into each executor and driver node,
broadcasting will not work as it will throw an error. Hence, there is no option left but to use the second approach though it might still cause performance issues at least code will work!!
Let me know if it helps!!
I am working on pair RDDs. My aim is to calculate jaccard similarity
between the set of rdd values and cluster them according to the jaccard similarity threshold value.Structure of my RDD is :
val a= [Key,Set(String)] //Pair RDD
For example:-
India,[Country,Place,....]
USA,[Country,State,..]
Berlin,[City,Popluatedplace,..]
After finding jaccard similarity, I will cluster the similar entities into one cluster. In the above example, India and USA will be cluster into one cluster based on some threshold value whereas Berlin will be in the other cluster.
So I took the Cartesian product of rdd a
val filterOnjoin = a.cartesian(a).filter(f =>
(!f._1._1.toString().contentEquals(f._2._1.toString())))
//Cartesianproduct of rdd a and filtering rows with same key at both
//the position.
//e.g. ((India,Set[Country,Place,....]),(USA,Set[Country,State,..]))
and compare the set of values with the help of jaccard similarity.
val Jsim = filterOnjoin.map(f => (f._1._1, (f._2._1,
Similarity.sim(f._1._2, f._2._2)))) //calculating jaccard similarity.
//(India,USA,0.8)
The code is running fine on smaller dataset. As the size of dataset is increased, Cartesian product is taking too much time. For 100 MB data(size of rdd "a"), its doing data shuffle read around 25 GB. For 3.5 GB data, its in TB.
I have gone through various links. Like spark tuning methods and some on stack overflow. But most of the post it is written that broadcast the smaller RDD. But here the size of both the rdd is the same and its big.
Links which I followed :-
Spark: produce RDD[(X, X)] of all possible combinations from RDD[X] of-all-possible-combinations-from-rddx
Spark repartition is slow and shuffles too much data
Map key, value pair based on similarity of their value in Spark
I am new to Spark and Scala. I am unable to think beyond Cartesian product which is bottleneck here. Is it possible to solve this problem without Cartesian product.
As Cartesian product is an expensive operation on rdd, I tried to solve above problem by using HashingTF and MinHashLSH library present in Spark MLib for finding jaccard similarity. Steps to find Jaccard similarity in rdd "a" mentioned in the question:
Convert the rdd into dataframe
import sparkSession.implicits._
val dfA = a.toDF("id", "values")
Create the feature vector with the help of HashingTF
val hashingTF = new HashingTF()
.setInputCol("values").setOutputCol("features").setNumFeatures(1048576)
Feature transformation
val featurizedData = hashingTF.transform(dfA) //Feature Transformation
Creating minHash table. More is the value of number of table, more accurate
results will be, but high communication cost and run time.
val mh = new MinHashLSH()
.setNumHashTables(3)
.setInputCol("features")
.setOutputCol("hashes")
Approximate similarity join takes two datasets and approximately returns pairs of rows in the datasets whose distance is smaller than a user-defined threshold. Approximate similarity join supports both joining two different datasets and self-joining. Self-joining will produce some duplicate pairs.
val model = mh.fit(featurizedData)
//Approximately joining featurizedData with Jaccard distance smaller
//than 0.45
val dffilter = model.approxSimilarityJoin(featurizedData, featurizedData,
0.45)
Since in spark, we have to do manual optimization in our code like setting of number of partition, setting persist level etc. I have configured these parameters also.
Changing storaagelevel from persist() to persist(StorageLevel.MEMORY_AND_DISK),
it help me to remove OOM error.
Also while doing join operation, re-partitioned the data according to the rdd
size. On 16.6 GB data set, while doing simple join operation, I was using 200
partition. On increase it to 600, it also solves my problem related to OOM.
PS: the constant parameters setNumFeatures(1048576) and setNumHashTables(3) are configured while experimenting on 16.6 data set. You can increase or decrease these value according to your data set. Also the number of partition depends upon your data set size. With these optimization, I got my desired results.
Useful links:-
[https://spark.apache.org/docs/2.2.0/ml-features.html#locality-sensitive-hashing]
[https://eng.uber.com/lsh/]
[https://data-flair.training/blogs/limitations-of-apache-spark/]
I have a DataFrame of user ratings (from 1 to 5) relative to movies. In order to get the DataFrame where the first column is movie id and the rest columns are the ratings for that movie by each user, I do the following:
val ratingsPerMovieDF = imdbRatingsDF
.groupBy("imdbId")
.pivot("userId")
.max("rating")
Now, here I get a DataFrame where most of the values are null due to the fact that most users have rated only few movies.
I'm interested in calculating similarities between those movies (item-based collaborative filtering).
I was trying to assemble a RowMatrix (for further similarities calculations using mllib) using the rating columns values. However, I don't know how to deal with null values.
The following code where I try to get a Vector for each row:
val assembler = new VectorAssembler()
.setInputCols(movieRatingsDF.columns.drop("imdbId"))
.setOutputCol("ratings")
val ratingsDF = assembler.transform(movieRatingsDF).select("imdbId", "ratings")
Gives me an error:
Caused by: org.apache.spark.SparkException: Values to assemble cannot be null.
I could substitute them with 0s using .na.fill(0) but that would produce incorrect correlation results since almost all Vectors would become very similar.
Can anyone suggest what to do in this case? The end goal here is to calculate correlations between rows. I was thinking of using SparseVectors somehow (to ignore null values but I don't know how.
I'm new to Spark and Scala so some of this might make little sense. I'm trying to understand things better.
I believe you are approaching this in a wrong way. Dealing with nuances of Spark API is secondary to a proper problem definition - what exactly do you mean by correlation in case of sparse data.
Filling data with zeros in case of explicit feedback (rating), is problematic not because all Vectors would become very similar (variation of the metric will be driven by existing ratings, and results can be always rescaled using min-max scaler), but because it introduces information which is not present in the original dataset. There is a significant difference between item which hasn't been rated and item which has the lowest possible rating.
Overall you can approach this problem in two ways:
You can compute pairwise similarity using only entries where both items have non-missing values. This should work reasonably well if dataset is reasonably dense. It could be expressed using self-join on the input dataset. With pseudocode:
imdbRatingsDF.alias("left")
.join(imdbRatingsDF.alias("right"), Seq("userId"))
.where($"left.imdbId" =!= $"right.imdbId")
.groupBy($"left.imdbId", $"right.imdbId")
.agg(simlarity($"left.rating", $"right.rating"))
where similarity implements required similarity metric.
You can impute missing ratings, for example using some measure of central tendency. Using average (Replace missing values with mean - Spark Dataframe) is probably the most natural choice.
More advanced imputation techniques might provide more reliable results, but likely won't scale very well in a distributed system.
Note
Using SparseVectors is essentially equivalent to na.fill(0).
I'm using MLlib's matrix factorization to recommend items to users. I have about a big implicit interaction matrix of M=20 million users and N=50k items. After training the model I want to get a short list(e.g. 200) of recommendations for each user. I tried recommendProductsForUsers in MatrixFactorizationModel but it's very very slow (ran 9 hours but still far from finish. I'm testing with 50 executors, each with 8g memory). This might be expected since recommendProductsForUsers need to calculate all M*N user-item interactions and get top for each user.
I'll try use more executors but from what I saw from the application detail on Spark UI, I doubt that it can finish in hours or a day even I have 1000 executors (after 9hours it's still in the flatmap here https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L279-L289, 10000 total tasks and only ~200 finished)
Are there any other things that I can tune to speed up the recommendation process beside increasing # of executors?
Here is sample code:
val data = input.map(r => Rating(r.getString(0).toInt, r.getString(1).toInt, r.getLong(2))).cache
val rank = 20
val alpha = 40
val maxIter = 10
val lambda = 0.05
val checkpointIterval = 5
val als = new ALS()
.setImplicitPrefs(true)
.setCheckpointInterval(checkpointIterval)
.setRank(rank)
.setAlpha(alpha)
.setIterations(maxIter)
.setLambda(lambda)
val model = als.run(ratings)
val recommendations = model.recommendProductsForUsers(200)
recommendations.saveAsTextFile(outdir)
#Jack Lei: Did you find the answer to this?
I myself tried few things but only helped a little.
For eg: I tried
javaSparkContext.setCheckpointDir("checkpoint/");
This helps becuase it avoid repeated computation in between.
Also tried adding more memory per Executor and overhead spark memory
--conf spark.driver.maxResultSize=5g --conf spark.yarn.executor.memoryOverhead=4000
In general, I want to compare the computing time between a large dataset and split datasets in Spark with the same learning algorithm. The other reason is that I want to get the partition model results.
However, the result shows that the original way is faster than the parallel method. In general, I predict the parallel running with split datasets, which is faster. However,I do not know how to set up it.
How can I adjust the parameters to get I want?
or Can I stop to use partitions using the original method in Spark?
The original:
val lr = new LogisticRegression()
val lrModel = lr.fit(training)
The parallel:
val lr = new LogisticRegression()
val split = training.randomSplit(Array(1,1,.....,1), 11L)
for (tran<-split)
lrModels=lr.fit(train)
First snippet, "original" is also parallelized. To understand it, please look at a Spark execution model.
In first example, Spark have one large dataset. Spark splits it to partitions and calculate each partition in other thread. In second example, you split your data manually (of course internally data is splitted also to partitions). Then you invoke fit - however, in a loop, so this model will be calculated, then other one, etc. So "parallel" example is not more parallel than first one and I'm not suprised that first code runs faster.
In first example you are making one model, in other you are making few models. Each model building is invoked on few threads, however each fit() in second example is invoked just after previous calculation is made.
You can stop parallelism via repartition method with parameter value = 1, however it's not a solution to stop parallelism in first example. You have just shown, that iterative approach is slower than parallel :)