In general, I want to compare the computing time between a large dataset and split datasets in Spark with the same learning algorithm. The other reason is that I want to get the partition model results.
However, the result shows that the original way is faster than the parallel method. In general, I predict the parallel running with split datasets, which is faster. However,I do not know how to set up it.
How can I adjust the parameters to get I want?
or Can I stop to use partitions using the original method in Spark?
The original:
val lr = new LogisticRegression()
val lrModel = lr.fit(training)
The parallel:
val lr = new LogisticRegression()
val split = training.randomSplit(Array(1,1,.....,1), 11L)
for (tran<-split)
lrModels=lr.fit(train)
First snippet, "original" is also parallelized. To understand it, please look at a Spark execution model.
In first example, Spark have one large dataset. Spark splits it to partitions and calculate each partition in other thread. In second example, you split your data manually (of course internally data is splitted also to partitions). Then you invoke fit - however, in a loop, so this model will be calculated, then other one, etc. So "parallel" example is not more parallel than first one and I'm not suprised that first code runs faster.
In first example you are making one model, in other you are making few models. Each model building is invoked on few threads, however each fit() in second example is invoked just after previous calculation is made.
You can stop parallelism via repartition method with parameter value = 1, however it's not a solution to stop parallelism in first example. You have just shown, that iterative approach is slower than parallel :)
Related
i work on graphs in GraphX. by using the below code i have made a variable to store neighbors of nodes in RDD:
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either)
i used broadcast variable to broadcast neighbors to all slaves by using below code:
val broadcastVar = all_neighbors.collect().toMap
val nvalues = sc.broadcast(broadcastVar)
i want to compute intersection between two nodes neighbors. for example intersection between node 1 and node 2 neighbors.
At first i use this code for computing intersection that uses the broadcast variable nvalues:
val common_neighbors=nvalues.value(1).intersect(nvalues.value(2))
and once i used the below code for computing intersection of two nodes:
val common_neighbors2=(all_neighbors.filter(x=>x._1==1)).intersection(all_neighbors.filter(x=>x._1==2))
my question is this: which one of the above methods is efficient and more distributed and parallel? using the broadcast variable nvalue for computing intersection or using filtering RDD method?
I think it depends on the situation.
In the case where your nvalues size is less and can fit into each executor and driver node, the approach with broadcasting will be optimal as data is cached in executors and this data is not recomputed over and over again. Also, it will save spark a huge communication and compute burden. In such cases, the other approach is not optimal as it might happen that all_neighbours rdd is calculated every time and this will decrease the performance as there will be a lot of recomputations and will increase computation cost.
In the case where your nvalues cannot fit into each executor and driver node,
broadcasting will not work as it will throw an error. Hence, there is no option left but to use the second approach though it might still cause performance issues at least code will work!!
Let me know if it helps!!
I am currently trying to use Python's linearregression() model to describe the relationship between two variables X and Y. Given a dataset with 8 columns and 1000 rows, I want to split this dataset into training and test sets using split_train_test.
My question: I wonder what is the difference between train_test_split(dataset, test_size, random_test = int) vs train_test_split(dataset, test_size).Also, does the 2nd one (without setting random_test=int) give me a different test set and training set each time I re-run my program? Also, does the 1st one give me the same test set and training set every time I re-run my program? What is the difference between setting random_test=42 vs random_test=43, for example?
In python scikit-learn train_test_split will split your input data into two sets i) train and ii) test. It has argument random_state which allows you to split data randomly.
If the argument is not mentioned it will classify the data in a stratified manner which will give you the same split for the same dataset.
Assume you want a random split the data so that you could measure the performance of your regression on the same data with different splits. you can use random_state to achieve it. Each random state will give you pseudo-random split of your initial data. In order to keep track of performance and reproduce it later on the same data you will use the random_state argument with value used before.
It is useful for cross validation technique in machine learning.
So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.
I am building a recommendation system for retail purposes. I use python and Spark.
I am trying to subtract all user product combinations of my predictions which also occur in the ratings (so I only predict the values of products users never bought before).
Those 2 RDD's are pretty large and are giving me memory issues on 28gb per worker node (3 nodes) when I do
filter_predictions = predictions.subtractByKey(user_boughtproduct)
When I read the documentation of Spark subtractByKey is optimal when using 1 large and 1 small rdd.
I cannot make the user_boughtproduct smaller (unless I loop it), but I could make.
filter_predictions = predictions.join(user_nonBoughtProduct)
Any thoughts on which of them is faster or best practice? Or another cleaner solution.
subtractByKey pushes filters after co-grouping and doesn't have to touch right values so it should be slightly more efficient than using outer join an filter after flattening.
If you use Spark 2.0+ and records can be encoded using Dataset encoders you can consider leftanti join but depending on the rest of your code cost of moving the data can negate benefits of an optimized execution.
Finally if you can accept potential data loss then building Bloom filter on the right RDD and use it to filter the left one can give really good result without shuffling.
I am training an org.apache.spark.mllib.recommendation.ALS model on an quite big RDD rdd. I'd like to select a decent regularization hyperparameter so that my model doesn't over- (or under-) fit. To do so, I split rdd (using randomSplit) into a train set and a test set and perform a cross-validation with a defined set of hyperparameters on these.
As I'm using the train and test RDDs several times in the cross-validation it seems natural to cache() the data at some point for faster computation. However, my Spark knowledge is quite limited and I'm wondering which of these two options is better (and why):
Cache the initial RDD rdd before splitting it, that is:
val train_proportion = 0.75
val seed = 42
rdd.cache()
val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
val train_set = split(0)
val test_set = split(1)
Cache the train and test RDDs after splitting the initial RDD:
val train_proportion = 0.75
val seed = 42
val split = rdd.randomSplit(Array(train_proportion, 1 - train_proportion), seed)
val train_set = split(0).cache()
val test_set = split(1).cache()
My speculation is that option 1 is better because the randomSplit would also benefit from the fact that rdd is cached, but I'm not sure whether it would negatively impact the (multiple) future accesses to train_set and test_set with respect to option 2.
This answer seems to confirm my intuition, but it received no feedback, so I'd like to be sure by asking here.
What do you think? And more importantly: Why?
Please note that I have run the experiment on a Spark cluster, but it is often busy these days so my conclusions may be wrong. I also checked the Spark documentation and found no answer to my question.
If the calculation on the RDD are made before the split, than it is better to cache it before, as (in my experience) all the transformation will be run only once and triggered by the cache() action.
I suppose split() cache() cache() are 3 actions vs cache() split() 2. EDIT: cache is not an action.
And indeed I find a confirmation in other similar questions around the web
Edit: to clarify my first sentence: the DAG will perform all the transformation on the RDD and then cache it, so all the things done to it afterwards will need no more computation, although the splitted parts will be calculated again.
In conclusion, should you operate heavier transformations on the splitted part than the original RDD itself, you would want to cache them instead. (I hope someone will back me up here)