How to distribute data across various worker nodes in SPARK in DATABRICKS? - pyspark

Like how can we distribute a table data on different worker nodes in SPARK and ensure that the operations we run on that data run parallelly. And shouldn't .parallelize() be .distribute() as it splits the dataset across various worker nodes in the SPARK cluster ?
Looked at SPARK and DATABRICKS documentation and I am confused between parralelization and distribution of data.

In Apache Spark, the process of dividing data into smaller partitions and processing each partition in parallel is called parallelization. The .parallelize() method is used to convert a collection in the driver program to an RDD (Resilient Distributed Dataset) that can be distributed across multiple nodes in a Spark cluster for parallel processing.
Distribution, on the other hand, refers to the process of distributing the data across multiple nodes in the Spark cluster. This is done automatically by Spark when you perform operations on an RDD, such as filtering, mapping, or reducing. Spark takes care of the distribution of data so that each node can work on a separate partition in parallel.
The terms parallelization and distribution are often used interchangeably, but they are slightly different concepts in the context of Spark. To summarize, parallelization is about dividing a single dataset into smaller partitions for parallel processing, while distribution is about distributing these partitions across multiple nodes in the cluster for further processing.
Example to help illustrate the difference between parallelization and distribution in Apache Spark:
Suppose you have a large dataset that you want to process using Spark. To start, you would create an RDD (Resilient Distributed Dataset) from your dataset using the .parallelize() method. This will divide your dataset into smaller partitions, each of which can be processed in parallel. This is parallelization.
Next, Spark will automatically distribute the partitions of your RDD across multiple nodes in the Spark cluster. Each node will receive one or more partitions and will process the data in those partitions in parallel with the other nodes. This is distribution.
In other words, parallelization is about dividing the data into smaller units for processing, while distribution is about spreading these units across multiple nodes for processing in parallel. This way, Spark can process large datasets much faster than if you processed the data on a single node.

Related

PySpark: Efficient strategy of splitting my dataframe when writing to a delta table

I would like to know if there is an efficient strategy to write my Spark dataframe in a delta Table in Datalake.
As a rule of thumb I am splitting the dataframe into some column that has between 70 and 300 different values.
The 'trick' I use to see which column is the candidate to use in the "partitionBy" is the following.
I transform my dataframe into a temporary table and look at the cardinality.
df.createOrReplaceTempView("my_table")
%sql
select
count(distinct(column1)) as column1,
count(distinct(column2)) as column2,
...
from my_table
Then I pick the column with a cardinality between 70 - 300, depending on the size of the table
mentally calculating table_size / 128 MB -->is this correct ?
df.write.partitionBy("column_candidate")
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.save(outputpaht)
This method I use does not seem very scientific, and I would like to know if there is a better way to estimate it.I have also seen that there is something called "repartition" but I don't know how to use it or if it is interesting.
How can I calculate the partitions in a more scientific way?
The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all. Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster. If a cluster has 30 cores then programmers want their RDDs to have 30 cores at the very least or maybe 2 or 3 times of that.
Some acclaimed guidelines for the number of partitions in Spark are as follows-
When the number of partitions is between 100 and 10K partitions based on the size of the cluster and data, the lower and upper bound should be determined.
o The lower bound for spark partitions is determined by 2 X number of cores in the cluster available to application.
o Determining the upper bound for partitions in Spark, the task should take 100+ ms time to execute. If it takes less time, then the partitioned data might be too small or the application might be spending extra time in scheduling tasks.
For more information Refer this article

How to perform large computations on Spark

I have 2 tables in Hive: user and item and I am trying to calculate cosine similarity between 2 features of each table for a cartesian product between the 2 tables, i.e. Cross Join.
There are around 20000 users and 5000 items resulting in 100 million rows of calculation. I am running the compute using Scala Spark on Hive Cluster with 12 cores.
The code goes a little something like this:
val pairs = userDf.crossJoin(itemDf).repartition(100)
val results = pairs.mapPartitions(computeScore) // computeScore is a function to compute the similarity scores I need
The Spark job will always fail due to memory issues (GC Allocation Failure) on the Hadoop cluster. If I reduce the computation to around 10 million, it will definitely work - under 15 minutes.
How do I compute the whole set without increasing the hardware specifications? I am fine if the job takes longer to run and does not fail halfway.
if you take a look in the Spark documentation you will see that spark uses different strategies for data management. These policies are enabled by the user via configurations in the spark configuration files or directly in the code or script.
Below the documentation about data management policies:
"MEMORY_AND_DISK" policy would be good for you because if the data (RDD) does not fit in the ram then the remaining partitons will be stored in the hard disk. But this strategy can be slow if you have to access the hard drive often.
There are few steps of doing that:
1. Check the expected Data volume after cross join and divide this by 200 as spark.sql.shuffle.partitions by default comes as 200. It has to be more than 1 GB raw data to each partition.
2. Calculate each row size and multiply with another table row count , you will be able to estimated the rough Volume. The process will work much better in Parquet in comparison to CSV file
3. spark.sql.shuffle.partitions needs to be set based on Total Data Volume/500 MB
4. spark.shuffle.minNumPartitionsToHighlyCompress needs to set a little less than Shuffle Partition
5. Bucketize the source parquet data based on the joining column for both of the files/tables
6. Provide a High Spark Executor Memory and Manage the Java Heap memory too considering the heap space

How to calculate the best numberOfPartitions for coalesce?

So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.

Degrading performance when increasing number of slaves [duplicate]

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.
//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output
The input and output directories in each case, are in HDFS.
1 core: 80 secs
8 cores: 160 secs
I would expect 8 cores performance to have x amount of speedup.
Theoretical limitations
I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :
where :
s - is the speedup of the parallel part.
p - is fraction of the program that can be parallelized.
In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:
(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia)
Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because
Spark is a high cost abstraction
Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes
and execute long running jobs but it is doesn't scale down very well.
Spark is not focused on parallel computing
In practice Spark and similar systems are focused on two problems:
Reducing overall IO latency by distributing IO operations between multiple nodes.
Increasing amount of available memory without increasing the cost per unit.
which are fundamental problems for large scale, data intensive systems.
Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.
With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.
Practical implications
Spark is not a replacement for multiprocessing or mulithreading on a single machine.
Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.
In this context:
Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

Should i use Trident to compute the global mean of tuples in Storm?

I want to compute with Storm the mean from incoming tuples made of [int id,int value].
As you can see i can't partition the data by using a fields grouping. I need a topology architecture to distribute this computation and the only way im thinking of is doing mini batches within each bolt instances and then aggregate.
I kind of understood that trident was the appropriate solution to do mini-batch processing within storm.
What is the best practice to compute global analytics with storm like means, global count, std-devs when you can't partition the data based on attribute? Any topology example?
You can easily compute stream statistics such as mean, standard deviation and count computed using Trident-ML. There's a section in the README which explains how to compute theses stats within a trident topology.
Hope it helps.