What are the advantages of increasing the partition size and decreasing partitions number in spark? - scala

I have 1 master and 3 slaves(4 cores each)
By Default the min partition size in my spark cluster is 32MB and my file size is 41 Gb.
So i am trying to reduce the number of partitions by changing the minsize to 64Mb
sc.hadoopConfiguration.setLong("mapreduce.input.fileinputformat.split.minsize", 64*1024*1024)
val data =sc.textFile("/home/ubuntu/BigDataSamples/Posts.xml",800)
data.partitions.size = 657
So what are the advantages of increasing the partition size and reducing the number of partitions.
Because when my partitions are around 1314 it took around 2-3min appx and even after reducing the partition count, it is still taking same amount of time.

The more partitions the more overhead, but to some extend it helps with performance as you can run all of them in parallel.
So, on one hand it makes sense to keep number of partitions equal to number of cores. On the other it might happen specific partition size lead to specific amount of garbage in the JVM, which might overhead the limit. In this case you'd like to increase number of partitions to reduce memory footprint of each of them.
It might also depend on the workflow. Consider groupByKey vs reduceByKey. In the latter case you can compute a lot locally and send just a little to remote node. Shuffles happen to be written to disk before being sent to remote, thus having more partitions might reduce performance.
It is also true that there is some overhead come along with each partition.
In case you'd like to share cluster with several people, then you might consider approach to take somewhat less number of partitions to process everything, so that all of the users would have some processing time.
Smth like this.

Related

PySpark: Efficient strategy of splitting my dataframe when writing to a delta table

I would like to know if there is an efficient strategy to write my Spark dataframe in a delta Table in Datalake.
As a rule of thumb I am splitting the dataframe into some column that has between 70 and 300 different values.
The 'trick' I use to see which column is the candidate to use in the "partitionBy" is the following.
I transform my dataframe into a temporary table and look at the cardinality.
df.createOrReplaceTempView("my_table")
%sql
select
count(distinct(column1)) as column1,
count(distinct(column2)) as column2,
...
from my_table
Then I pick the column with a cardinality between 70 - 300, depending on the size of the table
mentally calculating table_size / 128 MB -->is this correct ?
df.write.partitionBy("column_candidate")
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.save(outputpaht)
This method I use does not seem very scientific, and I would like to know if there is a better way to estimate it.I have also seen that there is something called "repartition" but I don't know how to use it or if it is interesting.
How can I calculate the partitions in a more scientific way?
The number of partitions in spark should be decided thoughtfully based on the cluster configuration and requirements of the application. Increasing the number of partitions will make each partition have less data or no data at all. Apache Spark can run a single concurrent task for every partition of an RDD, up to the total number of cores in the cluster. If a cluster has 30 cores then programmers want their RDDs to have 30 cores at the very least or maybe 2 or 3 times of that.
Some acclaimed guidelines for the number of partitions in Spark are as follows-
When the number of partitions is between 100 and 10K partitions based on the size of the cluster and data, the lower and upper bound should be determined.
o The lower bound for spark partitions is determined by 2 X number of cores in the cluster available to application.
o Determining the upper bound for partitions in Spark, the task should take 100+ ms time to execute. If it takes less time, then the partitioned data might be too small or the application might be spending extra time in scheduling tasks.
For more information Refer this article

Is it advisable to add more brokers to kafka cluster although the load is still low

although we do not have any perfomance issues yet, and the nodes are pretty much idle, is it advisable to increase the number of kafka brokers (and zookeepers) from 3 to 5 immediately to improve cluster high availability? The intention is then of course to increase the replication factor from 3 to 5 as a default config for critical topics.
If high level of data replication is essential for your business, it is advisable to increase the count of brokers. To attain this, on top of extra nodes, you are creating a technical debt on network load also. Obviously if you increase the number of brokers in cluster, you are decreasing the risk related to loosing high availability.
Depending of your needs. If you do not have to ensure a very high availability(example a bank), the increase of replication factor in your cluster will reducer the overall performance because when you write a message on a topic/partition, that message will be replicated in 5 nodes instead of 3. You can increase the number of nodes for high availability and distribute less partitions on every node, but without a increase of replication factor.

Scaling Kafka for Throughput

I have setup a sample Kafka cluster on AWS and am trying to identify maximum throughput possible with the given configurations. I am currently following post provided here for this analysis.
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
I would appreciate it if you could clarify the following issues.
I observed a throughput of 40MB/s for messages of size 512 bytes ( single producer - single consumer ) with given hardware. Assume I need to achieve a throughput of 80MB/s.
As I understand one way to do this to increase the number of partitions per topic and increase the number of threads in producer and consumer. ( Assuming I do not change the default values for batch size, compression ratio etc. )
How to find the maximum throughput possible with given hardware? The point after which we are required to improve our hardware resources if we are to further improve the throughput?
( In other words how to make the decision "With X GB RAM and Y GB disk space this is the maximum throughput I can achieve. If I need to further improve the throughput I have to upgrade RAM to XX GB and disk space to YY GB" )
2.Should we scale the cluster vertically or horizontally? What is the recommended approach?
Thank you.
If we define throughput as the volume of data transmitted over the network per second, the maximum throughput should not exceed #machine number * bandwidth. Given a single machine whose NIC is configured with 1Gbps, the max TPS on single machine cannot be larger than 1Gbps. In your case, TPS is 40MB/s, namely 320Mbps,which is quite less than 1Gbps, meaning there is still room for improvement. However, if your target is far larger than 1Gbps, you definitely need more machines.
AFAIK, bandwidth is the most likely cause for the system bottleneck. Unlike CPU and RAM, it's not easy to scale vertically, so a horizontally scaling might be an option.
You could do some maths before scaling. Say the throughput target is "produce 2 billion of records with 512Bytes in 1 hour". That's to say, the TPS has to achieve 2,000,000,000 * 8 * 512 / 3600 / 1024 / 1024 = 2170mbps. Assuming available bandwidth for single machine is 700mbps(Over 70% usage normally brings 'packet loss'), at least 4 machines should be planned for the producer application.

How to calculate the best numberOfPartitions for coalesce?

So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.

How much data per node in Cassandra cluster?

Where are the boundaries of SSTables compaction (major and minor) and when it becomes ineffective?
If I have major compaction couple of 500G SSTables and my final SSTable will be over 1TB - will this be effective for one node to "rewrite" this big dataset?
This can take about day for HDD and need double size space, so are there best practices for this?
1 TB is a reasonable limit on how much data a single node can handle, but in reality, a node is not at all limited by the size of the data, only the rate of operations.
A node might have only 80 GB of data on it, but if you absolutely pound it with random reads and it doesn't have a lot of RAM, it might not even be able to handle that number of requests at a reasonable rate. Similarly, a node might have 10 TB of data, but if you rarely read from it, or you have a small portion of your data that is hot (so that it can be effectively cached), it will do just fine.
Compaction certainly is an issue to be aware of when you have a large amount of data on one node, but there are a few things to keep in mind:
First, the "biggest" compactions, ones where the result is a single huge SSTable, happen rarely, even more so as the amount of data on your node increases. (The number of minor compactions that must occur before a top-level compaction occurs grows exponentially by the number of top-level compactions you've already performed.)
Second, your node will still be able to handle requests, reads will just be slower.
Third, if your replication factor is above 1 and you aren't reading at consistency level ALL, other replicas will be able to respond quickly to read requests, so you shouldn't see a large difference in latency from a client perspective.
Last, there are plans to improve the compaction strategy that may help with some larger data sets.