I have some problems with interpreting storm UI stats.
I've deployed a simple topology with Kafka spout and Custom Bolt that do nothing, but acknowledge tuples. The idea was to measure performance of Kafka spout.
So, looking at this stats I've got some questions.
1) In last 10 min topology acked 1619220 tuples and complete latency was 14.125ms. But if you make some calculations:
(1619220 * 14.125) / (1000 * 60) = 381
1619220 tuples with complete latency 14.125 each, requires 381 min to pass topology. But that stats was on last 10 minutes. Complete latency shows wrong number, or I interpret it wrong?
2) Bolt capacity is around 0.5. Does it proof that bottleneck is kafka spout?
I will appreciate any information about improving storm topology perfomace, since it's not obvious for me right now.
As tuples processing does overlap for many tuples in-flight due to parallelism, your calculation is not correct. (It would only be correct, if Storm would process a single tuple completely before starting with the next tuple.)
A capacity on 0.5 is not really an overload. Overload would be more than 1. But I would suggest to try to keep it below 0.9 or maybe 0.8 to get some head room for spikes in you input rate.
Related
We have a spark job (HDInsight) and its run time is increasing slowly over the last couple of months. Based on the spark UI , are there any indicators to say that there is dataskew thats why its performance is degrading ? below is stage details, please see the median and 75th percentile difference. How should i go about optimizing this job ? appreciate any guidance
Also what is the optimal value for spark.sql.shuffle.partitions given the input dataset size is around 12 GB and cluster has got 128 cores ?
Normally we use the following approach to identify possible stragglers and you have caught on to that:
When Max duration among completed tasks is significanly higher than
Median or 75th Percentile value, then this indicates the possibility
of Stragglers.
The median in your case tells enough.
You need to salt the key to distribute the data better - which can be complicated, or look at current partitioning approach.
This article https://medium.com/swlh/troubleshooting-stragglers-in-your-spark-application-47f2568663ec provides good guidance.
I've read some article, benchmarking the performance of stream processing engines like Spark streaming, Storm, and Flink. In the evaluation part, the criterion was 99th percentile and throughput. For example, Apache Kafka sent data at around 100.000 events per seconds and those three engines act as stream processor and their performance was described using 99th percentile latency and throughput.
Can anyone clarify these two criteria for me?
99th percentile latency of X milliseconds in stream jobs means that 99% of the items arrived at the end of the pipeline in less than X milliseconds. Read this reference for more details.
When application developers expect a certain latency, they often need
a latency bound. We measure several latency bounds for the stream
record grouping job which shuffles data over the network. The
following figure shows the median latency observed, as well as the
90-th, 95-th, and 99-th percentiles (a 99-th percentile of latency of
50 milliseconds, for example, means that 99% of the elements arrive at
the end of the pipeline in less than 50 milliseconds).
So, I understand that in general one should use coalesce() when:
the number of partitions decreases due to a filter or some other operation that may result in reducing the original dataset (RDD, DF). coalesce() is useful for running operations more efficiently after filtering down a large dataset.
I also understand that it is less expensive than repartition as it reduces shuffling by moving data only if necessary. My problem is how to define the parameter that coalesce takes (idealPartionionNo). I am working on a project which was passed to me from another engineer and he was using the below calculation to compute the value of that parameter.
// DEFINE OPTIMAL PARTITION NUMBER
implicit val NO_OF_EXECUTOR_INSTANCES = sc.getConf.getInt("spark.executor.instances", 5)
implicit val NO_OF_EXECUTOR_CORES = sc.getConf.getInt("spark.executor.cores", 2)
val idealPartionionNo = NO_OF_EXECUTOR_INSTANCES * NO_OF_EXECUTOR_CORES * REPARTITION_FACTOR
This is then used with a partitioner object:
val partitioner = new HashPartitioner(idealPartionionNo)
but also used with:
RDD.filter(x=>x._3<30).coalesce(idealPartionionNo)
Is this the right approach? What is the main idea behind the idealPartionionNo value computation? What is the REPARTITION_FACTOR? How do I generally work to define that?
Also, since YARN is responsible for identifying the available executors on the fly is there a way of getting that number (AVAILABLE_EXECUTOR_INSTANCES) on the fly and use that for computing idealPartionionNo (i.e. replace NO_OF_EXECUTOR_INSTANCES with AVAILABLE_EXECUTOR_INSTANCES)?
Ideally, some actual examples of the form:
Here 's a dataset (size);
Here's a number of transformations and possible reuses of an RDD/DF.
Here is where you should repartition/coalesce.
Assume you have n executors with m cores and a partition factor equal to k
then:
The ideal number of partitions would be ==> ???
Also, if you can refer me to a nice blog that explains these I would really appreciate it.
In practice optimal number of partitions depends more on the data you have, transformations you use and overall configuration than the available resources.
If the number of partitions is too low you'll experience long GC pauses, different types of memory issues, and lastly suboptimal resource utilization.
If the number of partitions is too high then maintenance cost can easily exceed processing cost. Moreover, if you use non-distributed reducing operations (like reduce in contrast to treeReduce), a large number of partitions results in a higher load on the driver.
You can find a number of rules which suggest oversubscribing partitions compared to the number of cores (factor 2 or 3 seems to be common) or keeping partitions at a certain size but this doesn't take into account your own code:
If you allocate a lot you can expect long GC pauses and it is probably better to go with smaller partitions.
If a certain piece of code is expensive then your shuffle cost can be amortized by a higher concurrency.
If you have a filter you can adjust the number of partitions based on a discriminative power of the predicate (you make different decisions if you expect to retain 5% of the data and 99% of the data).
In my opinion:
With one-off jobs keep higher number partitions to stay on the safe side (slower is better than failing).
With reusable jobs start with conservative configuration then execute - monitor - adjust configuration - repeat.
Don't try to use fixed number of partitions based on the number of executors or cores. First understand your data and code, then adjust configuration to reflect your understanding.
Usually, it is relatively easy to determine the amount of raw data per partition for which your cluster exhibits stable behavior (in my experience it is somewhere in the range of few hundred megabytes, depending on the format, data structure you use to load data, and configuration). This is the "magic number" you're looking for.
Some things you have to remember in general:
Number of partitions doesn't necessarily reflect
data distribution. Any operation that requires shuffle (*byKey, join, RDD.partitionBy, Dataset.repartition) can result in non-uniform data distribution. Always monitor your jobs for symptoms of a significant data skew.
Number of partitions in general is not constant. Any operation with multiple dependencies (union, coGroup, join) can affect the number of partitions.
Your question is a valid one, but Spark partitioning optimization depends entirely on the computation you're running. You need to have a good reason to repartition/coalesce; if you're just counting an RDD (even if it has a huge number of sparsely populated partitions), then any repartition/coalesce step is just going to slow you down.
Repartition vs coalesce
The difference between repartition(n) (which is the same as coalesce(n, shuffle = true) and coalesce(n, shuffle = false) has to do with execution model. The shuffle model takes each partition in the original RDD, randomly sends its data around to all executors, and results in an RDD with the new (smaller or greater) number of partitions. The no-shuffle model creates a new RDD which loads multiple partitions as one task.
Let's consider this computation:
sc.textFile("massive_file.txt")
.filter(sparseFilterFunction) // leaves only 0.1% of the lines
.coalesce(numPartitions, shuffle = shuffle)
If shuffle is true, then the text file / filter computations happen in a number of tasks given by the defaults in textFile, and the tiny filtered results are shuffled. If shuffle is false, then the number of total tasks is at most numPartitions.
If numPartitions is 1, then the difference is quite stark. The shuffle model will process and filter the data in parallel, then send the 0.1% of filtered results to one executor for downstream DAG operations. The no-shuffle model will process and filter the data all on one core from the beginning.
Steps to take
Consider your downstream operations. If you're just using this dataset once, then you probably don't need to repartition at all. If you are saving the filtered RDD for later use (to disk, for example), then consider the tradeoffs above. It takes experience to become familiar with these models and when one performs better, so try both out and see how they perform!
As others have answered, there is no formula which calculates what you ask for. That said, You can make an educated guess on the first part and then fine tune it over time.
The first step is to make sure you have enough partitions. If you have NO_OF_EXECUTOR_INSTANCES executors and NO_OF_EXECUTOR_CORES cores per executor then you can process NO_OF_EXECUTOR_INSTANCES*NO_OF_EXECUTOR_CORES partitions at the same time (each would go to a specific core of a specific instance).
That said this assumes everything is divided equally between the cores and everything takes exactly the same time to process. This is rarely the case. There is a good chance that some of them would be finished before others either because of locallity (e.g. the data needs to come from a different node) or simply because they are not balanced (e.g. if you have data partitioned by root domain then partitions including google would probably be quite big). This is where the REPARTITION_FACTOR comes into play. The idea is that we "overbook" each core and therefore if one finishes very quickly and one finishes slowly we have the option of dividing the tasks between them. A factor of 2-3 is generally a good idea.
Now lets take a look at the size of a single partition. Lets say your entire data is X MB in size and you have N partitions. Each partition would be on average X/N MBs. If N is large relative to X then you might have very small average partition size (e.g. a few KB). In this case it is usually a good idea to lower N because the overhead of managing each partition becomes too high. On the other hand if the size is very large (e.g. a few GB) then you need to hold a lot of data at the same time which would cause issues such as garbage collection, high memory usage etc.
The optimal size is a good question but generally people seem to prefer partitions of 100-1000MB but in truth tens of MB probably would also be good.
Another thing you should note is when you do the calculation how your partitions change. For example, lets say you start with 1000 partitions of 100MB each but then filter the data so each partition becomes 1K then you should probably coalesce. Similar issues can happen when you do a groupby or join. In such cases both the size of the partition and the number of partitions change and might reach an undesirable size.
I have 1 master and 3 slaves(4 cores each)
By Default the min partition size in my spark cluster is 32MB and my file size is 41 Gb.
So i am trying to reduce the number of partitions by changing the minsize to 64Mb
sc.hadoopConfiguration.setLong("mapreduce.input.fileinputformat.split.minsize", 64*1024*1024)
val data =sc.textFile("/home/ubuntu/BigDataSamples/Posts.xml",800)
data.partitions.size = 657
So what are the advantages of increasing the partition size and reducing the number of partitions.
Because when my partitions are around 1314 it took around 2-3min appx and even after reducing the partition count, it is still taking same amount of time.
The more partitions the more overhead, but to some extend it helps with performance as you can run all of them in parallel.
So, on one hand it makes sense to keep number of partitions equal to number of cores. On the other it might happen specific partition size lead to specific amount of garbage in the JVM, which might overhead the limit. In this case you'd like to increase number of partitions to reduce memory footprint of each of them.
It might also depend on the workflow. Consider groupByKey vs reduceByKey. In the latter case you can compute a lot locally and send just a little to remote node. Shuffles happen to be written to disk before being sent to remote, thus having more partitions might reduce performance.
It is also true that there is some overhead come along with each partition.
In case you'd like to share cluster with several people, then you might consider approach to take somewhat less number of partitions to process everything, so that all of the users would have some processing time.
Smth like this.
Where are the boundaries of SSTables compaction (major and minor) and when it becomes ineffective?
If I have major compaction couple of 500G SSTables and my final SSTable will be over 1TB - will this be effective for one node to "rewrite" this big dataset?
This can take about day for HDD and need double size space, so are there best practices for this?
1 TB is a reasonable limit on how much data a single node can handle, but in reality, a node is not at all limited by the size of the data, only the rate of operations.
A node might have only 80 GB of data on it, but if you absolutely pound it with random reads and it doesn't have a lot of RAM, it might not even be able to handle that number of requests at a reasonable rate. Similarly, a node might have 10 TB of data, but if you rarely read from it, or you have a small portion of your data that is hot (so that it can be effectively cached), it will do just fine.
Compaction certainly is an issue to be aware of when you have a large amount of data on one node, but there are a few things to keep in mind:
First, the "biggest" compactions, ones where the result is a single huge SSTable, happen rarely, even more so as the amount of data on your node increases. (The number of minor compactions that must occur before a top-level compaction occurs grows exponentially by the number of top-level compactions you've already performed.)
Second, your node will still be able to handle requests, reads will just be slower.
Third, if your replication factor is above 1 and you aren't reading at consistency level ALL, other replicas will be able to respond quickly to read requests, so you shouldn't see a large difference in latency from a client perspective.
Last, there are plans to improve the compaction strategy that may help with some larger data sets.