We are working in Spark streaming .
Our DataFrame contains the following columns
[unitID,source,avrobyte,schemeType]
The unitID values are [ 10, 76, 510, 269 , 7, 0, 508, , 509 ,511 , 507]
We active the following command :
val dfGrouped :KeyValueGroupedDataset [Int,Car] = dfSource.groupByKey(car1=> ca1.unitID)
val afterLogic : DataSet[CarLogic]= dfGrouped.flatMapGroups{
case(unitID: Int , messages:Iterator[Car])=> performeLogic(...)
}
We allocate 8 Spark executers .
In our Dataset we have 10 different units so we have 10 different unitID,
so we excepted that job processing will split on all over the executers in equal manner, but when we looking on the executers performance via the UI I see that only 2 executers are working and all the other are idle during the mission....
What are we doing wrong? or how we can divide the job over all the executers to be more or the less equal...
What you are seeing can be explained by the low cardinality of your key space. Spark uses a HashPartitioner (by default) to assign keys to partitions (by default 200 partitions). On a low cardinality key space this is rather problematic and requires careful attention as each collision has a massive impact. Even further, these partitions then have to be assigned to executors. At the end of this process it's not surprising to end up with a rather sub-optimal distribution of data.
You have a few options:
If applicable, attempt to increase the cardinality of your keys, e.g. by salting them (appending some randomness temporarily). That has the advantage that you can also better handle skew in the data (when the amount of data per keys is not equally distributed). In a following step you can then remove the random part again and combine the partial results.
If you absolutely require a partition per key (and the key space is static and well-known), you should configure spark.sql.shuffle.partitions to match the cardinality n of your keys space and assign each key a partition id in [0, n) ahead of time (to avoid collisions when hashing). Then you can use this partition id in your groupBy.
Just for completeness, using the RDD API you could provide you own custom partitioner that does the same as described above: rdd.partitionBy(n, customPartitioner)
Though, one final word: Even following one of the latter two options above, using 8 executors for 10 keys (equals 10 non-empty partitions) is a poor choice. If your data is equally distributed, you will still end up with 2 executors doing double the work. If your data is skewed things might even be worse (or you are accidentally lucky) - in any case, it's out of your control.
So it's best to make sure that the number of partitions can be equally distributed among your executors.
Related
Can anyone explain how the skew data is handled in production for Apache spark?
Scenario:
We submitted the spark job using "spark-submit" and in spark-ui it is observed that few tasks are taking long time which indicates presence of skew.
Questions:
(1) What steps shall we take(re-partitioning,coalesce,etc.)?
(2) Do we need to kill the job and then include the skew solutions in the jar and
re-submit the job?
(3) Can we solve this issue by running the commands like (coalesce) directly from
shell without killing the job?
Data skews a primarily a problem when applying non-reducing by-key (shuffling) operations. The two most common examples are:
Non-reducing groupByKey (RDD.groupByKey, Dataset.groupBy(Key).mapGroups, Dataset.groupBy.agg(collect_list)).
RDD and Dataset joins.
Rarely, the problem is related to the properties of the partitioning key and partitioning function, with no per-existent issue with data distribution.
// All keys are unique - no obvious data skew
val rdd = sc.parallelize(Seq(0, 3, 6, 9, 12)).map((_, None))
// Drastic data skew
rdd.partitionBy(new org.apache.spark.HashPartitioner(3)).glom.map(_.size).collect
// Array[Int] = Array(5, 0, 0)
What steps shall we take(re-partitioning,coalesce,etc.)?
Repartitioning (never coalesce) can help you with the the latter case by
Changing partitioner.
Adjusting number of partitions to minimize possible impact of data (here you can use the same rules as for associative arrays - prime number and powers of two should be preferred, although might not resolve the problem fully, like 3 in the example used above).
The former cases typically won't benefit from repartitioning much, because skew is naturally induced by the operation itself. Values with the same key cannot be spread multiple partitions, and non-reducing character of the process, is minimally affected by the initial data distribution.
These cases have to be handled by adjusting the logic of your application. It could mean a number of things in practice, depending on the data or problem:
Removing operation completely.
Replacing exact result with an approximation.
Using different workarounds (typically with joins), for example frequent-infrequent split, iterative broadcast join or prefiltering with probabilistic filter (like Bloom filter).
Do we need to kill the job and then include the skew solutions in the jar and re-submit the job?
Normally you have to at least resubmit the job with adjust parameters.
In some cases (mostly RDD batch jobs) you can design your application, to monitor task execution and kill and resubmit particular job in case of possible skew, but it might hard to implement right in practice.
In general, if data skew is possible, you should design your application to be immune to data skews.
Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job?
I believe this is already answered by the points above, but just to say - there is no such option in Spark. You can of course include these in your application.
We can fine tune the query to reduce the complexity .
We can Try Salting mechanism:
Salt the skewed column with random number creation better distribution of data across each partition.
Spark 3 Enables Adaptive Query Execution mechanism to avoid such scenarios in production.
Below are couple of spark properties which we can fine tune accordingly.
spark.sql.adaptive.enabled=true
spark.databricks.adaptive.autoBroadcastJoinThreshold=true #changes sort merge join to broadcast join dynamically , default size = 30 mb
spark.sql.adaptive.coalescePartitions.enabled=true #dynamically coalesced
spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB default
spark.sql.adaptive.coalescePartitions.minPartitionSize=true
spark.sql.adaptive.coalescePartitions.minPartitionNum=true # Default 2X number of cores
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.skewJoin.skewedPartitionFactor=Default is 5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256 MB
I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.
I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".
There is a very small but very powerful detail in the Kafka org.apache.kafka.clients.producer.internals.DefaultPartitioner implementation that bugs me a lot.
It is this line of code:
return DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
to be more precise, the last % numPartitions. I keep asking myself what is the reason behind introducing such a huge constraint by making the partition ID a function of the number of existent partitions? Just for the convenience of having small numbers (human readable/traceable?!) in comparison to the total number of partitions? Does anyone here have a broader insight into the issue?
I'm asking this because in our implementation, the key we use to store data in kafka is domain-sensitive and we use it to retrieve information from kafka based on that. For instance, we have consumers that need to subscribe ONLY to partitions that present interest to them and the way we do that link is by using such keys.
Would be safe to use a custom partitioner that doesn't do that modulo operation? Should we notice any performance degradation. Does this have any implications on the Producer and/or Consumer side?
Any ideas and comments are welcome.
Partitions in a Kafka topic are numbered from 0...N. Thus, if a key is hashed to determine a partitions, the result hash value must be in the interval [0;N] -- it must be a valid partition number.
Using modulo operation is a standard technique in hashing.
Normally you do modulo on hash to make sure that the entry will fit in the hash range.
Say you have hash range of 5.
-------------------
| 0 | 1 | 2 | 3 | 4 |
-------------------
if your hashcode of entry happens to be 6 you would have to divide by number of available
buckets so that it fits in the range, means bucket 1 in this case.
Even more important thing is when you decide to add or remove bucket from the range.
Say you decreased the size of hashmap to 4 buckets, then the last bucket will be inactive and
you have to rehash the values in bucket#4 to next bucket in clockwise direction. (I'm talking
about consistent hashing here)
Also, new coming hashes need to be distributed within active 4 buckets, because 5th one will go away, this is taken care by the modulo.
The same concept is used in distributed systems for rehashing which happens when you add or remove node to your cluster.
Kafka Default Partiotioner is using modulo for the same purpose. If you add or remove partitions, which is very usual case if you ask me, for example during high volume of incoming messages I might want to add more partitions so that I achieve high write throughput and also high read throughput, as I can parallely consume partitions.
You can override partitioning algorithm based on your business logic by choosing some key in your message which will make sure the messages are distributed uniformly within the range[0...n]
The performance impact of using a custom partitioner entirely depends on your implementation of it.
I'm not entirely sure what you're trying to accomplish though. If I understand your question correctly, you want to use the value of the message key as the partition number directly, without doing any modulo operation on it to determine a partition?
In that case all you need to do is use the overloaded constructor for the ProducerRecord(java.lang.String topic, java.lang.Integer partition, K key, V value) when producing a message to a kafka topic, passing in the desired partition number.
This way all the default partitioning logic will be bypassed entirely and the message will go to specified partition.
I am running an iterative algorithm in which during each iteration, a list of values are each assigned a set of keys (1 to N). Over time, the distribution of files over keys become skewed. I noticed that after a few iterations, coalesce phase, things seem to start running really slow on the last few partitions of my RDD.
My transformation is as follows:
dataRDD_of_20000_partitions.aggregateByKey(zeroOp)(seqOp, mergeOp)
.mapValues(...)
.coalesce(1000, true)
.collect()
Here, aggregatebykey aggregates upon the keys I assigned earlier (1 to N). I can coalescing partitions because I know the number of partitions I need, and set coalesce shuffle to true in order to balance out the partitions.
Could anyone point to some reasons that these transformations may cause the last few partitions of the RDD to process slow? I am wondering if part of this has to do with data skewness.
I have some observations.
You should have right number of partitions to avoid data skewness. I suspect that you have fewer partitions than required number of partitions. Have a look at this blog.
collect() call, fetches entire RDD into single driver node.It may cause OutOfMemory some times.
Transformers like aggregateByKey() may cause performance issues due to shuffling.
Have a look this SE question for more details: Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()
Imagine I have a RDD with 100 records and I partitioned it with 10, so each partition is now having 10 records I am just converting to rdd to key value pair rdd and saving it to a file now my output data is divided into 10 partitions which is ok to me, but is it best practise to use coalesce function before saving output data to file ? for example rdd.coalesce(1) this gives just one file as output does it not shuffles data insides nodes ? want to know where coalesce should be used.
Thanks
Avoid coalesce if you don't need it. Only use it to reduce the amount of files generated.
As with anything, depends on your use case; coalesce() can be used to either increase or decrease the number of partitions but there is a cost associated with it.
If you are attempting to increase the number of partitions (in which the shuffle parameter must be set to true), you will incur the cost of redistributing data through a HashPartitioner. If you are attempting to decrease the number of partitions, the shuffle parameter can be set to false but the number of nodes actively grabbing from the current set of partitions will be the number of partitions you are coalescing to. For example, if you are coalescing to 1 partition, only 1 node will be active in pulling data from the parent partitions (this can be dangerous if you are coalescing a large amount of data).
Coalescing can be useful though as sometimes you can make your job run more efficiently by decreasing your partition set size (e.g. after a filter or a sparse inner join).
you can simply use it like this
rdd.coalesce(numberOfPartition)
It doesn't shuffle data if you decease partitions but its shuffle data if you increase partitions. Its according to use cases.But we careful to use it because if you decrease partition less than or not equal to number of cores in your cluster then its cant use full resources of your cluster. And Sometimes less shuffle data or network IO like you decrease rdd partition but equal to number of partition so its increase performance of your system.