Is there a way to reduce by partition in Apache Beam? - apache-beam

Does apache beam allow for reduce operation per partition ?
For more context, I am looking to understand if it is possible in apache beam to aggregate the data in the partition before shuffling the data to one node for final merge of aggregates?

With some guessing, if I understand your question correctly, doing this means 1) doing a limited-scope (i.e. per partition/shard) shuffle first, reduce, and then 2) shuffle cross/between different partitions then reduce again.
In most case, doing this won't benefit unless doing a step 1) reduce significantly reduce the amount of data-transmission needed for step 2) shuffle above. And if that is the case, consider using a "combine". Under the hood, a combine does (almost) the same to what you propose.

Related

Handling Skew data in apache spark production scenario

Can anyone explain how the skew data is handled in production for Apache spark?
Scenario:
We submitted the spark job using "spark-submit" and in spark-ui it is observed that few tasks are taking long time which indicates presence of skew.
Questions:
(1) What steps shall we take(re-partitioning,coalesce,etc.)?
(2) Do we need to kill the job and then include the skew solutions in the jar and
re-submit the job?
(3) Can we solve this issue by running the commands like (coalesce) directly from
shell without killing the job?
Data skews a primarily a problem when applying non-reducing by-key (shuffling) operations. The two most common examples are:
Non-reducing groupByKey (RDD.groupByKey, Dataset.groupBy(Key).mapGroups, Dataset.groupBy.agg(collect_list)).
RDD and Dataset joins.
Rarely, the problem is related to the properties of the partitioning key and partitioning function, with no per-existent issue with data distribution.
// All keys are unique - no obvious data skew
val rdd = sc.parallelize(Seq(0, 3, 6, 9, 12)).map((_, None))
// Drastic data skew
rdd.partitionBy(new org.apache.spark.HashPartitioner(3)).glom.map(_.size).collect
// Array[Int] = Array(5, 0, 0)
What steps shall we take(re-partitioning,coalesce,etc.)?
Repartitioning (never coalesce) can help you with the the latter case by
Changing partitioner.
Adjusting number of partitions to minimize possible impact of data (here you can use the same rules as for associative arrays - prime number and powers of two should be preferred, although might not resolve the problem fully, like 3 in the example used above).
The former cases typically won't benefit from repartitioning much, because skew is naturally induced by the operation itself. Values with the same key cannot be spread multiple partitions, and non-reducing character of the process, is minimally affected by the initial data distribution.
These cases have to be handled by adjusting the logic of your application. It could mean a number of things in practice, depending on the data or problem:
Removing operation completely.
Replacing exact result with an approximation.
Using different workarounds (typically with joins), for example frequent-infrequent split, iterative broadcast join or prefiltering with probabilistic filter (like Bloom filter).
Do we need to kill the job and then include the skew solutions in the jar and re-submit the job?
Normally you have to at least resubmit the job with adjust parameters.
In some cases (mostly RDD batch jobs) you can design your application, to monitor task execution and kill and resubmit particular job in case of possible skew, but it might hard to implement right in practice.
In general, if data skew is possible, you should design your application to be immune to data skews.
Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job?
I believe this is already answered by the points above, but just to say - there is no such option in Spark. You can of course include these in your application.
We can fine tune the query to reduce the complexity .
We can Try Salting mechanism:
Salt the skewed column with random number creation better distribution of data across each partition.
Spark 3 Enables Adaptive Query Execution mechanism to avoid such scenarios in production.
Below are couple of spark properties which we can fine tune accordingly.
spark.sql.adaptive.enabled=true
spark.databricks.adaptive.autoBroadcastJoinThreshold=true #changes sort merge join to broadcast join dynamically , default size = 30 mb
spark.sql.adaptive.coalescePartitions.enabled=true #dynamically coalesced
spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB default
spark.sql.adaptive.coalescePartitions.minPartitionSize=true
spark.sql.adaptive.coalescePartitions.minPartitionNum=true # Default 2X number of cores
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.skewJoin.skewedPartitionFactor=Default is 5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256 MB

Understanding Spark partitioning

I'm trying to understand how Spark partitions data. Suppose I have an execution DAG like that in the picture (orange boxes are the stages). The two groupBy and the join operations are supposed to be very heavy if the RDD's are not partitioned.
Is it wise then to use .partitonBy(new HashPartitioner(properValue)) to P1, P2, P3 and P4 to avoid shuffle? What's the cost of partitioning an existing RDD? When isn't proper to partition an existing RDD? Doesn't Spark partition my data automatically if I don't specify a partitioner?
Thank you
tl;dr The answers to your questions respectively: Better to partition at the outset if you can; Probably less than not partitioning; Your RDD is partitioned one way or another anyway; Yes.
This is a pretty broad question. It takes up a good portion of our course! But let's try to address as much about partitioning as possible without writing a novel.
As you know, the primary reason to use a tool like Spark is because you have too much data to analyze on one machine without having the fan sound like a jet engine. The data get distributed among all the cores on all the machines in your cluster, so yes, there is a default partitioning--according to the data. Remember that the data are distributed already at rest (in HDFS, HBase, etc.), so Spark just partitions according to the same strategy by default to keep the data on the machines where they already are--with the default number of partitions equal to the number of cores on the cluster. You can override this default number by configuring spark.default.parallelism, and you want this number to be 2-3 per core per machine.
However, typically you want data that belong together (for example, data with the same key, where HashPartitioner would apply) to be in the same partition, regardless of where they are to start, for the sake of your analytics and to minimize shuffle later. Spark also offers a RangePartitioner, or you can roll your own for your needs fairly easily. But you are right that there is an upfront shuffle cost to go from default partitioning to custom partitioning; it's almost always worth it.
It is generally wise to partition at the outset (rather than delay the inevitable with partitionBy) and then repartition if needed later. Later on you may choose to coalesce even, which causes an intermediate shuffle, to reduce the number of partitions and potentially leave some machines and cores idle because the gain in network IO (after that upfront cost) is greater than the loss of CPU power.
(The only situation I can think of where you don't partition at the outset--because you can't--is when your data source is a compressed file.)
Note also that you can preserve partitions during a map transformation with mapPartitions and mapPartitionsWithIndex.
Finally, keep in mind that as you experiment with your analytics while you work your way up to scale, there are diagnostic capabilities you can use:
toDebugString to see the lineage of RDDs
getNumPartitions to, shockingly, get the number of partitions
glom to see clearly how your data are partitioned
And if you pardon the shameless plug, these are the kinds of things we discuss in Analytics with Apache Spark. We hope to have an online version soon.
By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.
Doesn't Spark partition my data automatically if I don't specify a partitioner?
It will partition (a.k.a. shuffle) your data a part of the join) and subsequent groupBy (unless you keep the same key and use transformation which preserves partitioning).

How can I make more partitions in Spark without causing a shuffle

Basically my use case is such that in the firt stage, I can only have a few partitions, since each task runs a C program which takes as much as 10 GB of memory. However, I use a RangePartitioner later on. But with few partitions in the previous stage, the RangePartitioner throws out of memory errors while performing the suffle. This is a known fact that when you have too few partitions, Spark can throw out of memory errors during a shuffle.
Now, what I want is to simply divide the already existing partitions into more partitions. Basically, the opposite of what coalesces does in Spark. If I use a partitioner, such as the HashPartitioner, it would obviously cause a shuffle, which I want to avoid. So, how can I achieve this?
Not at this time. You can track related JIRA ticket: https://issues.apache.org/jira/browse/SPARK-5997

Spark transformation on last partitions extremely slow

I am running an iterative algorithm in which during each iteration, a list of values are each assigned a set of keys (1 to N). Over time, the distribution of files over keys become skewed. I noticed that after a few iterations, coalesce phase, things seem to start running really slow on the last few partitions of my RDD.
My transformation is as follows:
dataRDD_of_20000_partitions.aggregateByKey(zeroOp)(seqOp, mergeOp)
.mapValues(...)
.coalesce(1000, true)
.collect()
Here, aggregatebykey aggregates upon the keys I assigned earlier (1 to N). I can coalescing partitions because I know the number of partitions I need, and set coalesce shuffle to true in order to balance out the partitions.
Could anyone point to some reasons that these transformations may cause the last few partitions of the RDD to process slow? I am wondering if part of this has to do with data skewness.
I have some observations.
You should have right number of partitions to avoid data skewness. I suspect that you have fewer partitions than required number of partitions. Have a look at this blog.
collect() call, fetches entire RDD into single driver node.It may cause OutOfMemory some times.
Transformers like aggregateByKey() may cause performance issues due to shuffling.
Have a look this SE question for more details: Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()

Why Apache Spark take function not parallel?

Reading Apache Spark guide at http://spark.apache.org/docs/latest/programming-guide.html it states :
Why is take function not run in parallel? What are the difficulties in implementing this type of function in parallel ? Is it something to do with fact that in order to take first n elements of RDD it is required to traverse entire RDD ?
Actually, while take is not entirely parallel, it's not entirely sequential either.
For example let's say you take(200), and each partition has 10 elements. take will first fetch partition 0 and see that it has 10 elements. It assumes that it would need 20 such partitions to get 200 elements. But it's better to ask for a bit more in a parallel request. So it wants 30 partitions, and it already has 1. So it fetches partitions 1 to 29 next, in parallel. This will likely be the last step. If it's very unlucky, and does not find a total of 200 elements, it will again make an estimate and request another batch in parallel.
Check out the code, it's well documented:
https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1049
I think the documentation is wrong. Local calculation only happens when a single partition is required. This is the case in the first pass (fetching partition 0), but typically not the case in later passes.
How would you implement it in parallel? Let's say you have 4 partitions and want to take first 5 elements. If you knew in advance the size of each partition, it would be easy: for example, if each partition has 3 elements driver asks partition 0 for all elements and it asks partition 1 for 2 elements. So the problem is that it isn't known how many elements each partition has.
Now, you could first calculate partition sizes, but this requires limiting the set of RDD transformations supported, calculating elements more than once, or some other tradeoff, and will generally need more communication overhead.