For the following join between two DataFrames in Spark 1.6.0
val df0Rep = df0.repartition(32, col("a")).cache
val df1Rep = df1.repartition(32, col("a")).cache
val dfJoin = df0Rep.join(df1Rep, "a")
println(dfJoin.count)
Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you.
[https://medium.com/#achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c]
According to the article link provided above Sort-Merge join is the default join, would like to add important point
For Ideal performance of Sort-Merge join, it is important that all
rows having the same value for the join key are available in the same
partition. This warrants for the infamous partition exchange(shuffle)
between executors. Collocated partitions can avoid unnecessary data
shuffle. Data needs to be evenly distributed n the join keys. The
number of join keys is unique enough so that they can be equally
distributed across the cluster to achieve the max parallelism from the
available partitions
Related
We want to use GlobalKTable in Kafka streams application. Input topics(KTable/KStream) have N partitions and a GlobalKTable will be used as a dictionary in the stream application.
Does the input topic for the GlobalKTable must have the same number of partitions as other input topics (which are sources of KTable/KStream)?
As I understand, the answer is NO(it is not limited and the topic may also have M partitions where N > M), because GlobalKTable is fully loaded in each instance of the stream application and the co-partitioning is not required during KStream join operation. But I need confirmation from the experts!
Thank you!
No, The number of partitions for topics for KStream and GlobalTable (that are join) can differ.
From Kafka Streams developer guide
At a high-level, KStream-GlobalKTable joins are very similar to KStream-KTable joins. However, global tables provide you with much more flexibility at the some expense when compared to partitioned tables:
They do not require data co-partitioning.
More details can be found here:
Global Table join
Join co-partitioning requirements
More accurately:
Why is data co-partitioning required? Because KStream-KStream,
KTable-KTable, and KStream-KTable joins are performed based on the
keys of records (e.g., leftRecord.key == rightRecord.key), it is
required that the input streams/tables of a join are co-partitioned by
key.
The only exception are KStream-GlobalKTable joins. Here,
co-partitioning is it not required because all partitions of the
GlobalKTableās underlying changelog stream are made available to each
KafkaStreams instance, i.e. each instance has a full copy of the
changelog stream. Further, a KeyValueMapper allows for non-key based
joins from the KStream to the GlobalKTable.
How to join unequal number of partition holding streams in KSQL apart from increase the partition ?
Example Stream-1 is having the 3 partitions and Stream-2 is having the 2 partitions . In that case , of course we can increase the number partitions for Stream-1 as 3 join . But I want to know , any other method to join unequal partitioned streams through KSQL ?
No, unfortunately KStream/KSQL doesn't support join for unequal partitioned topics.
It's a pre-requisite that both topics should have same number of partitions before calling join operation otherwise it will fail.
You can read more about Co-partitioning requirement here:
https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html#partition-data-to-enable-joins
To ensure co-partitioning, you can use PARTITION_BY clause to create new stream :
CREATE STREAM topic_rekeyed WITH (PARTITIONS=6) AS SELECT * FROM topic PARTITION BY topic_key;
I have a Kafka Streams Application which takes data from few topics and joins the data and puts it in another topic.
Kafka Configuration:
5 kafka brokers
Kafka Topics - 15 partitions and 3 replication factor.
Few millions of records are consumed/produced every hour.
I am making KStream-KStream join which creates 2 internal topics.
While KStream-KTable join will create 1 internal topic + 1 table.
Which is better in terms of performance and other factors ?
The choice is not a question of performance, but a question of semantics: what should the join result be? Both joins, do compute quite different results so you should pick the semantics that meet your application needs.
The different semantics are documented in CP docs and AK wiki:
https://docs.confluent.io/current/streams/developer-guide.html#joining
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics
I'm using spark for processing large files, I have 12 partitions.
I have rdd1 and rdd2 i make a join between them, than select (rdd3).
My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs but the partition 12 9100000 recodrs.
so i divided 9100000 / 45000 =~ 203. i repartition my rdd3 into 214(203+11)
but i last partition still too big.
How i can balance the size of my partitions ?
My i write my own custom partitioner?
I have rdd1 and rdd2 i make a join between them
join is the most expensive operation is Spark. To be able to join by key, you have to shuffle values, and if keys are not uniformly distributed, you get described behavior. Custom partitioner won't help you in that case.
I'd consider adjusting the logic, so it doesn't require a full join.
I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time.
I want to know is spark smart enough to do that for me or I have to implement this logic myself?
I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to use this information and only shuffle the other RDD. But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join.
If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node.
Nevertheless, the performance should be better as if both RDDs would have different partitioner.
I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs;
Speeding Up Joins by Assigning a Known Partitioner
If you have to do an operation before the join that requires a
shuffle, such as aggregateByKey or reduceByKey, you can prevent the
shuffle by adding a hash partitioner with the same number of
partitions as an explicit argument to the first operation and
persisting the RDD before the join.