specify partitions size with spark - scala

I'm using spark for processing large files, I have 12 partitions.
I have rdd1 and rdd2 i make a join between them, than select (rdd3).
My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs but the partition 12 9100000 recodrs.
so i divided 9100000 / 45000 =~ 203. i repartition my rdd3 into 214(203+11)
but i last partition still too big.
How i can balance the size of my partitions ?
My i write my own custom partitioner?

I have rdd1 and rdd2 i make a join between them
join is the most expensive operation is Spark. To be able to join by key, you have to shuffle values, and if keys are not uniformly distributed, you get described behavior. Custom partitioner won't help you in that case.
I'd consider adjusting the logic, so it doesn't require a full join.

Related

PySpark - Does Coalesce(1) Retain the Order of Range Partitioning?

Looking into the Spark UI and physical plan, I found that orderBy is accomplished by Exchange rangepartitioning(col#0000 ACS NULLS FIRST, 200) and then Sort [col#0000 ACS NULLS FIRST], true, 0.
From what I understand, rangepartitioning would define minimum and maximum values for each partition and order the data with column value within the min and max into that partition so as to achieve global ordering.
But now I have 200 partitions and I want to output to a single csv file. If I do a repartition(1), spark will trigger a shuffle and the ordering will be gone. However, I tried coalesce(1) and it retained the global ordering. Yet I don't know if it was merely pure luck since coalesce does not necessarily decrease number of partitions and keep the ordering of partitions. Does anyone know how to repartition to keep the ordering after rangepartitioning? Thanks a lot.
As you state yourself maintaining order is not part of the coalesce API contract. You you have to choose:
collect the ordered dataframe as a list of Row instances and write to csv outside spark
write the partitions to individual CSV files with spark and concatenate the partitions with some other tool, e.g. "hadoop dfs getmerge" on the command line.

KSQL - Join unequal partitions streams

How to join unequal number of partition holding streams in KSQL apart from increase the partition ?
Example Stream-1 is having the 3 partitions and Stream-2 is having the 2 partitions . In that case , of course we can increase the number partitions for Stream-1 as 3 join . But I want to know , any other method to join unequal partitioned streams through KSQL ?
No, unfortunately KStream/KSQL doesn't support join for unequal partitioned topics.
It's a pre-requisite that both topics should have same number of partitions before calling join operation otherwise it will fail.
You can read more about Co-partitioning requirement here:
https://docs.confluent.io/current/ksql/docs/developer-guide/partition-data.html#partition-data-to-enable-joins
To ensure co-partitioning, you can use PARTITION_BY clause to create new stream :
CREATE STREAM topic_rekeyed WITH (PARTITIONS=6) AS SELECT * FROM topic PARTITION BY topic_key;

How repartition spark with one (or several) very big partitions?

I have a DataFrame with 10 paritions but 90% of the data belongs to 1 or 2 partitions. If I invoke dataFrame.coalesce(10) this splits each partition into 10 parts while this is not neccessary for 8 partitions. Is there a way to split only partitions with data into more parts then others?

Spark colocated join between two partitioned dataframes

For the following join between two DataFrames in Spark 1.6.0
val df0Rep = df0.repartition(32, col("a")).cache
val df1Rep = df1.repartition(32, col("a")).cache
val dfJoin = df0Rep.join(df1Rep, "a")
println(dfJoin.count)
Does this join not only co-partitioned but also co-located? I know that for RDDs if using the same partitioner and shuffled in the same operation, the join would be co-located. But what about dataframes? Thank you.
[https://medium.com/#achilleus/https-medium-com-joins-in-apache-spark-part-3-1d40c1e51e1c]
According to the article link provided above Sort-Merge join is the default join, would like to add important point
For Ideal performance of Sort-Merge join, it is important that all
rows having the same value for the join key are available in the same
partition. This warrants for the infamous partition exchange(shuffle)
between executors. Collocated partitions can avoid unnecessary data
shuffle. Data needs to be evenly distributed n the join keys. The
number of join keys is unique enough so that they can be equally
distributed across the cluster to achieve the max parallelism from the
available partitions

When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?

I want to do a join operation between two very big key-value pair RDDs. The keys of these two RDD comes from the same set. To reduce data shuffle, I wish I could add a pre-distribute phase so that partitions with the same key will be distributed on the same machine. Hopefully this could reduce some shuffle time.
I want to know is spark smart enough to do that for me or I have to implement this logic myself?
I know when I join two RDD, one preprocess with partitionBy. Spark is smart enough to use this information and only shuffle the other RDD. But I don't know what will happen if I use partitionBy on two RDD at the same time and then do the join.
If you use the same partitioner for both RDDs you achieve co-partitioning of your data sets. That does not necessarily mean that your RDDs are co-located - that is, that the partitioned data is located on the same node.
Nevertheless, the performance should be better as if both RDDs would have different partitioner.
I have seen this, Speeding Up Joins by Assigning a Known Partitioner that would be helpful to understand the effect of using the same partitioner for both RDDs;
Speeding Up Joins by Assigning a Known Partitioner
If you have to do an operation before the join that requires a
shuffle, such as aggregateByKey or reduceByKey, you can prevent the
shuffle by adding a hash partitioner with the same number of
partitions as an explicit argument to the first operation and
persisting the RDD before the join.