How repartition spark with one (or several) very big partitions? - scala

I have a DataFrame with 10 paritions but 90% of the data belongs to 1 or 2 partitions. If I invoke dataFrame.coalesce(10) this splits each partition into 10 parts while this is not neccessary for 8 partitions. Is there a way to split only partitions with data into more parts then others?

Related

Number Of Parallel Task in Spark Streaming and Kafka Integration

I am very new to Spark Streaming.I have some basic doubts..Can some one please help me to clarify this:
My message size is standard.1Kb each message.
Number of Topic partitions is 30 and using dstream approach to consume message from kafka.
Number of cores given to spark job as :
( spark.max.cores=6| spark.executor.cores=2)
As I understand that Number of Kafka Partitions=Number of RDD partitions:
In this case dstream approach:
dstream.forEachRdd(rdd->{
rdd.forEachPartition{
}
**Question**:This loop forEachPartiton will execute 30 times??As there are 30 Kafka partitions
}
Also since I have given 6 cores,How many partitions will be consumed in parallel from kafka
Questions: Is it 6 partitions at a time
or
30/6 =5 partitions at a time?
Can some one please give little detail on it on how this exactly work in dstream approach.
"Is it 6 partitions at a time or
30/6 =5 partitions at a time?"
As you said already, the resulting RDDs within the Direct Stream will match the number of partitions of the Kafka topic.
On each micro-batch Spark will create 30 tasks to read each partition. As you have set the maximum number of cores to 6 the job is able to read 6 partitions in parallel. As soon as one of the tasks finishes a new partition can be consumed.
Remember, even if you have no new data in on of the partitions, the resulting RDD still get 30 partitions so, yes, the loop forEachPartiton will iterate 30 times within each micro-batch.

How to do topic level sorting/counting with Kafka Streams running with multiple instances

I am new to Kafka Streams and looking for a way to order the streaming data across partitions. My sales data topic has 10 partitions and are partitioned based on the sold items. For example, groceries goes to one partition, beverages goes to another partition. The requirement is to find out top 5 sold items every 15 min. Now if i run 10 instances on 10 nodes, each partition will be served with one dedicated consumer. In this case, how can we find top 5 sold items across all partitions?
You will need to use single-partition topic.
Kafka Streams inherits the scaling model from the brokers and consumers and thus only if you have a single-partition input topic you can process all data.
Cf: https://docs.confluent.io/current/streams/architecture.html#parallelism-model

Split 1 topic/partitions into multiple topics

I am only starting to learn about Kafka Topic/Partitions, So I have a case where I have 1 topic and a possible 10,000 partitions possibly more.
I'm assuming having 10,000 partitions is a very large number and this is discouraged.
So what I am thinking is to split the 1 topic into logical topic buckets and thus having the 10,000 partitions spread among these topics.
So instead of :
1 topic + 10,000+ partitions
I will have:
10 topics + 1,000 partitions each
Is this a viable approach?

Is there a way to further parallelize kstreams aside from partitions?

I understand that the fundamental approach to parallelization with kafka is to utilize partitioning. However, I have a special situation in that I have to leverage an existing infrastructure that only has 6 partitions, and I need to process millions and millions of records per second.
Is there a way to further optimize in a way that I could have each kstream consumer read and equally distribute load at the same time from a single partition?
The simplest way is to create a "helper" topic with the desired number of partitions. This topic can be configured with a very short retention time, because the original data is safely stored in the actual input topic. You use this helper topic to route all data through it and thus allow for more parallelism downstream:
builder.stream("input-topic")
.through("helper-topic-with-many-partitions")
... // actual processing
Partitions are the level of parallelization. With 6 partitions - you could maximum have 6 instances (of kstream) consuming data. If each instance is in a separate machine i.e. with 1 GBps network each, you could be reading in total with 600 Mbytes / sec
If that's not enough, you'd need to repartition data
Now for distributing your processing, you would need to run each kstream (with the same consumer group) on a different machine
Here's a short video that demonstrates how Kafka Streams (via Kafka SQL) are parallelized to 5 processes https://www.youtube.com/watch?v=denwxORF3pU
It all depends on partitions & executors. With 6 partitions, I usually can achieve 500K+ messages / second, depending on the complexity of the processing of course

specify partitions size with spark

I'm using spark for processing large files, I have 12 partitions.
I have rdd1 and rdd2 i make a join between them, than select (rdd3).
My problem is, i consulted that the last partition is too big than other partitions, from partition 1 to partitions 11 45000 recodrs but the partition 12 9100000 recodrs.
so i divided 9100000 / 45000 =~ 203. i repartition my rdd3 into 214(203+11)
but i last partition still too big.
How i can balance the size of my partitions ?
My i write my own custom partitioner?
I have rdd1 and rdd2 i make a join between them
join is the most expensive operation is Spark. To be able to join by key, you have to shuffle values, and if keys are not uniformly distributed, you get described behavior. Custom partitioner won't help you in that case.
I'd consider adjusting the logic, so it doesn't require a full join.