Spark streaming uses lesser number of executors - scala

I am using spark streaming process some events. It is deployed in standalone mode with 1 master and 3 workers. I have set number of cores per executor to 4 and total num of executors to 24. This means totally 6 executors will be spawned. I have set spread-out to true. So each worker machine get 2 executors. My batch interval is 1 second. Also I have repartitioned the batch to 21. The rest 3 are for receivers. While running what I observe from event timeline is that only 3 of the executors are being used. The other 3 are not being used. As far as I know, there is no parameter in spark standalone mode to specify the number of executors. How do I make spark to use all the available executors?

Probably your streaming has not so many partitions to fill all executors on every 1-second minibatch. Try repartition(24) as first streaming transformation to use full spark cluster power.

Related

Spark DataFrame cache keeps growing

How does spark decide how many times to replicate a cached partition?
The storage level in the storage tab on the spark UI says “Disk Serialized 1x Replicated”, but it looks like partitions get replicated onto multiple executors. We have noticed this happening with DISK_ONLY storage level using spark 2.3. We are caching a dataset with 101 partitions (size on disk is 468.4 GB). Data is distributed initially on 101 executors (we have 600 executors total). As we run queries on this dataset, the size on disk grows as well as the number of executors data is distributed on. We also noticed that commonly one block/partition is replicated on multiple executors on the same node – if it is stored on disk, why is this not shared between executors on same node?
persistedDs = dataset.repartition(101).persist(StorageLevel.DISK_ONLY)
Initial Load
After Running queries on Cached Dataset
One executor can have 2 partitions cached in it. Also, note that the RDD is cached multiple time in the attached screenshot.
Data Distribution on 101 Executors

Apache Flink Kafka Integration Partition Seperation

I need to implement below data flow. I have one kafka topic which has 9 partitions. I can read this topic with 9 parallelism level. I have also 3 node Flink cluster. Each of nodes of this cluster has 24 task slot.
First of all, I want to spread my kafka like, each server has 3 partition like below. Order is not matter, I only transform kafka message and send it DB.
Second thing is, I want to increase my parallelism degree while saving NoSQL DB. If I increase my parallelism 48, since sending DB is IO operation, it does not consume CPU, I want to be sure, When Flink rebalance my message, my message will stay in the same server.
Is there any advice for me?
If you want to spread you Kafka readers across all 3 nodes, I would recommend to start them with 3 slots each and set the parallelism of the Kafka source to 9.
The problem is that at the moment it is not possible to control how tasks are placed if there are more slots available than the required parallelism. This means if you have fewer sources than slots, then it might happen that all sources will be deployed to one machine, leaving the other machines empty (source-wise).
Being able to spread out tasks across all available machines is a feature which the community is currently working on.

Strange delays in spark streaming

I have recently been using spark streaming to process data in kafka.
After the application is started and a few batches are finished, there is a continuous delay.
Most of the time, data processing is completed within 1-5 seconds.
However, after several batches, it took 41 ~ 45 seconds continuously, and most of the delay occurred in the area that fetches data from stage0.
I accidentally found the Kafka request.timemout.ms setting to be 40 seconds by default and changed this setting to 10 seconds.
I then restarted the application and observed that the batch was completed in 11 to 15 seconds.
Actual processing time is 1-5 sec. I can not understand this delay.
What is wrong?
My environment is as follows.
Spark streaming 2.1.0(createDirectStream)
Kafka : 0.10.1
Batch interval : 20s
Request.timeout.ms : 10s
/////
The following capture is the graph when request.timeout.ms is set to 8 seconds.
I found the problem and solution:
Basically when you are reading from your executors every partition of kafka, spark streaming for improve the performance or reading and processing, is caching the content of the partition read in memory.
If the size of the topic is so big, the cache can overflow and when kafka connect do fetch to kafka the cache is full and get the timeout.
Solution: If you are in spark 2.2.0 or higher( from spark documentation) this is the solution, is a bug known by spark and cloudera:
The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via spark.streaming.kafka.consumer.cache.maxCapacity.
If you would like to disable the caching for Kafka consumers, you can set spark.streaming.kafka.consumer.cache.enabled to false. Disabling the cache may be needed to workaround the problem described in SPARK-19185. This property may be removed in later versions of Spark, once SPARK-19185 is resolved.
The cache is keyed by topicpartition and group.id, so use a separate group.id for each call to createDirectStream.
spark.streaming.kafka.consumer.cache.enabled to false In your spark-submit as parameter and your mini-bacth performance will be like a supersonic aeroplane.
We face the same issue too, and after lots of analysis, we find that it is due to a kafka bug as described in KAFKA-4303.
For spark applications, we can avoid this issue by setting reconnect.backoff.ms = 0 in the consumer config.
I may decribe more details when I have time.

How to determine number of partitons of rdd in spark given the number of cores and executors ?

What will be the number of partitions for 10 nodes cluster with 20 executors and code reading a folder with 100 files?
It is different in different modes that you are running and you can tune it up using the spark.default.parallelism setting. From Spark Documentation :
For operations like parallelize with no parent RDDs, it depends on
the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Link to related Documentation:
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
You can yourself change the number of partitions yourself depending upon the data that you are reading.Some of the Spark api's provide an additional setting for the number of partition.
Further to check how many partitions are getting created do as #Sandeep Purohit says
rdd.getNumPartitions
And it will result into the number of partitions that are getting created !
You can also change the number of partitons after it is created by using two Api's namely : coalesce and repartition
Link to Coalesce and Repartition : Spark - repartition() vs coalesce()
From Spark doc:
By default, Spark creates one partition for each block of the file
(blocks being 64MB by default in HDFS), but you can also ask for a
higher number of partitions by passing a larger value. Note that you
cannot have fewer partitions than blocks.
Number of partitions also depends upon the size of the file. If the file size is too big, you may choose to have more partitions.
The number of partitions for the scala/java objects RDD will be dependent on the core of the machines and if you are creating RDD using Hadoop input files then it will dependent on block size of the hdfs (version dependent) you can find number of partitions in RDD as follows
rdd.getNumPartitions

Read more Kafka topics than number of CPU cores

From Spark Streaming Programming Guide:
Extending the logic to running on a cluster, the number of cores allocated to the Spark Streaming application must be more than the number of receivers. Otherwise the system will receive data, but not be able to process it.
Does this mean that if I have 16 CPU cores in the whole Spark cluster I cannot read data from more than 15 Kafka topics?
Only if you use the consumer/receiver based API. This does not apply to the Direct Stream one.
Have a look here for the differences between the two