Why is Spark breaking my stage into 3 different stages with the same description and DAG? - scala

I have a 5 worker node cluster with 1 executor each and 4 cores per executor.
I have an rdd that is spread over 20 partitions that I check with the rdd.isEmpty method. In the spark history server, I can see three different "jobs" with the same "description":
JobId Description Tasks
3 isEmpty at myFile.scala:42 1/1
4 isEmpty at myFile.scala:42 4/4
5 isEmpty at myFile.scala:42 15/15
When I click into these jobs/stages, they all have the same DAG. What might be causing the isEmpty stage to get broken into 3 different stages?
Additionally, when I change the RDD from 20 to 8 partitions, the History server shows:
JobId Description Tasks
3 isEmpty at myFile.scala:42 1/1
4 isEmpty at myFile.scala:42 4/4
5 isEmpty at myFile.scala:42 3/3
In both cases the sum of tasks from these 3 stages equal the total number of partitions for the rdd. Why doesn't it just put in all in one stage like:
JobId Description Tasks
3 isEmpty at myFile.scala:42 20/20

Related

Lag to read message from kafka topic by storm spout

While ingesting message on kafka topic, storm spout is not picking up immediately. There is lag of more than 1 hrs.
There is one spout and 3 bolt in Topology.
Spout- ddl
Bolt- kafkabolt, deletebolt, deletemapperbolt
Storm Config:
ddl.spout.executors: 3
topology.spout.executors: 10
topology.acker.executors: 3
topology.bolt.executors.kafkabolt: 2
topology.bolt.executors.deletebolt: 3
topology.bolt.tasks.deletebolt: 3
topology.max.spout.pending: 1
topology.bolt.executors.deletemapperbolt: 3
topology.bolt.tasks.deletemapperbolt: 3
topology.message.timeout.secs: 300
topology.max.task.parallelism: 100
topology.workers: 1
topology.debug: false
topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.receiver.buffer.size: 64
topology.transfer.buffer.size: 64

SPARK Join strategy in Cloud Datafusion

In cloud Datafusion I am using a joiner transform to join two tables.
One of them is a large table with about 87M Joins, while the other is a smaller table with only ~250 records. I am using 200 partitions in the joiner.
This causes the following failure:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 50 in stage 7.0 failed 4 times, most recent failure: Lost task
50.3 in stage 7.0 (TID xxx, cluster_workerx.c.project.internal, executor 6): ExecutorLostFailure (executor 6 exited caused by one of
the running tasks) Reason: Executor heartbeat timed out after 133355
ms java.util.concurrent.ExecutionException:
java.lang.RuntimeException: org.apache.spark.SparkException:
Application application_xxxxx finished with failed status
On a closer look into the spark UI the 200 tasks for the Join, nearly 80% of the 87m records go into one task O/P which fails with the heartbeat error, while the succeeded tasks has very few record O/P ~<10k records
Seems like spark performs a shuffle hash Join, is there a way in datafusion/cdap where we can force a broadcast join since one of my table is very small? Or can i make come configuration changes to the cluster config to make this join work?
What are the performance tuning i can make in the data fusion pipeline. I didnt find any reference to the configuration, tuning in the Datafusion documentation
You can use org.apache.spark.sql.functions.broadcast(Dataset[T]) to mark a dataframe/dataset to be broadcasted while being joined. Broadcast is not always guaranteed but for 250 record it will work. If the dataframe with 87M rows is evenly partitioned then it should improve the performance.

Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4

I am submitting Spark job with following specification:(same program has been used to run different size of data range from 50GB to 400GB)
/usr/hdp/2.6.0.3-8/spark2/bin/spark-submit
--master yarn
--deploy-mode cluster
--driver-memory 5G
--executor-memory 10G
--num-executors 60
--conf spark.yarn.executor.memoryOverhead=4096
--conf spark.shuffle.registration.timeout==1500
--executor-cores 3
--class classname /home//target/scala-2.11/test_2.11-0.13.5.jar
I have tried reparations the data while reading and also applied reparation before do any count by Key operation on RDD:
val rdd1 = rdd.map(x=>(x._2._2,x._2._1)).distinct.repartition(300)
val receiver_count=rdd1.map(x=>x._2).distinct.count
User class threw exception:
org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 20 (repartition at data_prep.scala:87) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 9
In my case I gave my executors a little more memory and the job went through fine.
You should definitely look at what stage your job is failing at and accordingly determine if increasing/decreasing the executors' memory would help.

Spark Application - High "Executor Computing Time"

I have a Spark application that is now running for 46 hours. While majority of its jobs complete within 25 seconds, specific jobs take hours. Some details are provided below:
Task Time Shuffle Read Shuffle Write
7.5 h 2.2 MB / 257402 2.9 MB / 128601
There are other similar task times off-course having values of 11.3 h, 10.6 h, 9.4 h etc. each of them spending bulk of the activity time on "rdd at DataFrameFunctions.scala:42.". Details for the stage reveals that the time spent by executor on "Executor Computing time". This executor runs at DataNode 1, where the CPU utilization is very normal about 13%. Other boxes (4 more worker nodes) have very nominal CPU utilization.
When the Shuffle Read is within 5000 records, this is extremely fast and completes with 25 seconds, as stated previously. Nothing is appended to the logs (spark/hadoop/hbase), neither anything is noticed at /tmp or /var/tmp location which will indicate some disk related activity is in progress.
I am clueless about what is going wrong. Have been struggling with this for quite some time now. The versions of software used are as follows:
Hadoop : 2.7.2
Zookeeper : 3.4.9
Kafka : 2.11-0.10.1.1
Spark : 2.1.0
HBase : 1.2.6
Phoenix : 4.10.0
Some configurations on the spark default file.
spark.eventLog.enabled true
spark.eventLog.dir hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.history.fs.logDirectory hdfs://SDCHDPMAST1:8111/data1/spark-event
spark.yarn.jars hdfs://SDCHDPMAST1:8111/user/appuser/spark/share/lib/*.jar
spark.driver.maxResultSize 5G
spark.deploy.zookeeper.url SDCZKPSRV01
spark.executor.memory 12G
spark.driver.memory 10G
spark.executor.heartbeatInterval 60s
spark.network.timeout 300s
Is there any way I can reduce the time spent on "Executor Computing time"?
The job performing on the specific dataset is skewed. Because of the skewness, jobs are taking more than expected.

Block missing exception while processing the data from hdfs in spark standalone cluster

I Was running the spark on the hadoop with 2 workers and 2 datanodes .
First machine contains : sparkmaster , namenode , worker-1 , datanode-1.
Second machine contains : worker2, datanode2
In hadoop cluster there are 2 files under the /usr directory on datanode-1 : Notice.txt and on datanode-2 : README.txt
I want to create a rdd from these two file and make a count of lines.
on the first machine I ran spark shell with master spark://masterIP:7077 [Standalone mode]
Then on the scala command line created RDD with
val rdd = sc.textFile("/usr/")
but when i went for the count operation rdd.count() it throws following error
(TID 2, masterIP, executor 1): org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-1313298757-masterIP-1499412323227:blk_1073741827_1003 file=/usr/README.txt
worker-1 is picking the NOTICE.txt but worker-2 is not picking README.txt
I was not getting the problem, any help will be appreciated , Thanks