Monitoring a task in apache Spark - scala

I start spark master using : ./sbin/start-master.sh
as described at :
http://spark.apache.org/docs/latest/spark-standalone.html
I then submit the Spark job :
sh ./bin/spark-submit \
--class simplespark.Driver \
--master spark://`localhost`:7077 \
C:\\Users\\Adrian\\workspace\\simplespark\\target\\simplespark-0.0.1-SNAPSHOT.jar
How can run a simple app which demonstrates a parallel task running ?
When I view http://localhost:4040/executors/ & http://localhost:8080/ there are no
tasks running :
The .jar I'm running (simplespark-0.0.1-SNAPSHOT.jar) just contains a single Scala object :
package simplespark
import org.apache.spark.SparkContext
object Driver {
def main(args: Array[String]) {
val conf = new org.apache.spark.SparkConf()
.setMaster("local")
.setAppName("knn")
.setSparkHome("C:\\spark-1.1.0-bin-hadoop2.4\\spark-1.1.0-bin-hadoop2.4")
.set("spark.executor.memory", "2g");
val sc = new SparkContext(conf);
val l = List(1)
sc.parallelize(l)
while(true){}
}
}
Update : When I change --master spark://localhost:7077 \ to --master spark://Adrian-PC:7077 \
I can see update on the Spark UI :
I have also updated Driver.scala to read default context, as I'm not sure if I set it correctly for submitting Spark jobs :
package simplespark
import org.apache.spark.SparkContext
object Driver {
def main(args: Array[String]) {
System.setProperty("spark.executor.memory", "2g")
val sc = new SparkContext();
val l = List(1)
val c = sc.parallelize(List(2, 3, 5, 7)).count()
println(c)
sc.stop
}
}
On Spark console I receive multiple same all same messages :
14/12/26 20:08:32 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
So it appears that the Spark job is not reaching the master ?
Update2 : After I start (thanks to Lomig Mégard comment below) the worker using :
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://Adrian-PC:7077
I receive error :
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Executor app-20141227212351-0003/8 removed: java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor added: app-20141227212351-0003/9 on worker-20141227211411-Adrian-PC-58199 (Adrian-PC:58199) with 4 cores
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Granted executor ID app-20141227212351-0003/9 on hostPort Adrian-PC:58199 with 4 cores, 2.0 GB RAM
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor updated: app-20141227212351-0003/9 is now RUNNING
14/12/27 21:23:52 INFO AppClient$ClientActor: Executor updated: app-20141227212351-0003/9 is now FAILED (java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified)
14/12/27 21:23:52 INFO SparkDeploySchedulerBackend: Executor app-20141227212351-0003/9 removed: java.io.IOException: Cannot run program "C:\cygdrive\c\spark-1.1.0-bin-hadoop2.4\spark-1.1.0-bin-hadoop2.4/bin/compute-classpath.cmd" (in directory "."): CreateProcess error=2, The system cannot find the file specified
14/12/27 21:23:52 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
14/12/27 21:23:52 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: Master removed our application: FAILED
14/12/27 21:23:52 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (ParallelCollectionRDD[0] at parallelize at Driver.scala:14)
14/12/27 21:23:52 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0
I'm running the scripts on Windows using Cygwin. To fix this error I copy the Spark installation to cygwin C:\ drive. But then I receive a new error :
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0

You have to start the actual computation to see the job.
val c = sc.parallelize(List(2, 3, 5, 7)).count()
println(c)
Here count is called an action, you need at least one of them to begin a job. You can find the list of available actions in the Spark doc.
The other methods are called transformations. They are lazily executed.
Don't forget to stop the context at the end, instead of your infinite loop, with sc.stop().
Edit: For the updated question, you allocate more memory to the executor than there is available in the worker. The defaults should be fine for simple tests.
You also need to have a running worker linked to your master. See this doc to start it.
./sbin/start-master.sh
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

Related

Spark application erroring out because driver outofmemory even though lot free memory still available

Can someone please help me to figure out my simple spark application is requiring huge driver memory? Even though I allocated about 112GB, my application fails at about 67GB.
Thanks in advance
The spark driver is using huge memory for running simple application
Allocated about 112G of memory for running my application when do spark-submit
At start of the job, I see below message in the logs
1019 [main] INFO org.apache.spark.storage.memory.MemoryStore - MemoryStore started with capacity 67.0 GiB
My application fails with this error message
java.lang.IllegalStateException: dag-scheduler-event-loop has already been stopped accidentally.
at org.apache.spark.util.EventLoop.post(EventLoop.scala:107)
at org.apache.spark.scheduler.DAGScheduler.taskStarted(DAGScheduler.scala:283)
at org.apache.spark.scheduler.TaskSetManager.prepareLaunchingTask(TaskSetManager.scala:539)
at org.apache.spark.scheduler.TaskSetManager.$anonfun$resourceOffer$2(TaskSetManager.scala:478)
at scala.Option.map(Option.scala:230)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:455)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2(TaskSchedulerImpl.scala:395)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$2$adapted(TaskSchedulerImpl.scala:390)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOfferSingleTaskSet$1(TaskSchedulerImpl.scala:390)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:381)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20(TaskSchedulerImpl.scala:587)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$20$adapted(TaskSchedulerImpl.scala:582)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16(TaskSchedulerImpl.scala:582)
at org.apache.spark.scheduler.TaskSchedulerImpl.$anonfun$resourceOffers$16$adapted(TaskSchedulerImpl.scala:555)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:555)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.$anonfun$makeOffers$5(CoarseGrainedSchedulerBackend.scala:359)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:955)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:351)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:162)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
59376410 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Cancelling stage 1
59376410 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Killing all running tasks in stage 1: Stage cancelled
59376415 [dispatcher-CoarseGrainedScheduler] ERROR org.apache.spark.scheduler.DAGSchedulerEventProcessLoop - DAGSchedulerEventProcessLoop failed; shutting down SparkContext
scala code snippet
val df = spark.read.parquet(data_path)
df.rdd.foreachPartition(p => {
// code to process the code...
})
Job Submit
'''
spark-submit --master "spark://x.x.x.x:7077" --driver-cores=4 --driver-memory=112G --conf spark.driver.maxResultSize=0 --conf spark.rpc.message.maxSize=2047 --conf spark.driver.host=x.x.x.x --class myclass.processor --packages "..,org.apache.hadoop:hadoop-azure:3.3.1" --deploy-mode client
'''

Spark RDD method "saveAsTextFile" throwing exception Even after deleting the output directory. org.apache.hadoop.mapred.FileAlreadyExistsException

I am calling this method on an RDD[String] with destination in the arguments. (Scala)
Even after deleting the directory before starting, the process gives this error.
I am running this process on EMR cluster with output location at aws S3.
Below is the command used:
spark-submit --deploy-mode cluster --class com.hotwire.hda.spark.prd.pricingengine.PRDPricingEngine --conf spark.yarn.submit.waitAppCompletion=true --num-executors 21 --executor-cores 4 --executor-memory 20g --driver-memory 8g --driver-cores 4 s3://bi-aws-users/sbatheja/hotel-shopper-0.0.1-SNAPSHOT-jar-with-dependencies.jar -d 3 -p 100 --search-bucket s3a://hda-prod-business.hotwire.hotel.search --prd-output-path s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput/
Log:
16/07/07 11:27:47 INFO BlockManagerMaster: BlockManagerMaster stopped
16/07/07 11:27:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/07/07 11:27:47 INFO SparkContext: Successfully stopped SparkContext
16/07/07 11:27:47 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: **org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3a://bi-aws-users/sbatheja/PRD/PriceEngineOutput already exists)**
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/07/07 11:27:47 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
16/07/07 11:27:47 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
16/07/07 11:27:47 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1467889642439_0001
16/07/07 11:27:47 INFO ShutdownHookManager: Shutdown hook called
16/07/07 11:27:47 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1467889642439_0001/spark-7f836950-a040-4216-9308-2bb4565c5649
It creates "_temporary" directory in the location, which contains empty part files.
In short, a word:
Make sure the scala version of spark-core and scala-library is consistent.
I encountered the same problem.
As I saving the file to the HDFS, it throws an exception: org.apache.hadoop.mapred.FileAlreadyExistsException
Then I checked the HDFS file directory, there is a empty temporary folder: TARGET_DIR/_temporary/0.
You can submit the job, open the detailed configuration:./spark-submit --verbose.
And then look at the full context and log, there must be other errors caused.
My job in the RUNNING state, the first error is thrown:
17/04/23 11:47:02 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.NoSuchMethodError: scala.Predef$.refArrayOps([Ljava/lang/Object;)[Ljava/lang/Object;
Then the job will be retried and re-executed. At this time, job re-implementation, it will find just the directory has been created. And also throws the directory already exists.
After confirming that the first error is version compatibility issues.
The spark version is 2.1.0, the corresponding spark-core scala version is 2.11, and the scala-library dependency of the scala version is 2.12.xx.
When the two scala version of the change is consistent (usually modify the scala-library version), you can solve the first exception problem, then job can be normal FINISHED.
pom.xml example:
<!-- Spark -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<!-- scala -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.7</version>
</dependency>

Lost Executor trying to load Graph using Spark/GraphX in Yarn/hdfs Cluster

I try to run a Spark/Graphx program written in Scala on YARN cluster with hdfs. The cluster has 16 nodes with 16GB RAM and 2TB HD each. All I want is to load an 3.29GB undirected Graph (called orkutUndirected.txt) using edgeListFile function, provided by GraphX library:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import java.io._
import java.util.Date
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext
import scala.util.control.Breaks._
import scala.math._
object MyApp {
def main(args: Array[String]): Unit = {
// Create spark configuration and spark context
val conf = new SparkConf().setAppName("My App")
val sc = new SparkContext(conf)
val edgeFile = "hdfs://master-bigdata:8020/user/sparklab/orkutUndirected.txt"
// Load the edges as a graph
val graph =GraphLoader.edgeListFile(sc,edgeFile,false,1,StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK)
}
}
I start the run using the following spark-submit in command line:
nohup spark-submit --master yarn --executor-memory 7g --num-executors 4 --executor-cores 2 ./target/scala-2.10/myapp_2.10-1.0.jar &
I tried different sizes of --executor-memory but no luck!!
After a few minutes I can see the following inside nohup.out:
16/02/24 23:45:25 ERROR YarnScheduler: Lost executor 1 on node12-bigdata: Executor heartbeat timed out after 160351 ms
16/02/24 23:45:29 ERROR YarnScheduler: Lost executor 1 on node12-bigdata: remote Rpc client disassociated
16/02/25 00:04:08 ERROR YarnScheduler: Lost executor 3 on node13-bigdata: remote Rpc client disassociated
16/02/25 00:18:05 ERROR YarnScheduler: Lost executor 4 on node06-bigdata: Executor heartbeat timed out after 129723 ms
16/02/25 00:18:07 ERROR YarnScheduler: Lost executor 4 on node06-bigdata: remote Rpc client disassociated
16/02/25 00:21:52 ERROR YarnScheduler: Lost executor 4 on node16-bigdata: remote Rpc client disassociated
16/02/25 00:41:29 ERROR YarnScheduler: Lost executor 1 on node03-bigdata: remote Rpc client disassociated
16/02/25 00:44:52 ERROR YarnScheduler: Lost executor 5 on node16-bigdata: remote Rpc client disassociated
16/02/25 00:44:52 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, node16-bigdata):
ExecutorLostFailure (executor 5 lost)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
...
...
Do you have any idea what could be wrong?
Depending on what types of objects you are creating from the raw text file, 3.9GB of raw data can easily exceed the working memory of this cluster, causing large GC pauses and lost executors. In addition to the overhead that comes with wrapping data in Java objects, there is additional overhead that comes with GraphX on top of RDDs. VisualVM or Ganglia are good tools for debugging these memory related issues. Also, see Tuning Spark for tips on how to keep your graph lean.
Another possibility is that the data was not partitioned optimally, causing some tasks stall. See the Spark UI stage information and make sure that each task is working on evenly distributed data. If it is not evenly distributed, you should repartition your data. I found the Cloudera Blog on this subject useful.

Spark worker throwing ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId

I am trying to execute a simple app example code with spark. Executing the job using spark submit.
spark-submit --class "SimpleJob" --master spark://:7077 target/scala-2.10/simple-project_2.10-1.0.jar
15/03/08 23:21:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/08 23:21:53 WARN LoadSnappy: Snappy native library not loaded
15/03/08 23:22:09 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Lines with a: 21, Lines with b: 21
The job gives correct results but gives following errors below it:
15/03/08 23:22:28 ERROR SendingConnection: Exception while reading SendingConnection to ConnectionManagerId(<worker-host.domain.com>,53628)
java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at org.apache.spark.network.SendingConnection.read(Connection.scala:390)
at org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:205)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/03/08 23:22:28 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(<worker-host.domain.com>,53628) not found
15/03/08 23:22:28 WARN ConnectionManager: All connections not cleaned up
Following is the spark-defaults.conf
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.master spark://<master-ip>:7077
spark.eventLog.enabled true
spark.executor.extraClassPath $SPARK-HOME/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.2.0-SNAPSHOT.jar
spark.cassandra.connection.conf.factory com.datastax.spark.connector.cql.DefaultConnectionFactory
spark.cassandra.auth.conf.factory com.datastax.spark.connector.cql.DefaultAuthConfFactory
spark.cassandra.query.retry.count 10
Following is the spark-env.sh
SPARK_LOCAL_IP=<master-ip in master worker-ip in workers>
SPARK_MASTER_HOST='<master-hostname>'
SPARK_MASTER_IP=<master-ip>
SPARK_MASTER_PORT=7077
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_INSTANCES=4
Got an answer to this,
Even though i am adding the cassandra connector to the class path by the command, i am not sending the same path to all nodes of cluster.
Now i am using below command sequence to do it properly
spark-shell --driver-class-path ~/Installers/spark-cassandra-connector-1.1.1/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.1.1.jar
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.datastax.spark.connector._
sc.addJar("~/Installers/spark-cassandra-connector-1.1.1/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.1.1.jar")
After these commands I am able to run all the read & write into my cassandra cluster properly using the spark RDDs.

why after some stages, all the taskes are assigned to one machine(excuter) in spark?

I encountered this problem: when i run a machine learning task on spark, after some stages, all the taskes are assigned to one machine(excutor), and the stage execution get slower and slower.
[the spark conf setting]
val conf = new SparkConf().setMaster(sparkMaster).setAppName("ModelTraining").setSparkHome(sparkHome).setJars(List(jarFile))
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator", "LRRegistrator")
conf.set("spark.storage.memoryFraction", "0.7")
conf.set("spark.executor.memory", "8g")
conf.set("spark.cores.max", "150")
conf.set("spark.speculation", "true")
conf.set("spark.storage.blockManagerHeartBeatMs", "300000")
val sc = new SparkContext(conf)
val lines = sc.textFile("hdfs://xxx:52310"+inputPath , 3)
val trainset = lines.map(parseWeightedPoint).repartition(50).persist(StorageLevel.MEMORY_ONLY)
[the warn log from the spark]
14/09/19 10:26:23 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(45, TS-BH109, 48384, 0)
14/09/19 10:27:18 WARN TaskSetManager: Lost TID 726 (task 14.0:9)
14/09/19 10:29:03 WARN SparkDeploySchedulerBackend: Ignored task status update (737 state FAILED) from unknown executor Actor[akka.tcp://sparkExecutor#TS-BH96:33178/user/Executor#-913985102] with ID 39
14/09/19 10:29:03 WARN TaskSetManager: Loss was due to fetch failure from BlockManagerId(30, TS-BH136, 28518, 0)
14/09/19 11:01:22 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(47, TS-BH136, 31644, 0) with no recent heart beats: 47765ms exceeds 45000ms
Any suggestions?
Can you post the executor log -- anything suspect there? In particular, you have Lost TID 726 (task 14.0:9). Further up in the driver log you should see which executor TID 726 was assigned to -- I'd check the error log on that machine and see if anything of interest shows up there.
My guess (and it's only a guess) is that your executor is crashing. At which point the system will try to launch a new executor but that is generally slow. In the mean time the current task might get resent to an existing executor hosing your system further.