Spark hangs during RDD reading - scala

I have Apache Spark master node. When I try to iterate throught RDDs Spark hangs.
Here is an example of my code:
val conf = new SparkConf()
.setAppName("Demo")
.setMaster("spark://localhost:7077")
.set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
val records = sc.textFile("file:///Users/barbara/projects/spark/src/main/resources/videos.csv")
println("Start")
records.collect().foreach(println)
println("Finish")
Spark log says:
Start
16/04/05 17:32:23 INFO FileInputFormat: Total input paths to process : 1
16/04/05 17:32:23 INFO SparkContext: Starting job: collect at Application.scala:23
16/04/05 17:32:23 INFO DAGScheduler: Got job 0 (collect at Application.scala:23) with 2 output partitions
16/04/05 17:32:23 INFO DAGScheduler: Final stage: ResultStage 0 (collect at Application.scala:23)
16/04/05 17:32:23 INFO DAGScheduler: Parents of final stage: List()
16/04/05 17:32:23 INFO DAGScheduler: Missing parents: List()
16/04/05 17:32:23 INFO DAGScheduler: Submitting ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19), which has no missing parents
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 120.5 KB)
16/04/05 17:32:23 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1811.0 B, free 122.3 KB)
16/04/05 17:32:23 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.18.199.187:55983 (size: 1811.0 B, free: 2.4 GB)
16/04/05 17:32:23 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/04/05 17:32:23 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (file:///Users/barbara/projects/spark/src/main/resources/videos.csv MapPartitionsRDD[1] at textFile at Application.scala:19)
16/04/05 17:32:23 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
I see only a "Start" message. Seems Spark do nothing to read RDDs. Any ideas how to fix it?
UPD
The data I want to read:
123v4n312bv4nb12,Action,Comedy
2n4vhj2gvrh24gvr,Action,Drama
sjfu326gjrw6g374,Drama,Horror

If Spark hands on such a small dataset I would first look for:
Am I trying to connect to a cluster that doesn't respond/exists? If I am trying to connect to a running cluster, I would first try to run the same code locally setMaster("local[*]"). If this works, I would know that there is something going on with the "master" I try to connect to.
Am I asking for more resources that what the cluster has to offer? For example, if the cluster manages 2G and I ask for a 3GB executor, my application will never get schedule and it will be in the job queue forever.
Specific to the comments above. If you started your cluster by sbin/start-master.sh you will NOT get a running cluster. At the very minimum you need a master and a worker (for standalone). You should use the start-all.sh script. I recommend a bit more homework and follow a tutorial.

Use this instead:
val bufferedSource = io.Source.fromFile("/path/filename.csv")
for (line <- bufferedSource.getLines) {
println(line)
}

Related

Spark-Submit: Failed to open native connection to Cassandra at {10.0.0.5, 10.0.0.4}:9042

I am trying to submit the job using command
"spark-submit --class it.polimi.dice.spark.WordCount --master yarn-master --conf spark.cassandra.connection.host=10.0.0.5 --num-executors 1 --deploy-mode client --driver-memory 512m --executor-memory 512m /home/useruser/temp/spark-cassandra-example/target/scala-2.10/spark-cassandra-exmaple-assembly-1.0.jar"
But I am getting error although I have tried the same thing using spark-shell and it is working it means that version of the spark-Cassandra connector 1.6.0-M1(spark version is 1.6.0 , scala 2.10.5, cassandra 3.3.0) and other configurations are fine. Here is the result I am getting after using spark-submit command.
16/11/21 09:39:03 INFO Client: Application report for application_1479668866076_0014 (state: ACCEPTED)
16/11/21 09:39:04 INFO Client: Application report for application_1479668866076_0014 (state: ACCEPTED)
16/11/21 09:39:05 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/11/21 09:39:05 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> sandbox.hortonworks.com, PROXY_URI_BASES -> http://sandbox.hortonworks.com:8088/proxy/application_1479668866076_0014), /proxy/application_1479668866076_0014
16/11/21 09:39:05 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/11/21 09:39:05 INFO Client: Application report for application_1479668866076_0014 (state: RUNNING)
16/11/23 10:10:38 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
16/11/23 10:10:38 INFO Cluster: New Cassandra host /10.0.0.4:9042 added
16/11/23 10:10:38 INFO Cluster: New Cassandra host /10.0.0.5:9042 added
16/11/23 10:10:38 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/11/23 10:10:39 INFO SparkContext: Starting job: fold at WordCount.scala:17
16/11/23 10:10:39 INFO DAGScheduler: Got job 0 (fold at WordCount.scala:17) with 2 output partitions
16/11/23 10:10:39 INFO DAGScheduler: Final stage: ResultStage 0 (fold at WordCount.scala:17)
16/11/23 10:10:39 INFO DAGScheduler: Parents of final stage: List()
16/11/23 10:10:39 INFO DAGScheduler: Missing parents: List()
16/11/23 10:10:39 INFO DAGScheduler: Submitting ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15), which has no missing parents
16/11/23 10:10:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.1 KB, free 7.1 KB)
16/11/23 10:10:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.7 KB, free 10.9 KB)
16/11/23 10:10:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.0.6:58236 (size: 3.7 KB, free: 143.6 MB)
16/11/23 10:10:39 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/11/23 10:10:39 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15)
16/11/23 10:10:39 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/11/23 10:10:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, sandbox.hortonworks.com, partition 0,RACK_LOCAL, 29218 bytes)
16/11/23 10:10:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on sandbox.hortonworks.com:51013 (size: 3.7 KB, free: 143.6 MB)
16/11/23 10:10:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, sandbox.hortonworks.com, partition 1,RACK_LOCAL, 29156 bytes)
16/11/23 10:10:43 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, sandbox.hortonworks.com): java.io.IOException: Failed to open native connection to Cassandra at {10.0.0.4, 10.0.0.5}:9042
16/11/23 10:10:45 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, sandbox.hortonworks.com, partition 0,RACK_LOCAL, 29218 bytes)
16/11/23 10:10:45 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor sandbox.hortonworks.com: java.io.IOException (Failed to open native connection to Cassandra at {10.0.0.4, 10.0.0.5}:9042) [duplicate 1]
Can anyone kindly help me ho wi can fix this issue.

Can't write to Cluster if replication__factor is greater than 1

I'm using Spark 1.6.1, Cassandra 2.2.3 and Cassandra-Spark connector 1.6. .
I already tried to write to multi node cluster but with replication_factor:1.
Now, I'm trying to write to 6-node cluster with one seed one and keyspace which has replication_factor > 1 but Spark is not responding and he is refusing to do that.
As I mention, it works when I'm writing to coordinator with keyspace set to 1.
This is an log which I'm getting and it always stops here or after half an hour he starts to cleaning accumulators and stops on fourth again.
16/08/16 17:07:03 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.1:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.1 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.2:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.2 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.3:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.3 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.4:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.4 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.5:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.5 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.6:9042 added
16/08/16 17:07:04 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/08/16 17:07:05 INFO SparkContext: Starting job: take at CassandraRDD.scala:121
16/08/16 17:07:05 INFO DAGScheduler: Got job 3 (take at CassandraRDD.scala:121) with 1 output partitions
16/08/16 17:07:05 INFO DAGScheduler: Final stage: ResultStage 4 (take at CassandraRDD.scala:121)
16/08/16 17:07:05 INFO DAGScheduler: Parents of final stage: List()
16/08/16 17:07:05 INFO DAGScheduler: Missing parents: List()
16/08/16 17:07:05 INFO DAGScheduler: Submitting ResultStage 4 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18), which has no missing parents
16/08/16 17:07:05 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 8.3 KB, free 170.5 KB)
16/08/16 17:07:05 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 4.2 KB, free 174.7 KB)
16/08/16 17:07:05 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:43680 (size: 4.2 KB, free: 756.4 MB)
16/08/16 17:07:05 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
16/08/16 17:07:05 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18)
16/08/16 17:07:05 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/08/16 17:07:05 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 204, localhost, partition 0,NODE_LOCAL, 22553 bytes)
16/08/16 17:07:05 INFO Executor: Running task 0.0 in stage 4.0 (TID 204)
16/08/16 17:07:06 INFO Executor: Finished task 0.0 in stage 4.0 (TID 204). 2074 bytes result sent to driver
16/08/16 17:07:06 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 204) in 1267 ms on localhost (1/1)
16/08/16 17:07:06 INFO DAGScheduler: ResultStage 4 (take at CassandraRDD.scala:121) finished in 1.276 s
16/08/16 17:07:06 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/08/16 17:07:06 INFO DAGScheduler: Job 3 finished: take at CassandraRDD.scala:121, took 1.310929 s
16/08/16 17:07:06 INFO SparkContext: Starting job: take at CassandraRDD.scala:121
16/08/16 17:07:06 INFO DAGScheduler: Got job 4 (take at CassandraRDD.scala:121) with 4 output partitions
16/08/16 17:07:06 INFO DAGScheduler: Final stage: ResultStage 5 (take at CassandraRDD.scala:121)
16/08/16 17:07:06 INFO DAGScheduler: Parents of final stage: List()
16/08/16 17:07:06 INFO DAGScheduler: Missing parents: List()
16/08/16 17:07:06 INFO DAGScheduler: Submitting ResultStage 5 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18), which has no missing parents
16/08/16 17:07:06 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 8.4 KB, free 183.1 KB)
16/08/16 17:07:06 INFO MemoryStore: Block broadcast_8_piece0 stored as byt es in memory (estimated size 4.2 KB, free 187.3 KB)
16/08/16 17:07:06 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:43680 (size: 4.2 KB, free: 756.3 MB)
16/08/16 17:07:06 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
16/08/16 17:07:06 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 5 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18)
16/08/16 17:07:06 INFO TaskSchedulerImpl: Adding task set 5.0 with 4 tasks
16/08/16 17:07:06 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 205, localhost, partition 1,NODE_LOCAL, 22553 bytes)
16/08/16 17:07:06 INFO Executor: Running task 0.0 in stage 5.0 (TID 205)
16/08/16 17:07:07 INFO Executor: Finished task 0.0 in stage 5.0 (TID 205). 2074 bytes result sent to driver
16/08/16 17:07:07 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 205) in 706 ms on localhost (1/4)
16/08/16 17:07:14 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_7_piece0 on localhost:43680 in memory (size: 4.2 KB, free: 756.4 MB)
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 14
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 13
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost:43680 in memory (size: 7.1 KB, free: 756.4 MB)
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 12
16/08/16 17:32:40 INFO ContextCleaner: Cleaned shuffle 0
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 11
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 10
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 9
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 8
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 7
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 6
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 5
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 4
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:43680 in memory (size: 13.8 KB, free: 756.4 MB)
16/08/16 20:45:06 INFO SparkContext: Invoking stop() from shutdown hook
EDIT
This is snippet of code what am I doing exactly:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types.{StructType, StructField, DateType, IntegerType};
object ff {
def main(string: Array[String]) {
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("local[4]")
.setAppName("ff")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true")
.load("test.csv")
df.registerTempTable("ff_table")
//df.printSchema()
df.count
time {
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ff_table", "keyspace" -> "traffic"))
.save()
}
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
}
}
Also, if I run nodetool describecluster I got this results:
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
bf6c3ae7-5c8b-3e5d-9794-8e34bee9278f: [127.0.0.1, 127.0.0.2, 127.0.0.3, 127.0.0.4, 127.0.0.5, 127.0.0.6]
My keyspace configuration:
CREATE KEYSPACE traffic WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} AND durable_writes = true;
I tried to insert in CLI on row for replication_factor:3 and it's working, so every node can see each other.
Why Spark can't insert anything than, anyone idea?

Scala UDF runs fine on Spark shell but gives NPE when using it in sparkSQL

I have created a sparkUDF. When I run it on spark-shell it runs perfectly fine. But when I register it and use in my sparkSQL query it gives NullPointerException.
scala> test_proc("1605","(#supp In (-1,118)")
16/03/07 10:35:04 INFO TaskSetManager: Finished task 0.0 in stage 21.0 (TID 220) in 62 ms on cdts1hdpdn01d.rxcorp.com (1/1)
16/03/07 10:35:04 INFO YarnScheduler: Removed TaskSet 21.0, whose tasks have all completed, from pool
16/03/07 10:35:04 INFO DAGScheduler: ResultStage 21 (first at :45) finished in 0.062 s 16/03/07 10:35:04 INFO DAGScheduler: Job 16 finished: first at :45, took 2.406408 s
res14: Int = 1
scala>
But when I register it and use it in my sparkSQL query, it gives NPE.
scala> sqlContext.udf.register("store_proc", test_proc _)
scala> hiveContext.sql("select store_proc('1605' , '(#supp In (-1,118)')").first.getInt(0)
16/03/07 10:37:58 INFO ParseDriver: Parsing command: select store_proc('1605' , '(#supp In (-1,118)') 16/03/07 10:37:58 INFO ParseDriver: Parse Completed 16/03/07 10:37:58 INFO SparkContext: Starting job: first at :24
16/03/07 10:37:58 INFO DAGScheduler: Got job 17 (first at :24) with 1 output partitions 16/03/07 10:37:58 INFO DAGScheduler: Final stage: ResultStage 22(first at :24) 16/03/07 10:37:58 INFO DAGScheduler: Parents of final stage: List()
16/03/07 10:37:58 INFO DAGScheduler: Missing parents: List()
16/03/07 10:37:58 INFO DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[86] at first at :24), which has no missing parents
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(10520) called with curMem=1472899, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 10.3 KB, free 2.1 GB)
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(4774) called with curMem=1483419, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 4.7 KB, free 2.1 GB)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 162.44.214.87:47564 (size: 4.7 KB, free: 2.1 GB)
16/03/07 10:37:58 INFO SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:861
16/03/07 10:37:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 22 (MapPartitionsRDD[86] at first at :24)
16/03/07 10:37:58 INFO YarnScheduler: Adding task set 22.0 with 1 tasks
16/03/07 10:37:58 INFO TaskSetManager: Starting task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com, partition 0,PROCESS_LOCAL, 2155 bytes)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on cdts1hdpdn02d.rxcorp.com:33678 (size: 4.7 KB, free: 6.7 GB)
16/03/07 10:37:58 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com): java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:291) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725) at $line20.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.test_proc(:41)
This is sample of my 'test_proc':
def test_proc(x:String, y:String):Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val z:Int = hiveContext.sql("select 7").first.getInt(0)
return z
}
Based on the output from a standalone call it looks like test_proc is executing some kind of Spark action and this cannot work inside UDF because Spark doesn't support nested operations on distributed data structures. If test_proc is using SQLContext this will result in NPP since Spark contexts exist only on the driver.
If that's the case you'll have restructure your code to achieve desired effect either using local (most likely broadcasted) variables or joins.

Using addFile with pipe on a yarn cluster

I've been using pyspark with my YARN cluster with success. The work I'm
doing involves using the RDD's pipe command to send data through a binary
I've made. I can do this easily in pyspark like so (assuming 'sc' is
already defined):
sc.addFile("./dumb_prog")
t= sc.parallelize(range(10))
t.pipe("dumb_prog")
t.take(10) # Gives expected result
However, if I do the same thing in Scala, the pipe command gets a 'Cannot
run program "dumb_prog": error=2, No such file or directory' error. Here's
the code in the Scala shell:
sc.addFile("./dumb_prog")
val t = sc.parallelize(0 until 10)
val u = t.pipe("dumb_prog")
u.take(10)
Why does this only work in Python and not in Scala? Is there a way I can
get it to work in Scala?
Here is the full error message from the scala side:
[59/3965]
14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17
14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1
output partitions (allowLocal=true)
14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at
<console>:17)
14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List()
14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List()
14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe
at <console>:14), which has no missing parents
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with
curMem=7453, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.1 KB, free 265.4 MB)
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with
curMem=9589, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1389.0 B, free 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on 10.10.0.20:37574 (size: 1389.0 B, free: 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3
(PipedRDD[3] at pipe at <console>:14)
14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with
1 tasks
14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, SERVERNAME, PROCESS_LOCAL, 1201 bytes)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on SERVERNAME:57118 (size: 1389.0 B, free: 530.3 MB)
14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,
SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,
No such file or directory
java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
I ran into a similar issue in spark 1.3.0 in Yarn client mode. When I look in the app cache directory, the file never gets pushed to the executors even when using --files. But when I added the below, it did push to each executor:
sc.addFile("dumb_prog",true)
t.pipe("./dumb_prog")
I think it is a bug, but the above got me past the issue.

Apache spark message understanding

Request help to understand this message..
INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is **2202921** bytes
what does 2202921 mean here?
My job does a shuffle operation and while reading shuffle files from previous stage, it gives the message first and then after sometime it fails with below error:
14/11/12 11:09:46 WARN scheduler.TaskSetManager: Lost task 224.0 in stage 4.0 (TID 13938, ip-xx-xxx-xxx-xx.ec2.internal): FetchFailed(BlockManagerId(11, ip-xx-xxx-xxx-xx.ec2.internal, 48073, 0), shuffleId=2, mapId=7468, reduceId=224)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Marking Stage 4 (coalesce at <console>:49) as failed due to a fetch failure from Stage 3 (map at <console>:42)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Stage 4 (coalesce at <console>:49) failed in 213.446 s
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting Stage 3 (map at <console>:42) and Stage 4 (coalesce at <console>:49) due to fetch failure
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 2)
14/11/12 11:09:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster.
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Removed 11 successfully in removeExecutor
14/11/12 11:09:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 11 (11893/12836, false)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting failed stages
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting Stage 3 (MappedRDD[13] at map at <console>:42), which has no missing parents
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(25472) called with curMem=474762, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 24.9 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(5160) called with curMem=500234, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.0 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on ip-xx.ec2.internal:35571 (size: 5.0 KB, free: 10.4 GB)
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Updated info of block broadcast_6_piece0
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting 943 missing tasks from Stage 3 (MappedRDD[13] at map at <console>:42)
14/11/12 11:09:46 INFO cluster.YarnClientClusterScheduler: Adding task set 3.1 with 943 tasks
My code looks like this,
(rdd1 ++ rdd2).map { t => ((t.id), t) }.groupByKey(1280).map {
case ((id), sequence) =>
val newrecord = sequence.maxBy {
case Fact(id, key, type, day, group, c_key, s_key, plan_id,size,
is_mom, customer_shipment_id, customer_shipment_item_id, asin, company_key, product_line_key, dw_last_updated, measures) => dw_last_updated.toLong
}
((PARTITION_KEY + "=" + newrecord.day.toString + "/part"), (newrecord))
}.coalesce(2048,true).saveAsTextFile("s3://myfolder/PT/test20nodes/")```
I derived 1280 as I have 20 nodes each having 32 cores. I derived it like 2*32*20.
For a Shuffle stage, it will create some ShuffleMapTasks which output the intermediate results to the disk. The location information will be stored in MapStatuses and sent to the MapOutputTrackerMaster(the driver).
Then when the next stage starts to run, it needs these location statuses. So executors will ask MapOutputTrackerMaster to fetch them. MapOutputTrackerMaster will serialize these status to bytes and send them to executors. Here is the size of these status in bytes.
These status will be sent via Akka. And Akka has a limitation to the max message size. You can set it via spark.akka.frameSize:
Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB. Increase this if your tasks need to send back large results to the driver (e.g. using collect() on a large dataset).
If the size is greater than spark.akka.frameSize, Akka will refuse to deliver the message and your job will fail. Therefore it can help you adjust spark.akka.frameSize to a best one.