I am receiving an UnknownHostException when running a custom jar with Spark on Mesos. The issue does not happen when running spark-shell.
My spark-env.sh contains the following:
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
export HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop/
My spark-defaults.conf contains the following:
spark.master mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
spark.mesos.executor.home /spark-1.5.0-bin-hadoop2.6/
These settings are on all masters and slaves.
Starting spark-shell as follows and running the following line works correctly:
/spark-1.5.0-bin-hadoop2.6/bin/spark-shell
sc.textFile("/tmp/Input").collect.foreach(println)
Log for spark-shell:
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(88528) called with curMem=0, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 86.5 KB, free 530.2 MB)
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(20236) called with curMem=88528, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.8 KB, free 530.2 MB)
15/09/28 20:04:49 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.31.21.104:49048 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:49 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:22
15/09/28 20:04:49 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/28 20:04:49 INFO spark.SparkContext: Starting job: collect at <console>:22
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:22) with 3 output partitions
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(collect at <console>:22)
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Missing parents: List()
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:22), which has no missing parents
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(3120) called with curMem=108764, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 530.2 MB)
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(1784) called with curMem=111884, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1784.0 B, free 530.2 MB)
15/09/28 20:04:49 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.21.104:49048 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:49 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:22)
15/09/28 20:04:49 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
15/09/28 20:04:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-31-37-82.us-west-2.compute.internal, NODE_LOCAL, 2142 bytes)
15/09/28 20:04:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-31-21-104.us-west-2.compute.internal, NODE_LOCAL, 2142 bytes)
15/09/28 20:04:49 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, ip-172-31-4-4.us-west-2.compute.internal, NODE_LOCAL, 2142 bytes)
15/09/28 20:04:52 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-4-4.us-west-2.compute.internal:50648 with 530.3 MB RAM, BlockManagerId(20150928-190245-1358962604-5050-11297-S2, ip-172-31-4-4.us-west-2.compute.internal, 50648)
15/09/28 20:04:52 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-37-82.us-west-2.compute.internal:52624 with 530.3 MB RAM, BlockManagerId(20150928-190245-1358962604-5050-11297-S1, ip-172-31-37-82.us-west-2.compute.internal, 52624)
15/09/28 20:04:52 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-21-104.us-west-2.compute.internal:56628 with 530.3 MB RAM, BlockManagerId(20150928-190245-1358962604-5050-11297-S0, ip-172-31-21-104.us-west-2.compute.internal, 56628)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-37-82.us-west-2.compute.internal:52624 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-21-104.us-west-2.compute.internal:56628 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-4-4.us-west-2.compute.internal:50648 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-37-82.us-west-2.compute.internal:52624 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-21-104.us-west-2.compute.internal:56628 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-4-4.us-west-2.compute.internal:50648 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:53 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3907 ms on ip-172-31-37-82.us-west-2.compute.internal (1/3)
15/09/28 20:04:53 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 3884 ms on ip-172-31-4-4.us-west-2.compute.internal (2/3)
15/09/28 20:04:53 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3907 ms on ip-172-31-21-104.us-west-2.compute.internal (3/3)
15/09/28 20:04:53 INFO scheduler.DAGScheduler: ResultStage 0 (collect at <console>:22) finished in 3.940 s
15/09/28 20:04:53 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/09/28 20:04:53 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:22, took 4.019454 s
pepsi
cocacola
The following sample code compiled into a Jar fails
Sample code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
sc.textFile("/tmp/Input").collect.foreach(println)
}
}
Run via:
/spark-1.5.0-bin-hadoop2.6/bin/spark-submit --class "SimpleApp" /home/hdfs/test_2.10-0.1.jar
Log for spark-submit:
java.lang.IllegalArgumentException: java.net.UnknownHostException: affinio
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:601)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1007)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1007)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: affinio
... 35 more
hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>affinio</value>
</property>
<property>
<name>dfs.ha.namenodes.affinio</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.affinio.nn1</name>
<value>172.31.16.81:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.affinio.nn2</name>
<value>172.31.32.81:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.affinio.nn1</name>
<value>172.31.16.81:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.affinio.nn2</name>
<value>172.31.32.81:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>file:///nfs/dfs/ha-name-dir-shared</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.affinio</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hdfs/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/namenode</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>268435456</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hdfs</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>172.31.16.81:2181,172.31.32.81:2181,172.31.0.81:2181</value>
</property>
</configuration>
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://affinio</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
spark-shell conf.toDebugString
spark.app.id=20150929-173220-1361059756-5050-16026-0005
spark.app.name=Spark shell
spark.driver.host=172.31.25.67
spark.driver.port=37613
spark.executor.id=driver
spark.externalBlockStore.folderName=spark-d4bf255f-f1f3-4026-83bf-b377a24f5f2c
spark.fileserver.uri=http://172.31.25.67:54526
spark.jars=
spark.master=mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
spark.mesos.executor.home=/spark-1.5.0-bin-hadoop2.6/
spark.repl.class.uri=http://172.31.25.67:45553
spark.submit.deployMode=client
spark-submit conf.toDebugString
spark.app.id=20150929-173220-1361059756-5050-16026-0004
spark.app.name=Simple Application
spark.driver.host=172.31.25.67
spark.driver.port=47968
spark.executor.id=driver
spark.externalBlockStore.folderName=spark-846de0d9-8bb1-414b-8b81-f2d6646a58d3
spark.fileserver.uri=http://172.31.25.67:45283
spark.jars=file:/home/hdfs/./test_2.10-0.1.jar
spark.master=mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
spark.mesos.executor.home=/spark-1.5.0-bin-hadoop2.6/
spark.submit.deployMode=client
I am able to make it work if I run it as follows:
spark-submit --files /hadoop-2.7.1/etc/hadoop/hdfs-site.xml,/hadoop-2.7.1/etc/hadoop/core-site.xml ./test_2.10-0.1.jar
So, the configurations are not being loaded by default, even though I set HADOOP_CONF_DIR on all machines to /hadoop-2.7.1/etc/hadoop/ in the /spark-1.5.0-bin-hadoop2.6/conf/spark-env.sh as well as in the user profile settings:
cat /etc/profile.d/hadoop.sh
# Set path for hadoop
export HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop/
export PATH=$PATH:/hadoop-2.7.1/bin
Output of the --verbose switch
System properties:
spark.local.dir -> /data/spark/
SPARK_SUBMIT -> true
spark.files -> file:///hadoop-2.7.1/etc/hadoop/hdfs-site.xml,file:///hadoop-2.7.1/etc/hadoop/core-site.xml
spark.app.name -> SimpleApp
spark.jars -> file:/home/hdfs/./test_2.10-0.1.jar
spark.submit.deployMode -> client
spark.mesos.executor.home -> /spark-1.5.0-bin-hadoop2.6
spark.master -> mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
Classpath elements:
file:/home/hdfs/./test_2.10-0.1.jar
I also made the app print the environment variables from the executor
sc.parallelize(Array(1)).flatMap( v=>System.getenv ).collect.foreach(v=>println(s"${v._1}=${v._2}"))
Output:
LIBPROCESS_PORT=0
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
SPARK_EXECUTOR_MEMORY=1024m
SHLVL=1
MESOS_EXECUTOR_ID=20150930-115952-1361059756-5050-15990-S1
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
MESOS_DIRECTORY=/data/slaves/20150930-115952-1361059756-5050-15990-S1/frameworks/20150930-115952-1361059756-5050-15990-0008/executors/20150930-115952-1361059756-5050-15990-S1/runs/2baa786a-be89-4823-a248-bb35034bb2fa
MESOS_SLAVE_PID=slave(1)#172.31.32.118:5051
_SPARK_ASSEMBLY=/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
SPARK_HOME=/spark-1.5.0-bin-hadoop2.6
MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-0.24.0.so
SPARK_SCALA_VERSION=2.10
SPARK_USER=hdfs
PWD=/data/slaves/20150930-115952-1361059756-5050-15990-S1/frameworks/20150930-115952-1361059756-5050-15990-0008/executors/20150930-115952-1361059756-5050-15990-S1/runs/2baa786a-be89-4823-a248-bb35034bb2fa
SPARK_ENV_LOADED=1
MESOS_FRAMEWORK_ID=20150930-115952-1361059756-5050-15990-0008
MESOS_SLAVE_ID=20150930-115952-1361059756-5050-15990-S1
MESOS_CHECKPOINT=0
HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop/
SPARK_EXECUTOR_OPTS=
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
So we can see the executors have HADOOP_CONF_DIR in their environment, but it's still not working without using the spark.files
UPDATE:
Downgrading to spark-1.3.1 and the issue goes away. Something in spark-1.5 breaks the classpath
Spark-1.3.1 outputs:
System properties:
SPARK_SUBMIT -> true
spark.app.name -> SimpleApp
spark.jars -> file:/home/hdfs/./test_2.10-0.1.jar
spark.mesos.executor.home -> /spark-1.3.1-bin-hadoop2.6
spark.master -> mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
Classpath elements:
file:/home/hdfs/./test_2.10-0.1.jar
Executor Environment:
LIBPROCESS_PORT=0
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
SPARK_EXECUTOR_MEMORY=512m
SHLVL=1
MESOS_EXECUTOR_ID=20150930-115952-1361059756-5050-15990-S2
CLASSPATH=/spark-1.3.1-bin-hadoop2.6/conf:/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/spark-1.3.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/spark-1.3.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/spark-1.3.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/hadoop-2.7.1/etc/hadoop
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
MESOS_DIRECTORY=/data/slaves/20150930-115952-1361059756-5050-15990-S2/frameworks/20150930-115952-1361059756-5050-15990-0013/executors/20150930-115952-1361059756-5050-15990-S2/runs/23c38710-14d7-4550-b3f7-2879576ce1d2
MESOS_SLAVE_PID=slave(1)#172.31.18.189:5051
PYTHONPATH=/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip:/spark-1.3.1-bin-hadoop2.6/python:
SPARK_HOME=/spark-1.3.1-bin-hadoop2.6
SPARK_CONF_DIR=/spark-1.3.1-bin-hadoop2.6/conf
MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-0.24.0.so
SPARK_SCALA_VERSION=2.10
SPARK_USER=hdfs
PWD=/data/slaves/20150930-115952-1361059756-5050-15990-S2/frameworks/20150930-115952-1361059756-5050-15990-0013/executors/20150930-115952-1361059756-5050-15990-S2/runs/23c38710-14d7-4550-b3f7-2879576ce1d2
SPARK_ENV_LOADED=1
MESOS_FRAMEWORK_ID=20150930-115952-1361059756-5050-15990-0013
MESOS_SLAVE_ID=20150930-115952-1361059756-5050-15990-S2
MESOS_CHECKPOINT=0
HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop
SPARK_EXECUTOR_OPTS=
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
Related
I am trying to submit the job using command
"spark-submit --class it.polimi.dice.spark.WordCount --master yarn-master --conf spark.cassandra.connection.host=10.0.0.5 --num-executors 1 --deploy-mode client --driver-memory 512m --executor-memory 512m /home/useruser/temp/spark-cassandra-example/target/scala-2.10/spark-cassandra-exmaple-assembly-1.0.jar"
But I am getting error although I have tried the same thing using spark-shell and it is working it means that version of the spark-Cassandra connector 1.6.0-M1(spark version is 1.6.0 , scala 2.10.5, cassandra 3.3.0) and other configurations are fine. Here is the result I am getting after using spark-submit command.
16/11/21 09:39:03 INFO Client: Application report for application_1479668866076_0014 (state: ACCEPTED)
16/11/21 09:39:04 INFO Client: Application report for application_1479668866076_0014 (state: ACCEPTED)
16/11/21 09:39:05 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/11/21 09:39:05 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> sandbox.hortonworks.com, PROXY_URI_BASES -> http://sandbox.hortonworks.com:8088/proxy/application_1479668866076_0014), /proxy/application_1479668866076_0014
16/11/21 09:39:05 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/11/21 09:39:05 INFO Client: Application report for application_1479668866076_0014 (state: RUNNING)
16/11/23 10:10:38 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
16/11/23 10:10:38 INFO Cluster: New Cassandra host /10.0.0.4:9042 added
16/11/23 10:10:38 INFO Cluster: New Cassandra host /10.0.0.5:9042 added
16/11/23 10:10:38 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/11/23 10:10:39 INFO SparkContext: Starting job: fold at WordCount.scala:17
16/11/23 10:10:39 INFO DAGScheduler: Got job 0 (fold at WordCount.scala:17) with 2 output partitions
16/11/23 10:10:39 INFO DAGScheduler: Final stage: ResultStage 0 (fold at WordCount.scala:17)
16/11/23 10:10:39 INFO DAGScheduler: Parents of final stage: List()
16/11/23 10:10:39 INFO DAGScheduler: Missing parents: List()
16/11/23 10:10:39 INFO DAGScheduler: Submitting ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15), which has no missing parents
16/11/23 10:10:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.1 KB, free 7.1 KB)
16/11/23 10:10:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.7 KB, free 10.9 KB)
16/11/23 10:10:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.0.6:58236 (size: 3.7 KB, free: 143.6 MB)
16/11/23 10:10:39 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/11/23 10:10:39 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15)
16/11/23 10:10:39 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/11/23 10:10:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, sandbox.hortonworks.com, partition 0,RACK_LOCAL, 29218 bytes)
16/11/23 10:10:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on sandbox.hortonworks.com:51013 (size: 3.7 KB, free: 143.6 MB)
16/11/23 10:10:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, sandbox.hortonworks.com, partition 1,RACK_LOCAL, 29156 bytes)
16/11/23 10:10:43 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, sandbox.hortonworks.com): java.io.IOException: Failed to open native connection to Cassandra at {10.0.0.4, 10.0.0.5}:9042
16/11/23 10:10:45 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, sandbox.hortonworks.com, partition 0,RACK_LOCAL, 29218 bytes)
16/11/23 10:10:45 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor sandbox.hortonworks.com: java.io.IOException (Failed to open native connection to Cassandra at {10.0.0.4, 10.0.0.5}:9042) [duplicate 1]
Can anyone kindly help me ho wi can fix this issue.
This question already has answers here:
Failed to locate the winutils binary in the hadoop binary path
(17 answers)
Closed 6 years ago.
I am running spark program in Windows 10 machine.
I am trying to run the below spark program
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import com.databricks.spark.csv
object json1 {
def main(args : Array[String]){
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g")
val sc = new org.apache.spark.SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)
val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true"))
NyseDF.registerTempTable("NYSE")
NyseDF.printSchema()
}
}
When i run the program through Run application mode in eclispse with passing arguments
as
src/test/resources/demo.text
It fails with below error.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/10/10 11:02:18 INFO SparkContext: Running Spark version 1.6.0
16/10/10 11:02:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/10 11:02:18 INFO SecurityManager: Changing view acls to: subho
16/10/10 11:02:18 INFO SecurityManager: Changing modify acls to: subho
16/10/10 11:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho)
16/10/10 11:02:19 INFO Utils: Successfully started service 'sparkDriver' on port 61108.
16/10/10 11:02:20 INFO Slf4jLogger: Slf4jLogger started
16/10/10 11:02:20 INFO Remoting: Starting remoting
16/10/10 11:02:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.116:61121]
16/10/10 11:02:20 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 61121.
16/10/10 11:02:20 INFO SparkEnv: Registering MapOutputTracker
16/10/10 11:02:20 INFO SparkEnv: Registering BlockManagerMaster
16/10/10 11:02:21 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-69afda02-ccd1-41d1-aa25-830ba366a75c
16/10/10 11:02:21 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB
16/10/10 11:02:21 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/10 11:02:21 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/10/10 11:02:21 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040
16/10/10 11:02:21 INFO Executor: Starting executor ID driver on host localhost
16/10/10 11:02:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61132.
16/10/10 11:02:21 INFO NettyBlockTransferService: Server created on 61132
16/10/10 11:02:21 INFO BlockManagerMaster: Trying to register BlockManager
16/10/10 11:02:21 INFO BlockManagerMasterEndpoint: Registering block manager localhost:61132 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 61132)
16/10/10 11:02:21 INFO BlockManagerMaster: Registered BlockManager
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 107.7 KB, free 107.7 KB)
16/10/10 11:02:23 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 9.8 KB, free 117.5 KB)
16/10/10 11:02:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61132 (size: 9.8 KB, free: 1128.4 MB)
16/10/10 11:02:23 INFO SparkContext: Created broadcast 0 from textFile at TextFile.scala:30
16/10/10 11:02:23 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1293)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174)
at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147)
at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
at json1$.main(json1.scala:22)
at json1.main(json1.scala)
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/demo.text
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1293)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.take(RDD.scala:1288)
at com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:174)
at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:169)
at com.databricks.spark.csv.CsvRelation.inferSchema(CsvRelation.scala:147)
at com.databricks.spark.csv.CsvRelation.<init>(CsvRelation.scala:70)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:138)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:40)
at com.databricks.spark.csv.DefaultSource.createRelation(DefaultSource.scala:28)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
at json1$.main(json1.scala:22)
at json1.main(json1.scala)
16/10/10 11:02:23 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 11:02:23 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040
16/10/10 11:02:23 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 11:02:23 INFO MemoryStore: MemoryStore cleared
16/10/10 11:02:23 INFO BlockManager: BlockManager stopped
16/10/10 11:02:24 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 11:02:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 11:02:24 INFO SparkContext: Successfully stopped SparkContext
16/10/10 11:02:24 INFO ShutdownHookManager: Shutdown hook called
16/10/10 11:02:24 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-7f53ea20-a38c-46d5-8476-a1ae040736ac
Below is the main error msg
Input path does not exist: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/demo.text
I have the file in the below location.
!]1
When i ran the below program it ran sucussfully,
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import com.databricks.spark.csv
object json1 {
def main(args : Array[String]){
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]").set("spark.executor.memory", "1g")
val sc = new org.apache.spark.SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)
/* val NyseDF = sqlc.load("com.databricks.spark.csv",Map("path" -> args(0),"header"->"true"))
NyseDF.registerTempTable("NYSE")
NyseDF.printSchema()
print(sqlc.sql("select distinct(symbol) from NYSE").collect().toList)*/
val PersonDF = sqlc.jsonFile("src/test/resources/Person.json")
// PersonDF.printSchema()
PersonDF.registerTempTable("Person")
sqlc.sql("select * from Person where age < 60").collect().foreach(print)
}
Below is the log file.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/10/10 11:54:12 INFO SparkContext: Running Spark version 1.6.0
16/10/10 11:54:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/10/10 11:54:13 INFO SecurityManager: Changing view acls to: subho
16/10/10 11:54:13 INFO SecurityManager: Changing modify acls to: subho
16/10/10 11:54:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(subho); users with modify permissions: Set(subho)
16/10/10 11:54:14 INFO Utils: Successfully started service 'sparkDriver' on port 51113.
16/10/10 11:54:14 INFO Slf4jLogger: Slf4jLogger started
16/10/10 11:54:14 INFO Remoting: Starting remoting
16/10/10 11:54:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.116:51126]
16/10/10 11:54:15 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 51126.
16/10/10 11:54:15 INFO SparkEnv: Registering MapOutputTracker
16/10/10 11:54:15 INFO SparkEnv: Registering BlockManagerMaster
16/10/10 11:54:15 INFO DiskBlockManager: Created local directory at C:\Users\subho\AppData\Local\Temp\blockmgr-a52a5d5a-075b-4859-8434-935fdaba8538
16/10/10 11:54:15 INFO MemoryStore: MemoryStore started with capacity 1128.4 MB
16/10/10 11:54:15 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/10 11:54:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/10/10 11:54:15 INFO SparkUI: Started SparkUI at http://192.168.1.116:4040
16/10/10 11:54:15 INFO Executor: Starting executor ID driver on host localhost
16/10/10 11:54:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 51137.
16/10/10 11:54:15 INFO NettyBlockTransferService: Server created on 51137
16/10/10 11:54:15 INFO BlockManagerMaster: Trying to register BlockManager
16/10/10 11:54:15 INFO BlockManagerMasterEndpoint: Registering block manager localhost:51137 with 1128.4 MB RAM, BlockManagerId(driver, localhost, 51137)
16/10/10 11:54:15 INFO BlockManagerMaster: Registered BlockManager
16/10/10 11:54:17 INFO JSONRelation: Listing file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json on driver
16/10/10 11:54:17 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:293)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:76)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
at org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:109)
at org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:108)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
at org.apache.spark.sql.SQLContext.jsonFile(SQLContext.scala:1011)
at json1$.main(json1.scala:28)
at json1.main(json1.scala)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 128.0 KB, free 128.0 KB)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 14.1 KB, free 142.1 KB)
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.4 MB)
16/10/10 11:54:18 INFO SparkContext: Created broadcast 0 from jsonFile at json1.scala:28
16/10/10 11:54:18 INFO FileInputFormat: Total input paths to process : 1
16/10/10 11:54:18 INFO SparkContext: Starting job: jsonFile at json1.scala:28
16/10/10 11:54:18 INFO DAGScheduler: Got job 0 (jsonFile at json1.scala:28) with 2 output partitions
16/10/10 11:54:18 INFO DAGScheduler: Final stage: ResultStage 0 (jsonFile at json1.scala:28)
16/10/10 11:54:18 INFO DAGScheduler: Parents of final stage: List()
16/10/10 11:54:18 INFO DAGScheduler: Missing parents: List()
16/10/10 11:54:18 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28), which has no missing parents
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KB, free 146.3 KB)
16/10/10 11:54:18 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KB, free 148.6 KB)
16/10/10 11:54:18 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:51137 (size: 2.4 KB, free: 1128.4 MB)
16/10/10 11:54:18 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/10/10 11:54:18 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at jsonFile at json1.scala:28)
16/10/10 11:54:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/10/10 11:54:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:18 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:18 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/10/10 11:54:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92
16/10/10 11:54:18 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/10/10 11:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/10/10 11:54:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/10/10 11:54:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/10/10 11:54:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/10/10 11:54:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/10/10 11:54:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2886 bytes result sent to driver
16/10/10 11:54:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2886 bytes result sent to driver
16/10/10 11:54:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1287 ms on localhost (1/2)
16/10/10 11:54:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1264 ms on localhost (2/2)
16/10/10 11:54:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/10/10 11:54:19 INFO DAGScheduler: ResultStage 0 (jsonFile at json1.scala:28) finished in 1.314 s
16/10/10 11:54:19 INFO DAGScheduler: Job 0 finished: jsonFile at json1.scala:28, took 1.413653 s
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:51137 in memory (size: 2.4 KB, free: 1128.4 MB)
16/10/10 11:54:20 INFO ContextCleaner: Cleaned accumulator 1
16/10/10 11:54:20 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:51137 in memory (size: 14.1 KB, free: 1128.4 MB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 59.6 KB, free 59.6 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 13.8 KB, free 73.3 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:51137 (size: 13.8 KB, free: 1128.4 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 2 from collect at json1.scala:34
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 128.0 KB, free 201.3 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 14.1 KB, free 215.4 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:51137 (size: 14.1 KB, free: 1128.3 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 3 from collect at json1.scala:34
16/10/10 11:54:21 INFO FileInputFormat: Total input paths to process : 1
16/10/10 11:54:21 INFO SparkContext: Starting job: collect at json1.scala:34
16/10/10 11:54:21 INFO DAGScheduler: Got job 1 (collect at json1.scala:34) with 2 output partitions
16/10/10 11:54:21 INFO DAGScheduler: Final stage: ResultStage 1 (collect at json1.scala:34)
16/10/10 11:54:21 INFO DAGScheduler: Parents of final stage: List()
16/10/10 11:54:21 INFO DAGScheduler: Missing parents: List()
16/10/10 11:54:21 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34), which has no missing parents
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.6 KB, free 223.0 KB)
16/10/10 11:54:21 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.1 KB, free 227.1 KB)
16/10/10 11:54:21 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:51137 (size: 4.1 KB, free: 1128.3 MB)
16/10/10 11:54:21 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1006
16/10/10 11:54:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[9] at collect at json1.scala:34)
16/10/10 11:54:21 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/10/10 11:54:21 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:21 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2113 bytes)
16/10/10 11:54:21 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/10/10 11:54:21 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:92+93
16/10/10 11:54:21 INFO HadoopRDD: Input split: file:/C:/Users/subho/Desktop/code-master/simple-spark-project/src/test/resources/Person.json:0+92
16/10/10 11:54:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:51137 in memory (size: 13.8 KB, free: 1128.4 MB)
16/10/10 11:54:22 INFO GenerateUnsafeProjection: Code generated in 548.352258 ms
16/10/10 11:54:22 INFO GeneratePredicate: Code generated in 5.245214 ms
16/10/10 11:54:22 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2283 bytes result sent to driver
16/10/10 11:54:22 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2536 bytes result sent to driver
16/10/10 11:54:22 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 755 ms on localhost (1/2)
16/10/10 11:54:22 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 759 ms on localhost (2/2)
16/10/10 11:54:22 INFO DAGScheduler: ResultStage 1 (collect at json1.scala:34) finished in 0.760 s
16/10/10 11:54:22 INFO DAGScheduler: Job 1 finished: collect at json1.scala:34, took 0.779652 s
16/10/10 11:54:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
[53,Barack,Obama]16/10/10 11:54:22 INFO SparkContext: Invoking stop() from shutdown hook
16/10/10 11:54:22 INFO SparkUI: Stopped Spark web UI at http://192.168.1.116:4040
16/10/10 11:54:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/10 11:54:22 INFO MemoryStore: MemoryStore cleared
16/10/10 11:54:22 INFO BlockManager: BlockManager stopped
16/10/10 11:54:22 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/10 11:54:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/10 11:54:22 INFO SparkContext: Successfully stopped SparkContext
16/10/10 11:54:22 INFO ShutdownHookManager: Shutdown hook called
16/10/10 11:54:22 INFO ShutdownHookManager: Deleting directory C:\Users\subho\AppData\Local\Temp\spark-6cab6329-83f1-4af4-b64c-c869550405a4
16/10/10 11:54:22 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
Thanks and Regards,
The important section of the stacktrace is here:
ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
One possibility is to download winutils.exe (e.g. from here), put in in a folder called bin (as a subdirectory or your home directory, e.g. C:\Users\XXX\bin\winutils.exe) and then add this line at the beginning of your code:
System.setProperty("hadoop.home.dir", raw"C:\Users\XXX\")
I'm using Spark 1.6.1, Cassandra 2.2.3 and Cassandra-Spark connector 1.6. .
I already tried to write to multi node cluster but with replication_factor:1.
Now, I'm trying to write to 6-node cluster with one seed one and keyspace which has replication_factor > 1 but Spark is not responding and he is refusing to do that.
As I mention, it works when I'm writing to coordinator with keyspace set to 1.
This is an log which I'm getting and it always stops here or after half an hour he starts to cleaning accumulators and stops on fourth again.
16/08/16 17:07:03 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.1:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.1 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.2:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.2 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.3:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.3 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.4:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.4 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.5:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.5 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.6:9042 added
16/08/16 17:07:04 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/08/16 17:07:05 INFO SparkContext: Starting job: take at CassandraRDD.scala:121
16/08/16 17:07:05 INFO DAGScheduler: Got job 3 (take at CassandraRDD.scala:121) with 1 output partitions
16/08/16 17:07:05 INFO DAGScheduler: Final stage: ResultStage 4 (take at CassandraRDD.scala:121)
16/08/16 17:07:05 INFO DAGScheduler: Parents of final stage: List()
16/08/16 17:07:05 INFO DAGScheduler: Missing parents: List()
16/08/16 17:07:05 INFO DAGScheduler: Submitting ResultStage 4 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18), which has no missing parents
16/08/16 17:07:05 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 8.3 KB, free 170.5 KB)
16/08/16 17:07:05 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 4.2 KB, free 174.7 KB)
16/08/16 17:07:05 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:43680 (size: 4.2 KB, free: 756.4 MB)
16/08/16 17:07:05 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
16/08/16 17:07:05 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18)
16/08/16 17:07:05 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/08/16 17:07:05 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 204, localhost, partition 0,NODE_LOCAL, 22553 bytes)
16/08/16 17:07:05 INFO Executor: Running task 0.0 in stage 4.0 (TID 204)
16/08/16 17:07:06 INFO Executor: Finished task 0.0 in stage 4.0 (TID 204). 2074 bytes result sent to driver
16/08/16 17:07:06 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 204) in 1267 ms on localhost (1/1)
16/08/16 17:07:06 INFO DAGScheduler: ResultStage 4 (take at CassandraRDD.scala:121) finished in 1.276 s
16/08/16 17:07:06 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/08/16 17:07:06 INFO DAGScheduler: Job 3 finished: take at CassandraRDD.scala:121, took 1.310929 s
16/08/16 17:07:06 INFO SparkContext: Starting job: take at CassandraRDD.scala:121
16/08/16 17:07:06 INFO DAGScheduler: Got job 4 (take at CassandraRDD.scala:121) with 4 output partitions
16/08/16 17:07:06 INFO DAGScheduler: Final stage: ResultStage 5 (take at CassandraRDD.scala:121)
16/08/16 17:07:06 INFO DAGScheduler: Parents of final stage: List()
16/08/16 17:07:06 INFO DAGScheduler: Missing parents: List()
16/08/16 17:07:06 INFO DAGScheduler: Submitting ResultStage 5 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18), which has no missing parents
16/08/16 17:07:06 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 8.4 KB, free 183.1 KB)
16/08/16 17:07:06 INFO MemoryStore: Block broadcast_8_piece0 stored as byt es in memory (estimated size 4.2 KB, free 187.3 KB)
16/08/16 17:07:06 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:43680 (size: 4.2 KB, free: 756.3 MB)
16/08/16 17:07:06 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
16/08/16 17:07:06 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 5 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18)
16/08/16 17:07:06 INFO TaskSchedulerImpl: Adding task set 5.0 with 4 tasks
16/08/16 17:07:06 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 205, localhost, partition 1,NODE_LOCAL, 22553 bytes)
16/08/16 17:07:06 INFO Executor: Running task 0.0 in stage 5.0 (TID 205)
16/08/16 17:07:07 INFO Executor: Finished task 0.0 in stage 5.0 (TID 205). 2074 bytes result sent to driver
16/08/16 17:07:07 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 205) in 706 ms on localhost (1/4)
16/08/16 17:07:14 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_7_piece0 on localhost:43680 in memory (size: 4.2 KB, free: 756.4 MB)
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 14
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 13
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost:43680 in memory (size: 7.1 KB, free: 756.4 MB)
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 12
16/08/16 17:32:40 INFO ContextCleaner: Cleaned shuffle 0
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 11
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 10
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 9
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 8
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 7
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 6
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 5
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 4
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:43680 in memory (size: 13.8 KB, free: 756.4 MB)
16/08/16 20:45:06 INFO SparkContext: Invoking stop() from shutdown hook
EDIT
This is snippet of code what am I doing exactly:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types.{StructType, StructField, DateType, IntegerType};
object ff {
def main(string: Array[String]) {
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("local[4]")
.setAppName("ff")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true")
.load("test.csv")
df.registerTempTable("ff_table")
//df.printSchema()
df.count
time {
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ff_table", "keyspace" -> "traffic"))
.save()
}
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
}
}
Also, if I run nodetool describecluster I got this results:
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
bf6c3ae7-5c8b-3e5d-9794-8e34bee9278f: [127.0.0.1, 127.0.0.2, 127.0.0.3, 127.0.0.4, 127.0.0.5, 127.0.0.6]
My keyspace configuration:
CREATE KEYSPACE traffic WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} AND durable_writes = true;
I tried to insert in CLI on row for replication_factor:3 and it's working, so every node can see each other.
Why Spark can't insert anything than, anyone idea?
I am trying to perform this simple Spark job using IntelliJ IDEA in Scala. However, Spark UI stops completely after complete execution of the object. Is there something that I am missing or listening at wrong location? Scala Version - 2.10.4 and Spark - 1.6.0
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
val logFile = "C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
16/02/24 01:24:39 INFO SparkContext: Running Spark version 1.6.0
16/02/24 01:24:40 INFO SecurityManager: Changing view acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: Changing modify acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Sivaram Konanki); users with modify permissions: Set(Sivaram Konanki)
16/02/24 01:24:41 INFO Utils: Successfully started service 'sparkDriver' on port 54881.
16/02/24 01:24:41 INFO Slf4jLogger: Slf4jLogger started
16/02/24 01:24:42 INFO Remoting: Starting remoting
16/02/24 01:24:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.15:54894]
16/02/24 01:24:42 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54894.
16/02/24 01:24:42 INFO SparkEnv: Registering MapOutputTracker
16/02/24 01:24:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/24 01:24:42 INFO DiskBlockManager: Created local directory at C:\Users\Sivaram Konanki\AppData\Local\Temp\blockmgr-dad99e77-f3a6-4a1d-88d8-3b030be0bd0a
16/02/24 01:24:42 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
16/02/24 01:24:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/24 01:24:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/02/24 01:24:42 INFO SparkUI: Started SparkUI at http://192.168.1.15:4040
16/02/24 01:24:42 INFO Executor: Starting executor ID driver on host localhost
16/02/24 01:24:43 INFO Utils: <b>Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 54913.
16/02/24 01:24:43 INFO NettyBlockTransferService: Server created on 54913
16/02/24 01:24:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/24 01:24:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54913 with 2.4 GB RAM, BlockManagerId(driver, localhost, 54913)
16/02/24 01:24:43 INFO BlockManagerMaster: Registered BlockManager
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
16/02/24 01:24:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54913 (size: 13.9 KB, free: 2.4 GB)
16/02/24 01:24:44 INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/24 01:24:45 WARN : Your hostname, OSG-E5450-42 resolves to a loopback/non-reachable address: fe80:0:0:0:d9ff:4f93:5643:703d%wlan3, but we couldn't find any external IP address!
16/02/24 01:24:46 INFO FileInputFormat: Total input paths to process : 1
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:12
16/02/24 01:24:46 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 144.5 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1886.0 B, free 146.3 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54913 (size: 1886.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_1 not found, computing it
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_0 not found, computing it
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:1679+1680
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:0+1679
16/02/24 01:24:46 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/02/24 01:24:46 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/02/24 01:24:46 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/02/24 01:24:46 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/02/24 01:24:46 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 4.7 KB, free 151.0 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:54913 (size: 4.7 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 5.4 KB, free 156.5 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54913 (size: 5.4 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 170 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 143 ms on localhost (2/2)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 0.187 s
16/02/24 01:24:46 INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.303861 s
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:13
16/02/24 01:24:46 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 159.6 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1888.0 B, free 161.5 KB)16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54913 (size: 1888.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_0 locally
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_1 locally
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 34 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 37 ms on localhost (2/2)
Lines with a: 58, Lines with b: 26
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.040 s
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.068350 s
16/02/24 01:24:46 INFO SparkContext: Invoking stop() from shutdown hook
16/02/24 01:24:46 INFO SparkUI: Stopped Spark web UI at http://192.168.1.15:4040
16/02/24 01:24:46 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/24 01:24:46 INFO MemoryStore: MemoryStore cleared
16/02/24 01:24:46 INFO BlockManager: BlockManager stopped
16/02/24 01:24:46 INFO BlockManagerMaster: BlockManagerMaster stopped
16/02/24 01:24:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/24 01:24:46 INFO SparkContext: Successfully stopped SparkContext
16/02/24 01:24:46 INFO ShutdownHookManager: Shutdown hook called
16/02/24 01:24:46 INFO ShutdownHookManager: Deleting directory C:\Users\Sivaram Konanki\AppData\Local\Temp\spark-861b5aef-6732-45e4-a4f4-6769370c555e
You can add a
Thread.sleep(1000000);//For 1000 seconds or more
at the bottom of your spark job, this will allow you to inspect the WebUI in IDEs like IntelliJ while running your Spark Job.
This is an expected behavior. Spark UI is maintained by the SparkContext so it cannot be active after application finished and context has been destroyed.
In the standalone mode information is preserved by the cluster web UI, on Mesos or Yarn you can use history server but in the local mode the only option I am aware of is to keep application running.
I've been using pyspark with my YARN cluster with success. The work I'm
doing involves using the RDD's pipe command to send data through a binary
I've made. I can do this easily in pyspark like so (assuming 'sc' is
already defined):
sc.addFile("./dumb_prog")
t= sc.parallelize(range(10))
t.pipe("dumb_prog")
t.take(10) # Gives expected result
However, if I do the same thing in Scala, the pipe command gets a 'Cannot
run program "dumb_prog": error=2, No such file or directory' error. Here's
the code in the Scala shell:
sc.addFile("./dumb_prog")
val t = sc.parallelize(0 until 10)
val u = t.pipe("dumb_prog")
u.take(10)
Why does this only work in Python and not in Scala? Is there a way I can
get it to work in Scala?
Here is the full error message from the scala side:
[59/3965]
14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17
14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1
output partitions (allowLocal=true)
14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at
<console>:17)
14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List()
14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List()
14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe
at <console>:14), which has no missing parents
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with
curMem=7453, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.1 KB, free 265.4 MB)
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with
curMem=9589, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1389.0 B, free 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on 10.10.0.20:37574 (size: 1389.0 B, free: 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3
(PipedRDD[3] at pipe at <console>:14)
14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with
1 tasks
14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, SERVERNAME, PROCESS_LOCAL, 1201 bytes)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on SERVERNAME:57118 (size: 1389.0 B, free: 530.3 MB)
14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,
SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,
No such file or directory
java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
I ran into a similar issue in spark 1.3.0 in Yarn client mode. When I look in the app cache directory, the file never gets pushed to the executors even when using --files. But when I added the below, it did push to each executor:
sc.addFile("dumb_prog",true)
t.pipe("./dumb_prog")
I think it is a bug, but the above got me past the issue.