spark-notebook “Bad substitution” when submitting spark job to yarn-cluster - spark-notebook

Similar to "Bad substitution" when submitting spark job to yarn-cluster
I get the following when submitting job to yarn cluster
2016-02-25 19:49:11,029 INFO [Remote-akka.actor.default-dispatcher-4] (org.apache.spark.deploy.yarn.Client) - Application report for application_1456408114938_0007 (state: ACCEPTED)
2016-02-25 19:49:12,034 INFO [Remote-akka.actor.default-dispatcher-4] (org.apache.spark.deploy.yarn.Client) - Application report for application_1456408114938_0007 (state: ACCEPTED)
2016-02-25 19:49:13,039 INFO [Remote-akka.actor.default-dispatcher-4] (org.apache.spark.deploy.yarn.Client) - Application report for application_1456408114938_0007 (state: FAILED)
2016-02-25 19:49:13,040 INFO [Remote-akka.actor.default-dispatcher-4] (org.apache.spark.deploy.yarn.Client) -
client token: N/A
diagnostics: Application application_1456408114938_0007 failed 2 times due to AM Container for appattempt_1456408114938_0007_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://m:8088/cluster/app/application_1456408114938_0007Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_e03_1456408114938_0007_02_000001
Exit code: 1
Exception message: /hadoop/yarn/local/usercache/spark-notebook/appcache/application_1456408114938_0007/container_e03_1456408114938_0007_02_000001/launch_container.sh: line 24: $PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
Stack trace: ExitCodeException exitCode=1: /hadoop/yarn/local/usercache/spark-notebook/appcache/application_1456408114938_0007/container_e03_1456408114938_0007_02_000001/launch_container.sh: line 24: $PWD:$PWD/__spark_conf__:$PWD/__spark__.jar:$HADOOP_CONF_DIR:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure: bad substitution
The following works:
- zeppelin works
and the SparkPi example work
set MASTER=yarn-client
. /etc/spark/conf/spark-env.sh
./run-example SparkPi
16/02/25 19:54:38 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 1.458580 s
Pi is roughly 3.14232

Related

JobManager doesn't automatically redirect all requests to the remaining / running TaskManager

Problem Description
2 computers(203,204)
created a Standalone mode HA Flink v1.6.1 cluster
both run jobmanager and taskmanager(2 task slots) on every computer
After I start a job (examples SocketWindowWordCount.jar ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000) on the JobManager node, I kill the working TaskManager instance.
Web Dashboard I can see the job being cancelled and then failed. Web Dashboard image
flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab
full logs
log-standalonesession-203
log-taskexecutor-203
log-standalonesession-204
exception
kill working TM, get the excpetion like this
2018-12-28 11:04:27,877 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.
submit job
run ./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9, get the JM logs like this
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink#hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124
kill TM
Body is limited to 30000 characters. please read this JM logs when kill TM
The logs indicate that your RestartStrategy has depleted its restart attempts or that no RestartStrategy has been configured. Please check whether you specified a RestartStrategy in your program via env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) or in flink-conf.yaml via restart-strategy: fixed-delay. If you want to learn more about Flink's restart strategies check out the documentation.

How to run Spark Scala code on Amazon EMR

I am trying to run the following piece of Spark code written in Scala on Amazon EMR:
import org.apache.spark.{SparkConf, SparkContext}
object TestRunner {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Hello World")
val sc = new SparkContext(conf)
val words = sc.parallelize(Seq("a", "b", "c", "d", "e"))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
println(wordCounts)
}
}
This is the script I am using to deploy the above code into EMR:
#!/usr/bin/env bash
set -euxo pipefail
cluster_id='j-XXXXXXXXXX'
app_name="HelloWorld"
main_class="TestRunner"
jar_name="HelloWorld-assembly-0.0.1-SNAPSHOT.jar"
jar_path="target/scala-2.11/${jar_name}"
s3_jar_dir="s3://jars/"
s3_jar_path="${s3_jar_dir}${jar_name}"
###################################################
sbt assembly
aws s3 cp ${jar_path} ${s3_jar_dir}
aws emr add-steps --cluster-id ${cluster_id} --steps Type=spark,Name=${app_name},Args=[--deploy-mode,cluster,--master,yarn-cluster,--class,${main_class},${s3_jar_path}],ActionOnFailure=CONTINUE
But, this exits with producing no output at all in AWS after few minutes!
Here's my controller's output:
2016-10-20T21:03:17.043Z INFO Ensure step 3 jar file command-runner.jar
2016-10-20T21:03:17.043Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/mscheiber/HelloWorld-assembly-0.0.1-SNAPSHOT.jar'
INFO Environment:
PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
LESS_TERMCAP_md=[01;38;5;208m
LESS_TERMCAP_me=[0m
HISTCONTROL=ignoredups
LESS_TERMCAP_mb=[01;31m
AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
UPSTART_JOB=rc
LESS_TERMCAP_se=[0m
HISTSIZE=1000
HADOOP_ROOT_LOGGER=INFO,DRFA
JAVA_HOME=/etc/alternatives/jre
AWS_DEFAULT_REGION=us-east-1
AWS_ELB_HOME=/opt/aws/apitools/elb
LESS_TERMCAP_us=[04;38;5;111m
EC2_HOME=/opt/aws/apitools/ec2
TERM=linux
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
runlevel=3
LANG=en_US.UTF-8
AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
MAIL=/var/spool/mail/hadoop
LESS_TERMCAP_ue=[0m
LOGNAME=hadoop
PWD=/
LANGSH_SOURCED=1
HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3/tmp
_=/etc/alternatives/jre/bin/java
CONSOLETYPE=serial
RUNLEVEL=3
LESSOPEN=||/usr/bin/lesspipe.sh %s
previous=N
UPSTART_EVENTS=runlevel
AWS_PATH=/opt/aws
USER=hadoop
UPSTART_INSTANCE=
PREVLEVEL=N
HADOOP_LOGFILE=syslog
HOSTNAME=ip-10-17-186-102
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3
EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
SHLVL=5
HOME=/home/hadoop
HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3
INFO ProcessRunner started child process 24549 :
hadoop 24549 4780 0 21:03 ? 00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/TestRunner-assembly-0.0.1-SNAPSHOT.jar
2016-10-20T21:03:21.050Z INFO HadoopJarStepRunner.Runner: startRun() called for s-3UAS8JQ0KEOV3 Child Pid: 24549
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 44 seconds
2016-10-20T21:04:03.102Z INFO Step created jobs:
2016-10-20T21:04:03.103Z INFO Step succeeded with exitCode 0 and took 44 seconds
The syslog and stdout is empty and this is in my stderr:
16/10/20 21:03:20 INFO RMProxy: Connecting to ResourceManager at ip-10-17-186-102.ec2.internal/10.17.186.102:8032
16/10/20 21:03:21 INFO Client: Requesting a new application from cluster with 2 NodeManagers
16/10/20 21:03:21 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (53248 MB per container)
16/10/20 21:03:21 INFO Client: Will allocate AM container, with 53247 MB memory including 4840 MB overhead
16/10/20 21:03:21 INFO Client: Setting up container launch context for our AM
16/10/20 21:03:21 INFO Client: Setting up the launch environment for our AM container
16/10/20 21:03:21 INFO Client: Preparing resources for our AM container
16/10/20 21:03:21 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/10/20 21:03:22 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_libs__5484581201997889110.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_libs__5484581201997889110.zip
16/10/20 21:03:24 INFO Client: Uploading resource s3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/DataScience-assembly-0.0.1-SNAPSHOT.jar
16/10/20 21:03:24 INFO S3NativeFileSystem: Opening 's3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar' for reading
16/10/20 21:03:26 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_conf__5724047842379101980.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_conf__.zip
16/10/20 21:03:26 INFO SecurityManager: Changing view acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing view acls groups to:
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls groups to:
16/10/20 21:03:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
16/10/20 21:03:26 INFO Client: Submitting application application_1476995377469_0002 to ResourceManager
16/10/20 21:03:26 INFO YarnClientImpl: Submitted application application_1476995377469_0002
16/10/20 21:03:27 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:27 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1476997406896
final status: UNDEFINED
tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
user: hadoop
16/10/20 21:03:28 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:29 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:30 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:31 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:31 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.17.181.184
ApplicationMaster RPC port: 0
queue: default
start time: 1476997406896
final status: UNDEFINED
tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
user: hadoop
16/10/20 21:03:32 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:33 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:34 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:35 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:36 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:37 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:38 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:39 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:40 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:41 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:42 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:43 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:44 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:45 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:46 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:47 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:48 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:49 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:50 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:51 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:52 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:53 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:54 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:55 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:56 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:57 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:58 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:59 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:04:00 INFO Client: Application report for application_1476995377469_0002 (state: FINISHED)
16/10/20 21:04:00 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.17.181.184
ApplicationMaster RPC port: 0
queue: default
start time: 1476997406896
final status: SUCCEEDED
tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
user: hadoop
16/10/20 21:04:00 INFO Client: Deleting staging directory hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002
16/10/20 21:04:00 INFO ShutdownHookManager: Shutdown hook called
16/10/20 21:04:00 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4
Command exiting with ret '0'
What am I missing?
Looks like your application succeeded just fine. However, there are two reasons why you don't see any output in the step's stout logs.
1) You ran the application in yarn-cluster mode, which means that the driver runs on a random cluster node rather than on the master node. If you specified an S3 log uri when creating the cluster, you should see the logs for this application in the containers directory of your S3 bucket. The logs for the driver will be in container #0's logs.
2) You did not call anything like "collect()" to bring data from the Spark executors back to the driver, so your println() at the end is not printing the data anyway but rather a toString() representation of the RDD. You probably want to do something like .collect().foreach(println) instead.

Unable to run a jar or sparkApplication on aws EMR

I have a very simple app that I'm trying to run on aws emr. The jar has been built using assembly with spark a provided dependency. It resides on S3 along with a test text file that I wanted to test.
In the EMR UI I select to add a step and add the details telling it the location of the jar and the argument file location.
It runs but always fails with an error - I then set up a new cluster(sanity checking) and rna again only to get the same result, any help is appreciated.
Thank you
The error from the log:
16/03/18 11:40:56 INFO client.RMProxy: Connecting to ResourceManager at ip-10-1-1-234.ec2.internal/10.1.1.234:8032
16/03/18 11:40:56 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers
16/03/18 11:40:56 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
16/03/18 11:40:56 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
16/03/18 11:40:56 INFO yarn.Client: Setting up container launch context for our AM
16/03/18 11:40:56 INFO yarn.Client: Setting up the launch environment for our AM container
16/03/18 11:40:56 INFO yarn.Client: Preparing resources for our AM container
16/03/18 11:40:57 INFO yarn.Client: Uploading resource file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.7.1-amzn-1.jar -> hdfs://ip-10-1-1-234.ec2.internal:8020/user/hadoop/.sparkStaging/application_1458297951763_0003/spark-assembly-1.6.0-hadoop2.7.1-amzn-1.jar
16/03/18 11:40:57 INFO metrics.MetricsSaver: MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1458297958626
16/03/18 11:40:57 INFO metrics.MetricsSaver: Created MetricsSaver j-DKMA93DFZ456:i-91bff215:SparkSubmit:20036 period:60 /mnt/var/em/raw/i-91bff215_20160318_SparkSubmit_20036_raw.bin
16/03/18 11:40:58 INFO metrics.MetricsSaver: 1 aggregated HDFSWriteDelay 590 raw values into 1 aggregated values, total 1
16/03/18 11:40:59 INFO fs.EmrFileSystem: Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
16/03/18 11:41:00 INFO metrics.MetricsSaver: Thread 1 created MetricsLockFreeSaver 1
16/03/18 11:41:00 INFO yarn.Client: Uploading resource file:/mnt/tmp/spark-030f9d29-f7ca-42fa-9caf-64ea103a2bb1/__spark_conf__7615049662154628286.zip -> hdfs://ip-10-1-1-234.ec2.internal:8020/user/hadoop/.sparkStaging/application_1458297951763_0003/__spark_conf__7615049662154628286.zip
16/03/18 11:41:00 INFO spark.SecurityManager: Changing view acls to: hadoop
16/03/18 11:41:00 INFO spark.SecurityManager: Changing modify acls to: hadoop
16/03/18 11:41:00 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
16/03/18 11:41:01 INFO yarn.Client: Submitting application 3 to ResourceManager
16/03/18 11:41:01 INFO impl.YarnClientImpl: Submitted application application_1458297951763_0003
16/03/18 11:41:02 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:02 INFO yarn.Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1458301261052
final status: UNDEFINED
tracking URL: http://ip-10-1-1-234.ec2.internal:20888/proxy/application_1458297951763_0003/
user: hadoop
16/03/18 11:41:03 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:04 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:05 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:06 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:07 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:08 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:09 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:10 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:11 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:12 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:13 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:14 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:15 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:16 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:17 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:18 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:19 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:20 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:21 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:22 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:23 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:24 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:25 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:26 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:27 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:28 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:29 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:30 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:31 INFO yarn.Client: Application report for application_1458297951763_0003 (state: ACCEPTED)
16/03/18 11:41:32 INFO yarn.Client: Application report for application_1458297951763_0003 (state: FAILED)
16/03/18 11:41:32 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1458297951763_0003 failed 2 times due to AM Container for appattempt_1458297951763_0003_000002 exited with exitCode: 15
For more detailed output, check application tracking page:http://ip-10-1-1-234.ec2.internal:8088/cluster/app/application_1458297951763_0003Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1458297951763_0003_02_000001
Exit code: 15
Stack trace: ExitCodeException exitCode=15:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 15
Failing this attempt. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1458301261052
final status: FAILED
tracking URL: http://ip-10-1-1-234.ec2.internal:8088/cluster/app/application_1458297951763_0003
user: hadoop
Exception in thread "main" org.apache.spark.SparkException: Application application_1458297951763_0003 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1029)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1076)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/03/18 11:41:32 INFO util.ShutdownHookManager: Shutdown hook called
16/03/18 11:41:32 INFO util.ShutdownHookManager: Deleting directory /mnt/tmp/spark-030f9d29-f7ca-42fa-9caf-64ea103a2bb1
Command exiting with ret '1'
Referring to issue Running Spark Job on Yarn Cluster
It can mean a lot of things, for us, we get the similar error message because of unsupported Java class version, and we fixed the problem by deleting the referenced Java class in our project.
Use this command to see the detailed error message:
yarn logs -application_id application_1458297951763_0003

Spark submit exit in yarn-cluster with -1000

When I submit jars in local mode it works fine but, in yarn cluster it fails with below error,
spark-submit --class com.ffx.events.logevents.LogParser --master yarn-cluster
--jars hdfs://EC22.internal:8020/user/hadoop/dir/src/guava-18.0.jar
--driver-class-path logevents-1.0.0-SNAPSHOT-jar-with-dependencies.jar,hdfs:/
/EC22.internal:8020/user/hadoop/dir/src/netty-3.6.2.Final.jar hdfs://e
c2ipaddress.internal:8020/user/hadoop/dir/src/logevents-1.0.0-SNAPSHOT-jar-with
-dependencies.jar.filepart -input /from-s3-1year/ffx-data/20151001/* -output
hdfs://EC22.internal:8020/user/hadoop/dir/output/20151120
Error trace part of it,
15/11/20 14:00:00 INFO Client: Application report for application_1447788091680_0045 (state: ACCEPTED)
15/11/20 14:00:01 INFO Client: Application report for application_1447788091680_0045 (state: ACCEPTED)
15/11/20 14:00:02 INFO Client: Application report for application_1447788091680_0045 (state: ACCEPTED)
15/11/20 14:00:03 INFO Client: Application report for application_1447788091680_0045 (state: ACCEPTED)
15/11/20 14:00:04 INFO Client: Application report for application_1447788091680_0045 (state: FAILED)
15/11/20 14:00:04 INFO Client:
client token: N/A
diagnostics: Application application_1447788091680_0045 failed 2 times due to AM Container for
appattempt_1447788091680_0045_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://EC22.internal:20888/proxy/
application_1447788091680_0045/Then, click on links to logs of each attempt.
Diagnostics: File does not exist: hdfs://EC22.internal:8020/user/hadoop/.sparkStaging/
application_1447788091680_0045/__spark_conf__7360399142592952913.zip
java.io.FileNotFoundException: File does not exist: hdfs://EC22.internal:8020/user/hadoop/.sparkStaging/
application_1447788091680_0045/__spark_conf__7360399142592952913.zip
I'm not familiar with spark. Appreciate some help

'java.lang.OutOfMemoryError: Java heap space' error in spark application while trying to read the avro file and performing Actions

The avro size is around 44MB.
Below is the yarn logs error :
20/03/30 06:55:04 INFO spark.ExecutorAllocationManager: Existing executor 18 has been removed (new total is 0)
20/03/30 06:55:04 INFO cluster.YarnClusterScheduler: Cancelling stage 5
20/03/30 06:55:04 INFO scheduler.DAGScheduler: ResultStage 5 (head at IrdsFIInstrumentEnricher.scala:15) failed in 213.391 s due to Job aborted due to stage f ailure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 134, fratlhadooappd30.de.db.com, executor 18): ExecutorLostFa ilure (executor 18 exited caused by one of the running tasks) Reason: Container marked as failed: container_1585337469684_0037_02_000029 on host: fratlhadooap pd30.de.db.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
Driver stacktrace:
20/03/30 06:55:04 INFO scheduler.DAGScheduler: Job 3 failed: head at IrdsFIInstrumentEnricher.scala:15, took 213.427308 s
20/03/30 06:55:04 ERROR CCOIrdsEnrichmentService: Unexpected error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 13 4, fratlhadooappd30.de.db.com, executor 18): ExecutorLostFailure (executor 18 exited caused by one of the running tasks) Reason: Container marked as failed: c ontainer_1585337469684_0037_02_000029 on host: fratlhadooappd30.de.db.com. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
Driver stacktrace:
→ at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
.
.
.
.
.
.
.
.
20/03/30 06:48:19 INFO storage.DiskBlockManager: Shutdown hook called
20/03/30 06:48:19 INFO util.ShutdownHookManager: Shutdown hook called
LogType:stdout
Log Upload Time:Mon Mar 30 06:55:10 +0200 2020
LogLength:124
Log Contents:
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill %p"
Executing /bin/sh -c "kill 62191"...
LogType:container-localizer-syslog
Log Upload Time:Mon Mar 30 06:55:10 +0200 2020
LogLength:0
Log Contents:
Below is the code I am using :
fiDF = spark.read
.format("com.databricks.spark.avro")
.load("C:\\Users\\kativikb\\Downloads\\Temp\\cco-irds\\rds_db_global_rds_fi-instrument_20200328000000_v1_block3_snapshot-inc.avro").limit(1)
val tempDF = fiDF.select("payload.identifier.id")
tempDF.show(10) // ******* Error at t his line ******
This was because the avro schema was too large, and I was using the spark version 2.1.0, which perhaps has bug for larger schemas.
this has been fixed in 2.4.0.
I solved this error by changing the schema and using my custom schema, taking only the required fields in the schema.