Output results of spark-submit - scala

I'm beginner in spark and scala programing, I tried running example with spark-submit in local mode, it's run complete without any error or other message but i can't see any output result in consul or spark history web UI .Where and how can I see the results of my program in spark-submit?
This is a command that I run on spark
spark-submit --master local[*] --conf spark.history.fs.logDirectory=/tmp /spark-events --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/spark-events --conf spark.history.ui.port=18080 --class com.intel.analytics.bigdl.models.autoencoder.Train dist/lib/bigdl-0.5.0-SNAPSHOT-jar-with-dependencies.jar -f /opt/work/mnist -b 8
and this is a screenshot from end of run program

You can also locate your spark-defaults.conf (or spark-defaults.conf.template and copy it to spark-defaults.conf)
Create a logging dir (like /tmp/spark-events/)
Add these 2 lines:
spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events/
And run sbin/start-history-server.sh
To make all jobs run by spark-submit log to event dir and overviews available in History Server (http://localhost:18080/) => Web UI, without keeping your spark job running
More info: https://spark.apache.org/docs/latest/monitoring.html
PS: On mac via homebrew this is all in the subdirs /usr/local/Cellar/apache-spark/[version]/libexec/

Try to add this while(true) Thread.sleep(1000) in your code, to keep the server running then check the sparks task in the browser. Normally you should see your application running.

thank a lot for your answer, I already made these settings in the spark-submit command using "--conf" and I can see web UI history with "spark-class org.apache.spark.deploy.history.HistoryServer" but I don't have access to "start-history-server.sh" .I see tasks and jobs completed in history web UI ,I checked all the tabs (jobs,stages,storage, executors) and not found the output results nowhere
.can you explain to me where are the results in history web UI or even consul?(My goal is numerical results as the output of the dataset accepted in the spark-submit command)
screenshot from web UI history
Regards

To get the output from spark-submit, you can add below command in your code.scala file which we create and save in src/main/scala location before running sbt package command.
code.scala contents ->
........
........
result.saveAsTextFile("file:///home/centos/project")
Now, you should run "sbt package" command followed by "spark-submit". It will create project folder at your given location. This folder will contain two files: part-00000 and _SUCCESS. You can check your output in file -> part-00000

Related

How to use spark-submit's --properties-file option to launch Spark application in IntelliJ IDEA?

I'm starting a project using Spark developed with Scala and IntelliJ IDE.
I was wondering how to set -- properties-file with specific configuration of Spark in IntelliJ configuration.
I'm reading configuration like this "param1" -> sc.getConf.get("param1")
When I execute Spark job from command line works like a charm:
/opt/spark/bin/spark-submit --class "com.class.main" --master local --properties-file properties.conf ./target/scala-2.11/main.jar arg1 arg2 arg3 arg4
The problem is when I execute job using IntelliJ Run Configuration using VM Options:
I succeed with --master param as -Dspark.master=local
I succeed with --conf params as -Dspark.param1=value1
I failed with --properties-file
Can anyone point me at the right way to set this up?
I don't think it's possible to use --properties-file to launch a Spark application from within IntelliJ IDEA.
spark-submit is the shell script to submit Spark application for execution and does few extra things before it creates a proper submission environment for the Spark application.
You can however mimic the behaviour of --properties-file by leveraging conf/spark-defaults.conf that a Spark application loads by default.
You could create a conf/spark-defaults.conf under src/test/resources (or src/main/resources) with the content of properties.conf. That is supposed to work.

DStream.print() not outputting to console

I am running a simple HDFS stream Spark job which counts the words in text files located in a HDFS directory.
The code is taken from this example.
I notice that there is no output from the wordCounts.print() line, even though the log says that the new file was detected.
Looking at the source for the function, it seems I am definitely not getting output as the 'time' heading isn't bring printed either.
I am running on AWS EMR.
Any idea what could be going wrong?
./spark/bin/spark-submit --class com.rory.sparktest.WordCountStream --deploy-mode client --executor-memory 2G /home/hadoop/SparkStreamingTest2-1.0-bar.jar file:///home/hadoop/streamtest/

Where are the Spark logs on EMR?

I'm not able to locate error logs or message's from println calls in Scala while running jobs on Spark in EMR.
Where can I access these?
I'm submitting the Spark job, written in Scala to EMR using script-runner.jar with arguments --deploy-mode set to cluster and --master set to yarn. It runs the job fine.
However I do not see my println statements in the Amazon EMR UI where it lists "stderr, stdoutetc. Furthermore if my job errors I don't see why it had an error. All I see is this in thestderr`:
15/05/27 20:24:44 INFO yarn.Client: Application report from ResourceManager:
application identifier: application_1432754139536_0002
appId: 2
clientToAMToken: null
appDiagnostics:
appMasterHost: ip-10-185-87-217.ec2.internal
appQueue: default
appMasterRpcPort: 0
appStartTime: 1432758272973
yarnAppState: FINISHED
distributedFinalState: FAILED
appTrackingUrl: http://10.150.67.62:9046/proxy/application_1432754139536_0002/A
appUser: hadoop
`
With the deploy mode of cluster on yarn the Spark driver and hence the user code executed will be within the Application Master container. It sounds like you had EMR debugging enabled on the cluster so logs should have also pushed to S3. In the S3 location look at task-attempts/<applicationid>/<firstcontainer>/*.
If you SSH into the master node of your cluster then you should be able to find the stdout, stderr, syslog and controller logs under:
/mnt/var/log/hadoop/steps/<stepname>
I also spent a lot of time figuring this out. Found logs in the following location:
EMR UI Console -> Summary -> Log URI -> Containers -> application_xxx_xxx -> container_yyy_yy_yy -> stdout.gz.
The event logs, the ones required for the spark-history-server can be found at :
hdfs:///var/log/spark/apps
If you submit your job with emr-bootstrap you can specify the log directory as an s3 bucket with --log-uri

How to run a map reduce job using Java -jar command

I write a Map reduce Job using Java.
Set configuration
Configuration configuration = new Configuration();
configuration.set("fs.defaultFS", "hdfs://127.0.0.1:9000");
configuration.set("mapreduce.job.tracker", "localhost:54311");
configuration.set("mapreduce.framework.name", "yarn");
configuration.set("yarn.resourcemanager.address", "localhost:8032");
Run using Different Case
case 1: "Using Hadoop and Yarn command" : Success Fine Work
case 2: "Using Eclipse " : Success Fine Work
case 3: "Using Java -jar after remove all configuration.set() " :
Configuration configuration = new Configuration();
Run successful but not display Job status on Yarn (default port number 8088)
case 4: "Using Java -jar" : Error
Find stack trace:Exception in thread "main" java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:120)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:82)
at org.apache.hadoop.mapreduce.Cluster.<init>(Cluster.java:75)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1255)
at org.apache.hadoop.mapreduce.Job$9.run(Job.java:1251)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapreduce.Job.connect(Job.java:1250)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1279)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at com.my.cache.run.MyTool.run(MyTool.java:38)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at com.my.main.Main.main(Main.java:45)
I request to you please tell me how to run a map-reduce job using "Java -jar" command and also able to check status and log on Yarn (default port 8088).
Why need: want to create a web service and submit a map-reduce job.(Without using Java runtime library for executing Yarn or Hadoop command ).
In my opinion, it's quite difficult to run hadoop application without hadoop command. You better use hadoop jar than java -jar.
I think you don't have hadoop environment in your machine. First, you must make sure hadoop running well on your machine.
Personally, I do prefer set configuration at mapred-site.xml, core-site.xml, yarn-site.xml, hdfs-site.xml. I know a clear tutorial to install hadoop cluster in here
At this step, You can monitor hdfs in port 50070, yarn cluster in port 8088, mapreduce job history in port 19888.
Then, you should prove your hdfs environtment and yarn environtment running well. For hdfs environment you can try with simple hdfs command like mkdir, copyToLocal, copyFromLocal, etc and for yarn environment you can try sample wordcount project.
After you have hadoop environment, you can create your own mapreduce application (you can use any IDE). probably you need this for tutorial. compile it and make it in jar.
open your terminal, and run this command
hadoop jar <path to jar> <arg1> <arg2> ... <arg n>
hope this helpfull.

Spark and YARN. How does the SparkPi example use the first argument as the master url?

I'm just starting out learning Spark, and I'm trying to replicate the SparkPi example by copying the code into a new project and building a jar. The source for SparkPi is: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala
I have a working YARN cluster (running CDH 5.0.1), and I've uploaded the spark assembly jar and set it's hdfs location in SPARK_JAR.
If I run this command, the example works:
$ SPARK_CLASSPATH=/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.1.jar /usr/lib/spark/bin/spark-class org.apache.spark.examples.SparkPi yarn-client 10
However, if I copy the source into a new project and build a jar and run the same command (with a different jar and classname), I get the following error:
$ SPARK_CLASSPATH=Spark.jar /usr/lib/spark/bin/spark-class spark.SparkPi yarn-client 10
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:113)
at spark.SparkPi$.main(SparkPi.scala:9)
at spark.SparkPi.main(SparkPi.scala)
Somehow, the first argument isn't being passed as the master in the SparkContext in my version, but it works fine in the example.
Looking at the SparkPi code, it seems to only expect a single numeric argument.
So is there something about the Spark examples jar file which intercepts the first argument and somehow sets the spark.master property to be that?
This is a recent change — you are running old code in the first case and running new code in the second.
Here is the change: https://github.com/apache/spark/commit/44dd57fb66bb676d753ad8d9757f9f4c03364113
I think this would be the right command now:
/usr/lib/spark/bin/spark-submit Spark.jar --class spark.SparkPi yarn-client 10