DStream.print() not outputting to console - scala

I am running a simple HDFS stream Spark job which counts the words in text files located in a HDFS directory.
The code is taken from this example.
I notice that there is no output from the wordCounts.print() line, even though the log says that the new file was detected.
Looking at the source for the function, it seems I am definitely not getting output as the 'time' heading isn't bring printed either.
I am running on AWS EMR.
Any idea what could be going wrong?
./spark/bin/spark-submit --class com.rory.sparktest.WordCountStream --deploy-mode client --executor-memory 2G /home/hadoop/SparkStreamingTest2-1.0-bar.jar file:///home/hadoop/streamtest/

Related

Apache Spark submit --files java.io.FileNotFoundException

I have a spark script that reads the content of a file spark.read.textFile(filePath)
I run it from the container of the master itself and try to pass this file with the --files param such as
./spark-submit --class nameOfClass --files local/path/to/file.csv --master spark://master_ip generated_executable.jar local/path/to/file.csv`
But then I get the error
java.io.FileNotFoundException: File file:/local/path/to/file.csv does not exist
I tried changing the line to:
spark.read.textFile(SparkFiles.get(fileName))
but the error persists, now it says
java.io.FileNotFoundException: File file:/mnt/mesos/sandbox/spark-946bbaef-a258-4951-9b15-bec77b78bf5d/userFiles-3f9dcf85-4114-4968-b625-6bb1498f568d/file.csv does not exist
If I manually add the file to each worker, it works. But I don't want to do that. Is there a way to pass the file from the context where it is submitting the job?
one good alternative would be to put the csv file to hdfs. then you don't have to take care of the file residing on every executor. I don't think it's a common pattern to pass a file to a spark-submit invocation in order to read it.
assuming it's located at hdfs://tmp/file.csv you can read it by coding
spark.read.textFile("/tmp/file.csv")
which leads to the opportunity to dispense passing the file via --files fully
but what you could do is running the job in local mode. this is a easy workaround avoiding to place the file on every node manually. then spark should find the file just as you want under
file:/local/path/to/file.csv
i hope that helps

Output results of spark-submit

I'm beginner in spark and scala programing, I tried running example with spark-submit in local mode, it's run complete without any error or other message but i can't see any output result in consul or spark history web UI .Where and how can I see the results of my program in spark-submit?
This is a command that I run on spark
spark-submit --master local[*] --conf spark.history.fs.logDirectory=/tmp /spark-events --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/tmp/spark-events --conf spark.history.ui.port=18080 --class com.intel.analytics.bigdl.models.autoencoder.Train dist/lib/bigdl-0.5.0-SNAPSHOT-jar-with-dependencies.jar -f /opt/work/mnist -b 8
and this is a screenshot from end of run program
You can also locate your spark-defaults.conf (or spark-defaults.conf.template and copy it to spark-defaults.conf)
Create a logging dir (like /tmp/spark-events/)
Add these 2 lines:
spark.eventLog.enabled true
spark.eventLog.dir file:///tmp/spark-events/
And run sbin/start-history-server.sh
To make all jobs run by spark-submit log to event dir and overviews available in History Server (http://localhost:18080/) => Web UI, without keeping your spark job running
More info: https://spark.apache.org/docs/latest/monitoring.html
PS: On mac via homebrew this is all in the subdirs /usr/local/Cellar/apache-spark/[version]/libexec/
Try to add this while(true) Thread.sleep(1000) in your code, to keep the server running then check the sparks task in the browser. Normally you should see your application running.
thank a lot for your answer, I already made these settings in the spark-submit command using "--conf" and I can see web UI history with "spark-class org.apache.spark.deploy.history.HistoryServer" but I don't have access to "start-history-server.sh" .I see tasks and jobs completed in history web UI ,I checked all the tabs (jobs,stages,storage, executors) and not found the output results nowhere
.can you explain to me where are the results in history web UI or even consul?(My goal is numerical results as the output of the dataset accepted in the spark-submit command)
screenshot from web UI history
Regards
To get the output from spark-submit, you can add below command in your code.scala file which we create and save in src/main/scala location before running sbt package command.
code.scala contents ->
........
........
result.saveAsTextFile("file:///home/centos/project")
Now, you should run "sbt package" command followed by "spark-submit". It will create project folder at your given location. This folder will contain two files: part-00000 and _SUCCESS. You can check your output in file -> part-00000

Spark-Submit execution time

I have developed a Scala Program on Spark which connected MySQL Database to pull the data about 250K records and process it. When I execute the application from the IDE itself (IntelliJ) it takes about 1 min to complete the job where as if I submit through Spark-Sumit from my terminal it takes 4 minutes.
Scala Code
val sparkSession = SparkSession.builder().
appName("credithistory").
master("local[*]")
.getOrCreate()
From Terminal
spark-submit --master local[*] .....
Any changes should I have to make or it is normal behaviour? Since local[*] I have it in code also Im supplying from terminal.
local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
it's from the reference of spark web page. link
you can adjust the number of 'K',
for example, "local[4] or local[8]" following your CPU performance.

Spark (Scala) Writing (and reading) to local file system from driver

1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.
Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.
Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.
Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.

Spark and YARN. How does the SparkPi example use the first argument as the master url?

I'm just starting out learning Spark, and I'm trying to replicate the SparkPi example by copying the code into a new project and building a jar. The source for SparkPi is: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala
I have a working YARN cluster (running CDH 5.0.1), and I've uploaded the spark assembly jar and set it's hdfs location in SPARK_JAR.
If I run this command, the example works:
$ SPARK_CLASSPATH=/usr/lib/spark/examples/lib/spark-examples_2.10-0.9.0-cdh5.0.1.jar /usr/lib/spark/bin/spark-class org.apache.spark.examples.SparkPi yarn-client 10
However, if I copy the source into a new project and build a jar and run the same command (with a different jar and classname), I get the following error:
$ SPARK_CLASSPATH=Spark.jar /usr/lib/spark/bin/spark-class spark.SparkPi yarn-client 10
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:113)
at spark.SparkPi$.main(SparkPi.scala:9)
at spark.SparkPi.main(SparkPi.scala)
Somehow, the first argument isn't being passed as the master in the SparkContext in my version, but it works fine in the example.
Looking at the SparkPi code, it seems to only expect a single numeric argument.
So is there something about the Spark examples jar file which intercepts the first argument and somehow sets the spark.master property to be that?
This is a recent change — you are running old code in the first case and running new code in the second.
Here is the change: https://github.com/apache/spark/commit/44dd57fb66bb676d753ad8d9757f9f4c03364113
I think this would be the right command now:
/usr/lib/spark/bin/spark-submit Spark.jar --class spark.SparkPi yarn-client 10