Finding the location of my spark job output file - pyspark

I am testing pyspark jobs in an EMR cluster on AWS. The goal is to use a Lambda function to fire the spark job, but for now I am manually running the spark job. So, I SSH to the master node and then run the spark job as below:
spark-submit /home/hadoop/testspark.py mybucket
mybucket - parameter passed to the spark job.
The line that saves the RDD is
rddFiltered.repartition(1).saveAsTextFile("/home/hadoop/output.txt")
The spark job seems to run but it puts the output file in some location - Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt.
Where is this exactly located and how can I view the contents? Forgive my ignorance on HDFS and Hadoop.
Eventually, I want to rename output.txt to something meaningful and then transfer to S3, just haven't gotten there yet.
If I re-run the spark job it says "Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt already exists". How do I prevent this or at least overwrite the file?
Thanks

Based on the EMR documentation:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
if you do not specify prefix, spark will write data to HDFS by default. You can check EMR HDFS with this command:
hadoop fs -ls /home/hadoop/
You can also transfer from HDFS to S3 with S3DistCp:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Unfortunately you cannot overwrite the existing file using saveAsTextFile:
https://spark-project.atlassian.net/browse/SPARK-1100
As I can see you re-partitioned the file into one partition, so you can write it into the local file-system as well:
rddFiltered.repartition(1).collect().saveAsTextFile("file:///home/hadoop/output.txt")
Note, if you are using distributed cluster you have to collect() back to driver first!

Related

aws EMR cluster via cloud formation: not seeing hdfs on one

okay i have a EMR cluster which writes to HDFS and I am able to view the directory and see the files
via
hadoop fs -ls /user/hadoop/jobs - i am not seeing /user/hive or jobs directory in hadoop, but its supposed to be there.
I need to get in to the spark shell and perform sparql, so i created identical cluster with same vpc,security groups, and subnet id.
What i am supposed to see
Why this is happending i am not sure but i think this might be it? Or any suggestions
Could this be something to with a stale rule?

Spark (Scala) Writing (and reading) to local file system from driver

1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.
Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.
Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.
Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.

Pointing a file to the hadoop cluster

I have a file stored in a server. I want the file to be pointed on the Hadoop cluster upon running spark. What I have is that I can point the spark context to the hadoop cluster but the data cannot be accessed in Spark now that it is pointing to the cluster. I have the data stored locally so in order for me to access the data, I have to point it locally. However, this causes a lot of memory error. What I hope to do is to point Spark on the cluster but at the same time accessed my data stored locally. Please provide me some ways how I can do this.
Spark (on Hadoop) cannot read a file stored locally. Remember spark is a distributed system running on multiple machines, thus it cannot read data on one of the nodes (other than localhost) directly.
You should put the file on HDFS and have spark read it from there.
To access it locally you should use hadoop fs -get <hdfs filepath> or hadoop fs -cat <hdfs filepath> command.

Data access Spark EC2

After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes.
I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
Thank you in advance for your support.
There are a few things you need to do:
The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").
Have fun!
Seeing the code of your job would provide more details.
So far looks like workers cannot access the file on the local file system of the driver.
You need to use hadoop fs -put or -cp command to upload your file to HDFS. So workers will be able access the file with hdfs:// uri.
Since you are running your cluster on EC2 I would suggest to put the file to s3 bucket and use s3://... file uri.

spark on yarn; how to send metrics to graphite sink?

I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.
I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.