I have a spark script that reads the content of a file spark.read.textFile(filePath)
I run it from the container of the master itself and try to pass this file with the --files param such as
./spark-submit --class nameOfClass --files local/path/to/file.csv --master spark://master_ip generated_executable.jar local/path/to/file.csv`
But then I get the error
java.io.FileNotFoundException: File file:/local/path/to/file.csv does not exist
I tried changing the line to:
spark.read.textFile(SparkFiles.get(fileName))
but the error persists, now it says
java.io.FileNotFoundException: File file:/mnt/mesos/sandbox/spark-946bbaef-a258-4951-9b15-bec77b78bf5d/userFiles-3f9dcf85-4114-4968-b625-6bb1498f568d/file.csv does not exist
If I manually add the file to each worker, it works. But I don't want to do that. Is there a way to pass the file from the context where it is submitting the job?
one good alternative would be to put the csv file to hdfs. then you don't have to take care of the file residing on every executor. I don't think it's a common pattern to pass a file to a spark-submit invocation in order to read it.
assuming it's located at hdfs://tmp/file.csv you can read it by coding
spark.read.textFile("/tmp/file.csv")
which leads to the opportunity to dispense passing the file via --files fully
but what you could do is running the job in local mode. this is a easy workaround avoiding to place the file on every node manually. then spark should find the file just as you want under
file:/local/path/to/file.csv
i hope that helps
Related
I am testing pyspark jobs in an EMR cluster on AWS. The goal is to use a Lambda function to fire the spark job, but for now I am manually running the spark job. So, I SSH to the master node and then run the spark job as below:
spark-submit /home/hadoop/testspark.py mybucket
mybucket - parameter passed to the spark job.
The line that saves the RDD is
rddFiltered.repartition(1).saveAsTextFile("/home/hadoop/output.txt")
The spark job seems to run but it puts the output file in some location - Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt.
Where is this exactly located and how can I view the contents? Forgive my ignorance on HDFS and Hadoop.
Eventually, I want to rename output.txt to something meaningful and then transfer to S3, just haven't gotten there yet.
If I re-run the spark job it says "Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt already exists". How do I prevent this or at least overwrite the file?
Thanks
Based on the EMR documentation:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
if you do not specify prefix, spark will write data to HDFS by default. You can check EMR HDFS with this command:
hadoop fs -ls /home/hadoop/
You can also transfer from HDFS to S3 with S3DistCp:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Unfortunately you cannot overwrite the existing file using saveAsTextFile:
https://spark-project.atlassian.net/browse/SPARK-1100
As I can see you re-partitioned the file into one partition, so you can write it into the local file-system as well:
rddFiltered.repartition(1).collect().saveAsTextFile("file:///home/hadoop/output.txt")
Note, if you are using distributed cluster you have to collect() back to driver first!
I'm trying to write file to local FileSystem using FileSystem library of org.apache.hadoop.fs. Below is my one liner code inside the big scala code that should be doing this, but it's not.
fs.copyToLocalFile(false, hdfsSourcePath, new Path(newFile.getAbsolutePath), true)
The value of newFile is:
val newFile = new File(s"${localPath}/fileName.dat")
localPath is just a variable containing the full path on local disk.
hdfsSourcePath is the full path on HDFS location.
The job executes properly but I don't see the files created on local. I'm running it through Spark engine in cluster mode, that's why I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
Any ideas?
I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
I think you got this point wrong. Cluster mode makes driver run on executor node and local file system is that executor's file system. useRawLocalFileSystem only prevents writing checksum files (->info), it does not make the files appear on machine that is submitting the job, which is probably what you expected.
The best you can do is to save files to HDFS and retrieve them explicitly after the job finishes.
1st Question :
I have a 2 node virtual cluster with hadoop.
I have a jar that runs a spark job.
This jar accepts as a cli argument : a path to a commands.txt file which tells the jar which commands to run.
I run the job with spark-submit, and i have noticed that my slave node wasn't running because it couldn't find the commands.txt file which was local on the master.
This is the command i used to run it :
./spark-1.6.1-bin-hadoop2.6/bin/spark-submit --class
univ.bigdata.course.MainRunner --master yarn\
--deploy-mode cluster --executor-memory 1g \
--num-executors 4 \
final-project-1.0-SNAPSHOT.jar commands commands.txt
Do i need to upload commands.txt to the hdfs and give the hdfs path instead as follows ? :
hdfs://master:9000/user/vagrant/commands.txt
2nd Question :
How do i write to a file on the driver machine in the cwd ?
I used a normal scala filewriter to write output to queries_out.txt and it worked fine when using spark submit with
-master local[]
But, when running in
-master yarn
I cant find the file, No exceptions are thrown but i just cant locate the file. It doesn't exist as if it was never written. Is there a way to write the results to a file on the driver machine locally ? Or should i only write results to HDFS ?
Thanks.
Question 1: Yes, uploading it to hdfs or any network accessible file system is how you solve your problem.
Question 2:
This is a bit tricky. Assuming your results are in a RDD you could call collect(), that will aggregate all the data on your driver process. Then, you have a standard collection in your hands which you could simply write on disk. Note that you should give your driver's process enough memory to be able to hold all results in memory, do not forget to also increase the maximum result size. The parameters are:
--driver-memory 16G
--conf "spark.driver.maxResultSize=15g"
This is has absolutely poor scaling behaviour in both communication complexity and memory (both in the size of the result RDD). This is the easiest way and perfectly fine for a toy project or when the data set is always small. In all other cases it will certainly blow up at some point.
The better way, as you may have mentioned, is to use the build-in "saveAs" methods to write to i.e. hdfs (or another storage format). You can check the documentation for that: http://spark.apache.org/docs/latest/programming-guide.html#actions
Note that if you only want to persist the RDD, because you are reusing it in several computations (like cache, but instead of holding it in memory hold it in disk) there is also a persist method on RDDs.
Solution was very simple, i changed --deploy-mode cluster to --deploy-mode client and then the file writes were done correctly on the machine where i ran the driver.
Answer to Question 1:
Submitting spark job with the --files tag followed by path to a local file downloads the file from the driver node to the cwd of all the worker nodes and thus be accessed just by using its name.
After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes.
I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
Thank you in advance for your support.
There are a few things you need to do:
The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").
Have fun!
Seeing the code of your job would provide more details.
So far looks like workers cannot access the file on the local file system of the driver.
You need to use hadoop fs -put or -cp command to upload your file to HDFS. So workers will be able access the file with hdfs:// uri.
Since you are running your cluster on EC2 I would suggest to put the file to s3 bucket and use s3://... file uri.
I am new to spark and we are running spark on yarn. I can run my test applications just fine. I am trying to collect the spark metrics in Graphite. I know what changes to make to metrics.properties file. But how will my spark application see this conf file?
/xxx/spark/spark-0.9.0-incubating-bin-hadoop2/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /xxx/spark/spark-0.9.0-incubating-bin-hadoop2/examples/target/scala-2.10/spark-examples_2.10-assembly-0.9.0-incubating.jar --addJars "hdfs://host:port/spark/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar" --class org.apache.spark.examples.Test --args yarn-standalone --num-workers 50 --master-memory 1024m --worker-memory 1024m --args "xx"
Where should I be specifying the metrics.properties file?
I made these changes to it:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.Graphite.host=machine.domain.com
*.sink.Graphite.port=2003
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I have found a different solution to the same problem. It looks like that Spark can also take these metric settings from its config properties. For example the following line from metrics.properties:
*.sink.Graphite.class=org.apache.spark.metrics.sink.GraphiteSink
Can also be specified as a Spark property with key spark.metrics.conf.*.sink.graphite.class and value org.apache.spark.metrics.sink.GraphiteSink. You just need to prepend spark.metrics.conf. to each key.
I have ended up putting all these settings in the code like this:
val sparkConf = new spark.SparkConf()
.set("spark.metrics.conf.*.sink.graphite.class", "org.apache.spark.metrics.sink.GraphiteSink")
.set("spark.metrics.conf.*.sink.graphite.host", graphiteHostName)
// etc.
val sc = new spark.SparkContext(sparkConf)
This way I've got the metrics sink set up for both the driver and the executors. I was using Spark 1.6.0.
I struggled with the same thing. I have it working using these flags:
--files=/path/to/metrics.properties --conf spark.metrics.conf=metrics.properties
It's tricky because the --files flag makes it so your /path/to/metrics.properties file ends up in every executor's local disk space as metrics.properties; AFAIK there's no way to specify more complex directory structure there, or have two files with the same basename.
Related, I filed SPARK-5152 about letting the spark.metrics.conf file be read from HDFS, but that seems like it would require a fairly invasive change, so I'm not holding my breath on that one.