Pointing a file to the hadoop cluster - scala

I have a file stored in a server. I want the file to be pointed on the Hadoop cluster upon running spark. What I have is that I can point the spark context to the hadoop cluster but the data cannot be accessed in Spark now that it is pointing to the cluster. I have the data stored locally so in order for me to access the data, I have to point it locally. However, this causes a lot of memory error. What I hope to do is to point Spark on the cluster but at the same time accessed my data stored locally. Please provide me some ways how I can do this.

Spark (on Hadoop) cannot read a file stored locally. Remember spark is a distributed system running on multiple machines, thus it cannot read data on one of the nodes (other than localhost) directly.
You should put the file on HDFS and have spark read it from there.
To access it locally you should use hadoop fs -get <hdfs filepath> or hadoop fs -cat <hdfs filepath> command.

Related

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

Finding the location of my spark job output file

I am testing pyspark jobs in an EMR cluster on AWS. The goal is to use a Lambda function to fire the spark job, but for now I am manually running the spark job. So, I SSH to the master node and then run the spark job as below:
spark-submit /home/hadoop/testspark.py mybucket
mybucket - parameter passed to the spark job.
The line that saves the RDD is
rddFiltered.repartition(1).saveAsTextFile("/home/hadoop/output.txt")
The spark job seems to run but it puts the output file in some location - Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt.
Where is this exactly located and how can I view the contents? Forgive my ignorance on HDFS and Hadoop.
Eventually, I want to rename output.txt to something meaningful and then transfer to S3, just haven't gotten there yet.
If I re-run the spark job it says "Output directory hdfs://ip-xxx-xx-xx-xx.ec2.internal:8020/home/hadoop/output.txt already exists". How do I prevent this or at least overwrite the file?
Thanks
Based on the EMR documentation:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html
if you do not specify prefix, spark will write data to HDFS by default. You can check EMR HDFS with this command:
hadoop fs -ls /home/hadoop/
You can also transfer from HDFS to S3 with S3DistCp:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Unfortunately you cannot overwrite the existing file using saveAsTextFile:
https://spark-project.atlassian.net/browse/SPARK-1100
As I can see you re-partitioned the file into one partition, so you can write it into the local file-system as well:
rddFiltered.repartition(1).collect().saveAsTextFile("file:///home/hadoop/output.txt")
Note, if you are using distributed cluster you have to collect() back to driver first!

aws EMR cluster via cloud formation: not seeing hdfs on one

okay i have a EMR cluster which writes to HDFS and I am able to view the directory and see the files
via
hadoop fs -ls /user/hadoop/jobs - i am not seeing /user/hive or jobs directory in hadoop, but its supposed to be there.
I need to get in to the spark shell and perform sparql, so i created identical cluster with same vpc,security groups, and subnet id.
What i am supposed to see
Why this is happending i am not sure but i think this might be it? Or any suggestions
Could this be something to with a stale rule?

Where does Spark store data when storage level is set to disk?

I was wondering in which directory Spark stores data when Storage level is set to DISK_ONLY or MEMORY_AND_DISK (The data which doesn't fit into memory in that case). Because I see that it makes no difference which level I set to. If the program crashes with MEMORY_ONLY level, it also crashes with all other levels.
In the cluster I'm using, /tmp directory is a RAM disk, and therefore limited in size. Is Spark trying to store the disk level data to that drive? Maybe, that is why I'm not seeing the difference. If that is indeed the case, how can I change this default behavior? If I'm using a yarn cluster that comes with Hadoop, do I need to change the /tmp folder in the hadoop configuration files, or just changing the spark.local.dir with Spark would do?
Yes Spark is tying to store the disk level data to that drive.
In yarn-cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored.
Reference: https://spark.apache.org/docs/latest/running-on-yarn.html#important-notes
So for you to change the spark local directory change yarn.nodemanager.local-dirs in your yarn config

Data access Spark EC2

After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes.
I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
Thank you in advance for your support.
There are a few things you need to do:
The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").
Have fun!
Seeing the code of your job would provide more details.
So far looks like workers cannot access the file on the local file system of the driver.
You need to use hadoop fs -put or -cp command to upload your file to HDFS. So workers will be able access the file with hdfs:// uri.
Since you are running your cluster on EC2 I would suggest to put the file to s3 bucket and use s3://... file uri.