How to store the text file on the Master? - scala

I am using Standalone clusters to run the ALS algorithm. The predictions are being stored to the textfile using:
saveAsTextFile(path)
But the text file is being stored on the clusters. I want to store the text file on the Master.

That is expected behavior. path is resolved on the machine it
is executed, the slaves. I'd recommend to either use a cluster FS
(e.g. HDFS) or .collect() your data so you can save them locally on
the master. Beware of OOM if your data is large.

Related

Jmeter csv data split

I am load testing using Jmeter containers inside k8s cluster.Right now the jmx and the csv files are copied to all the containers.Is there a way to split the data file so that each JMeter instance in container gets its own subset of the original file?
Are you looking for split command or what? The number of lines in file and the number of pods in cluster can be obtained using wc command
Also there might be better solutions like using HTTP Simple Table Server or Redis Data Set so the test data would be stored in a "central" location and you won't have to bother about copying splitting it and copying the parts to the slaves

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

Hadoop FileUtils not able to write files on local(Unix) filesystem from Scala

I'm trying to write file to local FileSystem using FileSystem library of org.apache.hadoop.fs. Below is my one liner code inside the big scala code that should be doing this, but it's not.
fs.copyToLocalFile(false, hdfsSourcePath, new Path(newFile.getAbsolutePath), true)
The value of newFile is:
val newFile = new File(s"${localPath}/fileName.dat")
localPath is just a variable containing the full path on local disk.
hdfsSourcePath is the full path on HDFS location.
The job executes properly but I don't see the files created on local. I'm running it through Spark engine in cluster mode, that's why I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
Any ideas?
I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
I think you got this point wrong. Cluster mode makes driver run on executor node and local file system is that executor's file system. useRawLocalFileSystem only prevents writing checksum files (->info), it does not make the files appear on machine that is submitting the job, which is probably what you expected.
The best you can do is to save files to HDFS and retrieve them explicitly after the job finishes.

Pointing a file to the hadoop cluster

I have a file stored in a server. I want the file to be pointed on the Hadoop cluster upon running spark. What I have is that I can point the spark context to the hadoop cluster but the data cannot be accessed in Spark now that it is pointing to the cluster. I have the data stored locally so in order for me to access the data, I have to point it locally. However, this causes a lot of memory error. What I hope to do is to point Spark on the cluster but at the same time accessed my data stored locally. Please provide me some ways how I can do this.
Spark (on Hadoop) cannot read a file stored locally. Remember spark is a distributed system running on multiple machines, thus it cannot read data on one of the nodes (other than localhost) directly.
You should put the file on HDFS and have spark read it from there.
To access it locally you should use hadoop fs -get <hdfs filepath> or hadoop fs -cat <hdfs filepath> command.

Where does Spark store data when storage level is set to disk?

I was wondering in which directory Spark stores data when Storage level is set to DISK_ONLY or MEMORY_AND_DISK (The data which doesn't fit into memory in that case). Because I see that it makes no difference which level I set to. If the program crashes with MEMORY_ONLY level, it also crashes with all other levels.
In the cluster I'm using, /tmp directory is a RAM disk, and therefore limited in size. Is Spark trying to store the disk level data to that drive? Maybe, that is why I'm not seeing the difference. If that is indeed the case, how can I change this default behavior? If I'm using a yarn cluster that comes with Hadoop, do I need to change the /tmp folder in the hadoop configuration files, or just changing the spark.local.dir with Spark would do?
Yes Spark is tying to store the disk level data to that drive.
In yarn-cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored.
Reference: https://spark.apache.org/docs/latest/running-on-yarn.html#important-notes
So for you to change the spark local directory change yarn.nodemanager.local-dirs in your yarn config