Deleting temp file in a Spark datasource - scala

I am trying to write a spark datasource package.
My datasource job is simple.
Get the content from remote system and store the content in a temp
file
Construct dataframe using the content present in temp file
I was able to do the above stuff. But I want to delete the temp file once the dataframe has been constructed.
As Spark constructs the dataframe lazily, It can't be deleted from my DatasetRelation. So the option to delete from DefaultSource or DatasetRelation is ruled out.
Another option is add my temp folder in ShutdownHookManager which will take care of deleting my temp folder during spark shutdown. But unfortunately ShutdownHookManager is private.
Another option is get the temp directory that spark uses and deletes it during shutdown. There are temp directories created by spark. But I can't get the temp directory name created by spark. Spark creates temp directory with UUID in its directory name. Also there is no environment variable to get this temp directory. So not able to use this option as well.
Is there any other option to delete my temp file used to construct a dataframe in spark?

Get the content from remote system and store the content in a temp file
You should probably not do this in spark. If you did it in an external script, you could pass the path to that script to spark and spark would then copy it to the cluster and delete it afterwards.

Related

Reading a file from local file system after reading it from hadoop file system

I am trying to read a file from my local EMR file system. It is there as a file under the folder /emr/myFile.csv. However, I keep getting a FileNotFoundException. Here is the line of code that I use to read it:
val myObj: File = new File("/emr/myFile.csv")
I added a file://// prefix to the file path as well because I have seen that work for others, but that still did not work. So I also try to read directly from the hadoop file system where it is stored in the folder: /emr/CNSMR_ACCNT_BAL/myFile.csv because I thought it was maybe checking by default in hdfs. However, that also results in a FileNotFoundException. Here is the code for that:
val myObj: File = new File("/emr/CNSMR_ACCNT_BAL/myFile.csv")
How can I read this file into a File?
For your 1st problem:
When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration).
If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.
When you say file:///emr/myFile.csv this file exists on your local file system (I'm assuming that means on master node), your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.
2nd problem:
When you try to access a file in HDFS using java File.class, it won’t be able to access that file.
You need to use hadoop FileSystem api (org.apache.hadoop.fs.FileSystem) to interact with a HDFS file.
Also use HDFS file tag hdfs://<namenode>:<port>/emr/CNSMR_ACCNT_BAL/myFile.csv.
If your core-site.xml contains value of fs.defaultFS then you don’t need to put namenode and port info just simply hdfs:///emr/CNSMR_ACCNT_BAL/myFile.csv
So what's better option here while accessing file in hadoop cluster?
The answer depends upon your use case, but most cases putting it in HDFS it much better, because you don’t have to worry about where your application master is. Each and every node have access to the hdfs.
Hope that resolves your problem.

Pyspark dataframe write parquet without deleting /_temporary folder

df.write.mode("append").parquet(path)
I'm using this to write parquet files to an S3 location. It seems like in order to write the files, it's also creating a /_temporary directory and deleting it after use. So I got access denied. Admin on our AWS account doesn't want to grant the code delete permission on that folder.
I proposed to write the files to another folder where delete permission can be granted then copy the files over. But Admin still wants me to write files directly to the destination folder.
Is there certain configuration I can set to ask Pyspark don't do the deletion on the temporary directory?
I don't think there is such option for _temporary folder.
But if you're running your Spark job on EMR cluster, you can write first to local HDFS of your cluster and then copy data to S3 using Hadoop FileUtil.copy function.
On Pyspark, you can access this function via JVM gateway like this :
sc._gateway.jvm.org.apache.hadoop.fs.FileUtil

Hadoop FileUtils not able to write files on local(Unix) filesystem from Scala

I'm trying to write file to local FileSystem using FileSystem library of org.apache.hadoop.fs. Below is my one liner code inside the big scala code that should be doing this, but it's not.
fs.copyToLocalFile(false, hdfsSourcePath, new Path(newFile.getAbsolutePath), true)
The value of newFile is:
val newFile = new File(s"${localPath}/fileName.dat")
localPath is just a variable containing the full path on local disk.
hdfsSourcePath is the full path on HDFS location.
The job executes properly but I don't see the files created on local. I'm running it through Spark engine in cluster mode, that's why I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
Any ideas?
I used the copyToLocalFile method which overloads the 4th argument of useRawLocalFileSystem and set it to true. Using this, we can avoid getting the files being written on the executor node.
I think you got this point wrong. Cluster mode makes driver run on executor node and local file system is that executor's file system. useRawLocalFileSystem only prevents writing checksum files (->info), it does not make the files appear on machine that is submitting the job, which is probably what you expected.
The best you can do is to save files to HDFS and retrieve them explicitly after the job finishes.

Spark - Writing to csv results in _temporary file [duplicate]

After a spark program completes, there are 3 temporary directories remain in the temp directory.
The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7
The directories are empty.
And when the Spark program runs on Windows, a snappy DLL file also remains in the temp directory.
The file name is like this: snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava
They are created every time the Spark program runs. So the number of files and directories keeps growing.
How can let them be deleted?
Spark version is 1.3.1 with Hadoop 2.6.
UPDATE
I've traced the spark source code.
The module methods that create the 3 'temp' directories are as follows:
DiskBlockManager.createLocalDirs
HttpFileServer.initialize
SparkEnv.sparkFilesDir
They (eventually) call Utils.getOrCreateLocalRootDirs and then Utils.createDirectory, which intentionally does NOT mark the directory for automatic deletion.
The comment of createDirectory method says: "The directory is guaranteed to be
newly created, and is not marked for automatic deletion."
I don't know why they are not marked. Is this really intentional?
Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc
spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.
spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.
spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.
I assume you are using the "local" mode only for testing purposes. I solved this issue by creating a custom temp folder before running a test and then I delete it manually (in my case I use local mode in JUnit so the temp folder is deleted automatically).
You can change the path to the temp folder for Spark by spark.local.dir property.
SparkConf conf = new SparkConf().setMaster("local")
.setAppName("test")
.set("spark.local.dir", "/tmp/spark-temp");
After the test is completed I would delete the /tmp/spark-temp folder manually.
I don't know how to make Spark cleanup those temporary directories, but I was able to prevent the creation of the snappy-XXX files. This can be done in two ways:
Disable compression. Properties: spark.broadcast.compress, spark.shuffle.compress, spark.shuffle.spill.compress. See http://spark.apache.org/docs/1.3.1/configuration.html#compression-and-serialization
Use LZF as a compression codec. Spark uses native libraries for Snappy and lz4. And because of the way JNI works, Spark has to unpack these libraries before using them. LZF seems to be implemented natively in Java.
I'm doing this during development, but for production it is probably better to use compression and have a script to clean up the temp directories.
I do not think cleanup is supported for all scenarios. I would suggest to write a simple windows scheduler to clean up nightly.
You need to call close() on the spark context that you created at the end of the program.
for spark.local.dir, it will only move spark temp files, but the snappy-xxx file will still exists in /tmp dir.
Though didn't find way to make spark automatically clear it, but you can set JAVA option:
JVM_EXTRA_OPTS=" -Dorg.xerial.snappy.tempdir=~/some-other-tmp-dir"
to make it move to another dir, as most system has small /tmp size.

How to move all files in a directory to another in hadoop

I am trying to move all files from a directory to another directory within HDFS using spark scala. In short i need to find way to use hadoop fs -mv command in a programmatic way. As i am new to hadoop it would be great if someone could help.
In order to perform operation on HDFS, you just need to get a FileSystem instance, via
val fs = FileSystem.get(sparkContext.hadoopConfiguration)
then you can use its methods to do what you want on HDFS. If you want to move a file, look at the rename method.