Does Spark handle resource management? - scala

I'm new to Apache Spark and I started learning Scala along with Spark. In this code snippet, does Spark handle closing the text file when it is done the program?
val rdd = context.textFile(filePath)
I know in Java when you opened a file you would have to close it with a try-catch-finally or try-with-resources.
In this example, I am mentioning a text file but I want to know if Spark handles closing resources when they are done as RDDs can take multiple different types of data sets.

context.textFile() doesn't actually open the file, it just creates an RDD object. You can verify this experimentally by creating a textFile RDD for a file which doesn't exist- no error will be thrown. The file referenced by the RDD will only be opened, read, and closed when you call an action, which causes Spark to run the IO and data transformations which will result in the action you instructed.

Related

NullPointerException when writing parquet

I am trying to measure how long does it take me to read and write parquet files in Amazon s3 (under a specific partition)
For that I wrote a script that simply reads the files and than write them back:
val df = sqlContext.read.parquet(path + "p1.parquet/partitionBy=partition1")
df.write.mode("overwrite").parquet(path + "p1.parquet/partitionBy=partition1")
However I get a null pointer exception. I tried to add df.count in between, but got the same error.
The reason for the error is that Spark only reads the data when it is going to be used. This results in Spark reading data from the file at the same time as trying to overwrite the file. This causes an issue since data can't be overwritten while reading.
I'd recommend saving to a temporary location as this is for timing purposes. An alternative would be to use .cache() on the data when reading, perform an action to force the read (as well as actually cache the data), and then overwrite the file.

How to write data as single (normal) csv file in Spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))

Load DataFrame as Text File into HDFS and S3 [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to load a DataFrame into HDFS and S3 as text format file using below code. DataFrame name is finalData.
val targetPath = "/user/test/File"
val now = Calendar.getInstance().getTime()
val formatter = new SimpleDateFormat("yyyyMMddHHmmss")
val timeStampAfterFormatting = formatter.format(now)
val targetFile = s"""$targetPath/test_$timeStampAfterFormatting.txt"""
finalData.repartition(1).rdd.saveAsTextFile(targetFile)
Using above code I can load the Data successfully. But file name is not same as I provided and also not in text format. A directory has created with the name as I mentioned.
Directory Name - /user/test/File/test_20170918055206.txt
-bash-4.2$ hdfs dfs -ls /user/test/File/test_20170918055206.txt
Found 2 items
/user/test/File/test_20170918055206.txt/_SUCCESS
/user/test/File/test_20170918055206.txt/part-00000
I want to create the file as I mentioned instead of creating the directory. Can anyone please assist me.
In my opinion, this is working as design.
You got a repartition operation just before you saved your rdd data, and this would trigger a shuffle operation among your whole data, and eventually got a new rdd which had only one partition.
So this only one partition was stored in your HDFS as your saveAsTextFile operation.
This method is designed such way to let arbitrary number of partitions to be writed in a uniform way.
For example, if your rdd has 100 partitions, no coalesce or repartition before write to HDFS. Then you will get a directory include _SUCCESS flag and 100 files!
if this method is not designed such way, how rdd with multiple partitions could be stored in a concise, uniform and elegant way, and maybe user need to direct the multiple file names? ...ah, so tedious maybe
I hope this explanation helps you.
If you then need and a complete whole file on your local file system, just try the hadoop client command:
hadoop fs -getmerge [src] [des]

Saving files in Spark

There are two operations on RDD to save. One is saveAsTextFile and other is saveAsObjectFile. I understand saveAsTextFile, but not saveAsObjectFile. I am new to Spark and scala and hence I am curious about saveAsObjectFile.
1) Is it sequence file from Hadoop or some thing different?
2) Can I read those files which are generated using saveAsObjectFile using Map Reduce? If yes, how?
saveAsTextFile() - Persist the RDD as a compressed text file, using
string representations of elements. It leverages Hadoop's TextOutputFormat. In order to provide compression we can use the overloaded method which accepts the second argument as CompressionCodec. Refer to RDD API
saveAsObjectFile() - Persist the Object of RDD as a SequenceFile of serialized objects.
Now while reading the Sequence files you can use SparkContext.objectFile("Path of File") which Internally leverage Hadoop's SequenceFileInputFormat to read the files.
Alternatively you can also use SparkContext.newAPIHadoopFile(...) which accepts Hadoop's InputFormat and path as parameter.
rdd.saveAsObjectFile saves RDD as a sequence file. To read those files use sparkContext.objectFile("fileName")

how to cache data in apache spark that can be used by other spark job

I have a simple spark code in which I read a file using SparkContext.textFile() and then doing some operations on that data, and I am using spark-jobserver for getting output.
In code I am caching the data but after job ends and I execute that spark-job again then it is not taking that same file which is already there in cache. So, every time file is getting loaded which is taking more time.
Sample Code is as:
val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)
Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache.
Is there any way using which I can share the cached data among multiple spark jobs?
Yes - by calling persist/cache on the RDD you get and submitting additional jobs on the same context