Saving files in Spark - scala

There are two operations on RDD to save. One is saveAsTextFile and other is saveAsObjectFile. I understand saveAsTextFile, but not saveAsObjectFile. I am new to Spark and scala and hence I am curious about saveAsObjectFile.
1) Is it sequence file from Hadoop or some thing different?
2) Can I read those files which are generated using saveAsObjectFile using Map Reduce? If yes, how?

saveAsTextFile() - Persist the RDD as a compressed text file, using
string representations of elements. It leverages Hadoop's TextOutputFormat. In order to provide compression we can use the overloaded method which accepts the second argument as CompressionCodec. Refer to RDD API
saveAsObjectFile() - Persist the Object of RDD as a SequenceFile of serialized objects.
Now while reading the Sequence files you can use SparkContext.objectFile("Path of File") which Internally leverage Hadoop's SequenceFileInputFormat to read the files.
Alternatively you can also use SparkContext.newAPIHadoopFile(...) which accepts Hadoop's InputFormat and path as parameter.

rdd.saveAsObjectFile saves RDD as a sequence file. To read those files use sparkContext.objectFile("fileName")

Related

How to make BSON document splitable in Spark so it can be processed parallely?

I am querying MongoDB from Spark program. it's returning BSON document. Since BSON is not a splittable format its processing it sequentially with only one executor even though it has multiple available.
One solution I came to know is to Split file using BSONSplitter API and then compress it to Lzo as it is a splittable format.
I don't want to save the file to disk i.e. write using .lzo format as I have to do some processing on that.
So my query is, is it possible to compress the BSON file/split into memory itself and then create a dataframe / RDD out of it and then process it in distributed manner.
or
There is altogether a different approach possible.
I am using Spark 1.6.
Please let me know if additional information is required.

Aggregate S3 data for Spark operation

I have data in S3 that's being written there with a directory structure as follows:
YYYY/MM/DD/HH
I am trying to write a program which will take in a start date, aggregate the data between those dates, and then convert it to spark RDD to perform operations on.
Is there a clean way to do this aggregation with iterating over every combination of YYYY/MM/DD/HH and without having to do a tree to find where it's safe to use wildcards?
I.e. if input is start= 2014/01/01/00 end=2016/02/01/00 do I have to access each bucket individually (i.e. 2014///, 2015///, 2016/1//)?
are you using dataframe? If that is the case then you can leverage the loading data from multiple source API. Take a look at this documentation: https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrameReader
def load(paths: String*): DataFrame
the above method supports multiple source
Also discussion from this stack overflow thread might be helpful for you: Reading multiple files from S3 in Spark by date period

Spark save partitions with custom filename and gzip

I would like to save my generated RDD partitions using a custom filename, like: chunk0.gz, chunk1.gz, etc. Hence, I want them to be gzipped as well.
Using saveAsTextFile would result in a directory being created, with standard filenames part-00000.gz, etc.
fqPart.saveAsTextFile(outputFolder, classOf[GzipCodec])
How do I specify my own filenames? Would I have to iterate through the RDD partitions manually and write to the file, and then compress the resulting file as well?
Thanks in advance.

Spark Streaming - processing binary data file

I'm using pyspark 1.6.0.
I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.
In PYSPARK I read the binary file using:
sc.binaryFiles("s3n://.......")
This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .
I tried:
binaryRecordsStream(directory, recordLength)
but I couldn't get this working...
Can anyone share some lights how PYSPARK streaming read binary data file?
In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API
I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:
JavaPairInputDStream<String, PortableDataStream> rawAuctions =
sc.fileStream("s3n://<bucket>/<folder>",
String.class, PortableDataStream.class, StreamInputFormat.class);
Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

Does Spark handle resource management?

I'm new to Apache Spark and I started learning Scala along with Spark. In this code snippet, does Spark handle closing the text file when it is done the program?
val rdd = context.textFile(filePath)
I know in Java when you opened a file you would have to close it with a try-catch-finally or try-with-resources.
In this example, I am mentioning a text file but I want to know if Spark handles closing resources when they are done as RDDs can take multiple different types of data sets.
context.textFile() doesn't actually open the file, it just creates an RDD object. You can verify this experimentally by creating a textFile RDD for a file which doesn't exist- no error will be thrown. The file referenced by the RDD will only be opened, read, and closed when you call an action, which causes Spark to run the IO and data transformations which will result in the action you instructed.