Aggregate S3 data for Spark operation - scala

I have data in S3 that's being written there with a directory structure as follows:
YYYY/MM/DD/HH
I am trying to write a program which will take in a start date, aggregate the data between those dates, and then convert it to spark RDD to perform operations on.
Is there a clean way to do this aggregation with iterating over every combination of YYYY/MM/DD/HH and without having to do a tree to find where it's safe to use wildcards?
I.e. if input is start= 2014/01/01/00 end=2016/02/01/00 do I have to access each bucket individually (i.e. 2014///, 2015///, 2016/1//)?

are you using dataframe? If that is the case then you can leverage the loading data from multiple source API. Take a look at this documentation: https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrameReader
def load(paths: String*): DataFrame
the above method supports multiple source
Also discussion from this stack overflow thread might be helpful for you: Reading multiple files from S3 in Spark by date period

Related

pyspark dataframe writing results

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.
The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?
I solved above problem with persist and cache in pyspark.
Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.
Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

How to make BSON document splitable in Spark so it can be processed parallely?

I am querying MongoDB from Spark program. it's returning BSON document. Since BSON is not a splittable format its processing it sequentially with only one executor even though it has multiple available.
One solution I came to know is to Split file using BSONSplitter API and then compress it to Lzo as it is a splittable format.
I don't want to save the file to disk i.e. write using .lzo format as I have to do some processing on that.
So my query is, is it possible to compress the BSON file/split into memory itself and then create a dataframe / RDD out of it and then process it in distributed manner.
or
There is altogether a different approach possible.
I am using Spark 1.6.
Please let me know if additional information is required.

How can I obtain the DAG of an Apache Spark job without running it?

I have some Scala code that I can run with Spark using spark-submit. From what I understood, Spark creates a DAG in order to schedule the operation.
Is there a way to retrieve this DAG without actually performing the heavy operations, e.g. just by analyzing the code ?
I would like a useful representation such as a data structure or at least a written representation, not the DAG visualization.
If you are using dataframes (spark sql) you can use df.explain(true) to get the plan and all operations (before and after optimization).
If you are using rdd you can use rdd.toDebugString to get a string representation and rdd.dependencies to get the tree itself.
If you use these without the actual action you would get a representation of what is going to happen without actually doing the heavy lifting.

Saving files in Spark

There are two operations on RDD to save. One is saveAsTextFile and other is saveAsObjectFile. I understand saveAsTextFile, but not saveAsObjectFile. I am new to Spark and scala and hence I am curious about saveAsObjectFile.
1) Is it sequence file from Hadoop or some thing different?
2) Can I read those files which are generated using saveAsObjectFile using Map Reduce? If yes, how?
saveAsTextFile() - Persist the RDD as a compressed text file, using
string representations of elements. It leverages Hadoop's TextOutputFormat. In order to provide compression we can use the overloaded method which accepts the second argument as CompressionCodec. Refer to RDD API
saveAsObjectFile() - Persist the Object of RDD as a SequenceFile of serialized objects.
Now while reading the Sequence files you can use SparkContext.objectFile("Path of File") which Internally leverage Hadoop's SequenceFileInputFormat to read the files.
Alternatively you can also use SparkContext.newAPIHadoopFile(...) which accepts Hadoop's InputFormat and path as parameter.
rdd.saveAsObjectFile saves RDD as a sequence file. To read those files use sparkContext.objectFile("fileName")

Custom scalding tap (or Spark equivalent)

I am trying to dump some data that I have on a Hadoop cluster, usually in HBase, with a custom file format.
What I would like to do is more or less the following:
start from a distributed list of records, such as a Scalding pipe or similar
group items by some computed function
make so that items belonging to the same group reside on the same server
on each group, apply a transformation - that involves sorting - and write the result on disk. In fact I need to write a bunch of MapFile - which are essentially sorted SequenceFile, plus an index.
I would like to implement the above with Scalding, but I am not sure how to do the last step.
While of course one cannot write sorted data in a distributed fashion, it should still be doable to split data into chunks and then write each chunk sorted locally. Still, I cannot find any implementation of MapFile output for map-reduce jobs.
I recognize it is a bad idea to sort very large data, and this is the reason even on a single server I plan to split data into chunks.
Is there any way to do something like that with Scalding? Possibly I would be ok with using Cascading directly, or really an other pipeline framework, such as Spark.
Using Scalding (and the underlying Map/Reduce) you will need to use the TotalOrderPartitioner, which does pre-sampling to create appropriate buckets/splits of the input data.
Using Spark will speed up due to the faster access paths to the disk data. However it will still require shuffles to disk/hdfs so it will not be like orders of magnitude better.
In Spark you would use a RangePartitioner, which takes the number of partitions and an RDD:
val allData = sc.hadoopRdd(paths)
val partitionedRdd = sc.partitionBy(new RangePartitioner(numPartitions, allData)
val groupedRdd = partitionedRdd.groupByKey(..).
// apply further transforms..