Spark - Efficient way of reading multiple versions of S3 object into DataFrame - scala

I want to read N last versions of an S3 object and put them all into a Map[version, DataFrame] structure. Each S3 object is a json lines file of approximately 2 GB each. S3A client does not support passing versionId as far as I can see, so I cannot use this approach. Can anyone suggest an efficient alternative approach? The only thing I can think of is to create normal AmazonS3 client and get S3 object using SDK. However, I'm not too experienced in Spark/Scala and not sure how to then convert it into DataFrame.

Related

EMR-Spark is slow writing a DataFrame with an Array of Strings to S3

I'm trying to write a dataframe to S3 from EMR-Spark and I'm seeing some really slow write times where the writing comes to dominate the total runtime (~80%) of the script. For what it's worth, I've tried both .csv and .parquet formats, it doesn't seem make a difference.
My data can be formatted in two ways, here's the preferred format:
ID : StringType | ArrayOfIDs : ArrayType
(The number of unique IDs in the first column numbers in the low millions. ArrayOfIDs contains GUID formatted strings, and can contain anywhere from ~100 - 100,000 elements)
Writing the first form to S3 is incredibly slow. For what it's worth, I've tried setting the mapreduce.fileoutputcommitter.algorithm.version to 2 as described here: https://issues.apache.org/jira/browse/SPARK-20107 to no real effect.
However my data can also be formatted as an adjacency list, like this:
ID1 : StringType | ID2 : StringType
This appears to be much faster for writing to S3, but I am at a loss for why. Here are my specific questions:
Ultimately I'm trying to get my data into an Aurora RDS Postgres cluster (I was told firmly by those before me that the Spark JDBC connector is too slow for the job, which is why I'm currently trying to dump the data in S3 before loading it into Postgres with a COPY command). I'm not married to using S3 as an intermediate store if there are better alternatives for getting these data frames into RDS Postgres.
I don't know why the first schema with the Array of Strings is so much slower on write. The total data written is actually far less than the second schema on account of eliminating ID duplication from the first column. Would also be nice to understand this behavior.
Well, I still don't know why writing arrays directly from Spark is so much slower than the adjacency list format. But best practice seems to dictate that I avoid writing to S3 directly from Spark.
Here's what i'm doing now:
Write the data to HDFS (anecdotally, the write speed of the adjacency list vs the array now falls in line with my expectations).
From HDFS, use EMR's s3-dist-cp utility to wholesale write the data to S3 (this also seems reasonably performant with array typed data).
Bring the data into Aurora Postgres with the aws_s3.table_import_from_s3 extension.

How to write data as single (normal) csv file in Spark? [duplicate]

This question already has answers here:
Write single CSV file using spark-csv
(16 answers)
Closed 5 years ago.
I am trying to save a data frame as CSV file in my local drive. But, when I do that so, I get a folder generated and within that partition files were written. Is there any suggestion to overcome this ?
My Requirement:
To get a normal csv file with actual name given in the code.
Code Snippet:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/dataframe.csv")
TL:DR You are trying to enforce sequential, in-core concepts on a distribute enviornment. It cannot end up well.
Spark doesn't provide utility like this one. To be able to create one in a semi distributed fashion, you'd have to implement multistep, source dependent protocol where:
You write header.
You write data files for each partition.
You merge the files, and give a new name.
Since this has limited applications, is useful only for smallish files, and can be very expensive with some sources (like object stores) nothing like this is implemented in Spark.
You can of course collect data, use standard CSV parser (Univoicity, Apache Commons) and then put to the storage of your choice. This is sequential and requires multiple data transfers.
There is no automatic way to do this. I see two solutions
If the local directory is mounted on all the executors: Write the file as you did, but then move/rename the part-*csv file to the desired name
Or if the directory is not available on all executors: collect the
dataframe to the driver and then create the file using plain scala
But both solutions kind of destroy parallelism and thus the goal of spark.
It is not possible but you can do somethings like this:
dataframe.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").option("header", "true").csv("E:/data/")
import org.apache.hadoop.fs._
val fs = FileSystem.get(sc.hadoopConfiguration)
val filePath = "E:/data/"
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName
fs.rename(new Path(filePath+fileName), new Path(filePath+"dataframe.csv"))

"sqlContext.read.json" takes very long time to read 30,000 small JSON files (400 Kb) from S3

I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.
I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:
var df = sqlContext.read.json("s3n://path_to_bucket/*.json")
I also tried this approach since json(...) is deprecated:
var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")
The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.
Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?
UPDATE:
Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.
If you can get the last 1000 modified file names into a simple list you can simply call:
sqlContext.read.format("json").json(filePathsList: _*)
Please note that the .option call(s) are usually used to configure schema options.
Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names:
How do I find the last modified file in a directory in Java?
You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.
Try just to load and print the first row of the DF.
Anyway, what is the configuration of the cluster?

Spark Streaming - processing binary data file

I'm using pyspark 1.6.0.
I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.
In PYSPARK I read the binary file using:
sc.binaryFiles("s3n://.......")
This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .
I tried:
binaryRecordsStream(directory, recordLength)
but I couldn't get this working...
Can anyone share some lights how PYSPARK streaming read binary data file?
In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API
I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:
JavaPairInputDStream<String, PortableDataStream> rawAuctions =
sc.fileStream("s3n://<bucket>/<folder>",
String.class, PortableDataStream.class, StreamInputFormat.class);
Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

Apache spark streaming - cache dataset for joining

I'm considering using Apache Spark streaming for some real-time work but I'm not sure how to cache a dataset for use in a join/lookup.
The main input will be json records coming from Kafka that contain an Id, I want to translate that id into a name using a lookup dataset. The lookup dataset resides in Mongo Db but I want to be able to cache it inside the spark process as the dataset changes very rarely (once every couple of hours) so I don't want to hit mongo for every input record or reload all the records in every spark batch but I need to be able to update the data held in spark periodically (e.g. every 2 hours).
What is the best way to do this?
Thanks.
I've thought long and hard about this myself. In particular I've wondered is it possible to actually implement a database DB in Spark of sorts.
Well the answer is kind of yes. First you want a program that first caches the main data set into memory, then every couple of hours does an optimized join-with-tiny to update the main data set. Now apparently Spark will have a method that does a join-with-tiny (maybe it's already out in 1.0.0 - my stack is stuck on 0.9.0 until CDH 5.1.0 is out).
Anyway, you can manually implement a join-with-tiny, by taking the periodic bi-hourly dataset and turning it into a HashMap then broadcasting it as a broadcast variable. What this means is that the HashMap will be copied, but only once per node (compare this with just referencing the Map - it would be copied once per task - a much greater cost). Then you take your main dataset and add on the new records using the broadcasted map. You can then periodically (nightly) save to hdfs or something.
So here is some scruffy pseudo code to elucidate:
var mainDataSet: RDD[KeyType, DataType] = sc.textFile("/path/to/main/dataset")
.map(parseJsonAndGetTheKey).cache()
everyTwoHoursDo {
val newData: Map[KeyType, DataType] = sc.textFile("/path/to/last/two/hours")
.map(parseJsonAndGetTheKey).toarray().toMap
broadcast(newData)
val mainDataSetNew =
mainDataSet.map((key, oldValue) => (key,
newData.get(key).map(newDataValue =>
update(oldValue, newDataValue))
.getOrElse(oldValue)))
.cache()
mainDataSetNew.someAction() // to force execution
mainDataSet.unpersist()
mainDataSet = mainDataSetNew
}
I've also thought that you could be very clever and use a custom partioner with your own custom index, and then use a custom way of updating the partitions so that each partition itself holds a submap. Then you can skip updating partitions that you know won't hold any keys that occur in the newData, and also optimize the updating process.
I personally think this is a really cool idea, and the nice thing is your dataset is already ready in memory for some analysis / machine learning. The down side is your kinda reinventing the wheel a bit. It might be a better idea to look at using Cassandra as Datastax is partnering with Databricks (people who make Spark) and might end up supporting some kind of thing like this out of box.
Further reading:
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
http://www.datastax.com/2014/06/datastax-unveils-dse-45-the-future-of-the-distributed-database-management-system
Here is a fairly simple work-flow:
For each batch of data:
Convert the batch of JSON data to a DataFrame (b_df).
Read the lookup dataset from MongoDB as a DataFrame (m_df). Then cache, m_df.cache()
Join the data using b_df.join(m_df, "join_field")
Perform your required aggregation and then write to a data source.