Spark Streaming - processing binary data file

Spark Streaming - processing binary data file - pyspark

I'm using pyspark 1.6.0.
I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.
In PYSPARK I read the binary file using:
sc.binaryFiles("s3n://.......")
This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .
I tried:
binaryRecordsStream(directory, recordLength)
but I couldn't get this working...
Can anyone share some lights how PYSPARK streaming read binary data file?

In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API

I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:
JavaPairInputDStream<String, PortableDataStream> rawAuctions =
sc.fileStream("s3n://<bucket>/<folder>",
String.class, PortableDataStream.class, StreamInputFormat.class);
Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

Related

Spark - Efficient way of reading multiple versions of S3 object into DataFrame

I want to read N last versions of an S3 object and put them all into a Map[version, DataFrame] structure. Each S3 object is a json lines file of approximately 2 GB each. S3A client does not support passing versionId as far as I can see, so I cannot use this approach. Can anyone suggest an efficient alternative approach? The only thing I can think of is to create normal AmazonS3 client and get S3 object using SDK. However, I'm not too experienced in Spark/Scala and not sure how to then convert it into DataFrame.

how to decompress and read a file containing multiple compressed file in spark

I have a file AA.zip which again contains multiple files for ex aa.tar.gz, bb.tar.gz , etc
I need to read this files in spark scala , how can i achieve that??
the only problem here is to extract the contents of zip file.

so ZIPs on HDFS are going to be a bit tricky because they don't split well so you'll have to process 1 or more zip file per executor. This is also one of the few cases were you probably have to fall back to SparkContext because for some reason binary file support in Spark is not that good.
https://spark.apache.org/docs/2.4.0/api/scala/index.html#org.apache.spark.SparkContext
there's a readBinaryFiles there which gives you access to the zip binary data which you can then utilize with the usual ZIP-handling from java or scala.

How to read large files from HTTP response in Apache Beam?

Apache Beam's TextIO can be used to read JSON files in some filesystems, but how can I create a PCollection out of a large JSON (InputStream) resulted from a HTTP response in Java SDK?

I don't think there's a generic built-in solution in Beam to do this at the moment, see the list of supported IOs.
I can think of multiple approaches to this, whichever works for you may depend on your requirements:
I would probably first try to build another layer (probably not in Beam) that saves the HTTP output into a GCS bucket (maybe splitting it into multiple files in the process) and then use Beam's TextIO to read from the GCS bucket;
depending on the properties of the HTTP source you can consider:
writing your own ParDo that reads the whole response in a single step, splits it and outputs the split elements separately. Then further transforms would parse the JSON or do other stuff;
implementing you own source, that will be more complicated but probably work better for very large (unbounded) responses;

Aggregate S3 data for Spark operation

I have data in S3 that's being written there with a directory structure as follows:
YYYY/MM/DD/HH
I am trying to write a program which will take in a start date, aggregate the data between those dates, and then convert it to spark RDD to perform operations on.
Is there a clean way to do this aggregation with iterating over every combination of YYYY/MM/DD/HH and without having to do a tree to find where it's safe to use wildcards?
I.e. if input is start= 2014/01/01/00 end=2016/02/01/00 do I have to access each bucket individually (i.e. 2014///, 2015///, 2016/1//)?

are you using dataframe? If that is the case then you can leverage the loading data from multiple source API. Take a look at this documentation: https://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.DataFrameReader
def load(paths: String*): DataFrame
the above method supports multiple source
Also discussion from this stack overflow thread might be helpful for you: Reading multiple files from S3 in Spark by date period

Saving files in Spark

There are two operations on RDD to save. One is saveAsTextFile and other is saveAsObjectFile. I understand saveAsTextFile, but not saveAsObjectFile. I am new to Spark and scala and hence I am curious about saveAsObjectFile.
1) Is it sequence file from Hadoop or some thing different?
2) Can I read those files which are generated using saveAsObjectFile using Map Reduce? If yes, how?

saveAsTextFile() - Persist the RDD as a compressed text file, using
string representations of elements. It leverages Hadoop's TextOutputFormat. In order to provide compression we can use the overloaded method which accepts the second argument as CompressionCodec. Refer to RDD API
saveAsObjectFile() - Persist the Object of RDD as a SequenceFile of serialized objects.
Now while reading the Sequence files you can use SparkContext.objectFile("Path of File") which Internally leverage Hadoop's SequenceFileInputFormat to read the files.
Alternatively you can also use SparkContext.newAPIHadoopFile(...) which accepts Hadoop's InputFormat and path as parameter.

rdd.saveAsObjectFile saves RDD as a sequence file. To read those files use sparkContext.objectFile("fileName")

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Streaming - processing binary data file - pyspark

Related

Spark - Efficient way of reading multiple versions of S3 object into DataFrame

how to decompress and read a file containing multiple compressed file in spark

How to read large files from HTTP response in Apache Beam?

Aggregate S3 data for Spark operation

Saving files in Spark

Categories

Resources