Read Avro File one line at a time . Python - pyspark

Context:
I want to read Avro file into Spark as a RDD. I want to know whether it is possible to parse the Avro file one line at a time if I have access to Avro data schema .
I am using pyspark for writing my spark jobs . I am thinking about using sc.textfile to read in this huge file and do a parallel parse if I can parse a line at a time . Any pointers towards parsing Avro file one line at a time would be greatly appreciated .

Spark is meant for Big data processing with multiple partition of files in parallel and reading single line at time can not be spark use case.
You can add your business logic with the help of row transformations (apply on each row) and spark will execute lazily.

Related

How to do df.rdd or df.collect().foreach on streaming dataset?

This is the exception I am getting whenever I am trying to convert it.
val df_col = df.select("ts.user.friends_count").collect.map(_.toSeq)
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
All I am trying to do is replicate the following sql.dataframe operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
collect is a big no-no even in Spark Core's RDD world due to the size of the data you may transfer back to the driver's single JVM. It just sets the boundary of the benefits of Spark as after collect you are in a single JVM.
With that said, think about unbounded data, i.e. a data stream, that will never terminate. That's Spark Structured Streaming.
A streaming Dataset is one that is never complete and the data inside varies every time you ask for the content, i.e. the result of executing the structured query over a stream of data.
You simply cannot say "Hey, give me the data that is the content of a streaming Dataset". That does not even make sense.
That's why you cannot collect on a streaming dataset. It is not possible up to Spark 2.2.1 (the latest version at the time of this writing).
If you want to receive the data that is inside a streaming dataset for a period of time (aka batch interval in Spark Streaming or trigger in Spark Structured Streaming) you write the result to a streaming sink, e.g. console.
You can also write your custom streaming sink that does collect.map(_.toSeq) inside addBatch which is the main and only method of a streaming sink. As a matter of fact, console sink does exactly it.
All I am trying to do is replicate the following sql.dataframe
operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
The very first solution that comes to my mind is to use foreach sink:
The foreach operation allows arbitrary operations to be computed on the output data.
That, of course, does not mean that this is the best solution. Just one that comes to my mind immediately.

Spark-Running Batch Job with 15 minutes interval

I am using Scala,
I tried Spark streaming, but if by any chance my streaming job crashed for more than 15 minutes, this will generate data loss.
So I just want to know, how to manually keep checkpoints in batch job?
The directories of input data looks like the following
Data --> 20170818 --> (timestamp) --> (many .json files)
The data are uploaded every 5 minutes.
Thanks!
You may use readStream feature in structured streaming to monitor a directory and pick up new files. Spark automatically handles checkpointing and tracking for you.
val ds = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load(logDirectory)
Here is a link to for additional material on the topic: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
I personally used format("text") but you should be able to change to format("json"), here is more details on json format: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

How to process two RDDs serially in Spark?

As I was hitting the resource limit in my Spark program, I want to divide the processing into iterations, and upload results from each iteration to the HDFS, as shown below.
do something using first rdd
upload the output to hdfs
do something using second rdd
upload the output to hdfs
But as far as I know, Spark will try to run those two in parallel. Is there a way to wait for the processing of the first rdd, before processing the second rdd?
I think I understand where you're confused. Within a single RDD, the partitions will run in parallel to each other. However, two RDDs will run sequentially to each other (unless you code otherwise).
Is there a way to wait for the processing of the first rdd, before processing the second rdd
You have the RDD, so why do you need to wait and read from disk again?
Do some transformations on the RDD, write to disk in the first action, and continue with that same RDD to perform a second action.

Parse NiFi Data Packet using spark

I'm doing a small project for university using Apache NiFi and Apache Spark. I want to create a workflow with NiFi that reads TSV files from HDFS and using Spark Streaming I can process the files and store the information I need in MySQL. I've already created my Workflow in NiFi and the storage part is already working. The problem is that i can't parse the NiFi package so i can use them.
The files contain rows like this:
linea1File1 TheReceptionist 653 Entertainment 424 13021 4.34 1305 744 DjdA-5oKYFQ NxTDlnOuybo c-8VuICzXtU
Where each space is a tab ("\t")
This is my code in Spark using Scala:
val ssc = new StreamingContext(config, Seconds(10))
val packet = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))
val file = packet.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8))
Until here I can obtain my entire file (7000+ rows) in a single string... unfortunately i can't split that string in rows. I need to get the entire file in rows, so I can parse that in an object, apply some operations on it and store what I want
Anyone can help me with this?
Each data packet is going to be the content of one flow file from NiFi, so if NiFi picks up one TSV file from HDFS that has a lot of rows, all those rows will be in one data packet.
It is hard to say without seeing your NiFi flow, but you could probably use SplitText with a line count of 1 to split your TSV in NiFi before it gets to spark streaming.