Parse NiFi Data Packet using spark - scala

I'm doing a small project for university using Apache NiFi and Apache Spark. I want to create a workflow with NiFi that reads TSV files from HDFS and using Spark Streaming I can process the files and store the information I need in MySQL. I've already created my Workflow in NiFi and the storage part is already working. The problem is that i can't parse the NiFi package so i can use them.
The files contain rows like this:
linea1File1 TheReceptionist 653 Entertainment 424 13021 4.34 1305 744 DjdA-5oKYFQ NxTDlnOuybo c-8VuICzXtU
Where each space is a tab ("\t")
This is my code in Spark using Scala:
val ssc = new StreamingContext(config, Seconds(10))
val packet = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))
val file = packet.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8))
Until here I can obtain my entire file (7000+ rows) in a single string... unfortunately i can't split that string in rows. I need to get the entire file in rows, so I can parse that in an object, apply some operations on it and store what I want
Anyone can help me with this?

Each data packet is going to be the content of one flow file from NiFi, so if NiFi picks up one TSV file from HDFS that has a lot of rows, all those rows will be in one data packet.
It is hard to say without seeing your NiFi flow, but you could probably use SplitText with a line count of 1 to split your TSV in NiFi before it gets to spark streaming.

Related

Consuming exactly one file per micro batch in spark stream

I am new to spark streaming technology and doing a POC to migrate one of our ETL application to spark from commercial ETL tool. We get many small files thorough out the day as INPUT and will do the ETL operation on that.
As part of POC, we are trying out spark streaming. Currently, we are struck in logic to generate Surrogate KEY values for each record and we identified ROW_NUMBER as the way to generate SK, but it doesn't work for spark stream, as it has to be windowed operation.
I am trying to use the record number column (sequence number for each record in side the files) to generate SK.
Consume exactly one file at each micro batch.
calculate Sk = Previous batch max record number + record number
In write stream phase -calculate and store the max record number value using "foreachbatch" option.
I have used "maxFilesPerTrigger" option to fetch one file, but it doesn't guarantee exactly one file per micro batch.
val smsfDF_stream = spark.readStream.option("sep", ",").schema(schema).option("maxFilesPerTrigger", 1).option("cleanSource","archive").option("sourceArchiveDir", "/edw_ab_data/spark_poc/archive").csv("/edw_ab_data/spark_poc/input")
Also, I see max value for each batchdf, I see '9999' as value
val fin_query = resinterdf2.writeStream.format("csv").foreachBatch( (batchDF: DataFrame, batchId: Long ) => {
batchDF.createOrReplaceTempView("batchdf_tab")
val file_name = "max_file_"+batchId+".dat"
val max_rec_id = batchDF.sparkSession.sql("""select max(rec_number) from batchdf_tab""")
max_rec_id.write.format("csv").mode("overwrite").save("/edw_ab_data/spark_poc/max_val.dat")}).option("checkpointLocation", "/edw_ab_data/spark_poc").outputMode("append").option("path", "/edw_ab_data/spark_poc/output").start()

Data streamed from Kafka to Postgres and missing seconds later

I am trying to save data from local Kafka instance to local Postgres with Spark Streaming. I have configured all connections and parameters, and data actually gets to the database. However, it is there only for a couple of seconds. After that, the table simply becomes empty. If I stop the app as soon there is some data in Postgres, data persists, so I suppose I have missed some parameter for streaming in Spark or something in Kafka configuration files. The code is in Java, not Scala, so there is dataset instead of DataFrame.
I tried setting spark.driver.allowMultipleContexts to true, but this has nothing with context. When I run count on database with complete data set streaming in the background, there is always about 1700 records, which means there might be some parameter for batch size.
censusRecordJavaDStream.map(e -> {
Row row = RowFactory.create(e.getAllValues());
return row;
}).foreachRDD(rdd -> {
Dataset<Row> censusDataSet = spark.createDataFrame(rdd, CensusRecord.getStructType());
censusDataSet
.write()
.mode(SaveMode.Overwrite)
.jdbc("jdbc:postgresql:postgres", "census.census", connectionProperties);
});
My goal is to stream data from Kafka and save it to Postgre. Each record has unique ID, which is used as a key in Kafka, so there should be no conflicts regarding primary key or double entries. For current testing purposes, I am using small subset of about 100 records; complete dataset is over 300MB.

Spark-Running Batch Job with 15 minutes interval

I am using Scala,
I tried Spark streaming, but if by any chance my streaming job crashed for more than 15 minutes, this will generate data loss.
So I just want to know, how to manually keep checkpoints in batch job?
The directories of input data looks like the following
Data --> 20170818 --> (timestamp) --> (many .json files)
The data are uploaded every 5 minutes.
Thanks!
You may use readStream feature in structured streaming to monitor a directory and pick up new files. Spark automatically handles checkpointing and tracking for you.
val ds = spark.readStream
.format("text")
.option("maxFilesPerTrigger", 1)
.load(logDirectory)
Here is a link to for additional material on the topic: https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html
I personally used format("text") but you should be able to change to format("json"), here is more details on json format: https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Read Avro File one line at a time . Python

Context:
I want to read Avro file into Spark as a RDD. I want to know whether it is possible to parse the Avro file one line at a time if I have access to Avro data schema .
I am using pyspark for writing my spark jobs . I am thinking about using sc.textfile to read in this huge file and do a parallel parse if I can parse a line at a time . Any pointers towards parsing Avro file one line at a time would be greatly appreciated .
Spark is meant for Big data processing with multiple partition of files in parallel and reading single line at time can not be spark use case.
You can add your business logic with the help of row transformations (apply on each row) and spark will execute lazily.