Nifi and Spark Integration - scala

I want to create a Spark Session within a Nifi Custom processor written in Scala, so far I can create my spark session on a scala project, but when I add this spark session inside the OnTrigger method of the nifi custom processor, the spark session is never created, is there any way to achieve this? so far I have imported spark-core and spark-sql libraries
any feedback is appreciated

Not possible with Flow File. Period.
You need Kafka in between Spark Streaming or Spark Structured Streaming. Here is good read btw: https://community.cloudera.com/t5/Community-Articles/Spark-Structured-Streaming-with-NiFi-and-Kafka-using-PySpark/ta-p/245068

Related

How do implement a variable similar to Spark's Accumulators in Apache Beam

I am currently using Apache Beam 2.29.0 in Spark. My pipeline consumes data from Kafka for which I have a custom KafkaConsumer that Beam creates thru a call to a ConsumerFactoryFn. I need to share a piece of persistent data for the duration of the run among the custom Kafka consumers. This would have been very simple in Spark I would have create an Accumulator variable that all the executors will have access to as well as the driver.
Since Beam is designed to run in multiple platforms Spark, Flink, Google Dataflow it does not provide this functionality. Does anybody knows a way to implement this?
I believe Side Inputs should work. You can read about the Side Inputs here. A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection.
Here Is an example of how to use it.

Are spark.streaming.backpressure.* properties applicable to Spark Structured Streaming?

My understanding is that Spark structured streaming is build on top of Spark SQL and not Spark Streaming. Hence, the following question, does the properties that apply to spark streaming also applies to spark structured streaming such as:
spark.streaming.backpressure.initialRate
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate
No, these settings are applicable only to DStream API.
Spark Structured Streaming does not have a backpressure mechanism. You can find more details in this discussion: How Spark Structured Streaming handles backpressure?
No.
Spark Structured Stream processes data asap by default - after finishing the current batch. You can control via the rate of processing for various types, e.g. maxFilesPerTrigger for files and maxOffsetsPerTrigger for KAFKA.
This link http://javaagile.blogspot.com/2019/03/everything-you-needed-to-know-about.html explains that back pressure is not relevant.
It quotes: "Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.".
I am not sure this aspect is relevant as KAFKA buffers the data. None-the-less the article has good merit imho.

Write to AWS SQS queue using Spark

Is there any way to stream or write data to Amazon SQS queue from Spark using a library?
There is nothing listed on the Spark packages.
What things can I try?
One idea is to use Alpakka's SQS connector, which is built on Akka Streams.
I wrote a small library to write a dataframe to SQS
https://github.com/fabiogouw/spark-aws-messaging

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.

Multiple Streams support in Apache Flink Job

My Question in regarding Apache Flink framework.
Is there any way to support more than one streaming source like kafka and twitter in single flink job? Is there any work around.Can we process more than one streaming sources at a time in single flink job?
I am currently working in Spark Streaming and this is the limitation there.
Is this achievable by other streaming frameworks like Apache Samza,Storm or NIFI?
Response is much awaited.
Yes, this is possible in Flink and Storm (no clue about Samza or NIFI...)
You can add as many source operators as you want and each can consume from a different source.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = ... // see Flink webpage for more details
DataStream<String> stream1 = env.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties);)
DataStream<String> stream2 = env.readTextFile("/tmp/myFile.txt");
DataStream<String> allStreams = stream1.union(stream2);
For Storm using low level API, the pattern is similar. See An Apache Storm bolt receive multiple input tuples from different spout/bolt
Some solutions have already been covered, I just want to add that in a NiFi flow you can ingest many different sources, and process them either separately or together.
It is also possible to ingest a source, and have multiple teams build flows on this without needing to ingest the data multiple times.