Spark Streaming with Nifi - apache-kafka

I am looking for way where I can make use of spark streaming in Nifi. I see couple of posts where SiteToSite tcp connection is used for spark streaming application, but I think it will be good if I can launch Spark streaming from Nifi custom processor.
PublishKafka will publish message into Kafka followed by Nifi Spark streaming processor will read from Kafka Topic.
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
Does anyone suggest storing spark streaming context in controller service ? or any better approach for running spark streaming application with Nifi ?

You can make use of ExecuteSparkInteractive to write your spark code which you are trying to include in your spark streaming application.
Here you need few things setup for spark code to run from within Nifi -
Setup Livy server
Add Nifi controllers to start spark Livy sessions.
LivySessionController
StandardSSLContextService (may be required)
Once you enable LivySessionController within Nifi, it will start spark sessions and you can check on spark UI if those livy sessions are up and running.
Now as we have Livy spark sessions running, so whenever flow file move through Nifi flow, it will run spark code within ExecuteSparkInteractive
This will be similar to Spark streaming application running outside Nifi. For me this approach is working very well and easy to maintain compare to having separate spark streaming application.
Hope this will help !!

I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
You'd be launching a standalone application in each case, which is not what you want. If you are going to integrate with Spark Streaming or Flink, you should be using something like Kafka to pub-sub between them.

Related

How to import and use kafka connect datagen in spark application

We need to perform unit testing for our real time streaming application written in scala-spark.
One option is to use embedded-Kafka for kafka test case simulation.
The other option is to use kafka connect datagen - https://github.com/confluentinc/kafka-connect-datagen
The examples found on various blogs include CLI option.
What i'm looking for is an example to import kafka connect datagen within scala application.
Appreciate help on any good resource on kafka connect datagen OR simulating streaming application within scala application
Kafka Connect is meant to be standalone.
You can use TestContainers project to start a broker and Connect worker, then run datagen connector from there.
Otherwise, for more rigorous testing, write your own KafkaProducer.send methods with data you control

Is it possible to work with Spark Structured Streaming without HDFS?

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you
You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

How Apache Beam manage kinesis checkpointing?

I have a streaming pipeline developed in Apache Beam (using Spark Runner) which reads from kinesis stream.
I am looking out for options in Apache Beam to manage kinesis checkpointing (i.e. stores periodically the current position of kinesis stream) so as it allows the system to recover from failures and continue processing where the stream left off.
Is there a provision available for Apache Beam to support kinesis checkpointing as similar to Spark Streaming (Reference link - https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html)?
Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader.
Once a KinesisRecord has been read (actually, pulled from a records queue that is feed separately by actually fetching the records from Kinesis shard), then the shard checkpoint will be updated by using the record SequenceNumber and then, depending on runner implementation of UnboundedSource and checkpoints processing, will be saved.
Afaik, Beam Spark Runner uses Spark States mechanism for this purposes.

Apache Beam over Apache Kafka Stream processing

What are the differences between Apache Beam and Apache Kafka with respect to Stream processing?
I am trying to grasp the technical and programmatic differences as well.
Please help me understand by reporting from your experience.
Beam is an API that uses an underlying stream processing engine like Flink, Storm, etc... in one unified way.
Kafka is mainly an integration platform that offers a messaging system based on topics that standalone applications use to communicate with each other.
On top of this messaging system (and the Producer/Consummer API), Kafka offers an API to perform stream processing using messages as data and topics as input or output. Kafka Stream processing applications are standalone Java applications and act as regular Kafka Consummer and Producer (this is important to understand how these applications are managed and how workload is shared among stream processing application instances).
Shortly said, Kafka Stream processing applications are standalone Java applications that run outside the Kafka Cluster, feed from the Kafka Cluster and export results to the Kafka Cluster. With other stream processing platforms, stream processing applications run inside the cluster engine (and are managed by this engine), feed from somewhere else and export results to somewhere else.
One big difference between Kafka and Beam Stream API is that Beam makes the difference between bounded and unbounded data inside the data stream whereas Kafka does not make that difference. Thereby, handling bounded data with Kafka API has to be done manually using timed/sessionized windows to gather data.
Beam is a programming API but not a system or library you can use. There are multiple Beam runners available that implement the Beam API.
Kafka is a stream processing platform and ships with Kafka Streams (aka Streams API), a Java stream processing library that is build to read data from Kafka topics and write results back to Kafka topics.

Akka Persistence: migrating from jdbc (postgres) to cassandra

I have a running project with using of akka-persistence-jdbc plugin and postgresql as a backend.
Now I want to migrate to akka-persistence-cassandra.
But how can I convert the existing events (more than 4GB size in postgres) to cassandra?
Should I write a manual migration program? Reading from postgres and writing to right format in cassandra?
This is a classic migration problem. There are multiple solutions for this.
Spark SQL and Spark Cassandra Connector: Spark JDBC (called as Spark Dataframe, Spark SQL) API allows you to read from any JDBC source. You can read it in chunks by segmenting it otherwise you will go out of memory. Segmentation also makes the migration parallel. Then write the data into Cassandra by Cassandra Spark Connector. This is by far the simplest and efficient way I used in my tasks.
Java Agents: Java Agent can be written based on plain JDBC or other libraries and then write to Cassandra with Datastax driver. Spark program runs on multi machine - multi threaded way and recovers if something goes wrong automatically. But if you write an agent like this manually, then your agent only runs on single machine and multi threading also need to be coded.
Kafka Connectors: Kafka is a messaging broker. It can be used indirectly to migrate. Kafka has connector which can read and write to different databases. You can use JDBC connector to read from PostGres and Cassandra connector to write to Cassandra. It's not that easy to setup but it has the advantage of "no coding involved".
ETL Systems: Some ETL Systems have support for Cassandra but I haven't personally tried anything.
I saw some advantages in using Spark Cassandra and Spark SQL for migration, some of them are:
Code was concise. It was hardly 40 lines
Multi machine (Again multi threaded on each machine)
Job progress and statistics in Spark Master UI
Fault tolerance- if a spark node is down or thread/worker failed there then job is automatically started on other node - good for very long running jobs
If you don't know Spark then writing agent is okay for 4GB data.