Is it possible to work with Spark Structured Streaming without HDFS? - spark-structured-streaming

I'm working with HDFS and Kafka for times, and I note that Kafka is more reliable than HDFS.
So working now with Spark-structured-streaming , I'm suprised that checkpointing is only with HDFS.
Chekckpointing with Kafka would be faster and reliable.
So is it possible to work with spark structured streaming without HDFS ?
It seems strange that we have to use HDFS only for streaming data in Kafka.
Or is it possible to tell Spark to forget the ChekpPointing and managing it in the program as well ?
Spark 2.4.7
Thank you

You are not restricted to use a HDFS path as a checkpoint location.
According to the section Recovering from Failures with Checkpointing in the Spark Structured Streaming Guide the path has to be "an HDFS compatible file system". Therefore, also other file systems will work. However, it is mandatory that all Executors have access to that file system. For example choosing the local file system on the Edge Node in your cluster might be working in local mode, however, in cluster mode this can cause issues.
Also, it is not possible to have Kafka itself handle the offset position with Spark Structured Streaming. I have explained this in more depth in my answer on How to manually set group.id and commit kafka offsets in spark structured streaming?.

Related

Azure Event Hubs Streaming: Does Checkpointing override setStartingPosition?

If we specify the starting position in EventHub conf like so:
EventHubsConf(ConnectionStringBuilder(eventHubConnectionString).build)
.setStartingPosition(EventPosition.fromStartOfStream)
or
.setStartingPosition(EventPosition.fromEndOfStream)
And also sepecify the checkpoint location in the StreamWriter
streamingInputDF
.writeStream
.option("checkpointLocation", checkpointLocation)
...
After a restart, does the setStartingPosition become irrelevant because the checkpoint is always used as the point from where to begin reading?
Thanks.
The information on offsets stored in the checkpoint files will be used when restarting the streamimg query.
Interestingly, this is not specifically mentioned in the structured streaming eventhubs integration guide, however, in the DStreams guide it is:
"The connector fully integrates with the Structured Streaming checkpointing mechanism. You can recover the progress and state of you query on failures by setting a checkpoint location in your query. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."
Make sure to follow the general guidance on checkpoint recovery.

How Apache Beam manage kinesis checkpointing?

I have a streaming pipeline developed in Apache Beam (using Spark Runner) which reads from kinesis stream.
I am looking out for options in Apache Beam to manage kinesis checkpointing (i.e. stores periodically the current position of kinesis stream) so as it allows the system to recover from failures and continue processing where the stream left off.
Is there a provision available for Apache Beam to support kinesis checkpointing as similar to Spark Streaming (Reference link - https://spark.apache.org/docs/2.2.0/streaming-kinesis-integration.html)?
Since KinesisIO is based on UnboundedSource.CheckpointMark, it uses the standard checkpoint mechanism, provided by Beam UnboundedSource.UnboundedReader.
Once a KinesisRecord has been read (actually, pulled from a records queue that is feed separately by actually fetching the records from Kinesis shard), then the shard checkpoint will be updated by using the record SequenceNumber and then, depending on runner implementation of UnboundedSource and checkpoints processing, will be saved.
Afaik, Beam Spark Runner uses Spark States mechanism for this purposes.

Kafka Connect: can multiple standalone connectors write to the same HDFS directory?

For our pipeline, we have about 40 topics (10-25 partitions each) that we want to write into the same HDFS directory using HDFS 3 Sink Connectors in standalone mode (distributed doesn't work for our current setup). We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory? Since the connectors then organize all files in HDFS by topic, I don't think this should be an issue but I'm wondering if anyone has experience with this setup.
Basic example:
Connector-1 config
name=connect-1
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic1
hdfs.url=hdfs://kafkaOutput
Connector-2 config
name=connect-2
connector.class=io.confluent.connect.hdfs3.Hdfs3SinkConnector
topics=topic2
hdfs.url=hdfs://kafkaOutput
distributed doesn't work for our current setup
You should be able to run connect-distibured in the exact same nodes as connect-standalone is ran.
We have tried running all the topics on one connector but encounter problems recovering offsets if it needs to be restarted
Yeah, I would suggest not bundling all topics into one connector.
If we divide the topics among different standalone connectors, can they all write into the same HDFS directory?
That is my personal recommendation, and yes, they can because the HDFS path is named by the topic name, futher split by the partitioning scheme
Note: The following allow applies to all other storage connectors (S3 & GCS)

Kafka connect to load streaming log data into Kafka

Is Kafka Spool Directory connector suitable for loading streaming data (log) into Kafka in production ? Can it be run in distributed mode ? Is there any other connector that can be used since filestream source connector is not suitable for production ?
Does this match your requirements?
provides the capability to watch a directory for files and read the data as new files are written to the input directory.
Do you have CSV or JSON files?
If so, then you can use the Spooldir connector
It can be argued that something like Flume, Logastash, Filebeat, FluentD, Syslog, GELF, or other log solutions are more appropriately suited for your purposes of collecting logs into Kafka

Spark Streaming with Nifi

I am looking for way where I can make use of spark streaming in Nifi. I see couple of posts where SiteToSite tcp connection is used for spark streaming application, but I think it will be good if I can launch Spark streaming from Nifi custom processor.
PublishKafka will publish message into Kafka followed by Nifi Spark streaming processor will read from Kafka Topic.
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
Does anyone suggest storing spark streaming context in controller service ? or any better approach for running spark streaming application with Nifi ?
You can make use of ExecuteSparkInteractive to write your spark code which you are trying to include in your spark streaming application.
Here you need few things setup for spark code to run from within Nifi -
Setup Livy server
Add Nifi controllers to start spark Livy sessions.
LivySessionController
StandardSSLContextService (may be required)
Once you enable LivySessionController within Nifi, it will start spark sessions and you can check on spark UI if those livy sessions are up and running.
Now as we have Livy spark sessions running, so whenever flow file move through Nifi flow, it will run spark code within ExecuteSparkInteractive
This will be similar to Spark streaming application running outside Nifi. For me this approach is working very well and easy to maintain compare to having separate spark streaming application.
Hope this will help !!
I can launch Spark streaming application from custom Nifi processor using Spark Streaming launcher API, but the biggest challenge is that it will create spark streaming context for each flow file, which can be costly operation.
You'd be launching a standalone application in each case, which is not what you want. If you are going to integrate with Spark Streaming or Flink, you should be using something like Kafka to pub-sub between them.