Connecting Spark streaming to streamsets input - streaming

I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .
I exploring if there are other ways to connect them for a sample POC.

The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.
SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss.
It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.
In short, your best option would be SDC -> Kafka -> Spark Streaming.

Related

How to do batch processing on kafka connect generated datasets?

Suppose we have batch jobs producing records into kafka and we have a kafka connect cluster consuming records and moving them to HDFS. We want the ability to run batch jobs later on the same data but we want to ensure that batch jobs see the whole records generated by producers. What is a good design for this?
You can run any MapReduce, Spark, Hive, etc query on the data, and you will get all records that have been thus far been written to HDFS. It will not see data that has not been consumed by the Sink from the producers, but this has nothing to do with Connect or HDFS, that is a pure Kafka limitation.
Worth pointing out that Apache Pinot is a better place to combine Kafka streaming data and have batch query support.

Are spark.streaming.backpressure.* properties applicable to Spark Structured Streaming?

My understanding is that Spark structured streaming is build on top of Spark SQL and not Spark Streaming. Hence, the following question, does the properties that apply to spark streaming also applies to spark structured streaming such as:
spark.streaming.backpressure.initialRate
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate
No, these settings are applicable only to DStream API.
Spark Structured Streaming does not have a backpressure mechanism. You can find more details in this discussion: How Spark Structured Streaming handles backpressure?
No.
Spark Structured Stream processes data asap by default - after finishing the current batch. You can control via the rate of processing for various types, e.g. maxFilesPerTrigger for files and maxOffsetsPerTrigger for KAFKA.
This link http://javaagile.blogspot.com/2019/03/everything-you-needed-to-know-about.html explains that back pressure is not relevant.
It quotes: "Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.".
I am not sure this aspect is relevant as KAFKA buffers the data. None-the-less the article has good merit imho.

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.

how to make spark streaming asynchronously when read from Kafka

I have one Kafka partition, and one sparkStreaming application. One server with 10 cores. When the spark streaming get one message from Kafka, the subsequent process will take 5 seconds(this is my code). So I found sparkStreaming read Kafka message very slow, I'm guessing that when spark read out one message then it will wait until the message was processed, so the reading and processing are synchronized.
I was wondering can I make the spark reading asynchronously? So the reading from Kafka won't be dragged by the subsequent processing. Then the spark will very quickly consume data from Kafka. And then I can focus on the slow data process inside spark. btw, I'm using foreachRDD function.
you can increase the number of partitions in kafka, it should improve the parallelism , also you can try with "Direct kafka receiver" which really improve the performance when your app is reading from kafka

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.