Are spark.streaming.backpressure.* properties applicable to Spark Structured Streaming? - scala

My understanding is that Spark structured streaming is build on top of Spark SQL and not Spark Streaming. Hence, the following question, does the properties that apply to spark streaming also applies to spark structured streaming such as:
spark.streaming.backpressure.initialRate
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate

No, these settings are applicable only to DStream API.
Spark Structured Streaming does not have a backpressure mechanism. You can find more details in this discussion: How Spark Structured Streaming handles backpressure?

No.
Spark Structured Stream processes data asap by default - after finishing the current batch. You can control via the rate of processing for various types, e.g. maxFilesPerTrigger for files and maxOffsetsPerTrigger for KAFKA.
This link http://javaagile.blogspot.com/2019/03/everything-you-needed-to-know-about.html explains that back pressure is not relevant.
It quotes: "Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.".
I am not sure this aspect is relevant as KAFKA buffers the data. None-the-less the article has good merit imho.

Related

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

How to do the transformations in Kafka (PostgreSQL-> Red shift )

I'm new to Kafka/AWS.My requirement to load data's from several sources into DW(Redshift).
One of my sources is PostgreSQL. I found a good article using Kafka to Sync data into Redshift.
This article is more good enough to sync the data between the PostgreSQL to redshift.But my requirement is to transform the data's before loading into Redshift.
Can somebody help me to how to transform the data's in Kafka (PostgreSQL->Redhsift)?
Thanks in Advance
Jay
Here's an article I just published on exactly this pattern, describing how to use Apache Kafka's Connect API, and KSQL (which is built on Kafka's Streams API) to do streaming ETL: https://www.confluent.io/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
You should check out Debezium for streaming events from Postgres into Kafka.
For this, you can use any streaming application be it storm/spark/kafka streaming. These application will consume data from diff sources and the data transformation can be done on the fly. All three have their own advantages and complexity.

How to use Kafka consumer in spark

I am using spark 2.1 and Kafka 0.10.1.
I want to process the data by reading the entire data of specific topics in Kafka on a daily basis.
For spark streaming, I know that createDirectStream only needs to include a list of topics and some configuration information as arguments.
However, I realized that createRDD would have to include all of the topic, partitions, and offset information.
I want to make batch processing as convenient as streaming in spark.
Is it possible?
I suggest you to read this text from Cloudera.
This example show you how to get from Kafka the data just one time. That you will persist the offsets in a postgres due to the ACID archtecture.
So I hope that will solve your problem.

Read Spark Streaming checkpoint data

I'm writing a Spark Streaming application reading from Kafka. In order to have an exactly one semantic, I'd like to use the direct Kafka stream and using Spark Streaming native checkpointing.
The problem is that checkpointing makes pratically impossible to mantain the code: if you change something you loose the checkpointed data, thus you are almost compelled to read twice some messages from Kafka. And I'd like to avoid it.
Thus, I was trying to read the data in the checkpointing directory by myself, but so far I haven't been able to do that. Can someone tell me how to read the information about last processed Kafka offsets by the checkpointing folder?
Thank you,
Marco

Connecting Spark streaming to streamsets input

I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .
I exploring if there are other ways to connect them for a sample POC.
The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.
SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss.
It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.
In short, your best option would be SDC -> Kafka -> Spark Streaming.