Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record - postgresql

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You

I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Related

What is the gain of using kafka-connect over traditional approach?

I have a use case where I need to send the data changes in relational database into a kafka-topic.
I'm able to write a simple JDBC program which executes set of queries for the changes in certain time period and write data into kafka-topic using KafkaTemplate (a wrapper provided by spring framework).
If I do the same using kafka-connect, which is to write a source connector. what benefits or overheads (if in case any) will I get?
The first thing is that you have "... to write a simple JDBC program ..." and take care of the logic of writing on both database and Kafka topic.
Kafka Connect does that for you and your business application has to write to the database only. With Kafka Connect you have more than that like fail-over handling, parallelism, scaling, ... it's all out of box for you while you should take care of them when for example you write on the database but something fails and you are not able to write to Kafka topic and so on.
Today you want to ingest from a database using a set of queries from one database to a Kafka topic, and write some bespoke code to do that.
Tomorrow you want to use a second database, or you want to change the serialisation format of your data in Kafka, or you want to scale out your ingest or you want to have high availability. Or you want to add in the ability to stream data from Kafka to another target, to ingest data also from other places. And, manage it all centrally using a standardised configuration pattern expressed just in JSON. Oh, and you want it to be easily maintainable by someone else who doesn't have to read through code but can just use a common API of Apache Kafka (which is what Kafka Connect is).
If you manage to do all of this yourself—you've just reinvented Kafka Connect :)
I talk extensively about this in my Kafka Summit session: "From Zero to Hero with Kafka Connect" which you can find online here

How to get data from Kafka into a store without Kafka Connect sink?

When reading about Kafka and how to get data from Kafka to a queryable database suited for some specific task, there is usually mention of Kafka Connect sinks.
This sounds like the way to go if I needed Kafka to search indexing like ElasticSearch or analytics like Hadoop to Spark where there's a Kafka Connect sink available.
But my question is what is the best way to handle a store that isn't as popular say MyImaginaryDB, where the only way I can get to it is through some API, and the data needs to be handled securely and reliably, as well as decently transformed before inserting? Is it recommended to:
Just have the API consume from Kafka and use the MyImaginaryDB driver to write
Figure out how to build a custom Kafka Connect sink (assuming it can handle schemas, authentication/authorization, retries, fault-tolerance, transforms and post-processing needed before landing in MyImaginaryDB)
I have also been reading about Kafka KSQL and Streams and am wondering if that helps with transforming the data before it is sent to the end store.
Option 2, definitely. Just because there isn't an existing source connector, doesn't mean Kafka Connect isn't for you. If you're going to be writing some code anyway, it still makes sense to hook into the Kafka Connect framework. Kafka Connect handles all the common stuff (schemas, serialisation, restarts, offset tracking, scale out, parallelism etc etc), and leaves you just to implement the bit of getting the data to MyImaginaryDB.
As regards transformations, standard pattern is either:
Use Single Message Transform for lightweight stuff
Use Kafka Streams/KSQL and write back to another topic, which is then routed through Kafka Connect to the target
If you try to build your own app doing (transformation + data sink) then you're munging together responsibilities, and you're reinventing a chunk of wheel that exists already (integration with an external system in a reliable scalable way)
You might find this talk useful for background about what Kafka Connect can do: http://rmoff.dev/ksldn19-kafka-connect

Kafka streams vs Kafka connect for Kafka HBase ETL pipeline

I have straightforward scenario for the ETL job: take data from Kafka topic and put it to HBase table. In the future i'm going to add the support for some logic after reading data from a topic.
I consider two scenario:
use Kafka Streams for reading data from a topic and further writing via native HBased driver each record
Use Kafka -> HBase connector
I have the next concerns about my options:
Is is a goo idea to write data each time it arrives in a Kafka Stream's window? - suggest that it'll downgrade performance
Kafka Hbase connector is supported only by third-party developer, i'm not sure about code quality of this solution and about the option to add custom aggregation logic over data from a topic.
I myself have been trying to search for ETL options for KAFKA to HBase, however, so far my research tells me that it's a not a good idea to have an external system interaction within a KAFKA streams application (check the answer here and here). KAFKA streams are super powerful and great if you have KAFKA->Transform_message->KAFKA kind of use case, and eventually you can have KAFKA connect that will take your data from KAFKA topic and write it to a sink.
Since you do not want to use the third party KAFKA connect for HBase, one option is to write something yourself using the connect API, the other option is to use the KAFKA consumer producer API and write the app using the traditional way, poll the messages, write to sink, commit the batch and move on.

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.

Does it make sense to build a data processing pipeline using only Kafka?

I am building a data processing pipeline using Kafka.
The pipeline is linear with 4 stages.
The data volume is medium (will need more than one machine but not hundreds or thousands; data volume is a few tens of gigabytes)
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why? Of course, I prefer the simplest possible architecture. If I can do it all with Kafka, I'd prefer that. In the future I may need some additional machine learning stages and that may affect the answer. I have no strong once-only semantics, I can accept some message loss and some duplication with no problem.
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why?
Technically yes you can. If you are ready to handle the whole distributed architecture on your own. Writing your own multi-threaded producers, managing those consumers and so on. You also need to consider in terms of Scalability, performance, durability etc. And here comes the beauty of using computation engine like Storm, Spark etc. So you can simply concentrate on the core logic and leave the infrastructure be maintained by them.
For example using a combination of Kafka and Storm for your architecture, you can store terabytes of data using kafka and feed them to storm for processing. If you are familiar with storm then a sample topology can be something like this:
(kafka-spout consuming messages from topic) --> ( Bolt-A for processing the data receive through spout & feeding it to bolt B) --> (Bolt-B for pushing back the processed data into another kafka topic)
Using such architecture offers great deal in scalability, throughput, performance etc.Making some easy configuration changes you will be able to tune your application based on your requirements.