Build a solution for Kafka+Spark for RDBMS data - apache-kafka

My current project is in MainFrames with DB2 as its database. We have 70 databases with nearly 60 tables in each of them. Our architect proposed a plan of using Kafka with Spark streaming for processing data. How good is Kafka in reading the RDBMS tables for data ? Do we directly read the data from the tables using Kafka or is there any other way to get the data from RDBMS into Kafka ?
If there is any better solution, your suggestions can help a lot.

Do not directly read from database, it will create additional load. I would suggest two approaches.
Send new data both to databases and to Kafka, or send it to Kafka and then consume for processing.
Read data from database write ahead log (I know it is possible for MySQL with Maxwell but I am not sure for DB2) and send it to Kafka for further processing.
You can use Spark Streaming or Kafka Streams depending on your needs.

Related

RabbitMq and KStreams for Data Aggregation

I'm trying to solve the problem of data denormalization before indexing to the Elasticsearch. Right now, my Postgres 11 database is configured with pgoutput plugin and Debezium with Postgresql Connector is streaming the log changes to RabbitMq which are then aggregated by doing a reverse lookup on the db and feeding to the Elasticsearch.
Although, this works okay, the lookup at the App layer to aggregate the data is expensive and taking a lot of execution time (the query is already refined but it has about 10 joins making it sloppy).
The other alternative I explored was to use KStreams for data aggregation. My knowledge on Apache Kafka is minimal and thus I'm here. My question here is it a requirement to have Apache Kafka as the broker to be able to utilize the Java KStreams API or can it be leveraged with any broker such as RabbitMq? I'm unsure about this because all the articles talk about Kafka Topics and Key Value pairs which are specific to Apache Kafka.
If there is a better way to solve the data denormalization problem, I'm open to it too.
Thanks
Kafka Steams is only for Kafka. You're more than welcome to use Kafka Streams between Debezium and the process that consumes any topic (the Postgres connector that writes to RabbitMQ?)
You can use Spark, Flink, or Beam for stream processing on other messaging queues, but Debezium requires Kafka so start with tools around that.
Spark, for example, has an Elasticsearch writer library; not sure about the others.

How does connector works in KSQLDB and Kafka?

I don't have enough information about how source connector works in KSQLDb and Kafka altogether.
How much fast the data is populated to Kafka topics?
And what if KsqlDb stream needs data from source to join data, but data is loaded still?
Does source connector send updated/inserted data to topic, it happens instantly?
Could you help me with these issues or avice a good tutorial, where I can learn more.
How much fast the data is populated to Kafka topics?
Depends on the connector. Some connectors are event driven and some use a polling mechanism. The event driven connectors are generally going to be more real-time, but often require more db-side setup. Where as the polling based connectors generally don't require any db-side changes. With the polling based connectors you can increase the polling frequency, trading lower latencies for high db load.
Look more into the documentation of the connectors for more info.
And what if KsqlDb stream needs data from source to join data, but data is loaded still?
ksqlDB generally processes your data in time order. When joining two topics, ksqlDB will process the side with the oldest data. This will generally mean the stream data is not processed until the table is bootstrapped.
Does source connector send updated/inserted data to topic, it happens instantly?
Not sure how this question differs from question #1.

What is the gain of using kafka-connect over traditional approach?

I have a use case where I need to send the data changes in relational database into a kafka-topic.
I'm able to write a simple JDBC program which executes set of queries for the changes in certain time period and write data into kafka-topic using KafkaTemplate (a wrapper provided by spring framework).
If I do the same using kafka-connect, which is to write a source connector. what benefits or overheads (if in case any) will I get?
The first thing is that you have "... to write a simple JDBC program ..." and take care of the logic of writing on both database and Kafka topic.
Kafka Connect does that for you and your business application has to write to the database only. With Kafka Connect you have more than that like fail-over handling, parallelism, scaling, ... it's all out of box for you while you should take care of them when for example you write on the database but something fails and you are not able to write to Kafka topic and so on.
Today you want to ingest from a database using a set of queries from one database to a Kafka topic, and write some bespoke code to do that.
Tomorrow you want to use a second database, or you want to change the serialisation format of your data in Kafka, or you want to scale out your ingest or you want to have high availability. Or you want to add in the ability to stream data from Kafka to another target, to ingest data also from other places. And, manage it all centrally using a standardised configuration pattern expressed just in JSON. Oh, and you want it to be easily maintainable by someone else who doesn't have to read through code but can just use a common API of Apache Kafka (which is what Kafka Connect is).
If you manage to do all of this yourself—you've just reinvented Kafka Connect :)
I talk extensively about this in my Kafka Summit session: "From Zero to Hero with Kafka Connect" which you can find online here

reading data from Kafka and storing it in dynamo db

I need to read data from multiple topics in Kafka broker and store the data in Dynamo DB.
Any reference code or any specific method i can go ahead with.
I Tried using https://github.com/shikhar/kafka-connect-dynamodb but i couldn't get much help as am new to this.
One of the options to read from Kafka and write to Dynamo is Nifi.
Use ConsumeKafka Nifi Processor as consumer:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-kafka-0-9-nar/1.5.0/org.apache.nifi.processors.kafka.pubsub.ConsumeKafka/
and PutDynamoDB to write:
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-aws-nar/1.5.0/org.apache.nifi.processors.aws.dynamodb.PutDynamoDB/
This also facilitates to do any quick transformation, forking etc.

How to do the transformations in Kafka (PostgreSQL-> Red shift )

I'm new to Kafka/AWS.My requirement to load data's from several sources into DW(Redshift).
One of my sources is PostgreSQL. I found a good article using Kafka to Sync data into Redshift.
This article is more good enough to sync the data between the PostgreSQL to redshift.But my requirement is to transform the data's before loading into Redshift.
Can somebody help me to how to transform the data's in Kafka (PostgreSQL->Redhsift)?
Thanks in Advance
Jay
Here's an article I just published on exactly this pattern, describing how to use Apache Kafka's Connect API, and KSQL (which is built on Kafka's Streams API) to do streaming ETL: https://www.confluent.io/ksql-in-action-real-time-streaming-etl-from-oracle-transactional-data
You should check out Debezium for streaming events from Postgres into Kafka.
For this, you can use any streaming application be it storm/spark/kafka streaming. These application will consume data from diff sources and the data transformation can be done on the fly. All three have their own advantages and complexity.