Is there a way to automatically tell Kafka to send all events of a specific topic to a specific table of a database?
In order to avoid creating a new consumer that needs to read from that topic and perform the copy explicitly.
You have two options here:
Kafka Connect - this is the standard way to connect your Kafka to a database. There are a lot of connectors. In order to choose one:
The best bet is to use the specific one for your database that is maintained by confluent.
If you don't have a specific one, the second best option is to use the JDBC connector.
Direct ingestion from the database if your database supports it (for instance Clickhouse, and MemSQL are able to load data coming from a Kafka topic). The difference between this and Kafka connects is this way it is fully supported and tested by the db vendor and you need to maintain less pieces of infrastructure.
Which one is better? It depends on:
your data volume
how much you can (and need !) to paralelize the load
and how much you can tolerate downtime or latencies.
Direct ingestion from DB is usually from one node (consumer) to Kafka.
It is good for mid-low volume data traffic. If it fails (or throttles), you might have latency issues.
Kafka connect allows you to insert data in parallel into the db using several workers. If one of the worker fails, the load is redistributed among the others. If you have a lot of data, this probably the best way to load it into the db, but you'll need to take care of the kafka connect infrastructure unless you're using a managed cloud offering.
Related
Here is an example of how Kafka should run for a Social network site.
But it is hard for me to understand the point of Kafka here. We would not want to store posts and likes in Kafka as they will be destroyed after some time. So kafka should be an intermediate storage between View and DB.
But why would we need it? Wouldn't it be better to use DB straightaway.
I guess that we could use kafka as some kind of cache so the data accumulates in Kafka and then we can insert it to DB in one big batch query. But I am pretty sure that is not the reason kafka here.
What's not shown in the diagram is the processes querying the database (RocksDB, in this case). Without using Kafka Streams, you'd need to write some external service to run GROUP BY / SUM on the database. The "website" box on the left is doing some sort of front-end Javascript, and it is unclear how the Kafka backend consumer sends data to it (perhaps WebSockets?).
With Kafka Streams Interactive Queries, that logic can be moved closer to the actual event source, and is performed in near real time, rather than a polling batch. In a streaming framework, you could also send out individual event hooks (websockets, for example) to dynamically update "likes per post", "shares per post", "trends", etc without needing the user to update the page, or have the page load AJAX calls with large API responses for those details for all page rendered items.
More specifically, each Kafka Stream instance serves a specific query, rather than the API hitting one database for all queries. Therefore, load is more distributed and fault tolerant.
Worth pointing out that Apache Pinot loaded from Kafka is more suited for such real time analytical queries than Kafka Streams.
Also as you pointed out, Kafka or any message queue would act as a buffer ahead of any database (not a cache, although, Redis could be added as a cache, just like the later mentioned search service). And there's nothing preventing you from adding another database that's connected to Kafka Connect sink. For instance, a popular design is to write data to a RDBMS as well as Elasticsearch for text based search-indexing. The producer code only cares about one Kafka topic, not every downstream system where the data is needed.
I consider to use Kafka Connect replicator for event enrichment inside the same cluster.
The idea is to have SMT that will enrich the events and after that the events needs to be sent to Mongo DB & S3 bucket.
I understand that KStream / Flink are alternatives.
My question is: Is it a "make sense" design or I am missing something here?
Thanks
Replicator is intended to be used between clusters, not within the same. (It's also a paid feature, and you could just use MirrorMaker2 instead to do the same, if it were a possible solution).
KStreams / ksqlDB is meant for a transferring data within a cluster and seems to be the best option here.
Flink, Spark, or other stream processing tools would work, but require an external scheduler, and can themselves write to Mongo, S3, etc without the need of Kafka Connect, so really depends on how flexible you need the solution to be.
In a Microservices based architecture, who writes to Kafka? services themselves or the Microservices databases? I've been thinking about this and see pros and cons to both approaches but leaning towards having database write to Kafka topics because
Database and data in the Kafka topic won't go out of sync in case write to Kafka fails for whatever reason
Application teams won't have to have one more step to worry about
Applications can keep focusing on the core function rather than worrying about Kafka.
Thanks for your inputs
As cricket_007 has been saying, databases typically cannot write to Apache Kafka themselves; instead, you'd need a change data capturing services such as Debezium in order to stream data changes from the database into Kafka (disclaimer: I'm the lead of Debezium).
Such an approach allows to ensure (eventual) consistency between a service's own database and Kafka messages sent to other services. On specific CDC application I'd recommend to look into is the outbox pattern. The idea there is to not capture changes to the service's actual business tables, but instead work with a separate "outbox table", into which the service writes specific messages meant for consumption by other services. CDC would then be used to sent these events from that table to Kafka.
This approach avoids exposing internal data structures to outside consumers while also avoiding the issues of "dual writes" which a service would suffer from when directly writing to its database and Kafka. In Debezium there's some means of built-in support for the outbox pattern via a message transformation that helps to route the events from the outbox table into event-type specific Kafka topics.
Not all services need a database, they just emit data (logs, metrics, sensors, etc)
So, the answer would be either.
Plus, I'm not sure what database directly can export to Kafka, so you'd have some other service like Debezium deployed which would be polling those CDC records off the database
Application developers still have to "worry" about how to deserialize their data, how many partitions are in the topic so they can scale out consumption, manage offsets, among other things
I have a use case where I need to send the data changes in relational database into a kafka-topic.
I'm able to write a simple JDBC program which executes set of queries for the changes in certain time period and write data into kafka-topic using KafkaTemplate (a wrapper provided by spring framework).
If I do the same using kafka-connect, which is to write a source connector. what benefits or overheads (if in case any) will I get?
The first thing is that you have "... to write a simple JDBC program ..." and take care of the logic of writing on both database and Kafka topic.
Kafka Connect does that for you and your business application has to write to the database only. With Kafka Connect you have more than that like fail-over handling, parallelism, scaling, ... it's all out of box for you while you should take care of them when for example you write on the database but something fails and you are not able to write to Kafka topic and so on.
Today you want to ingest from a database using a set of queries from one database to a Kafka topic, and write some bespoke code to do that.
Tomorrow you want to use a second database, or you want to change the serialisation format of your data in Kafka, or you want to scale out your ingest or you want to have high availability. Or you want to add in the ability to stream data from Kafka to another target, to ingest data also from other places. And, manage it all centrally using a standardised configuration pattern expressed just in JSON. Oh, and you want it to be easily maintainable by someone else who doesn't have to read through code but can just use a common API of Apache Kafka (which is what Kafka Connect is).
If you manage to do all of this yourself—you've just reinvented Kafka Connect :)
I talk extensively about this in my Kafka Summit session: "From Zero to Hero with Kafka Connect" which you can find online here
When reading about Kafka and how to get data from Kafka to a queryable database suited for some specific task, there is usually mention of Kafka Connect sinks.
This sounds like the way to go if I needed Kafka to search indexing like ElasticSearch or analytics like Hadoop to Spark where there's a Kafka Connect sink available.
But my question is what is the best way to handle a store that isn't as popular say MyImaginaryDB, where the only way I can get to it is through some API, and the data needs to be handled securely and reliably, as well as decently transformed before inserting? Is it recommended to:
Just have the API consume from Kafka and use the MyImaginaryDB driver to write
Figure out how to build a custom Kafka Connect sink (assuming it can handle schemas, authentication/authorization, retries, fault-tolerance, transforms and post-processing needed before landing in MyImaginaryDB)
I have also been reading about Kafka KSQL and Streams and am wondering if that helps with transforming the data before it is sent to the end store.
Option 2, definitely. Just because there isn't an existing source connector, doesn't mean Kafka Connect isn't for you. If you're going to be writing some code anyway, it still makes sense to hook into the Kafka Connect framework. Kafka Connect handles all the common stuff (schemas, serialisation, restarts, offset tracking, scale out, parallelism etc etc), and leaves you just to implement the bit of getting the data to MyImaginaryDB.
As regards transformations, standard pattern is either:
Use Single Message Transform for lightweight stuff
Use Kafka Streams/KSQL and write back to another topic, which is then routed through Kafka Connect to the target
If you try to build your own app doing (transformation + data sink) then you're munging together responsibilities, and you're reinventing a chunk of wheel that exists already (integration with an external system in a reliable scalable way)
You might find this talk useful for background about what Kafka Connect can do: http://rmoff.dev/ksldn19-kafka-connect