Whether or not to use Kafka Streams and/or KSQL to denormalize datastreams from database - nosql

After doing lots of reading on the web I finally reach out to this forum. My challenge is to denormalize transactional data from a database sourced via CDC into Kafka before writing it out into a NoSQL database, in this case Cassandra. What is the best way to join the transactional data with lookups from master tables? The issue I have is there are maybe 5 to 10 lookup tables per transactional table.
Trying to do this in a proof of concept using KSQL learned me to A) load the lookup tables as KTables and B) repartition the transactional stream and finally C) perform the join and write into a new topic.
Following this approach, if I have 5 or 10 lookup tables that will generate lots and lots of data being sent around the cluster. I know Streams DSL can use the concept of GlobalKTable but that only works when the lookup tables are relatively small and in addition I prefer a higher level language like KSQL. Is there a better approach?

What you need is for ksqlDB to support non-key joins. So you should up-vote this issue that tracks that feature: https://github.com/confluentinc/ksql/issues/4424
Until then, your approach of repartitioning the transactional stream to match the key of the lookup tables it the only viable solution.

Related

Same kafka topic for multiple table records

I've several orders databases (Ex: OrderUSA, OrderUK, OrderIndia,...) in postgres, all the database will have the same schema and tables. I want to merge all the tbl_orders from these databases into one. I am writing debezium connectors for each database. Is possible to use the same topic for all these debezium connectors? It will be easy for me to have one topic which will have consolidated data. Then I could use the topic for the sink connector. Please advise so that I could move this model to production.
Yes, it is possible to have one topic containing data for all databases.
Whether you should do that, well it depends. This reddit post may be helpful to you: https://www.reddit.com/r/apachekafka/comments/q8a7sj/topic_strategies_when_to_split_into_multiple/
Ultimately, it depends on your business requirements and whether it is logical for all order data to be in one topic.

SQL Server Data to Kafka in real time

I would like to add real time data from SQL server to Kafka directly and I found there is a SQL server connector provided by https://debezium.io/docs/connectors/sqlserver/
In the documentation, it says that it will create one topic for each table. I am trying to understand the architecture because I have 500 clients which means I have 500 databases and each of them has 500 tables. Does it mean that it will create 250000 topics or do I need separate Kafka Cluster for each client and each cluster/node will have 500 topics based on the number of tables in the database?
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
With debezium you are stuck with one table to one topic mapping. However, there are creative ways to get around it.
Based on the description, it looks like you have some sort of product that has SQL Server backend, and that has 500 tables. This product is being used by 500 or more clients and everyone has their own instance of the database.
You can create a connector for one client and read all 500 tables and publish it to Kafka. At this point you will have 500 Kafka topics. You can route the data from all other database instances to the same 500 topics by creating separate connectors for each client / database instance. I am assuming that since this is a backend database for a product, the table names, schema names etc. are all same, and the debezium connector will generate same topic names for the tables. If that is not the case, you can use topic routing SMT.
You can differentiate the data in Kafka by adding a few metadata columns in the topic. This can easily be done in the connector by adding SMTs. The metadata columns could be client_id, client_name or something else.
As for your other question,
Is it the best way to send SQL data to Kafka or should we send an event to Kafka queue through code whenever there is an insert/update/delete on a table?
The answer is "it depends!".
If it is a simple transactional application, I would simply write the data to the database and not worry about anything else.
The answer is also dependent on why you want to deliver data to Kafka. If you are looking to deliver data / business events to Kafka to perform some downstream business processing requiring transactional integrity, and strict SLAs, writing the data from application may make sense. However, if you are publishing data to Kafka to make it available for others to use for analytical or any other reasons, using the K-Connect approach makes sense.
There is a licensed alternative, Qlik Replicate, which is capable of something very similar.

Design questions considering Kafka Streams and Spring Cloud Stream

I need to maintain external systems records (KTables) and track any change on those records (KStreams).
The KTables will be requested by KSQL queries, while the KStreams will be handled by an event monitor.
Questions:
I need the KTable working like mirrors from the external systems. Will I have any problem if I decide to use this design regarding data storage? Data loss, expiration?
Using Spring, what is the best approach for the data type? Avro with a schema registry?
The source of everything is a Topic, right? So I will need to send messages to Topics, and my KTable and KStream would translate as needed. Is that right?
The KTable definitions are known, but I may have a group KStreams being created dynamic; what is the best way to achieve this?
I appreciate any comment that could help better design it.
here are my suggestions/opinions on the questions, you might want to do further research into some of the core Kafka Streams related questions.
Not entirely clear what use-case/design you are proposing. The way I understood it, you have an external system (such as a database) and you want to extract that data as a key/value pair which could be translated into a KTable. In Kafka Streams, as you indicated in your question #3, the source of truth is the Kafka topic. Therefore, you need to bring the data from the external system into a Kafka topic first, and then materialize that as a KTable in Kafka Streams. There are established patterns such as the Change Data Capture (CDC) for exporting data from external systems to a Kafka topic in almost real-time. KTable can be materialized into state storage which is by default backed up RocksDB. The same information is also replicated by Kafka changelog topics and therefore applies the guarantees provided by data in a Kafka topic. I hope that someone from the Kafka Streams team can chime in on this specific topic for more information needed.
Spring Cloud Stream provides a binder for Kafka Streams using which you can establish bindings to Kafka topics through various Kafka Streams types such as KStream, KTable and GlobalKTable. See the reference docs for more details. The binder provides several convenient options for data types with Serde inference in the case of common data types. The question about Avro data types is really dependent on your use cases and how you want to manage the schema structure for the data. If centralized schema management is a concern, then avro is a good choice. You can use Confluent's schema registry for Avro with Spring Cloud Stream. Spring provides a schema registry, but for Kafka Streams workloads that require avro, we recommend using the Confluent schema registry as it has more features. Either way, it should work and we provide a number of sample applications demonstrating schema evolution here.
As I mentioned in the answer for #1, yes, the source of truth is Kafka topics and the Spring Cloud Stream binder provides binding mechanisms for connecting to Kafka topics and translate the data as KStream or KTable.
Here again, I am not following the actual use-case. However, Kafka Streams provides many different API methods which allow you to transform the incoming data so that other KStream types can be created dynamically. For instance, you apply a map or flatMap operation on the incoming KStream and thus create a new KStream from it. Not sure, if that is what you meant. If that is the case, then it really becomes a business logic concern. This is certainly possible.
Hope this helps, once again, these are my thoughts around these, and for some of these questions, there is no right or wrong answer. You need to consider the use case and design options carefully and choose the right path that fits your needs.

How to join multiple Kafka topics?

So I have...
1st topic that has general application logs (log4j). Stores things like HTTP API requests/responses and warnings, exceptions etc... There can be multiple logs associated to one logical business request. (These logs happen within seconds of each other)
2nd topic contains commands from the above business request which other services take action on. (The commands also happen within seconds of each other, but maybe couple minutes from the original request)
3rd topic contains events generated from actions of those other services. (Most events complete within seconds, but some can take up to 3-5 days to be received)
So a single logical business request can have multiple logs, commands and events associated to it by a uuid which the microservices pass to each other.
So what are some of the technologies/patterns that can be used to read the 3 topics and join them all together as a single json document and then dump them to lets say Elasticsearch?
Streaming?
You can use Kafka Streams, or KSQL, to achieve this. Which one depends on your preference/experience with Java, and also the specifics of the joins you want to do.
KSQL is the SQL streaming engine for Apache Kafka, and with SQL alone you can declare stream processing applications against Kafka topics. You can filter, enrich, and aggregate topics. Currently only stream-table joins are supported. You can see an example in this article here
The Kafka Streams API is part of Apache Kafka, and a Java library that you can use to do stream processing of data in Apache Kafka. It is actually what KSQL is built on, and supports greater flexibility of processing, including stream-stream joins.
You can use KSQL to join the streams.
There are 2 constructs in KSQL Table/Stream.
Currently, the Join is supported for a Stream & a table. So you need to identify the which is a good fit for what?
You don't need windowing for joins.
Benefits of using KSQL.
KSQL is easy to set up.
KSQL is SQL language which helps you to query your data quickly.
Drawback.
It's not production ready but in April-2018 the release is coming up.
Its little buggy right now but certainly will improve in a few months.
Please have a look.
https://github.com/confluentinc/ksql
Same as question Is it possible to use multiple left join in Confluent KSQL query? tried to join stream with more than 1 tables , if not then whats the solution?
And seems like you can not have multiple join keys within same query.

Build a solution for Kafka+Spark for RDBMS data

My current project is in MainFrames with DB2 as its database. We have 70 databases with nearly 60 tables in each of them. Our architect proposed a plan of using Kafka with Spark streaming for processing data. How good is Kafka in reading the RDBMS tables for data ? Do we directly read the data from the tables using Kafka or is there any other way to get the data from RDBMS into Kafka ?
If there is any better solution, your suggestions can help a lot.
Do not directly read from database, it will create additional load. I would suggest two approaches.
Send new data both to databases and to Kafka, or send it to Kafka and then consume for processing.
Read data from database write ahead log (I know it is possible for MySQL with Maxwell but I am not sure for DB2) and send it to Kafka for further processing.
You can use Spark Streaming or Kafka Streams depending on your needs.