JDBC Confluent kafka Connector and Topic per schema - apache-kafka

We recently started using Confluent Kafka-JDBC connector to import RDBMS data.
As part of default configuration settings, it seems that one Topic is created for every Table in the schema.
I would like to know if there is any way to
Create Topic per schema rather than every table. And if Topic per Schema is enabled then can Schema evolution (With Schema Registry) be supported on a table basis ?
If Topic per schema is not possible then are there any guidelines on how to manage hundred's or thousands of topics ? Considering that there will one to one mapping between number of tables to number of topics ?
Thanks in advance,

Create Topic per schema rather than every table.
No - it's either n tables -> n topics, or 1 query -> 1 topic.
any guidelines on how to manage hundred's or thousands of topics ?
Adopt a standard naming pattern for them. Use topic-specific configuration as required.

Related

Same kafka topic for multiple table records

I've several orders databases (Ex: OrderUSA, OrderUK, OrderIndia,...) in postgres, all the database will have the same schema and tables. I want to merge all the tbl_orders from these databases into one. I am writing debezium connectors for each database. Is possible to use the same topic for all these debezium connectors? It will be easy for me to have one topic which will have consolidated data. Then I could use the topic for the sink connector. Please advise so that I could move this model to production.
Yes, it is possible to have one topic containing data for all databases.
Whether you should do that, well it depends. This reddit post may be helpful to you: https://www.reddit.com/r/apachekafka/comments/q8a7sj/topic_strategies_when_to_split_into_multiple/
Ultimately, it depends on your business requirements and whether it is logical for all order data to be in one topic.

Set kafka message key to source database name in Debezium Postgresql

We are trying to collect changes from a number of Postgresql databases using Debezium.
The idea is to create a single topic with a number of partitions equal to the number of databases - each database gets its own partition, because order of events matters.
We managed to reroute events to a single topic using topic routing, but to be able to partition events by databases I need to set message key properly.
Qestion: Is there a way we can set kafka message key to be equal to the source database name?
My thougts:
Maybe there is a way to set message key globally per connector configuration?
Database name can be found in the message, but its a nested property payload.source.name. Didn't find a way to extract value from a nested propery.
Any thoughts?
Thank you in advance!
You'd need to write/find a Connect transform that can extract nested fields and set the message key, or if you don't mind duplicating data within Kafka topics, you can use Kafka Streams / KsqlDB, etc to do the same.
Overall, I don't think one topic + one partition per database is a good design for scalability of consumers. Sure, it'll keep order, but it's not much overhead to simply create one topic per database with only one partition. Then make consumers read all topics using a regex pattern rather than needing to assign to specific/all partitions in one topic.

Flink Table and Hive Catalog storage

I have a kafka topic and a Hive Metastore. I want to join the incomming events from the kafka topic with records of the metastore. I saw the possibility with Flink to use a catalog to query Hive Metastore.
So I see two ways to handle this:
using the DataStream api to consume the kafka topic and query the Hive Catalog one way or another in a processFunction or something similar
using the Table-Api, I would create a table from the kafka topic and join it with the Hive Catalog
My biggest concerns are storage related.
In both cases, what is stored in memory and what is not ? Does the Hive catalog stores anything on the Flink's cluster side ?
In the second case, how the table is handle ? Does flink create a copy ?
Which solution seems the best ? (maybe both or neither are good choices)
Different methods are suitable for different scenarios, sometimes depending on whether your hive table is a static table or a dynamic table.
If your hive is only a dimension table, you can try this chapter.
joins-in-continuous-queries
It will automatically associate the latest partition of hive, and it is suitable for scenarios where dimension data is slowly updated.
But you need to note that this feature is not supported by the Legacy planner.

One or more schemas per topic when using Schema Registry with Kafka, and Avro...?

There is somethigng I'm trying to understand about how Avro-serialized messages are treated by Kafka and Schema Registry - from this post I've understood the schema ID is stored in an predictable place in each message so it seems that we can have messages of varous schemas in the same topic and be able to find the right schema and deserialize them successfully based on just that. On the other hand I see many people seem to be using expression "a schema attached to a topic", this however implies one schema per topic..
So which is right? Can I take advantages of the Schema Registry (like i.e. KSql) and have messages of various types (or schemas) in the same topic?
Typically you have 1:1 topic/schema relationship, but it is possible (and valid) to have multiple schemas per topic in some situations. For more information, see https://www.confluent.io/blog/put-several-event-types-kafka-topic/

Kafka - Can you create schema before topic exist and what is the relation?

Is there any order that must be followed - e.g. person should create a topic first and then schema in schema registry or vice versa?
Can two topics use the same schema from Schema Registry?
Does every topic needs to have Key and Value? (and therefore needs to exist 2 schemas for each topic?)
What is the relation and possible combinations?
Thanks.
is there any order that must be followed
Nope. If you have auto topic creation enabled, you could even start producing Avro immediately to a non existing topic. The Confluent serializers automatically register the schema, and the broker will create a topic with default partitions and replicas
Can two topics use the same schema
Yes, the Avro Schema ID of two distinct topics can be the same. For example, Avro key of a string shared over more than one topic will cause two subjects to be entered into the registry, however, only one schema ID will back them
Does every topic needs to have Key and Value?
Yes. Thats part of the Kafka Record protocol. The key can be nullable, however. If you're not using Avro serializer for either key or value, no entry will be made. You're not required to use Avro for both options if one or the other is