How to monitor 'bad' messages written to kafka topic with no schema - apache-kafka

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks

use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Related

How does a kafka connect connector know which schema to use?

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.
Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

Is it a good practice to use the exiting topic for multiple connectors?

I am using the Debezium PostgreSQL connector to get the users table into a Kafka Topic.
I have a JDBC Sink Connector connector that then reads the data from the topic and pushes it into it's own Database.
Now, I need a subset of the data for another Microservice Database. So I am planning to write another JDBC Sink Connector.
The Question: is it a good practice to use the existing users table topic? If yes, then how I can make sure that new JDBC connector get's a snapshot of entire users table
 
If Debezium snapshotted the table and data hasn't been lost in the topic due to retention, then that's what any sink or other consumer will read.
Any unique sink connector name will read unique offsets from its topic. Nothing bad will happen with multiple consumers reading the same topic; this is how Kafka is intended to be used.
You may need to ensure consumer.auto.offset.reset=earliest for connect to read from the start of the topic
To get a subset of fields, you'll need to "replace" them - https://docs.confluent.io/platform/current/connect/transforms/replacefield.html#replacefield

Kafka Connect: a sink connector for writing multiple tables from one topic

I'd like to use Kafka Connect to detect changes on a Postgres DB via CDC for a group of tables, and push them as messages in one single topic, with the key as the logic key of the main table.
This will give the consumer to consume the data changes in the right order to apply them to a destination DB.
Are there Source and Sink connectors allowing me to achieve this goal?
I'm using Debezium CDC Source Connector for Postgres... which I can configure to route all the messages for all the tables into one single topic.
But then I'm not able to find a Sink connector capable to consume the messages, and write to the right table depending on the schema of the message.
There'd be no property for a specific sink connector, if that's what you're looking for
You can use Single Message Transforms to Extract parts of the record to set the outgoing topic name from a single sink connector topic
Example : https://docs.confluent.io/platform/current/connect/transforms/extracttopic.html

Sink events of a same Kafka topic into multiple paths in GCS

I am using the Schema Registry with RecordNameStrategy naming policy so I have events with totally different avro schemas into the same Kafka topic.
I am doing that as I want to group logically related events that may have different data structures under a same topic to keep order for these data.
For instance:
user_created event and user_mail_confirmed event might have different schemas but it's important to keep them into a same topic partition to guarantee order for consumers.
I am trying to sink these data, coming from a single topic, into GCS in multiple paths (one path for each schema)
Does someone know if the Confluent Kafka connect GCS Sink connector (or any other connector) provide us with that feature please ?
I haven't used GCS connector, but I suppose that this is not possible with Confluent connectors in general.
You should probably copy your source topic with different data structures to a new set of topics, where data have common data structure. This is possible with ksqlDB (check an example) or Kafka Streams application. Then, you can create connectors for these topics.
Alternatively, you can use RegexRouter transformation with a set of predicates based on the message headers.

What should I use: Kafka Stream or Kafka consumer api or Kafka connect

I would like to know what would be best for me: Kafka stream or Kafka consumer api or Kafka connect?
I want to read data from topic then do some processing and write to database. So I have written consumers but I feel I can write Kafka stream application and use it's stateful processor to perform any changes and write it to database which can eliminate my consumer code and only have to write db code.
Databases I want to insert my records are:
HDFS - (insert raw JSON)
MSSQL - (processed json)
Another option is Kafka connect but I have found there is no json support as of now for hdfs sink and jdbc sink connector.(I don't want to write in avro) and creating schema is also pain for complex nested messages.
Or should I write custom Kafka connect to do this.
So need you opinion on whether I should write Kafka consumer or Kafka stream or Kafka connect?
And what will be better in terms of performance and have less overhead?
You can use a combination of them all
I have tried HDFS sink for JSON but not able to use org.apache.kafka.connect.json.JsonConverter
Not clear why not. But I would assume you forgot to set schemas.enabled=false.
when I set org.apache.kafka.connect.storage.StringConverter it works but it writes the json object in string escaped format. For eg. {"name":"hello"} is written into hdfs as "{\"name\":\"hello\"}"
Yes, it will string-escape the JSON
Processing I want to do is basic validation and few field values transformation
Kafka Streams or Consumer API is capable of validation. Connect is capable of Simple Message Transforms (SMT)
Some use cases, you need to "duplicate data" onto Kafka; process your "raw" topic, read it using a consumer, then produce it back into a "cleaned" topic, from which you can use Kafka Connect to write to a database or filesystem.
Welcome to stack overflow! Please take the tout https://stackoverflow.com/tour
Please make posts with precise question, not asking for opinions - this makes the site clearer, and opinions are not answers (and subject to every person preferences). Asking "How to use Kafka-connect with json" - or so would fit this site.
Also, please show some research.
Less overhead would be kafka consumer - kafka stream and kafka connect use kafka consumer, so you will always be able to make less overhead, but will also lose all benefits (tolerant to failures, easy of usage, support, etc)
First, it depends of what your processing is. Aggregation? Counting? Validation? Then, you can use kafka streams to do the processing and write the result to a new topic, on the format you want.
Then, you can use kafka connect to send the data to your database. You are not forced to use avro, you can use other format for key/value, see
Kafka Connect HDFS Sink for JSON format using JsonConverter
Kafka Connect not outputting JSON