Sink events of a same Kafka topic into multiple paths in GCS - apache-kafka

I am using the Schema Registry with RecordNameStrategy naming policy so I have events with totally different avro schemas into the same Kafka topic.
I am doing that as I want to group logically related events that may have different data structures under a same topic to keep order for these data.
For instance:
user_created event and user_mail_confirmed event might have different schemas but it's important to keep them into a same topic partition to guarantee order for consumers.
I am trying to sink these data, coming from a single topic, into GCS in multiple paths (one path for each schema)
Does someone know if the Confluent Kafka connect GCS Sink connector (or any other connector) provide us with that feature please ?

I haven't used GCS connector, but I suppose that this is not possible with Confluent connectors in general.
You should probably copy your source topic with different data structures to a new set of topics, where data have common data structure. This is possible with ksqlDB (check an example) or Kafka Streams application. Then, you can create connectors for these topics.
Alternatively, you can use RegexRouter transformation with a set of predicates based on the message headers.

Related

Is it a good practice to use the exiting topic for multiple connectors?

I am using the Debezium PostgreSQL connector to get the users table into a Kafka Topic.
I have a JDBC Sink Connector connector that then reads the data from the topic and pushes it into it's own Database.
Now, I need a subset of the data for another Microservice Database. So I am planning to write another JDBC Sink Connector.
The Question: is it a good practice to use the existing users table topic? If yes, then how I can make sure that new JDBC connector get's a snapshot of entire users table
 
If Debezium snapshotted the table and data hasn't been lost in the topic due to retention, then that's what any sink or other consumer will read.
Any unique sink connector name will read unique offsets from its topic. Nothing bad will happen with multiple consumers reading the same topic; this is how Kafka is intended to be used.
You may need to ensure consumer.auto.offset.reset=earliest for connect to read from the start of the topic
To get a subset of fields, you'll need to "replace" them - https://docs.confluent.io/platform/current/connect/transforms/replacefield.html#replacefield

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Design questions considering Kafka Streams and Spring Cloud Stream

I need to maintain external systems records (KTables) and track any change on those records (KStreams).
The KTables will be requested by KSQL queries, while the KStreams will be handled by an event monitor.
Questions:
I need the KTable working like mirrors from the external systems. Will I have any problem if I decide to use this design regarding data storage? Data loss, expiration?
Using Spring, what is the best approach for the data type? Avro with a schema registry?
The source of everything is a Topic, right? So I will need to send messages to Topics, and my KTable and KStream would translate as needed. Is that right?
The KTable definitions are known, but I may have a group KStreams being created dynamic; what is the best way to achieve this?
I appreciate any comment that could help better design it.
here are my suggestions/opinions on the questions, you might want to do further research into some of the core Kafka Streams related questions.
Not entirely clear what use-case/design you are proposing. The way I understood it, you have an external system (such as a database) and you want to extract that data as a key/value pair which could be translated into a KTable. In Kafka Streams, as you indicated in your question #3, the source of truth is the Kafka topic. Therefore, you need to bring the data from the external system into a Kafka topic first, and then materialize that as a KTable in Kafka Streams. There are established patterns such as the Change Data Capture (CDC) for exporting data from external systems to a Kafka topic in almost real-time. KTable can be materialized into state storage which is by default backed up RocksDB. The same information is also replicated by Kafka changelog topics and therefore applies the guarantees provided by data in a Kafka topic. I hope that someone from the Kafka Streams team can chime in on this specific topic for more information needed.
Spring Cloud Stream provides a binder for Kafka Streams using which you can establish bindings to Kafka topics through various Kafka Streams types such as KStream, KTable and GlobalKTable. See the reference docs for more details. The binder provides several convenient options for data types with Serde inference in the case of common data types. The question about Avro data types is really dependent on your use cases and how you want to manage the schema structure for the data. If centralized schema management is a concern, then avro is a good choice. You can use Confluent's schema registry for Avro with Spring Cloud Stream. Spring provides a schema registry, but for Kafka Streams workloads that require avro, we recommend using the Confluent schema registry as it has more features. Either way, it should work and we provide a number of sample applications demonstrating schema evolution here.
As I mentioned in the answer for #1, yes, the source of truth is Kafka topics and the Spring Cloud Stream binder provides binding mechanisms for connecting to Kafka topics and translate the data as KStream or KTable.
Here again, I am not following the actual use-case. However, Kafka Streams provides many different API methods which allow you to transform the incoming data so that other KStream types can be created dynamically. For instance, you apply a map or flatMap operation on the incoming KStream and thus create a new KStream from it. Not sure, if that is what you meant. If that is the case, then it really becomes a business logic concern. This is certainly possible.
Hope this helps, once again, these are my thoughts around these, and for some of these questions, there is no right or wrong answer. You need to consider the use case and design options carefully and choose the right path that fits your needs.

MQTT topics and kafka topics mapping

I have started to learn about MQTT as I have a use case in telematics in my current organisation. I would like to integrate MQTT broker ( mosquitto ) messages to my kafka.
Since every vehicle is sending the data in its own topic in MQTT broker within a single organisation, I would like to push all this data in kafka. Now I know it is not advisable to create so many topics in kafka ( more than a million ). Also I would like not like to save all the vehicles data in one kafka topic as I would like to later put all this data in S3, differentiated via vehicle id.
How can I achieve this without making so many topics in kafka. One way is the consumer of kafka segregate the events and put in s3 but I believe there will be a lot of small files in S3.
Generally, if you have the same logical entity you would use the same topic.
You can use the MQTT plugin for Kafka Connect to stream the data from MQTT into Kafka, and Kafka Connect's Single Message Transform RegexRouter to modify the topic name to which messages are written, and other SMT to modify the message key. That way you get all the messages in one topic, partitioned based on the vehicle id. That's probably the best way to store it.
From there, you can use the data however you want. When it comes to stream it to S3, you can use Kafka Connect S3 sink and as cricket_007 mentioned partition the data by time if it's just volume you're worried about. If you want to route the messages to different buckets or areas of the same bucket you could use a stream processing (e.g. Kafka Streams / ksqlDB) to pre-process the topic to populate others.
See here for an example of the MQTT connector.

kafka connect hdfs flatten array records

We are experimenting with Kafka Connect HDFS and have source data which contains an array of avro records (referred to as container record below) which are pushed to topics.
The plan was to use kafka-connect hdfs to flatten the array in the container record into individual records and write to hdfs as avro and then commit the actual container record.
However it seems that kafka-connect-hdfs does not support this out of the box.
In attempt to navigate the source we found that there is a tight mapping of the topic names and offset management and attempting to transform 1 sinkRecod to multiple sinkRecords deviates from expected behavior.
so wanted to verify if 1 to n transformation possible ? if so how ?
Kafka Connect supports Single Message Transforms (SMT) which is the area of functionality, if any, in Kafka Connect that would do what you want. However IIRC SMT are only for 1:1 input/output record processing, not 1:n - that is, I don't think (but am willing to be corrected) that you can generate additional output records from a single input record.
You probably want to look at Kafka Streams which would give you the full capabilities of doing this. You'd write a streams app which would subscribe to the source topic, do the necessary array processing, and write it back to a new Kafka topic. The new Kafka topic would then be the source topic for your Kafka Connect HDFS sink.