Currently I have a sink connector which gets data from topic A and sends its to an external service.
Now I have a use case when based on some logic I should send it to topic B instead of the service.
And this logic based on the response of the target service,that will return response based on the data.
So because the data should be sent to the target system every time I couldnt use the stream api.
Is that feasible somehow?
Or should I add a kafka producer manually to my sink? If so is there any drawback?
The first option, is to create a custom Kafka Connect Single Message Transform that will implement the desired logic and possibly use ExtractTopic as well (depending on how your custom smt looks like).
The second option is to build your own consumer. For example:
Step 1: Create one more topic on top of topic A
Create one more topic, say topic_a_to_target_system
Step 2: Implement your custom consumer
Implement a Kafka Consumer that consumes all the messages from topic topic_a.
At this point, you need to instantiate a Kafka Producer and based on the logic, decide whether the topic needs to be forwarded to topic_B or to the target system (topic_a_to_target_system).
Step 3: Start Sink connector on topic_a_to_target_system
Finally start your sink connector so that it sinks the data from topic topic_a_to_target_system to your target system.
Related
Can I have the consumer act as a producer(publisher) as well? I have a user case where a consumer (C1) polls a topic and pulls messages. after processing the message and performing a commit, it needs to notify another process to carry on remaining work. Given this use case is it a valid design for Consumer (C1) to publish a message to a different topic? i.e. C1 is also acting as a producer
Yes. This is a valid use case. We have many production applications does the same, consuming events from a source topic, perform data enrichment/transformation and publish the output into another topic for further processing.
Again, the implementation pattern depends on which tech stack you are using. But if you after Spring Boot application, you can have look at https://medium.com/geekculture/implementing-a-kafka-consumer-and-kafka-producer-with-spring-boot-60aca7ef7551
Totally valid scenario, for example you can have connector source or a producer which simple push raw data to a topic.
The receiver is loosely coupled to your publisher so they cannot communicate each other directly.
Then you need C1 (Mediator) to consume message from the source, transform the data and publish the new data format to a different topic.
If you're using a JVM based client, this is precisely the use case for using Kafka Streams rather than the base Consumer/Producer API.
Kafka Streams applications must consume from an initial topic, then can convert(map), filter, aggregate, split, etc into other topics.
https://kafka.apache.org/documentation/streams/
I have started to learn about MQTT as I have a use case in telematics in my current organisation. I would like to integrate MQTT broker ( mosquitto ) messages to my kafka.
Since every vehicle is sending the data in its own topic in MQTT broker within a single organisation, I would like to push all this data in kafka. Now I know it is not advisable to create so many topics in kafka ( more than a million ). Also I would like not like to save all the vehicles data in one kafka topic as I would like to later put all this data in S3, differentiated via vehicle id.
How can I achieve this without making so many topics in kafka. One way is the consumer of kafka segregate the events and put in s3 but I believe there will be a lot of small files in S3.
Generally, if you have the same logical entity you would use the same topic.
You can use the MQTT plugin for Kafka Connect to stream the data from MQTT into Kafka, and Kafka Connect's Single Message Transform RegexRouter to modify the topic name to which messages are written, and other SMT to modify the message key. That way you get all the messages in one topic, partitioned based on the vehicle id. That's probably the best way to store it.
From there, you can use the data however you want. When it comes to stream it to S3, you can use Kafka Connect S3 sink and as cricket_007 mentioned partition the data by time if it's just volume you're worried about. If you want to route the messages to different buckets or areas of the same bucket you could use a stream processing (e.g. Kafka Streams / ksqlDB) to pre-process the topic to populate others.
See here for an example of the MQTT connector.
I'm implementing a streaming pipeline that resembles the illustration below:
*K-topic1* ---> processor1 ---> *K-topic2* ---> processor2 -->
*K-topic3* ---> processor3 --> *K-topic4*
The K-topic components represent Kafka topics and the processor components code (Python/Java).
For the processor component, the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain as well as persistent store such as S3.
I have a question regarding the design approach.
The way I see it, each processor component should encapsulate both consumer and producer functionality.
Would the best approach be to have a Processor module/class that could contain KafkaConsumer and KafkaProducer classes ? To date, most examples I've seen have separate consumer and producer components which are run separately and would entail running double the number of components
as opposed to encapsulating producers & consumers within each Processor object.
Any suggestions/references are welcome.
This question is different from
Designing a component both producer and consumer in Kafka
as that question specifically mentions using Samza which is not the case here.
the intention is to read/consume data from the topic, perform some processing/ETL on it, and persist the results to the next topic in the chain
This is exactly the strength of Kafka Streams and/or KSQL. You could use the Processor API, but from what you describe, I think you'll only need the Streams DSL API
persist the results to the next topic in the chain as well as persistent store such as S3.
From the above topic, you can use a Kafka Connect Sink for getting the topic data into these other external systems. There is no need to write a consumer to do this for you.
I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.
I'm using Flink to read and write data from different Kafka topics.
Specifically, I'm using the FlinkKafkaConsumer and FlinkKafkaProducer.
I'd like to know if it is possible to change the Kafka topics I'm reading from and writing to 'on the fly' based on either logic within my program, or the contents of the records themselves.
For example, if a record with a new field is read, I'd like to create a new topic and start diverting records with that field to the new topic.
Thanks.
If you have your topics following a generic naming pattern, for example, "topic-n*", your Flink Kafka consumer can automatically reads from "topic-n1", "topic-n2", ... and so on as they are added to Kafka.
Flink 1.5 (FlinkKafkaConsumer09) added support for dynamic partition discovery & topic discovery based on regex. This means that the Flink-Kafka consumer can pick up new Kafka partitions without needing to restart the job and while maintaining exactly-once guarantees.
Consumer constructor that accepts subscriptionPattern: link.
Thinking more about the requirement,
1st step is - You will start from one topic (for simplicity) and will spawn more topic during runtime based on the data provided and direct respective messages to these topics. It's entirely possible and will not be a complicated code. Use ZkClient API to check if topic-name exists, if does not exist create a model topic with new name and start pushing messages into it through a new producer tied to this new topic. You don't need to restart job to produce messages to a specific topic.
Your initial consumer become producer(for new topics) + consumer(old topic)
2nd step is - You want to consume messages for new topic. One way could be to spawn a new job entirely. You can do this be creating a thread pool initially and supplying arguments to them.
Again be more careful with this, more automation can lead to overload of cluster in case of a looping bug. Think about the possibility of too many topics created after some time if input data is not controlled or is simply dirty. There could be better architectural approaches as mentioned above in comments.