Filter kafka message from topic - apache-kafka

I have a spring batch job which will write messages into a Kafka topic. Now I need to test whether specific messages are present in the kafka topic how do i achieve this using spring boot
I am anticipating to use the Kafka stream will this be useful

Searching Kafka topics requires consuming the whole topic. Ideally, you'd dump the topic into an index database instead, and search that.
You can use Kafka Streams, but it doesn't return true/false and stop on single messages.
Start a regular consumer. You can use spring-kafka w/ #KafkaListener, and you can then write an if statement to check message content.

Related

Separate Dead letter queue for each topic in kafka connect

I'm implementing streaming using kafka connect in one of my projects. I have created S3 sink connector to consume messages from different topics (using regex) and then write a file to S3. Consuming messages from different topics is done using the below property.
"topics.regex": "my-topic-sample\\.(.+)",
I have 3 different topics as below. Using the above property, S3 sink connector will consume messages from these 3 topics and writes a separate file (for each topic) to S3.
my-topic-sample.test1
my-topic-sample.test2
my-topic-sample.test3
For now, ignoring all the invalid messages. However, want to implement dead letter queue.
We can achieve that using the below properties
'errors.tolerance'='all',
'errors.deadletterqueue.topic.name' = 'error_topic'
From the above property, we can move all the invalid messages to DLQ. But the issue is that we will have only 1 DLQ though there are 3 different topics from which S3 sink connector is consuming the messages. Invalid messages from all the 3 topics will be pushed to same DLQ.
Is there a way that we can have multiple DLQ's and write the message to a different DLQ based on the topic that was consumed from.
Thanks
You can only configure a single DLQ topic per connector, it's not possible to use a different DLQ topic for each source topic.
Today if you want to split records your connector fails to process into different DLQ topics, you need to start multiple connector instances each consuming from a different source topic.
Apache Kafka is an open source project so if you are up for the challenge, you can propose implementing such a feature if you want to. See https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals for the overall process.

How to store all topics present in the kafka cluster to another topic using KSQL

I'm new to KSQL. I want to store all topics names present in a kafka cluster to another topic using KSQL query.
SHOW TOPICS; from KSQL CLI gives me list of topics. I want to store all these topic information in another topic by creating a stream.
I will be polling this new topic (using a consumer) and whenever a new topic get created in the cluster, my consumer will receive a message.
I want a KSQL query to accomplish this.
Thanks in advance.
You can't currently achieve what you want using ksqlDB. The SHOW TOPICS command is a system command, not a sql statement. So the output of the query can't be piped into a stream.
ksqlDB allows you to process the data within the topics in the Kafka cluster. It doesn't (yet) allow you to process the metadata of the Kafka cluster, e.g. the list of topics, or consumer groups, etc.
It may be worth raising a feature request on GitHub: https://github.com/confluentinc/ksql/issues/new/choose

Fetch message with specified key using Kafka Listener vs Kafka Consumer

I use a SpringBoot App to produce or consume/listen to Kafka
messages.
I produce a message in the topic and consume/listen to the
specific message by comparing the messageKey and then send the
consumed message for further processing.
I am stuck with what approach will be better suited to my requirement to get specific message i.e. Kafka Listener or Kafka Consumer what ?
KafkaListener is a Spring specific concept that wraps the Kafka Consumer API.
There is no way to get a message by a particular offset given the key. You must calculate the partition, then scan the entire offset

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.
In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.
It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster
It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.

User topic management using Kafka Stream Processor API

I have just started my hands dirty with kafka. I have gone through this. It only says data/topic management for kafka stream DSL. Can anyone share any link for same sort of data management for Processor API of kafka stream? I am specially interested for user and internal topic management of Processor API.
TopologyBuilder builder = new TopologyBuilder();
// add the source processor node that takes Kafka topic "source-topic" as input
builder.addSource("Source", "source-topic")
From where to populate this source topic with input data before stream processor starts consuming the same?
In short, can we write to kafka "Source" topic using streams, like as producer writes to topic? Or is stream only for parallel consumption of topic?
I believe we should as "Kafka's Streams API is built on top of Kafka's producer and consumer clients".
Yes, you have to use a KafkaProducer to generate inputs for the source topics which feeds the KStream.
But, the intermediate topics can be populated, via
KafkaStreams#to
KafkaStreams#through
You can use JXL(Java Excel API) to write a producer which writes to a kafka topic from an excel file.
Then create a kafka streams application to consume that topic and produce to another topic.
And you can use context.getTopic() to get topic from which the processor is receiving.
Then set multiple if statements to call the process logic for that topic inside the process() function.