configure and publish data to NifiKafkaConsumer [duplicate] - apache-kafka

I have started a Nifi process(Consume Kafka) and connected it to a topic. It is running but I am not able to (don't know) where can I view the messages?

ConsumeKafka processor runs and generates flowfile for each message. Only when you connect a processor to other components like another processor or an output port, will you be able to visualize the data being moved through.
For starters you can try this:
Connect ConsumeKafka with LogAttribute or any other processor for
that matter.
Stop or disable the LogAttribute processor.
Now when
you start ConsumeKafka, all the received messages from the
configured Kafka topic will be queued up in the form of flowfiles.
Right click that relationship where the flowfiles are queued up and
click List Queue and you can access the queue.
Click any item on
the queue, a context menu will come up. Click View button and you
can see the data.
This whole explanation of "viewing" the Kafka message is just to help you in debugging and get started with NiFi. Ideally you would be using other NiFi processors to work out your usecase.
Example
You receive messages from Kafka and wants to write it to MongoDB, so you can have the flow as:
Note:
There are record based processors like ConsumeKafkaRecord and PutMongoRecord but they are basically doing the same thing with more enhancements. Since you're new to this, I have suggested a simple flow. You can find details about the Record based processors here and try that.

You might need to consume messages --from-beginning if those messages have been consumed before (and therefore offsets have been committed).
On GetKafka processor, there is a property Auto Offset Reset which should be set to smallest which is the equivalent of --from-beginning in Kafka Console Consumer.

Related

Kafka - redriving events in an error topic

We've implemented some resilience in our kafka consumer by having a main topic, a retry topic and an error topic as outlined in this blog.
I'm wondering what patterns teams are using out there to redrive events in the error topic back into the retry topic for reprocessing. Do you use some kind of GUI to help do this redrive? I foresee a need to potentially append all events from the error topic into the retry topic, but also to selectively skip certain events in the error topic if they can't be reprocessed.
Two patterns I've seen
redeploy the app with a new topic config (via environment variables or other external config).
Or use a scheduled task within the code that checks the upstream DLQ topic(s)
If you want to use a GUI, that's fine, but seems like more work for little gain as there's no tooling already built around that

Are Kafka and Kafka Streams right tools for our case?

I'm new to Kafka and will be grateful for any advice
We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?
Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.
While it is ideal to have smaller messages I have seen streams users have messages in the GBs.
You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.

How do I view the consumed messages of Kafka in Nifi?

I have started a Nifi process(Consume Kafka) and connected it to a topic. It is running but I am not able to (don't know) where can I view the messages?
ConsumeKafka processor runs and generates flowfile for each message. Only when you connect a processor to other components like another processor or an output port, will you be able to visualize the data being moved through.
For starters you can try this:
Connect ConsumeKafka with LogAttribute or any other processor for
that matter.
Stop or disable the LogAttribute processor.
Now when
you start ConsumeKafka, all the received messages from the
configured Kafka topic will be queued up in the form of flowfiles.
Right click that relationship where the flowfiles are queued up and
click List Queue and you can access the queue.
Click any item on
the queue, a context menu will come up. Click View button and you
can see the data.
This whole explanation of "viewing" the Kafka message is just to help you in debugging and get started with NiFi. Ideally you would be using other NiFi processors to work out your usecase.
Example
You receive messages from Kafka and wants to write it to MongoDB, so you can have the flow as:
Note:
There are record based processors like ConsumeKafkaRecord and PutMongoRecord but they are basically doing the same thing with more enhancements. Since you're new to this, I have suggested a simple flow. You can find details about the Record based processors here and try that.
You might need to consume messages --from-beginning if those messages have been consumed before (and therefore offsets have been committed).
On GetKafka processor, there is a property Auto Offset Reset which should be set to smallest which is the equivalent of --from-beginning in Kafka Console Consumer.

Does it make sense to use Apache Kafka for this Scenario?

There are several applications which have to be integrated together and they have to exchange Issues. So one of them will get the issue and then do something and later on change the Status of this Issue. And the other applications which could be involved to this Issue should get the new Information. This continues until the Issue reaches the final Status Closed. The Problem is the Issue have to be mapped, because these applications do not all support the same Data Format.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
How does Kafka Support Data Transformation?
What if my Issue has an attachment?(>5MB)
Thanks for your advice
Yes it does make sense.
Kafka can do transformations through both the Kafka Streams API, and KSQL which is a streaming SQL engine built on top of Kafka Streams.
Typically Kafka is used for smaller messages; one pattern to consider for larger content is to store it in an object store (e.g. S3, or similar depending on your chosen architecture) and reference a pointer to it in your Kafka message.
I'm not sure whether to send the whole Issue always or just the new Status as an Event.
You can do this either way. If you send the whole Issue and then publish all subsequent updates to the same issue as Kafka messages that contain a common kafka message key (perhaps a unique issue ID number) then you can configure your kafka topic as a compacted topic and the brokers will automatically delete any older copies of the data to save disk space.
If you chose to only send deltas (changes) then you need to be careful to have a retention period that’s long enough so that the initial complete record will never expire while the issue is still open and publishing updates. The default retention period is 7 days.
How does Kafka Support Data Transformation?
Yes. In Kafka Connect via Single Message Transforms (SMT), or in Kafka Streams using native Streams code (in Java).
What if my Issue has an attachment?(>5MB)
You can configure kafka for large messages but if they are much larger than 5 or 10 MB then it’s usually better to follow a claim check pattern and store them external to Kafka and just publish a reference link back to the externally stored data so the consumer can retrieve the attachment out of band from Kafka.

Jboss Messaging. sending one message per time

We are using JBOSS 5.1.0, we using topic for storing our messages. And our client is making a durable subscription to get those messages.
Everything is working fine, but one issue is we are getting data from TCP client, we are processing and keeping it in topic, it is sending around 10 messages per second, and our client is reading one message at a time. There is a huge gap between that, and after sometime JBOSS Topic have many messages and it crashes saying out of memory.
IS there any workaround for this.
Basically the producer is producing 10x more messages than consumer can handle. If this situation is stable (not only during peak), this will never work.
If you limit the producer to send only one message per second (which is of course possible, e.g. check out RateLimiter), what will you do with extra messages on the producer side? If they are not queueing up in the topic, they will queue up on the producer side.
You have few choices:
somehow tune your consumer to process messages faster, so the topic is never filled up
tune the topic to use persistent storage. This is much better. Not only the topic won't store everything in memory, but you might also get transactional behaviour (messages are durable)
put a queue of messages that you want to set to the topic and process one message per second. That queue must be persistent and must be able to keep more messages than the topic currently can