I have started a Nifi process(Consume Kafka) and connected it to a topic. It is running but I am not able to (don't know) where can I view the messages?
ConsumeKafka processor runs and generates flowfile for each message. Only when you connect a processor to other components like another processor or an output port, will you be able to visualize the data being moved through.
For starters you can try this:
Connect ConsumeKafka with LogAttribute or any other processor for
that matter.
Stop or disable the LogAttribute processor.
Now when
you start ConsumeKafka, all the received messages from the
configured Kafka topic will be queued up in the form of flowfiles.
Right click that relationship where the flowfiles are queued up and
click List Queue and you can access the queue.
Click any item on
the queue, a context menu will come up. Click View button and you
can see the data.
This whole explanation of "viewing" the Kafka message is just to help you in debugging and get started with NiFi. Ideally you would be using other NiFi processors to work out your usecase.
Example
You receive messages from Kafka and wants to write it to MongoDB, so you can have the flow as:
Note:
There are record based processors like ConsumeKafkaRecord and PutMongoRecord but they are basically doing the same thing with more enhancements. Since you're new to this, I have suggested a simple flow. You can find details about the Record based processors here and try that.
You might need to consume messages --from-beginning if those messages have been consumed before (and therefore offsets have been committed).
On GetKafka processor, there is a property Auto Offset Reset which should be set to smallest which is the equivalent of --from-beginning in Kafka Console Consumer.
Related
Topic messages are disappearing from a topic when using confluent client. The only ones I can see (while not reloading page), are messages which I create using the "Produce" option in the same page. Kafka configurations are ok (I think), but I still don't understand what is wrong?
Looks like you are producing and consuming messages through a web browser.
Consumers typically subscribe to a topic and commit the offsets which have been consumed. The subsequent polls do not return the older messages (unless you do a seek operation) but only the newly produced messages.
The term disappearing may be applicable in two contexts:
As said above, consumer has already consumed that message and doesn't consume it again (because it has polled it already)
Your topic retention policy could be deleting older messages. You can check this, by using built in tools like kafka-console-consumer or kafka-avro-console-consumer with --from-beginning flag. If the messages are there means that is an issue with your consumer.
If you are calling consumer.poll() on every reload, then you will only get the messages after the previous call to poll (i.e. produced after the last reload). In case, you want all messages that have been present in the topic, since beginning or since sometime, you need to seek from beginning or since some timestamp or offset. See seek in KafkaConsumer
We've implemented some resilience in our kafka consumer by having a main topic, a retry topic and an error topic as outlined in this blog.
I'm wondering what patterns teams are using out there to redrive events in the error topic back into the retry topic for reprocessing. Do you use some kind of GUI to help do this redrive? I foresee a need to potentially append all events from the error topic into the retry topic, but also to selectively skip certain events in the error topic if they can't be reprocessed.
Two patterns I've seen
redeploy the app with a new topic config (via environment variables or other external config).
Or use a scheduled task within the code that checks the upstream DLQ topic(s)
If you want to use a GUI, that's fine, but seems like more work for little gain as there's no tooling already built around that
I'm new to Kafka and will be grateful for any advice
We are updating a legacy application together with moving it from IBM MQ to something different.
Application currently does the following:
Reads batch XML messages (up to 5 MB)
Parses it to something meaningful
Processes data parallelizing this procedure somehow manually for parts of the batch. Involves some external legacy API calls resulting in DB changes
Sends several kinds of email notifications
Sends some reply to some other queue
input messages are profiled to disk
We are considering using Kafka with Kafka Streams as it is nice to
Scale processing easily
Have messages persistently stored out of the box
Built-in partitioning, replication, and fault-tolerance
Confluent Schema Registry to let us move to schema-on-write
Can be used for service-to-service communication for other applications as well
But I have some concerns.
We are thinking about splitting those huge messages logically and putting them to Kafka this way, as from how I understand it - Kafka is not a huge fan of big messages. Also it will let us parallelize processing on partition basis.
After that use Kafka Streams for actual processing and further on for aggregating some batch responses back using state store. Also to push some messages to some other topics (e.g. for sending emails)
But I wonder if it is a good idea to do actual processing in Kafka Streams at all, as it involves some external API calls?
Also I'm not sure what is the best way to handle the cases when this external API is down for any reason. It means temporary failure for current and all the subsequent messages. Is there any way to stop Kafka Stream processing for some time? I can see that there are Pause and Resume methods on the Consumer API, can they be utilized somehow in Streams?
Is it better to use a regular Kafka consumer here, possibly adding Streams as a next step to merge those batch messages together? Sounds like an overcomplication
Is Kafka a good tool for these purposes at all?
Overall I think you would be fine using Kafka and probably Kafka Streams as well. I would recommend using streams for any logic you need to do i.e. filtering or mapping that you have todo. Where you would want to write with a connector or a standard producer.
While it is ideal to have smaller messages I have seen streams users have messages in the GBs.
You can make remote calls, to send and email, from a Kafka Streams Processor but that is not recommend. It would probably be better to write the event to send an email to an output topic and use a normal consumer to read and send the messages. This would also take care of your concern about the API being down as you can always remember the last offset in case and restart from there. Or use the Pause and Resume methods.
I have started a Nifi process(Consume Kafka) and connected it to a topic. It is running but I am not able to (don't know) where can I view the messages?
ConsumeKafka processor runs and generates flowfile for each message. Only when you connect a processor to other components like another processor or an output port, will you be able to visualize the data being moved through.
For starters you can try this:
Connect ConsumeKafka with LogAttribute or any other processor for
that matter.
Stop or disable the LogAttribute processor.
Now when
you start ConsumeKafka, all the received messages from the
configured Kafka topic will be queued up in the form of flowfiles.
Right click that relationship where the flowfiles are queued up and
click List Queue and you can access the queue.
Click any item on
the queue, a context menu will come up. Click View button and you
can see the data.
This whole explanation of "viewing" the Kafka message is just to help you in debugging and get started with NiFi. Ideally you would be using other NiFi processors to work out your usecase.
Example
You receive messages from Kafka and wants to write it to MongoDB, so you can have the flow as:
Note:
There are record based processors like ConsumeKafkaRecord and PutMongoRecord but they are basically doing the same thing with more enhancements. Since you're new to this, I have suggested a simple flow. You can find details about the Record based processors here and try that.
You might need to consume messages --from-beginning if those messages have been consumed before (and therefore offsets have been committed).
On GetKafka processor, there is a property Auto Offset Reset which should be set to smallest which is the equivalent of --from-beginning in Kafka Console Consumer.
We are using JBOSS 5.1.0, we using topic for storing our messages. And our client is making a durable subscription to get those messages.
Everything is working fine, but one issue is we are getting data from TCP client, we are processing and keeping it in topic, it is sending around 10 messages per second, and our client is reading one message at a time. There is a huge gap between that, and after sometime JBOSS Topic have many messages and it crashes saying out of memory.
IS there any workaround for this.
Basically the producer is producing 10x more messages than consumer can handle. If this situation is stable (not only during peak), this will never work.
If you limit the producer to send only one message per second (which is of course possible, e.g. check out RateLimiter), what will you do with extra messages on the producer side? If they are not queueing up in the topic, they will queue up on the producer side.
You have few choices:
somehow tune your consumer to process messages faster, so the topic is never filled up
tune the topic to use persistent storage. This is much better. Not only the topic won't store everything in memory, but you might also get transactional behaviour (messages are durable)
put a queue of messages that you want to set to the topic and process one message per second. That queue must be persistent and must be able to keep more messages than the topic currently can