NiFi - JMS acknowledgement after processing - queue

Is it possible to acknowledge a message from JMS using NiFi after processing it?
More specifically, I want to use ConsumeJMS processor to read a message from a queue, process it (using multiple NiFi processors), send it into Kafka using PublishKafka processor and only then send an acknowledgement to JMS. In other words, I do not want to remove the message from the queue before it is committed into Kafka.
I know that there are multiple acknowledgement modes, however none of them seems to do what I need. I aim to achieve similar effect as with HandleHttpRequest and HandleHttpResponse processors, when I am able to send a response after the message is processed.

In most cases like JMS and Kafka, the nifi processor that retrieves the data will write it to NiFi's internal repositories, commit to the internal repositories, and then send the acknowledgement to the source system. So it sends the ack in the source processor after it is guaranteed to be in NiFi.
There is an effort called NiFi stateless which is more geared to what you described where it can run a linear series of processors and perform any commit/ack until the very end. This effort is fairly new and under development.

Related

Is it possible to have a DeadLetter Queue topic on Kafka Source Connector side?

Is it possible to have a DeadLetter Queue topic on Kafka Source Connector side?
We have a challenge with the events processed by the IBM MQ Source connector, which is processing N number of messages but sending N-100 messages, where 100 messages are the Poison messages.
But from below blog by Robin Moffatt, I can see it is not doable to have DLQ on Source Connectors side.
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/
Below note mentioned in above article:
Note that there is no dead letter queue for source connectors.
1Q) Please confirm if anyone used the Deadletter queue for the IBM MQ Source Connector (below is the documentation)
https://github.com/ibm-messaging/kafka-connect-mq-source
2Q) Is anyone used the DLQ on any other source connectors side?
3Q) Why it is a limitation on not having DLQ on source connector side?
Thanks.
errors.tolerance is available for source connectors too - refer docs
However, if you compare that to sinks, no, DLQ options are not available. You would instead need to parse the Connector logs with the event details, then pipe that to a topic on your own.
Overall, how would the source connectors decide what events are bad? A network connection exception means that no messages would be read at all, so there's nothing to produce. If messages fail to serialize to Kafka events, then they also would fail to be produced... Your options are either to fail-fast, or skip and log.
If you're just wanting to send through binary data as-is, then nothing would be "poisonous" it can be done with the ByteArrayConverter class, but that's not really a good use case for Kafka Connect since it's primarily designed around Structured types with parsible Schemas, but at least with that option, data gets into Kafka and you can use Kstreams to branch/filter good messages from the bad ones

Is it better to keep a Kafka Producer open or to create a new one for each message?

I have data coming in through RabbitMQ. The data is coming in constantly, multiple messages per second.
I need to forward that data to Kafka.
In my RabbitMQ delivery callback where I am getting the data from RabbitMQ I have a Kafka producer that immediately sends the recevied messages to Kafka.
My question is very simple. Is it better to create a Kafka producer outside of the callback method and use that one producer for all messages or should I create the producer inside the callback method and close it after the message is sent, which means that I am creating a new producer for each message?
It might be a naive question but I am new to Kafka and so far I did not find a definitive answer on the internet.
EDIT : I am using a Java Kafka client.
Creating a Kafka producer is an expensive operation, so using Kafka producer as a singleton will be a good practice considering performance and utilizing resources.
For Java clients, this is from the docs:
The producer is thread safe and should generally be shared among all threads for best performance.
For librdkafka based clients (confluent-dotnet, confluent-python etc.), I can link this related issue with this quote from the issue:
Yes, creating a singleton service like that is a good pattern. you definitely should not create a producer each time you want to produce a message - it is approximately 500,000 times less efficient.
Kafka producer is stateful. It contains meta info(periodical synced from brokers), send message buffer etc. So create producer for each message is impracticable.

Making Kafka producer and Consumer synchronous

I have one kafka producer and consumer.The kafka producer is publishing to one topic and the data is taken and some processing is done. The kafka consumer is reading from another topic about whether the processing of data from topic 1 was successful or not ie topic 2 has success or failure messages.Now Iam starting my consumer and then publishing the data to topic 1 .I want to make the producer and consumer synchronous ie once the producer publishes the data the consumer should read the success or failure message for that data and then the producer should proceed with the next set of data .
Apache Kafka and Publish/Subscribe messaging in general seeks to de-couple producers and consumers through the use of streaming async events. What you are describing is more like a batch job or a synchronous Remote Procedure Call (RPC) where the Producer and Consumer are explicitly coupled together. The standard Apache Kafka Producers/Consumer APIs do not support this Message Exchange Pattern but you can always write your own simple wrapper on top of the Kafka API's that uses Correlation IDs, Consumption ACKs, and Request/Response messages to make your own interface that behaves as you wish.
Short Answer : You can't do that, Kafka doesn't provide that support.
Long Answer: As Hans explained, Publish/Subscribe messaging model keeps Publish and subscribe completely unaware of each other and I believe that is where the power of this model lies. Producer can produce without worrying about if there is any consumer and consumer can consume without worrying about how many producers are there.
The closest you can do is, you can make your producer synchronous. Which means you can wait till your message is received and acknowledged by broker.
if you want to do that, flush after every send.

Email notification when Kafka producer and consumer goes down

I have developed a data pipeline using Kafka. Right now I have one type of producer and two types of consumers setup in the cluster.
Producer: gets the message from a windows server
Consumer: Consumer A uses Spark Streaming to transform and present a real time view. Consumer B stores the RAW data, might be useful for building the schema at a later stage.
For various reasons starting from network, the consumers do not receive any data and also it is possible that the consumer process might die in case there is a system failure.
I would be interested in knowing if there is a way to implement something which sends you email notification when the consumer stops receiving messages or the consumer thread dies altogether. Do Kafka or Zookeeper provide a way of doing it?
Right now I am thinking of checking the target system if it is receiving messages or not. But in future if the number of targets increase it will be really complex to write email notification systems for individual targets.

Data ingestion with Apache Storm

I have been reading a lot of articles where implementations of Apache Storm are explained for ingesting data from either Apache Flume or Apache Kafka. My main question remains unanswered after reading several articles. What is the main benefit of using Apache Kafka or Apache Flume? Why not collecting data from a source directly into Apache Storm?
To understand this I looked into these frameworks. Correct me if I am wrong.
Apache Flume is about collecting data from a source and pushing data to a sink. The sink being in this case Apache Storm.
Apache Kafka is about collecting data from a source and storing them in a message queue until Apache Storm processes it.
I am assuming you are dealing with the use case of Continuous Computation Algorithms or Real Time Analytics.
Given below is what you will have to go through if you DO NOT use Kafka or any message queue:
(1) You will have to implement functionality like consistency of data.
(2) You are ready to implement replication on your own
(3) You are ready to tackle a variety of failures and ready to build a fault tolerant system.
(4) You will need to create a good design so that your producer and consumer are completely decoupled.
(5) You will have to implement persistence. What happens if your consumer fails?
(6) What happens to fault resilience? Do you want to take the entire system down when your consumer fails?
(7) You will have to implement delivery guarantees as well as ordering guarantees.
All of the above are inherent features of a message queue (Kafka etc.) and you will of-course not like to re-invent the wheel here.
I think the reason for having different configurations could be a matter of how the source data is obtained. Storm spouts (the first elements in the Storm topologies) are meant to synchronously polling for the data, while Flume agents (agent=source+channel+sink) are meant to asynchronously receive the data at the source. Thus, if you have a system that notifies certain events then a Flume agent is required; then this agent would be in charge of receiving the data and putting into any queue management system (ActiveMQ, RabbitMQ...) in order to be cosumed by Storm. The same would apply to Kafka.