Apache NiFi & Kafka Integration - apache-kafka

I am not sure this questions is already addressed somewhere, but I couldn't find a helpful answer anywhere on internet.
I am trying to integrate Apache NiFi with Kafka - consuming data from Kafka using Apache NiFi. Below are few questions that comes to my mind before proceeding with this.
Q-1) The use case that we have is - read data from Kafka real time, parse the data, do some basic validations on the data and later push the data to HBase. I know
Apache NiFi is the right candidate for doing this kind of processing, but how easy it is to build the workflow if the JSON that we are processing is a complex one ? We were
initially thinking of doing the same using Java Code, but later realised this can be done with minimum effort in NiFi. Please note, 80% of data that we are processing from
Kafka would be simple JSONs, but 20% would be complex ones(invovles arrays)
Q-2) The trickiest part while writing Kafka consumer is handling the offset properly. How Apache NiFi will handle offsets while consuming from Kafka topics ? How offsets
would be properly committed in case rebalancing is triggered while processing ? The frameworks like Spring-Kafka provide options to commit the offsets (to some extent) in case
rebalance is triggered in the middle of processing. How NiFi handles this ?

I have deployed a number of pipeline in 3 node NiFi cluster in production, out of which one is similar to your use case.
Q-1) It's very simple and easy to build a pipeline for your use-case. Since you didn't mention the types of tasks involved in processing a json, I'm assuming generic tasks. Generic task involving JSONs can be schema validation which can be achieved using ValidateRecord Processor, transformation using JoltTransformRecord Processor, extraction of attribute values using EvaluateJsonPath, conversion of json to some other format say avro using ConvertJSONToAvro processors etc.
Nifi gives you flexibility to scale each stage/processor in the pipelines independently. For example, if transformation using JoltTransformRecord is time consuming, you can scale it to run N concurrent tasks in each node by configuring Concurrent Tasks under Scheduling tab.
Q-2) As far as ConsumeKafka_2_0 processor is concerned, the offset management is handled by committing the NiFi processor session first and then the Kafka offsets which means we have an at-least once guarantee by default.
When Kafka trigger rebalancing of consumers for a given partition, processor quickly commits(processor session and Kafka offset) whatever it has got and will return the consumer to the pool for reuse.
ConsumeKafka_2_0 handles committing offset when members of the consumer group change or the subscription of the members changes. This can occur when processes die, new process instances are added or old instances come back to life after failure. Also taken care for cases where the number of partitions of subscribed topic is administratively adjusted.

Related

how do i test Exactly Once Semantics working in my kafka streams application

i have a Kafka Streams DSL application, we have a requirement on exactly once processing, for the same i have added the configuration
streamConfig.put(processing.gurantee, "exactly_once");
I am using kafka 2.7
I have 2 queries
what's the difference between exactly_once and exactly_once_beta
how do i test this functionality to be sure my messages are getting processed only once
Thanks!
exactly_once_beta is an improvement over exactly_once. While exactly_once uses a transactional producer for each stream task (combination of sub-topology and input partition, exactly_once_beta uses a transactional producer for each stream thread of a Kafka Streams client.
Every producer comes with separate memory buffers, a separate thread, separate network connections which might limit scaling the number of input partitions (i.e. number of tasks). A high number of producers might also cause more load on the brokers. Hence, exactly_once_beta has better scaling characteristics. You can find more details in KIP-447.
Note that exactly_once will be deprecated and exactly_once_beta will be renamed to exactly_once_v2 in Apache Kafka 3.0. See KIP-732 for more details.
For tests you can get inspiration from the tests in the Apache Kafka repo:
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EosIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EOSUncleanShutdownIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/tests/kafkatest/tests/streams/streams_eos_test.py
Basically, you need to create a failover scenario and verify that messages are not produced multiple times to the output topics. Note that messages may be processed multiple times, but the results in the output topics must appear as if they were only processed once. You can find a pretty good talk about exactly-once semantics that also explains the failover scenarios here: https://www.confluent.io/kafka-summit-london18/dont-repeat-yourself-introducing-exactly-once-semantics-in-apache-kafka/

Processing Unprocessed Records in Kafka on Recovery/Rebalance

I'm using Spring Kafka to interface with my Kafka instance. Assume that I have a single topic with, say, 2+ partitions.
In the instances where, for example, my Spring Kafka-based application crashes (or even rebalances), and then comes back online and there are messages waiting in the topic, I'm currently using a strategy where the latest committed offsets for each partition are stored in an external store, which I then look up on a consumer's assignment to a partition and then seek to that offset to resume processing.
(This is based on a strategy I'd read about in an O'Reilly book.)
Is there a better way of handling this situation in order to implement "exactly once" semantics and not to miss any waiting messages? Or is there a better/more idiomatic way with Spring Kafka to handle this situation?
Thanks in advance.
Is there a reason you dont checkpoint your offsets to kafka itself?
generally, your options for "exactly once" processing are:
store your offsets and your side-effects together transactionally. this is only possible if your side effects go into a transaction-capable system (say a database)
use kafka transactions. this is a simplified variant of 1 as long as your side effects go to the same kafka cluster you read from
come up with a scheme that allows you to detect and disregard duplicates downstream of your kafka pipeline (aka idempotence)

Multiple Flink pipelines for the same Kafka topic

Background
We have a Kafka topic with a steady stream of data. To process it we have a stateless Flink pipeline that consumes that topic and writes to another topic.
From time to time we have bursts of information that our Flink is not configured to handle. We don't want to configure our Flink pipeline and cluster to always support the maximum load we can have, we want to dynamically scale according to the load. (budget reasons $$$)
Solutions we thought of
One way to do so is to add/remove nodes to the Flink cluster and change the parallelism of the Flink pipeline operators. This will require stopping the Flink job with a snapshot, reconfiguring the parallelism and restarting with new parallelism.
This would be great but we cannot allow ourselves the downtime it produces. We have to scale up/down without downtime.
If we would use regular Kafka consumers it would be as simple as adding a consumer (assuming we have enough Kafka partitions) and Kafka would redistribute the topic partitions between all the consumers.
The Flink Kafka consumer manages the partition assignment and the offset on its own which allows exactly-once semantics (we don't need it). The drawback is that a single Flink job always uses all the topic partitions.
We thought we could create another instance of Flink that would subscribe to the same topic with the same group and let Kafka distribute the partitions between them. But for that we would need the Kafka Flink consumer to let Kafka manage which partitions are assigned to which consumer.
What are we looking for
We couldn't find a library that contains such a consumer or a configuration in the existing consumer. We could write it on our own (not so difficult) but if there is an existing solution we'd rather use it.
Are we missing something? Are we misunderstanding something? Is there a better solution?
Thanks!
The most straightforward approach, since you said that at worst you'll need double the capacity, would be to modify your topology to be able to write Kafka messages you can't process quickly enough to a second overflow Kafka topic. Both input and output Kafka topic names would be configurable. Maybe you would have a threshold backlog delay that automatically triggers this writing or maybe you would have a flag in the topology that you can externally set while the topology is running. That's a design detail you can work through that has operational implications.
This gives you a Flink topology that can handle some maximum number of messages in a timely fashion while writing the rest of the messages that can't be handled to a second Kafka topic. You can then run a second instance of the same Flink topology that reads from that secondary topic and writes, if necessary to a third topic. If the writing to the overflow topic happens very early in the topology processing, you could chain several of these instances together via Kafka with minimal latency and without having to reconfigure and restart any topologies.

Spark/Spark Streaming in production without HDFS

I have been developing applications using Spark/Spark-Streaming but so far always used HDFS for file storage. However, I have reached a stage where I am exploring if it can be done (in production, running 24/7) without HDFS. I tried sieving though Spark user group but have not found any concrete answer so far. Note that I do use checkpoints and stateful stream processing using updateStateByKey.
Depending on the streaming(I've been using Kafka), you do not need to use checkpoints etc.
Since spark 1.3 they have implemented a direct approach with so many benefits.
Simplified Parallelism: No need to create multiple input Kafka streams
and union-ing them. With directStream, Spark Streaming will create as
many RDD partitions as there is Kafka partitions to consume, which
will all read data from Kafka in parallel. So there is one-to-one
mapping between Kafka and RDD partitions, which is easier to
understand and tune.
Efficiency: Achieving zero-data loss in the first approach required
the data to be stored in a Write Ahead Log, which further replicated
the data. This is actually inefficient as the data effectively gets
replicated twice - once by Kafka, and a second time by the Write Ahead
Log. This second approach eliminate the problem as there is no
receiver, and hence no need for Write Ahead Logs.
Exactly-once semantics: The first approach uses Kafka’s high level API
to store consumed offsets in Zookeeper. This is traditionally the way
to consume data from Kafka. While this approach (in combination with
write ahead logs) can ensure zero data loss (i.e. at-least once
semantics), there is a small chance some records may get consumed
twice under some failures. This occurs because of inconsistencies
between data reliably received by Spark Streaming and offsets tracked
by Zookeeper. Hence, in this second approach, we use simple Kafka API
that does not use Zookeeper and offsets tracked only by Spark
Streaming within its checkpoints. This eliminates inconsistencies
between Spark Streaming and Zookeeper/Kafka, and so each record is
received by Spark Streaming effectively exactly once despite failures.
If you are using Kafka, you can found out more here:
https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
Approach 2.

Does it make sense to build a data processing pipeline using only Kafka?

I am building a data processing pipeline using Kafka.
The pipeline is linear with 4 stages.
The data volume is medium (will need more than one machine but not hundreds or thousands; data volume is a few tens of gigabytes)
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why? Of course, I prefer the simplest possible architecture. If I can do it all with Kafka, I'd prefer that. In the future I may need some additional machine learning stages and that may affect the answer. I have no strong once-only semantics, I can accept some message loss and some duplication with no problem.
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why?
Technically yes you can. If you are ready to handle the whole distributed architecture on your own. Writing your own multi-threaded producers, managing those consumers and so on. You also need to consider in terms of Scalability, performance, durability etc. And here comes the beauty of using computation engine like Storm, Spark etc. So you can simply concentrate on the core logic and leave the infrastructure be maintained by them.
For example using a combination of Kafka and Storm for your architecture, you can store terabytes of data using kafka and feed them to storm for processing. If you are familiar with storm then a sample topology can be something like this:
(kafka-spout consuming messages from topic) --> ( Bolt-A for processing the data receive through spout & feeding it to bolt B) --> (Bolt-B for pushing back the processed data into another kafka topic)
Using such architecture offers great deal in scalability, throughput, performance etc.Making some easy configuration changes you will be able to tune your application based on your requirements.