Kafka streams threads and count of records being processed - apache-kafka

Say we have a Topic with 2 partitions and there are 'n' no of producers which are producing the data to this Topic. Now, Millions of the MessageRecords are being spread over 2 partitions.
Say, we have 2 threads (i.e. 2 separate Instances) powering to the Streams Processor. Now, In this setup, say Thread-1(i.e. Streaming Task-1) got Partition P-1 and say Thread-2(i.e. Streaming Task-2) got Partition P-2 for processing !!
ASK is :- Say, we want to know, how many MessageRecords have been processed by Streaming-Task-1 so far OR say for 28th September, 2KK ?? How do I do that ?
And, even the bigger the question is : "Streaming-Task-1" would never know about the TOTAL count of MessageRecords being processed, it shall only know about the count processed by itself !!
Can it ever know it know about the count processed by another Task-2 ??

There are several ways to accomplish what you are asking. If you are using the DSL I suggest you take a look at the word count example (https://docs.confluent.io/current/streams/quickstart.html). With a map operation you can make all the counts you want relatively simply.
If you are not using the dsl you can still do the same with a couple processors and state stores.

Related

Minimizing failure without impacting recovery when building processes on top of Kafka

I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?
If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.
For example, in your consumer-worker reading loop:
(pseudocode)
while(...)
{
MessageAndOffset = getMsg();
//do your things
saveOffsetInQueueToDB(offset);
}
saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.
Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:
An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.
If you are concerned about failures, you could use a combination of two of those three options (or even use all three).
Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.
If there's a crash, you could read all messages from the beginning (easiest way is just changing the group.id and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):
while(...)
{
MessageAndOffset = getMsg();
if (offset.notIncluded(offsetListFromDB))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:
//offsetMap = {[0,558],[1,600]}
while(...)
{
MessageAndOffset = getMsg();
//get partition => 0
if (offset > offsetMap.get(partition))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
This way, you guarantee that only the non-processed messages from each partition will be processed.
Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.
So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.
All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:

Kafka - Sequential I/O

I am going through some of the articles to better understand what could be the reasons why Kafka is faster compared with other Messaging systems
One of the reasons apart from Zero On Copy , is Sequential IO - this sounds exciting. However, some follow up question wrt it.
Is the Sequential IO in all cases like seeking offset? In case of seeking offset, doesnt it involve some kind of random seek access ?
In Kafka, we generally seek offset during the consumer startup and every poll after that reads messages one after the other sequentially.
So, if seeking to an offset is random, it happens only during the startup and not afterwards i.e. it is only once. Subsequent polling is always sequential. So, it is only sequential access afterwards.
You may for example call seek() in your program multiple times but that is not what is used atleast in production. Because for getting records, you may any way have to poll() which always reads messages sequentially from the given offset.

Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
production:
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset
combination.

Kafka - Difference between Events with batch data and Streams

What is the fundamental difference between an event with a batch of data attached and a kafka stream that occasionally sends data ? Can they be used interchangeably ? When should I use the first and when the latter ? Could you provide some simple use cases ?
Note: There is some info in the comments of this question but I would ask for a more well rounded answer.
I assume that with "difference" between streams and events with batched data you are thinking of:
Stream: Every event of interest is sent immediately to the stream. Those individual events are therefore fine-grained, small(er) in size.
Events with data batch: Multiple individual events get aggregated into a larger batch, and when the batch reaches a certain size, a certain time has passed, or a business transaction has completed, the batch event is sent to the stream. Those batch events are therefore more coarse-grained and large(r) in size.
Here is a list of characteristics that I can think of:
Realtime/latency: End-to-end processing time will typically be smaller for individual events, and longer for batch events, because the publisher may wait with sending batch events until enough individual events have accumulated.
Throughput: Message brokers differ in performance characteristics regarding max. # of in/out events / sec at comparable in/out amounts of data. For example, comparing Kinesis vs. Kafka, Kinesis has a lower max. # of in/out events / sec it can handle than a finely tuned Kafka cluster. So if you were to use Kinesis, batch events may make more sense to achieve the desired throughput in terms of # of individual events. Note: From what I know, the Kinesis client library has a feature to transparently batch individual events if desired/possible to increase throughput.
Order and correlation: If multiple individual events belong to one business transaction and need to be processed by consumers together and/or possibly in order, then batch events may make this task easier because all related data becomes available to consumers at once. With individual events, you would have to put appropriate measures in place like selecting appropriate partition keys to guarantee that individual events get processed in order and possibly by the same consumer worker instance.
Failure case: If batch events contain independent individual events, then it may happen that a subset of individual events in a batch fails to process (irrelevant whether temporary or permanent failure). In such a case, consumers may not be able to simply retry the entire event because parts of the batch event has already caused state changes. Explicit logic (=additional effort) may be necessary to handle partial processing failure of batch events.
To answer your question whether the two can be used interchangeably, I would say in theory yes, but depending on the specific use case, one of the two approaches will likely result better performance or result in less complex design/code/configuration.
I'll edit my answer if I can think of more differentiating characteristics.

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.