I am going through some of the articles to better understand what could be the reasons why Kafka is faster compared with other Messaging systems
One of the reasons apart from Zero On Copy , is Sequential IO - this sounds exciting. However, some follow up question wrt it.
Is the Sequential IO in all cases like seeking offset? In case of seeking offset, doesnt it involve some kind of random seek access ?
In Kafka, we generally seek offset during the consumer startup and every poll after that reads messages one after the other sequentially.
So, if seeking to an offset is random, it happens only during the startup and not afterwards i.e. it is only once. Subsequent polling is always sequential. So, it is only sequential access afterwards.
You may for example call seek() in your program multiple times but that is not what is used atleast in production. Because for getting records, you may any way have to poll() which always reads messages sequentially from the given offset.
Related
I am working with a microservice that consumes messages from Kafka. It does some processing on the message and then inserts the result in a database. Only then am I acknowledging the message with Kafka.
It is required that I keep data loss to an absolute minimum but recovery rate is quick (avoid reprocessing message because it is expensive).
I realized that if there was to be some kind of failure, like my microservice would crash, my messages would be reprocessed. So I thought to add some kind of 'checkpoint' to my process by writing the state of the transformed message to the file and reading from it after a failure. I thought this would mean that I could move my Kafka commit to an earlier stage, only after writing to the file is successful.
But then, upon further thinking, I realized that if there was to be a failure on the file system, I might not find my files e.g. using a cloud file service might still have a chance of failure even if the marketed rate is that of >99% availability. I might end up in an inconsistent state where I have data in my Kafka topic (which is unaccessible because the Kafka offset has been committed) but I have lost my file on the file system. This made me realize that I should send the Kafka commit at a later stage.
So now, considering the above two design decisions, it feels like there is a tradeoff between not missing data and minimizing time to recover from failure. Am I being unrealistic in my concerns? Is there some design pattern that I can follow to minimize the tradeoffs? How do I reason about this situation? Here I thought that maybe the Saga pattern is appropriate, but am I overcomplicating things?
If you are that concerned of data reprocess, you could always follow the paradigm of sending the offsets out of kafka.
For example, in your consumer-worker reading loop:
(pseudocode)
while(...)
{
MessageAndOffset = getMsg();
//do your things
saveOffsetInQueueToDB(offset);
}
saveOffsetInQueueToDB is responsible of adding the offset to a Queue/List, or whatever. This operation is only done one the message has been correctly processed.
Periodically, when a certain number of offsets are stored, or when shutdown is captured, you could implement another function that stores the offsets for each topic/partition in:
An external database.
An external SLA backed storing system, such as S3 or Azure Blobs.
Internal (disk) and remote loggers.
If you are concerned about failures, you could use a combination of two of those three options (or even use all three).
Storing these in a "memory buffer" allows the operation to be async, so there's no need for a new transfer/connection to the database/datalake/log for each processed message.
If there's a crash, you could read all messages from the beginning (easiest way is just changing the group.id and setting from beginning) but discarding those whose offset is included in the database, avoiding the reprocess. For example by adding a condition in your loop (yep pseudocode again):
while(...)
{
MessageAndOffset = getMsg();
if (offset.notIncluded(offsetListFromDB))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
You could implement better performant algorithms instead a "non-included" type one, just storing the last read offsets for each partition in a HashMap and then just checking if the partition that belongs to each consumer is bigger or not than the stored one. For example, partition 0's last offset was 558 and partitions 1's 600:
//offsetMap = {[0,558],[1,600]}
while(...)
{
MessageAndOffset = getMsg();
//get partition => 0
if (offset > offsetMap.get(partition))
{
//do your things
saveOffsetInQueueToDB(offset);
}
}
This way, you guarantee that only the non-processed messages from each partition will be processed.
Regarding file system failures, that's why Kafka comes as a cluster: Fault tolerance in Kafka is done by copying the partition data to other brokers which are known as replicas.
So if you have 5 brokers, for example, you must experience a total of 5 different system failures at the same time (I guess brokers are in separate hosts) in order to lose any data. Even 4 different brokers could fail at the same time without losing any data.
All brokers save the same amount of data, same partitions. If a filesystem error occurs in one of the brokers, the others will still hold all the information:
I am trying to understand how Kafka Stream work under the hood (to know it a little better), and came across confluent link, and it is really wonderful.
It says two terms viz: StreamThreads and StreamTasks.
I am not able to understand what exactly is StreamTasks?
Is it executed by StreamThread?
As per doc, StreamThreads can have multiple StreamTasks, so won't there be any data sharing and won't this thread run slower? How does a StreamThread "run" multiple StreamTasks?
Any explanation in simple words would be of great help.
"Tasks" are a logical abstractions of work than can be done in parallel (ie, stuff that can be processed independent from each other). Kafka Streams basically creates a task for each input topic partition, because data in different partitions can processed independent from each other (it's a simplification, but holds if you have a single input topic; for joins it's a little bit different).
A StreamThread is basically a JVM thread. Task are assigned to StreamsThread for execution. In the current implementation, a StreamThread basically loops over all tasks and processes some amount of input data for each task. In between, the StreamThread (that is using a KafkaConsumer) polls the broker for new data for all its assigned tasks.
Because tasks are independent from each other, you can run as many thread as there are tasks. For this case, each thread would execute only a single task.
What is the fundamental difference between an event with a batch of data attached and a kafka stream that occasionally sends data ? Can they be used interchangeably ? When should I use the first and when the latter ? Could you provide some simple use cases ?
Note: There is some info in the comments of this question but I would ask for a more well rounded answer.
I assume that with "difference" between streams and events with batched data you are thinking of:
Stream: Every event of interest is sent immediately to the stream. Those individual events are therefore fine-grained, small(er) in size.
Events with data batch: Multiple individual events get aggregated into a larger batch, and when the batch reaches a certain size, a certain time has passed, or a business transaction has completed, the batch event is sent to the stream. Those batch events are therefore more coarse-grained and large(r) in size.
Here is a list of characteristics that I can think of:
Realtime/latency: End-to-end processing time will typically be smaller for individual events, and longer for batch events, because the publisher may wait with sending batch events until enough individual events have accumulated.
Throughput: Message brokers differ in performance characteristics regarding max. # of in/out events / sec at comparable in/out amounts of data. For example, comparing Kinesis vs. Kafka, Kinesis has a lower max. # of in/out events / sec it can handle than a finely tuned Kafka cluster. So if you were to use Kinesis, batch events may make more sense to achieve the desired throughput in terms of # of individual events. Note: From what I know, the Kinesis client library has a feature to transparently batch individual events if desired/possible to increase throughput.
Order and correlation: If multiple individual events belong to one business transaction and need to be processed by consumers together and/or possibly in order, then batch events may make this task easier because all related data becomes available to consumers at once. With individual events, you would have to put appropriate measures in place like selecting appropriate partition keys to guarantee that individual events get processed in order and possibly by the same consumer worker instance.
Failure case: If batch events contain independent individual events, then it may happen that a subset of individual events in a batch fails to process (irrelevant whether temporary or permanent failure). In such a case, consumers may not be able to simply retry the entire event because parts of the batch event has already caused state changes. Explicit logic (=additional effort) may be necessary to handle partial processing failure of batch events.
To answer your question whether the two can be used interchangeably, I would say in theory yes, but depending on the specific use case, one of the two approaches will likely result better performance or result in less complex design/code/configuration.
I'll edit my answer if I can think of more differentiating characteristics.
According to this blog post:
If the source of the stream polls an external entity for new messages and the downstream processing is non-uniform, inserting a buffer can be crucial to realizing good throughput. For example, a large buffer inserted after the Kafka Consumer from the Reactive Streams Kafka library can improve performance by an order of magnitude in some situations. Otherwise, the source may not poll Kafka fast enough to keep the downstream saturated with work, with the source oscillating between backpressuring and polling Kafka.
The documentation for the alpakka kafka connnector doesn't mention that, so I was wondering if it makes sense to use a buffer in this case. Also does the same thing apply to Kafka sinks (should I add a buffer before)?
...I was wondering if it makes sense to use a buffer in this case
Consider the following segment from the blog post you quoted:
...the downstream processing is non-uniform....
One of the points of that section of the post is to illustrate the similar effects that a user-defined buffer and an asynchronous boundary can have on a stream. The default behavior, in which there are no buffers or asynchronous boundaries, is to enable operator fusion, which runs a stream in a single actor. This essentially means that for every Kafka message that is consumed, the message must go through the entire pipeline of the stream, from source to sink, before the next message goes through the pipeline. In other words, a message m2 will not go through the pipeline until the preceding message m1 is finished processing.
If the processing that occurs downstream from a Kafka connector source is "non-uniform" (i.e, it can take varying amounts of time: sometimes the processing happens quickly, sometimes it takes a while), then introducing a buffer or an asynchronous boundary could improve the overall throughput. This is because a buffer or asynchronous boundary can allow the source to continue consuming Kafka messages even if the downstream processing happens to take a long time. That is, if m1 takes a long time to process, the source could consume messages m2, m3, etc. (until the buffer is full), without waiting for m1 to finish. As Colin Breck states in his post:
The buffer improves performance by decoupling stages, allowing the upstream or downstream to continue to process elements, on average, even if one of them is busy processing a relatively expensive workload.
This potential performance boost doesn't apply for all situations. Again quoting Breck:
Similar to the async method discussed in the previous section, it should be noted that inserting buffers indiscriminately will not improve performance and simply consume additional resources. If adjacent workloads are relatively uniform, the addition of a buffer will not change the performance, as the overall performance of the stream will simply be dominated by the slowest processing stage.
One obvious way to determine whether using a buffer (i.e., .buffer) makes sense in your case is to try it. You might also try adding an asynchronous boundary (i.e., .async) instead. Compare the following three approaches--(1) the default fused behavior without buffering, (2) .buffer, and (3) .async--and see which one results in the best performance.
How to I exactly get the acknowledgement from Kafka once the message is consumed or processed. Might sound stupid but is there any way to know the start and end offset of that message for which the ack has been received ?
What I found so far is in 0.8 they have introduced the following way to choose from the offset for reading ..
kafka.api.OffsetRequest.EarliestTime() finds the beginning of the data in the logs and starts streaming from there, kafka.api.OffsetRequest.LatestTime() will only stream new messages.
example code
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
Still not sure about the acknowledgement part
Kafka isn't really structured to do this. To understand why, review the design documentation here.
In order to provide an exactly-once acknowledgement, you would need to create some external tracking system for your application, where you explicitly write acknowledgements and implement locks over the transaction id's in order to ensure things are only ever processed once. The computational cost of implementing such as system is extraordinarily high, and is one of the main reasons that large transactional systems require comparatively exotic hardware and have arguably lower scalability than systems such as Kafka.
If you do not require strong durability semantics, you can use the groups API to keep rough track of when the last message was read. This ensures that every message is read at least once. Note that since the groups API does not provide you the ability to explicitly track your applications own processing logic, that your actual processing guarantees are fairly weak in this scenario. Schemes that rely on idempotent processing are common in this environment.
Alternatively, you may use the poorly-named SimpleConsumer API (it is quite complex to use), which enables you to explicitly track timestamps within your application. This is the highest level of processing guarantee that can be achieved through the native Kafka API's since it enables you to track your applications own processing of the data that is read from the queue.