KafkaIO uneven partition consumption after a while - apache-kafka

I have a simple dataflow pipeline (job id 2018-05-15_06_17_40-8591349846083543299) with 1 min worker and 7 max workers that does the following:
Consume from 4 Kafka topics using KafkaIO. Each topic is represented differently and is a separate PCollection
Perform transformation on each PCollection to output a standard representation PCollection.
Merge the 4 PCollection using Flatten.pCollections
Window into hourly with the following trigger:
Repeatedly
.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(40000),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5))
)
)
.orFinally(AfterWatermark.pastEndOfWindow())
Write these events to GCS using AvroIO windowed writes with 14 shards.
When the pipeline is launched initially everything is fine, but after several hours later, the System Lag increases dramatically in the AvroIO:GroupIntoShards step.
Upon further investigation one of the topics is lagging behind many hours (this topic has the greatest events per second when compared to the other 3). Looking at the logs I see Closing idle reader for S12-000000000000000a which is understandable. However, the topic's consumer group offsets for the 36 partitions is in a state where for some partitions the offset is very low, but some are very high. The log-end-offset is more or less evenly distributed and the records we are producing are around the same size.
Questions:
If the System Lag is high in a certain step, does that prevent the Kafka consumers from consuming?
Any possible reason for the uneven distribution in Kafka offsets?
The PCollection's that is merged have different traffic patterns, some low and some high. Would adding the AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5) trigger effectively start writing to GCS for each (window, shard) after 5 minutes when an event is first seen in a window?
Updating the pipeline using the same code / configuration brings it back into a normal state where the consumed rate is much higher (due to the lag before the restart) than the produced rate.

Addressing 3 questions raised (I left a comment about the specific job):
No, system lag does not prevent Kafka from consuming.
In general if there is lots of work for downstream stages ready to be processed, that can delay upstream work from starting. But that is not KafkaIO specific.
Does not seem to be the case here. In general, assuming there is no skew among Kafka partitions themselves, heavy skew in Beam processing can cause readers assigned to workers that are doing more work than others.
I think yes. I think firstElementInPane() applies to element from any of the sources.

Related

Kafka Streams: event-time skew when processing messages from different partitions

Let's consider a topic with multiple partitions and messages written in event-time order without any particular partitioning scheme. Kafka Streams application does some transformations on these messages, then groups by some key, and then aggregates messages by an event-time window with the given grace period.
Each task could process incoming messages at a different speed (e.g., because running on servers with different performance characteristics). This means that after groupBy shuffle, event-time ordering will not be preserved between messages in the same partition of the internal topic when they originate from different tasks. After a while, this event-time skew could become larger than the grace period, which would lead to dropping messages originating from the lagging task.
Increasing the grace period doesn't seem like a valid option because it would delay emitting the final aggregation result. Apache Flink handles this by emitting the lowest watermark on partitions merge.
Should it be a real concern, especially when processing large amounts of historical data, or do I miss something? Does Kafka Streams offer a solution to deal with this scenario?
UPDATE My question is not about KStream-KStream joins but about single KStream event-time aggregation preceded by a stream shuffle.
Consider this code snippet:
stream
.mapValues(...)
.groupBy(...)
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofSeconds(10)))
.aggregate(...)
I assume mapValues() operation could be slow for some tasks for whatever reason, and because of that tasks do process messages at a different pace. When a shuffle happens at the aggregate() operator, task 0 could have processed messages up to time t while task 1 is still at t-skew, but messages from both tasks end up interleaved in a single partition of the internal topic (corresponding to the grouping key).
My concern is that when skew is large enough (more than 10 seconds in my example), messages from the lagging task 1 will be dropped.
Basically, a task/processor maintains a stream-time which is defined as the highest timestamp of any record already polled. This stream-time is then used for different purpose in Kafka Streams (e.g: Punctator, Windowded Aggregation, etc).
[Windowed Aggregation]
As you mentioned, the stream-time is used to determine if a record should be accepted, i.e record_accepted = end_window_time(current record) + grace_period > observed stream_time.
As you described it, if several tasks run in parallel to shuffle messages based on a grouping key, and some tasks are slower than others (or some partitions are offline) this will create out-of-order messages. Unfortunately, I'm afraid that the only way to deal with that is to increase the grace_period.
This is actually the eternal trade-off between Availability and Consistency.
[Behaviour for KafkaStream and KafkaStream/KTable Join
When you are perfoming a join operation with Kafka Streams, an internal Task is assigned to the "same" partition over multiple co-partitioned Topics. For example the Task 0 will be assigned to Topic1-Partition0 and TopicB-Partition0.
The fetched records are buffered per partition into internal queues that are managed by Tasks. So, each queue contains all records for a single partition waiting for processing.
Then, records are polled one by one from queues and processed by the topology instance. But, this is the record from the non-empty queue having the lowest timestamp which is returned from the polled.
In addition, if a queue is empty, the task may become idle during a period of time so that no more records are polled from queue. You can actually configure the maximum amount of time a Task will stay idle can be defined with the stream config :max.task.idle.ms
This mecanism allows synchronizing co-localized partitions. Bu, default the max.task.idle.ms is set to 0. This means a Task will never wait for more data from a partition which may lead to records being filtered because the stream-time will potentially increase more quickly.

Kafka Streams: does NUM_STREAM_THREADS_CONFIG > 1 break partition's total ordering?

Here we go: I got quite complicated topology of various joins, aggregations, filters, maps, etc. By defaul the NUM_STREAM_THREADS_CONFIG parameter equals to 1 and that's completely determenistic by definition - thus, partition's total ordering (that is guaranteed by Kafka itself) preserved.
Will total ordering be preserved once I set NUM_STREAM_THREADS_CONFIG to 2 or more then that?
Does it depend upon special topology? I've checked the docs and went throught the threading model section, yet did not find an answer.
Data is always processed in per-partition offset order, even if you set num.stream.threads to a larger value.
In Kafka Streams, sub-topologies are translated into tasks (based on input topic partitions) and tasks process records of their partitions in offset order. The number of tasks limits the number of threads you can keep busy (similar to the maximum number of consumers in a consumer group). If you configure more threads than available tasks, some threads just stay idle.
If a task processed data from multiple topics/partitions, there is no strict ordering guarantee for data of different partitions. Kafka Streams will take the record timestamps into account thought, and process records with smaller timestamp first.

Handling a Large Kafka topic

I have a very very large(count of messages) Kafka topic, it might have more than 20M message per second, but, message size is small, it's just some plain text, each less than 1KB, I can use several partitions per topic, and also I can use several servers to work on one topic and they will consume one of the partitions in the topic...
what if I need +100 servers for a huge topic?
Is it logical to create +100 partitions or more on a single topic?
You should define "large" when mentioning Kafka topics:
Large means huge data in terms of volume size.
Message size is large that it takes time sending a message from queue to client for processing?
Intensive write to that topic? In that case, do you need to process read as fast as possible? (i.e: can we delay process data for about 1 hour)
...
In either case, you should better think on the consumer side for a better design topic and partition. For instances:
Processing time for each message is slow, and it better process fast between messages: In that case, you should create many partitions. It is like a load balancer and server relationship, you create many workers for doing your job.
If only some message types, the time processing is slow, you should consider moving to a new topic. There is a nice article: Should you put several event types in the same Kafka topic explains this decision.
Is the order of messages important? for example, message A happens before message B, message A should be processed first. In this case, you should make all messages of the same type going to the same partition (only the same partition can maintain message order), or move to a separate topic (with a single partition).
...
After you have a proper design for topic and partition, it is come to question: how many partitions should you have for each topic. Increasing total partitions will increase your throughput, but at the same time, it will affect availability or latency. There are some good topics here and here that explain carefully how will total partitions per topic affect the performance. In my opinion, you should benchmark directly on your system to choose the correct value. It depends on many factors of your system: processing power of server machine, network capacity, memory ...
And the last part, you don't need 100 servers for 100 partitions. Kafka will try to balance all partitions between servers, but it is just optional. For example, if you have 1 topic with 7 partitions running on 3 servers, there will be 2 servers store 2 partitions each and 1 server stores 3 partitions. (so 2*2 + 3*1 = 7). In the newer version of Kafka, the mapping between partition and server information will be stored on the zookeeper.
you will get better help, if you are more specific and provide some numbers like what is your expected load per second and what is each message size etc,
in general Kafka is pretty powerful and behind the seances it writes the data to buffer and periodically flush the data to disk. and as per the benchmark done by confluent a while back, Kafka cluster with 6 node supports around 0.8 million messages per second below is bench marking pic
Our friends were right, I refer you to this book
Kafka, The Definitive Guide
by Neha Narkhede, Gwen Shapira & Todd Palino
You can find the answer on page 47
How to Choose the Number of Partitions
There are several factors to consider when choosing the number of
partitions:
What is the throughput you expect to achieve for the topic?
For example, do you expect to write 100 KB per second or 1 GB per
second?
What is the maximum throughput you expect to achieve when consuming from a single partition? You will always have, at most, one consumer
reading from a partition, so if you know that your slower consumer
writes the data to a database and this database never handles more
than 50 MB per second from each thread writing to it, then you know
you are limited to 60MB throughput when consuming from a partition.
You can go through the same exercise to estimate the maxi mum throughput per producer for a single partition, but since producers
are typically much faster than consumers, it is usu‐ ally safe to skip
this.
If you are sending messages to partitions based on keys, adding partitions later can be very challenging, so calculate throughput
based on your expected future usage, not the cur‐ rent usage.
Consider the number of partitions you will place on each broker and available diskspace and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for leader
elections. With all this in mind, it’s clear that you want many
partitions but not too many. If you have some estimate regarding the
target throughput of the topic and the expected throughput of the con‐
sumers, you can divide the target throughput by the expected con‐
sumer throughput and derive the number of partitions this way. So if I
want to be able to write and read 1 GB/sec from a topic, and I know
each consumer can only process 50 MB/s, then I know I need at least 20
partitions. This way, I can have 20 consumers reading from the topic
and achieve 1 GB/sec. If you don’t have this detailed information, our
experience suggests that limiting the size of the partition on the
disk to less than 6 GB per day of retention often gives satisfactory
results.

Certain partitions seem to take precedence when a consumer is reading from multiple partitions

I have a service which reads from a Kafka topic using librdkafka. I've noticed that if the consumer shuts down for a while, some log entries build up in kafka (this is perfectly fine and expected)
What's weird, is that sometimes when I start the consumer back up and look at the pending log entries by partition, partitions assigned to the same consumer seem to be recovered at a different rate.
For example, say I have a consumer X and it claims partitions 30 through 50. When the consumer starts there are 10,000 entries pending on each.
What I see is the pending entries for 30-40 trend downward while the pending entries for 41-50 grow. When 30-40 finally hits zero (or gets close enough to zero) 41-50 starts trending downward.
Why is this happening? Is it a client feature or a server feature?
The way kafka works is consumer will keep switching through the partitions to take the data, however Kafka is smart to ensure switch and handle only those many partitions what it can handle based upon the capacity of your consumer i.e had your consumer been a more powerful (server performance) it would take a little more partitions but never mind it would take the remaining partitions in second go after being done with the first ones.
In summary: if you create X partitions you are expecting it to go through all one by one before re-visiting the first one, but that would eat the performance by more effort in switching.
In your case, I understand that since the other partitions also have business data you don't want to delay them heavily, i suggest to reduce the number of partitions.

Merging ordered Kafka topics into a single ordered topic

I have N topics as input, each with messages added in ascending delivery date order. Topics can vary widely in message count, date range, partitioning strategy. But I know that all partitions for every topic will independently be in date order.
I want to merge all N topics priority-queue style into a new single topic T. T also has whatever partition count and strategy it wants since the only requirement is that each individual partition of T is still in date order on its own. I then feed T to partition-aware consumers which will consume them and idle between due dates since I want each message to be delivered on or closely thereafter its delivery date. This whole pipeline can stream forever.
I expect tuning issues with exactly how partitions amongst all the N input topics and the single T output topic are distributed, and advice which affects that specifically is welcome but right now I'm mainly interested in the overall viability of doing this at all using only Kafka topics, not a RDB or Key-value store. So some extra I/O moving messages between non-optimal topic partitions is okay.
Is this doable with the 0.9 consumer where I can control knowing which partitions are assigned to each consumer, so I can let auto-rebalancing occur while endlessly peek/merge-to-T/commit-offset the oldest message on each actual partition? I must have partition awareness to have a chance of this working.
Due to needing shared merge state (the last date added to T), is it better to stick with multiple partition-aware consumers in a single process, parallel processes or multiple servers given where that state will need to be? I favor keeping the state onboard in shared memory not networked in ZK or whatever. On a restart I can get it once and maintain it while running if on a single machine.
Am I overlooking any Kafka features that would make what I describe easier or more efficient, like some atomic message move between topics? I know I am going against the grain of its design and this scenario is similar to TS.