Kafka Streams: Stream tasks moving across app instances - apache-kafka

Consider stream application setup with an input topic of 6 partitions that has a state store. Assume there is a constant inflow of over 5 million records each hour. If the application is run on a single node, the state for all the incoming records remains in the same node. Now, if we add another instance on a different node, I assume it would equally balance the partitions between the two instances (assume we set the max threads as 3 in each instance).
I guess my question is when the rebalance occurs and if a partition instance moves from one to another one and vice versa, this will cause the state store to be restored for those partitions on their respective instances and that takes time. Wouldn't the frequent shuffle of the partitions between the instances (especially of significant volume) due to the rebalance be a major overhead and impact the streaming performance. I am not sure if it is possible to completely prevent the rebalance (which I understand is for the load balancing benefit), but would this not prevent scaling up with multiple instances for the same topic that uses the store?

Kafka Streams uses its own implementation of PartitionAssignor (not the default one used by KafkaConsumer) and implements a sticky assignment strategy. During rebalance, it's known which partitions were assigned to what KafkaStreams instance and we try to reassign partitions to the same instance if possible to avoid state movement. Load balancing also plays a role to allow for scaling scenarios of course.

Related

Uneven partition assignment in kafka streams

I am experiencing strange assignment behavior with Kafka Streams. I am having 3-node cluster of Kafka streams. My stream is pretty straightforward, one source topic (24 partitions, all kafka brokers are running on other machines than kafka stream nodes) and our stream graph only takes messages, group them by key, perform some filtering and store everything to sink topic. Everything is running with 2 Kafka Threads on each node.
However whenever I am doing rolling update of my kafka stream (by shutting down always only one app so other two nodes are running) my kafka streams ends with uneven number of partitions per "node"(usually 16-9-0). Only once I restart node01 and sometimes node02 cluster gets back to more even state.
Can somebody advice any hint how I can achieve more equal distribution before additional restarts?
I assume both nodes running the kafka streams app have identical group ids for consumption.
I suggest you check to see if the partition assignment strategy your consumers are using isn't org.apache.kafka.clients.consumer.RangeAssignor.
If this is the case, configure it to be org.apache.kafka.clients.consumer.RoundRobinAssignor. This way, when the group coordinator receives a JoinGroup request and hands the partitions over to the group leader, the group leader will ensure the spread between the nodes isn't uneven by more than 1.
Unless you're using an older version of Kafka streams, the default is Range and does not guarantee even spread across consumers.
Is your Kafka Streams application stateful? If so, you can possibly thank this well-intentioned KIP: https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams
If you want to override this behaviour, you can set acceptable.recovery.lag=9223372036854775807 (Long.MAX_VALUE).
The definition of that config from https://docs.confluent.io/platform/current/streams/developer-guide/config-streams.html#acceptable-recovery-lag
The maximum acceptable lag (total number of offsets to catch up from the changelog) for an instance to be considered caught-up and able to receive an active task. Streams only assigns stateful active tasks to instances whose state stores are within the acceptable recovery lag, if any exist, and assigns warmup replicas to restore state in the background for instances that are not yet caught up. Should correspond to a recovery time of well under a minute for a given workload. Must be at least 0.

What should be an appropriate value for Kafka consumer concurrency (regard to scaling)?

I'm creating a new service which will be a consumer of Kafka topic. It's Spring app so I'm using spring-kafka.
Topic has 20 partitions. In the beginning, there are two instances in Kubernetes. In future, depends on load, we want to scale and run additional instances. What should be the appropriate value of kafka.consumer.concurrency in my case? I bet that 10, but am I right?
When there are only two service instances, each one runs 10 threads and each thread reads from one partition. But what if I would like to scale service? What will happen if I run two additional instances? As far as I know, when a new consumer joins a consumer group the set of consumers attempt to "rebalance" the load to assign partitions to each consumer.
Does it mean that two existing instances will reduce threads number to 5 and will listen on only 5 partitions (and each instance will handle 5 partitions)?
Is my understanding correct?
If not, what should be the appropriate value in my case?
Documentation says:
if you have more partitions than you have threads, some threads will receive data from multiple partitions
Just to make sure: if I set concurrency to e.g. 5, each thread will read from two partitions. Will it affect service performance?
When a new consumer is added to the same group, Kafka will perform a rebalance; if there are more consumers than partitions, there is no guarantee that each instance will get 5 partitions - Kafka just sees 40 consumers and the 20 partitions will be distributed. However, it probably depends on configured Assignor - the default RangeAssignor seems to do it that way.
However, when you exceed the number of partitions, the containers will have idle threads (assigned no partitions).
Generally, the best practice is to over-provision the number of partitions and let each consumer handle multiple partitions; that way, when you scale out; you won't end up with idle consumers.
If not, what should be the appropriate value in my case?
It depends entirely on your application.
Bottom line; if you start with 2x10 consumers, and you expect you might end up requiring 10x10, you should start out with 100 partitions.

kafka consumer rebalancing in case of manual/assigned partitioning

I have some doubt regarding rebalancing. Right now, I am manually assigning partition to consumer. So as per docs, there will no rebalancing in case consumer leave/crashed in a consumer groups.
Let's say there are 3 partition and 3 consumers in same group and each partition is manually assigned to each consumer. And after some time, the 3rd consumer went down. Since there is no rebalancing, what all measures I can take to ensure minimum downtime?
Do I need to change config of any of the 1st two partition to start consuming from 3rd partition or something else?
Well I don't know why would you assign partitions to consumers manually?
I think you need to write rebalanceListener. https://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/ConsumerRebalanceListener.html
My advice: just let kafka decide which consumer will listen to which partition and you would not have to worry about this.
Although there might be context that would make the approach valid, as written, I question your approach a little bit.
The best way to ensure minimum downtime is to let the kafka brokers and zookeeper do what they're good at, managing your workload (partitions) among your consumers, which includes reassigning partitions when a consumer goes down.
Your best path is likely to use the OnPartitionsRevoked and OnpartitionsAssigned events to handle whatever logic you need to be able to assume a new partition (see JRs link for more-details information on these events).
I'll describe a recent use-case I've had, in the hope it is relevant to your use-case.
I recently had 5 consumers that required an in-memory cache of 50 million objects. Without partitioning, each consumer had its own cache, resulting in 250 mil objects.
To reduce that number to the original 50 million, we could use the onpartitionsrevoked event to clear the cache and the onassigned to repopulate the cache with the relevant cache for the assigned partitions.
Short of using those two handlers, if you really want to manually assign your partitions, you're going to have to do all of the orchestration yourself:
Something to monitor if one of the other consumers is down
Something to pick up the dead consumer's partition and process it
Orchestrate communication between the consumers to communicate when the dead consumer is alive again, so it can start working again.
As you can probably tell from the list, you're in for a real world of hurt if you force yourself down that path, and you probably won't do a better job than the kafka brokers - there's an entire business whose entire focus focus is developing and maintaining kafka so you don't have to handle all of that complexity.

Kafka Topology Best Practice

I have 4 machines where a Kafka Cluster is configured with topology that
each machine has one zookeeper and two broker.
With this configuration what do you advice for maximum topic&partition for best performance?
Replication Factor 3:
using kafka 0.10.XX
Thanks?
Each topic is restricted to 100,000 partitions no matter how many nodes (as of July 2017)
As to the number of topics that depends on how large the smallest RAM is across the machines. This is due to Zookeeper keeping everything in memory for quick access (also it doesnt shard the znodes, just replicates across ZK nodes upon write). This effectively means once you exhaust one machines memory that ZK will fail to add more topics. You will most likely run out of file handles before reaching this limit on the Kafka broker nodes.
To quote the KAFKA docs on their site (6.1 Basic Kafka Operations https://kafka.apache.org/documentation/#basic_ops_add_topic):
Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id. Since a typical folder name can not be over 255 characters long, there will be a limitation on the length of topic names. We assume the number of partitions will not ever be above 100,000. Therefore, topic names cannot be longer than 249 characters. This leaves just enough room in the folder name for a dash and a potentially 5 digit long partition id.
To quote the Zookeeper docs (https://zookeeper.apache.org/doc/trunk/zookeeperOver.html):
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Performance:
Depending on your publishing and consumption semantics the topic-partition finity will change. The following are a set of questions you should ask yourself to gain insight into a potential solution (your question is very open ended):
Is the data I am publishing mission critical (i.e. cannot lose it, must be sure I published it, must have exactly once consumption)?
Should I make the producer.send() call as synchronous as possible or continue to use the asynchronous method with batching (do I trade-off publishing guarantees for speed)?
Are the messages I am publishing dependent on one another? Does message A have to be consumed before message B (implies A published before B)?
How do I choose which partition to send my message to?
Should I: assign the message to a partition (extra producer logic), let the cluster decide in a round robin fashion, or assign a key which will hash to one of the partitions for the topic (need to come up with an evenly distributed hash to get good load balancing across partitions)
How many topics should you have? How is this connected to the semantics of your data? Will auto-creating topics for many distinct logical data domains be efficient (think of the effect on Zookeeper and administrative pain to delete stale topics)?
Partitions provide parallelism (more consumers possible) and possibly increased positive load balancing effects (if producer publishes correctly). Would you want to assign parts of your problem domain elements to specific partitions (when publishing send data for client A to partition 1)? What side-effects does this have (think refactorability and maintainability)?
Will you want to make more partitions than you need so you can scale up if needed with more brokers/consumers? How realistic is automatic scaling of a KAFKA cluster given your expertise? Will this be done manually? Is manual scaling viable for your problem domain (are you building KAFKA around a fixed system with well known characteristics or are you required to be able to handle severe spikes in messages)?
How will my consumers subscribe to topics? Will they use pre-configured configurations or use a regex to consume many topics? Are the messages between topics dependent or prioritized (need extra logic on consumer to implement priority)?
Should you use different network interfaces for replication between brokers (i.e. port 9092 for producers/consumers and 9093 for replication traffic)?
Good Links:
http://cloudurable.com/ppt/4-kafka-detailed-architecture.pdf
https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive
https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844
https://kafka.apache.org/documentation/

Kafka Streams - all instances local store pointing to the same topic

We have the following problem:
We want to listen on certain Kafka topic and build it's "history" - so for specified key extract some data, add it to already existing list for that key (or create a new one if it does not exist) an put it to another topic, which has only single partition and is highly compacted. Another app can just listen on that topic and update it's history list.
I'm thinking how does it fit with Kafka streams library. We can certainly use aggregation:
msgReceived.map((key, word) -> new KeyValue<>(key, word))
.groupBy((k,v) -> k, stringSerde, stringSerde)
.aggregate(String::new,
(k, v, stockTransactionCollector) -> stockTransactionCollector + "|" + v,
stringSerde, "summaries2")
.to(stringSerde, stringSerde, "transaction-summary50");
which creates a local store backed by Kafka and use it as history table.
My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Question is, if I just set the same applicationId for each running instance, should it eventually replay all data from the very same kafka topic that each running instance has the same local state?
Why would you create multiple apps with different ID's to perform the same job? The way Kafka achieves parallelism is through tasks:
An application’s processor topology is scaled by breaking it into multiple tasks.
More specifically, Kafka Streams creates a fixed number of tasks based on the input stream partitions for the application, with each task assigned a list of partitions from the input streams (i.e., Kafka topics). The assignment of partitions to tasks never changes so that each task is a fixed unit of parallelism of the application.
Tasks can then instantiate their own processor topology based on the assigned partitions; they also maintain a buffer for each of its assigned partitions and process messages one-at-a-time from these record buffers. As a result stream tasks can be processed independently and in parallel without manual intervention.
If you need to scale your app, you can start new instances running the same app (same application ID), and some of the already assigned tasks will reassigned to the new instance. The migration of the local state stores will be automatically handled by the library:
When the re-assignment occurs, some partitions – and hence their corresponding tasks including any local state stores – will be “migrated” from the existing threads to the newly added threads. As a result, Kafka Streams has effectively rebalanced the workload among instances of the application at the granularity of Kafka topic partitions.
I recommend you to have a look to this guide.
My concern is, if we decide to scale such app, each running instance will create a new backed topic ${applicationId}-${storeName}-changelog (I assume each app has different applicationId). Each instance start to consume input topic, gets a different set of keys and build a different subset of the state. If Kafka decides to rebalance, some instances will start to miss some historic states in local store as they get a completely new set of partitions to consume from.
Some assumptions are not correct:
if you run multiple instances of your application to scale your app, all of them must have the same application ID (cf. Kafka's consumer group management protocol) -- otherwise, load will not be shared because each instance will be considered an own application, and each instance will get all partitions assigned.
Thus, if all instanced do use the same application ID, all running application instance will use the same changelog topic name and thus, what you intend to do, should work out-of-the box.