Kafka High Level Vs Low level consumer - apache-kafka

I have the following questions regarding topics and partitions
1) What is the difference between n-topics with m-partitions and nm topics ?
Would there be a difference when accessing m-partitions through m threads and nm topics using n*m different processes
2)A perfect use case differentiating high level and low level consumer
3)In case of a failure (i.e) message not delivered where can i find the error logs in Kafka.

1) What is the difference between n-topics with m-partitions and nm topics ?
There has to be at least one partition for every topic. Topic is just a named group of partitions and partitions are really streams of data. The code that uses Kafka producer normally is not concerned with partitions, it just sends a message to a topic.
By default producer uses round robin approach to select a partiton to store a message but you can create a custom one if needed and select a partition based on message's content.
If there is only one partition, only one broker processes messages for the topic and appends them to a file.
On the other hand, if there are as many partitions as brokers, message processing is parallelized and there is up to m times (minus overhead) speedup. That assumes that each broker is running on its own box and kafka data storage is not shared among brokers.
If there are more partitions for a topic than brokers, Kafka tries to distribute them evenly among all of brokers.
The same goes to reading from Kafka. If there is only one partition, the kafka consumer speed is limited by max read speed of a single disk. If there are multiple partitions, messages from all partitions (on different brokers) are retrieved in parallel.
1a) Would there be a difference when accessing m-partitions through m threads and nm topics using n*m different processes
You're mixing partitions and topics here, see my answer above.
2)A perfect use case differentiating high level and low level consumer
High level consumer :
I just want to use Kafka as extermely fast persistent FIFO buffer and not worry much about details.
Low level consumer :
I want to have a custom partition data consuming logic, e.g. start reading data from newly created topics without a need of consumer reconnection to brokers.
3)In case of a failure (i.e) message not delivered where can i find the error logs in Kafka.
Kafka uses log4j for logging. It depends on its configuration where the log is stored (in case of producer and consumer).
Kafka broker logs are normally stored in /var/log/kafka/.

Related

Advice on how I can decrease Kafka Lag

I'm relatively new to working with Kafka, below is a sample of what my current set up is.
Kafka Setup
Multiple topics that all have one partition each. 2 Consumer Groups with each group containing one consumer.
The issue I am seeing is that the Lag is enormous, sometimes upwards of 8-10 hours waiting for consuming, the load is about 100-200 million messages a day
What steps should I look at in order to address this? Is it as simple as reassigning partitions or creating new partitions for the 3 topics that are being consumed by the two consumers? - I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag. I've looked at network connections and don't feel that it is anything got to do with this. If anyone could point me in the direction of Kafka and Low Latency documents that would be good also.
Generally the flow is to parallelize your consumption through the increase on the number of partitions and consumers in consumer groups that subscribe to those topics with increased partitions (Nconsumers <= Npartitions).
And distribute your topics with increase on the number of brokers in your cluster.
So from topic considerations:
Less partition per topic result:
in producer and/or consumer lag
starved or overloaded brokers and consumers.
(But take into account) More partition per topic result in:
More broker resources – file handlers and memory.
There is an overhead with each additional partition and a number of partitions a broker can handle is limited.
Overhead of replication load
Then increase the number of consumers in that consumer groups.
Try increasing partition per topic, but by itself it should not help! You also will need to increase the number of consumers in your consumer group. Is that single consumers or consumer groups on your diagram? How many consumers in your consumer group vs partitions on the topic that they are subscibed to.
From this in your message:
I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag.
I get an idead that your messages may be huge! Is it so? In case yes, try to keep messages small (for example by excluding BLOBs and keep external links to them)
Still the issue may be somewhere else like bad configs, consumer commit messages (acknowledgment handling), etc.
So, I highly advice you to read article Fine-tune Kafka performance with the Kafka optimization theorem
I also advise you to go through Apache Kafka courses on Confluent web-page
This should be added as a comment, but I haven't had permissions to do so. The provided info is very limited with incorrect diagram, which limits the ability to provide an adequate helpfull answer. If possible please correct your diagram and add more details about your set-up like:
broker configuration, file attached;
consumer set-up (Consumer commit messages);
producer set-up;
topic set-up;
kafka version (the defaults differ with major/minor versions)
The provided diagram is not correct in the notion of topic - partition relationship, so I assume it is a mistype and Partition 0 must be substituded with Broker 0, right?
Kafka's topics are divided into several partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic...
Then there is an open question on the number of partiotions in each topic and the number of topics in each broker, as well as the number of brokers in your cluster!

What is the correlation in kafka stream/table, globalktable, borkers and partition?

I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.

Apache Kafka Scaling Topics using partitions

We started to use Apache Kafka to persist Timeseries data into a Timeseries database. What we started with was to just have a single topic, a producer writing to this topic and a single consumer reading from this topic and dumping the data to the Timeseries database.
We had 3 broker instances and what we noticed in the first try was that the producer was pretty fast in writing messages to the topic. Within a matter of 30 minutes, we had around 1.5 million messages. The consumer was just doing 300 messages per second.
Our next approach was to partition the topic and have more consumer instances (equal to the number of partitions). This definitely improved on the consumer write speed. Now my questions are:
What happens if I set my topic partition to 6, but I have only 3 broker instances. Which broker instance would be the leader for partition 1 to 6?
Is there a formula to determine how many partitions would I be needing? Since this was our test environment, we could play with it and scale it. We might not be able to do the same on our production environment. So how to determine the partition size?
The partitions get distributed amongst your brokers. It's impossible to know which broker will be elected leader of a given partition -- and it can change over time. Depending on which version of Kafka and which Consumer API you use, your consumer may or may not discover partition leaders on its own. With the SimpleConsumer you have to find partition leaders on your own, and respond to new leader election in your code (instead of having it handled by the API automatically).
As to the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions. If you have 4 partitions and 5 consumers, one of the consumers will starve. I usually use numbers like 12 or 60 or multiples thereof for the number of partitions for large topics. Something that divides easily and cleanly among variable numbers of consumers.
Also, note that you can later on change the number of partitions, with some caveats. See this answer for how and what the caveats are.

Kafka how to consume one topic parallel

I read kafka document, still don't know how consume one topic parallel?
Suppose:
I have one topic like "something happened" (don't split this topic), and I have many customers that want to consume it.
So what should I do, so that multiple customers can consume it parallel? Should I use partitioning and customer groups?
I have one idea about this, but I'm not sure whether is it right.
Make many partitions about the same topic, and make one partition to one customer, so one producer must produce the same to these partitions, and every customer in the different customer group, is it right?
Using partitions is the way of being able to parallelize the consumption of a topic. Let´s say you have 10 partitions for your topic, then you can have 10 consumers in the same consumer group reading one partition each. If you have less consumers than partitions, then they would be responsible for more than one partition each. If you have more consumers than partitions, then there would be consumers who would not get any partition assigned to them and have nothing to do except being available to replace another consumer who has died.
Each topic in Kafka can be organized into many partitions. Partition allows for parallel consumption increasing throughput.
Producer publishes the message to a topic using the Kafka producer client library which balances the messages across the available partitions using a Partitioner. The broker to which the producer connects to takes care of sending the message to the broker which is the leader of that partition using the partition owner information in zookeeper. Consumers use Kafka’s High-level consumer library (which handles broker leader changes, managing offset info in zookeeper and figuring out partition owner info etc implicitly) to consume messages from partitions in streams; each stream may be mapped to a few partitions depending on how the consumer chooses to create the message streams.
For example, if there are 10 partitions for a topic and 3 consumer instances (C1,C2,C3 started in that order) all belonging to the same Consumer Group, we can have different consumption models that allow read parallelism as below
Each consumer uses a single stream.
In this model, when C1 starts all 10 partitions of the topic are mapped to the same stream and C1 starts consuming from that stream. When C2 starts, Kafka rebalances the partitions between the two streams. So, each stream will be assigned to 5 partitions(depending on the rebalance algorithm it might also be 4 vs 6) and each consumer consumes from its stream. Similarly, when C3 starts, the partitions are again rebalanced between the 3 streams. Note that in this model, when consuming from a stream assigned to more than one partition, the order of messages will be jumbled between partitions.
Each consumer uses more than one stream
(say C1 uses 3, C2 uses 3 and C3 uses 4). In this model, when C1 starts, all the 10 partitions are assigned to the 3 streams and C1 can consume from the 3 streams concurrently using multiple threads. When C2 starts, the partitions are rebalanced between the 6 streams and similarly when C3 starts, the partitions are rebalanced between the 10 streams. Each consumer can consume concurrently from multiple streams. Note that the number of streams and partitions here are equal. In case the number of streams exceed the partitions, some streams will not get any messages as they will not be assigned any partitions.
Just to add the list of answers, Confluent has a library to do this for you, like Rapids. The project is here:
https://github.com/confluentinc/parallel-consumer
It's open source.
Note: I'm the author.
#Lundahl did all the didactic, I'll give you a pratical sample.
Create a topic for some meaning, e.g. news_events with the parallelism your consumers will need (partitions), you can calc that using the time to process one message, the number of messages you will have and the time you want to have all the messages processed.
Let's create consumers for that topic, you wan't to read the news and your sister or brother also, each one on your time, then every one needs a consumer group id, this way kafka will know that threads a,b,c are for one consumer group and the d,e,c are for the second consumer group, every consumer group will receive the same messages, process it at their time and won't affect each other.
A message will come at one or other partition, never at two, by default Kafka makes round robin to choose the partition, remember, all consumers groups can connect and read data from all the same partitions
I would suggest you to use rapids-kafka-client, a library which do that parallelism stuff for you, choose the number of threads equal the number of partitions you have, choose a consumer group, and see the magic happen.
public static void main(String[] args){
ConsumerConfig.<String, String>builder()
.prop(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName())
.prop(GROUP_ID_CONFIG, "news-app")
.topics("news_events")
.consumers(7)
.callback((ctx, record) -> {
System.out.printf("status=consumed, value=%s%n", record.value());
})
.build()
.consume()
.waitFor();
}
You can read more about consumer groups, topics and partitions here
I assume what you want is parallel consumption between customers in a publish/subscribe fashion.
Beside that, you can also have parallel consumption within a single customer in order to scale the consumer application.
Parallel consumption between customers
If by "customers" you mean different organizations which are interested in consuming topic's messages independently, all you need is consumer groups.
This is a simple publish/subscribe pattern where each customer runs its own application and read all topic's messages without interfering with others.
Each customer application can be seen as a consumer group, made up by one or more Kafka consumers (whether running on a single node or spread across a cluster), all of them sharing the consumer group's identifier.
You achieve this goal regardless of partitions. In case topic is partitioned, you don't need to worry about writing the same message to all partitions. Remember that in Kafka messages are durable, a message read by a Kafka consumer is not deleted and is available to be read by other Kafka consumers from a different consumer group (until it expires). Furthermore, partitions are not meant to work like this, they help scale storage of data (at a certain point all topic's data wouldn't fit into just one node) and scale consumer applications as you can see below.
Parallel consumption within single customer
You can further parallelize, or better to say, scale the consumption of messages within a consumer group with, in fact, Kafka consumers.
Imagine topic is huge, producers write into it with an high rate, and consumer group has only one consumer: this poor consumer may struggle to keep up with the message arrival rate, especially if message processing is time-consuming too.
That's the case where you need partitions and more consumers in your consumer group, so that Kafka will assign partitions to consumers to distribute reading load among them.
How partition assignment works has been already explained in other answers here, but basically for a given consumer group:
each topic's partition is assigned exclusively to one consumer,
a consumer might get assigned more partitions
if consumers are more than topic's partitions, some of them will stay idle as they won't get assigned any partition to consume from.
Remember that message ordering in Kafka is guaranteed only at partition level, so if you have many partitions and ordering matters, you need to choose the right message key to partition data according to your requirements.
For example if you want messages be ordered by device, a device_id would be your key that guarantees messages of the same device will be written to the same partition.