I know Kafka can handle tons of traffic. However, how well does it scale for big number of concurrent clients?
Each client would have their own unique group_id (and as consequence Kafka would be keeping track of each one's offsets).
Would that be an issue for Kafka 0.9+ with offsets stored internally?
Would that be an issue for Kafka 0.8 with offsets stored in Zookeeper?
Some Kafka users such as LinkedIn have reported in the past that a single Kafka broker can support ~10K client connections. This number may vary depending on hardware, configuration, etc.
As long as the request rate is not too high, the limiting factor is probably just the open-file-descriptors limit as configured in the operating system, see e.g. http://docs.confluent.io/current/kafka/deployment.html#file-descriptors-and-mmap for more information.
Related
I have a Kafka Cluster (Using Aivan on AWS):
Kafka Hardware
Startup-2 (2 CPU, 2 GB RAM, 90 GB storage, no backups) 3-node high availability set
Ping between my consumers and the Kafka Broker is 0.7ms.
Backgroup
I have a topic such that:
It contains data about 3000 entities.
Entity lifetime is a week.
Each week there will be different 3000 entities (on avg).
Each entity may have between 15k to 50k messages in total.
There can be at most 500 messages per second.
Architecture
My team built an architecture such that there will be a group of consumers. They will parse this data, perform some transformations (without any filtering!!) and then sends the final messages back to the kafka to topic=<entity-id>.
It means I upload the data back to the kafka to a topic that contains only a data of a specific entity.
Questions
At any given time, there can be up to 3-4k topics in kafka (1 topic for each unique entity).
Can my kafka handle it well? If not, what do I need to change?
Do I need to delete a topic or it's fine to have (alot of!!) unused topics over time?
Each consumer which consumes the final messages, will consume 100 topics at the same time. I know kafka clients can consume multiple topics concurrenctly but I'm not sure what is the best practices for that.
Please share your concerns.
Requirements
Please focus on the potential problems of this architecture and try not to talk about alternative architectures (less topics, more consumers, etc).
The number of topics is not so important in itself, but each Kafka topic is partitioned and the total number of partitions could impact performance.
The general recommendation from the Apache Kafka community is to have no more than 4,000 partitions per broker (this includes replicas). The linked KIP article explains some of the possible issues you may face if the limit is breached, and with 3,000 topics it would be easy to do so unless you choose a low partition count and/or replication factor for each topic.
Choosing a low partition count for a topic is sometimes not a good idea, because it limits the parallelism of reads and writes, leading to performance bottlenecks for your clients.
Choosing a low replication factor for a topic is also sometimes not a good idea, because it increases the chance of data loss upon failure.
Generally it's fine to have unused topics on the cluster but be aware that there is still a performance impact for the cluster to manage the metadata for all these partitions and some operations will still take longer than if the topics were not there at all.
There is also a per-cluster limit but that is much higher (200,000 partitions). So your architecture might be better served simply by increasing the node count of your cluster.
I am trying to implement a way to randomly access messages from Kafka, by using KafkaConsumer.assign(partition), KafkaConsumer.seek(partition, offset).
And then read poll for a single message.
Yet i can't get past 500 messages per second in this case. In comparison if i "subscribe" to the partition i am getting 100,000+ msg/sec. (#1000 bytes msg size)
I've tried:
Broker, Zookeeper, Consumer on the same host and on different hosts. (no replication is used)
1 and 15 partitions
default threads configuration in "server.properties" and increased to 20 (io and network)
Single consumer assigned to a different partition each time and one consumer per partition
Single thread to consume and multiple threads to consume (calling multiple different consumers)
Adding two brokers and a new topic with it's partitions on both brokers
Starting multiple Kafka Consumer Processes
Changing message sizes 5k, 50k, 100k -
In all cases the minimum i get is ~200 msg/sec. And the maximum is 500 if i use 2-3 threads. But going above, makes the ".poll()", call take longer and longer (starting from 3-4 ms on a single thread to 40-50 ms with 10 threads).
My naive kafka understanding is that the consumer opens a connection to the broker and sends a request to retrieve a small portion of it's log. While all of this has some involved latency, and retrieving a batch of messages will be much better - i would imagine that it would scale with the number of receivers involved, with the expense of increased server usage on both the VM running the consumers and the VM running the broker. But both of them are idling.
So apparently there is some synchronization happening on broker side, but i can't figure out if it is due to my usage of Kafka or some inherent limitation of using .seek
I would appreaciate some hints of whether i should try something else, or this is all i can get.
Kafka is a streaming platform by design. It means there are many, many things has been developed for accelerating sequential access. Storing messages in batches is just one thing. When you use poll() you utilize Kafka in such way and Kafka do its best. Random access is something for what Kafka don't designed.
If you want fast random access to distributed big data you would want something else. For example, distributed DB like Cassandra or in-memory system like Hazelcast.
Also you could want to transform Kafka stream to another one which would allow you to use sequential way.
We are using kafka 0.10.x, I am looking, if there is a way to stop a publisher kafka to stop sending messages after certain messages/limit is reached in an hour. The goal here is to restrict user to only send certain number messages in and hour/day ?
If anyone has come across similar use case, please share your findings.
Thanks in Advance......
Kafka has a few throttling and quota mechanisms but none of them exactly match your requirement to strictly limit a producer based on message count on a daily basis.
From the Apache Kafka 0.11.0.0 documentation at https://kafka.apache.org/documentation/#design_quotas
Kafka cluster has the ability to enforce quotas on requests to control
the broker resources used by clients. Two types of client quotas can
be enforced by Kafka brokers for each group of clients sharing a
quota:
Network bandwidth quotas define byte-rate thresholds (since 0.9)
Request rate quotas define CPU utilization thresholds as a percentage
of network and I/O threads (since 0.11)
Client quotas were first introduced in Kafka 0.9.0.0. Rate limits on producers and consumers are enforced to prevent clients saturating the network or monopolizing broker resources.
See KIP-13 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
The quota mechanism introduced on 0.9 was based on the client.id set in the client configuration, which can be changed easily. Ideally, quota should be set on the authenticated user name so it is not easy to circumvent so in 0.10.1.0 an addition Authenticated Quota feature was added.
See KIP-55 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-55%3A+Secure+Quotas+for+Authenticated+Users
Both the quota mechanisms described above work on data volume (i.e. bandwidth throttling) and not on number of messages nor number of requests. If a client sends lots of small messages or makes lots of requests that return no messages (e.g., a consumer with min.byte configured to 0), it can still overwhelm the broker. To address this issue 0.11.0.0 added in additionally support for throttling by request rate.
See KIP-124 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas
With all that as background then, if you know that your producer always publishes messages of a certain size, then you can compute a daily limit expressed in MB and also a rate limit expressed in MB/sec which you can configure as a quota. That's not a perfect fit for your need because a producer might send nothing for 12 hours and then try and send at a faster rate for a short time and the quota would still limit them to a lower publish rate because the limit is enforced per second and not per day.
If you don't know the message size or it varies a lot then since messages are published using a produce request, you could use request rate throttling to somewhat control the rate that an authenticated user is allow to publish messages but again it would not be a message/day limit nor even a bandwidth limit but rather as a "CPU utilization threshold as a percentage of network and I/O threads". This helps more for avoiding DoS problems and not really for limiting message counts.
If you would like to see message count quotas or message storage quotas added to Kafka then clearly the Kafka Improvement Proposal (KIP) process works and you are encouraged to submit improvement proposals in this or any other area.
See KIP process for details: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
you can make use of broker configs:
message.max.bytes (default:1000000) – Maximum size of a message the broker will accept. This has to be smaller than the consumer fetch.message.max.bytes, or the broker will have messages that can’t be consumed, causing consumers to hang.
log.segment.bytes (default: 1GB) – size of a Kafka data file. Make sure its larger than 1 message. Default should be fine (i.e. large messages probably shouldn’t exceed 1GB in any case. Its a messaging system, not a file system)
replica.fetch.max.bytes (default: 1MB) – Maximum size of data that a broker can replicate. This has to be larger than message.max.bytes, or
a broker will accept messages and fail to replicate them. Leading to potential data loss.
I think you can tweak the config to do what you want
Most articles depicts Kafka better in read/write throughput than other message broker(MB) like ActiveMQ. Per mine understanding reading/writing
with the help of offset makes it faster. But I am not clear how offset makes it faster ?
After reading Kafka architecture, I have got some understanding but not clear what makes Kafka scalable and high in throughput based on below points :-
Probably with the offset, client knows which exact message it needs to read which may be one of the factor to make it high in performance.
And in case of other MB's , broker need to coordinate among consumers so
that message is delivered to only consumer. But this is the case for queues only not for topics. Then What makes Kafka topic faster than other MB's topic.
Kafka provides partitioning for scalability but other message broker(MB) like ActiveMQ also provides the clustering. so how Kafka is better for big data/high loads ?
In other MB's we can have listeners . So as soon as message comes, broker will deliver the message but in case of Kafka we need to poll which means more
load on both broker/client side ?
Lots of details on what makes Kafka different and faster than other messaging systems are in Jay Kreps blog post here
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
There are actually a lot of differences that make Kafka perform well including but not limited to:
Maximized use of sequential disk reads and writes
Zero-copy processing of messages
Use of Linux OS page cache rather than Java heap for caching
Partitioning of topics across multiple brokers in a cluster
Smart client libraries that offload certain functions from the
brokers
Batching of multiple published messages to yield less frequent network round trips to the broker
Support for multiple in-flight messages
Prefetching data into client buffers for faster subsequent requests.
It's largely marketing that Kafka is fast for a message broker. For example IBM MessageSight appliances did 13M msgs/sec with microsecond latency in 2013. On one machine. A year before Kreps even started the Github.:
https://www.zdnet.com/article/ibm-launches-messagesight-appliance-aimed-at-m2m/
Kafka is good for a lot of things. True low latency messaging is not one of them. You flatly can't use batch delivery (e.g. a range of offsets) in any pure latency-centric environment. When an event arrives, delivery must be attempted immediately if you want the lowest latency. That doesn't mean waiting around for a couple seconds to batch read a block of events or enduring the overhead of requesting every message. Try using Kafka with an offset range of 1 (so: 1 message) if you want to compare it to a normal push-based broker and you'll see what I mean.
Instead, I recommend focusing on the thing pull-based stream buffering does give you:
Replayability!!!
Personally, I think this makes downstream data engineering systems a bit easier to build in the face of failure, particularly since you don't have to rely on their built-in replication models (if they even have one). For example, it's very easy for me to consume messages, lose the disks, restore the machine, and replay the lost data. The data streams become the single source of truth against which other systems can synchronize and this is exceptionally useful!!!
There's no free lunch in messaging, pull and push each have their advantages and disadvantages vs. each other. It might not surprise you that people have also tried push-pull messaging and it's no free lunch either :).
I was wondering if Kafka has any limitation or starts slowing down (due to GC or other reasons) if we have large number of channels. We have a heavy volume of data that we will be sending through Kafka (Over 2B data points). We were thinking of having about 1600 channels to start with.
Has anyone come across issues when we have such large number of channels in Kafka? Similarly, do you see issues with local DC replication with these large number of channels and lastly any foreseeable issues if we are using MirrorMaker for cross DC replication with such large number of channels
Any pointers are highly appreciated
Thanks
I believe there is no hard limit on number of topics in Kafka itself. However, since Kafka stores topic info in Zookeeper (//brokers/topics/), and Zookeeper has a 1MB limitation on max node size, there can be only a finite number of topics. Also, Kafka brokers store data for different topics in /var/kafka/data/. Performance may suffer if there are too many subdirs in /var/kafka/data/.
I haven't tried thousands of topics but Kafka with a few hundred topics works ok for my purposes. The only area where I had problems was dynamic topic creation while using high level consumer. It required client re-connection to pick up the new topics on all consumer boxes. This caused time consuming consumer re-balancing (which sometimes failed, preventing reading from some topics). As a result I had to switch to simple consumer and take care about read coordination in my code.
I'd recommend to create a simple test app that generates some random data for the number of topics you expect going forward and verify that performance is acceptable.