I understand that the fundamental approach to parallelization with kafka is to utilize partitioning. However, I have a special situation in that I have to leverage an existing infrastructure that only has 6 partitions, and I need to process millions and millions of records per second.
Is there a way to further optimize in a way that I could have each kstream consumer read and equally distribute load at the same time from a single partition?
The simplest way is to create a "helper" topic with the desired number of partitions. This topic can be configured with a very short retention time, because the original data is safely stored in the actual input topic. You use this helper topic to route all data through it and thus allow for more parallelism downstream:
builder.stream("input-topic")
.through("helper-topic-with-many-partitions")
... // actual processing
Partitions are the level of parallelization. With 6 partitions - you could maximum have 6 instances (of kstream) consuming data. If each instance is in a separate machine i.e. with 1 GBps network each, you could be reading in total with 600 Mbytes / sec
If that's not enough, you'd need to repartition data
Now for distributing your processing, you would need to run each kstream (with the same consumer group) on a different machine
Here's a short video that demonstrates how Kafka Streams (via Kafka SQL) are parallelized to 5 processes https://www.youtube.com/watch?v=denwxORF3pU
It all depends on partitions & executors. With 6 partitions, I usually can achieve 500K+ messages / second, depending on the complexity of the processing of course
Related
I have the following use case:
50 students write their own code which consumes a preloaded dataset, and they will repeat it many times.
They all need to do the same task: read the data in order, and process it.
The dataset is a time series containing 600 million messages, each message is about 1.3KB.
Processing will probably be in Spark, but not mandatory.
The dataset is fixed and ReadOnly.
The data should be read at "reasonable speed" > 30MB/sec for each consumer.
I was thinking of setting kafka cluster with 3+ brokers, 1 topic, and 50 partitions.
My issue with the above plan is that each student (== consumer) must read all the data, regardless of what other consumers do.
Is Kafka a good fit for this? If so, how?
What if I relax the requirement of reading the dataset in order? i.e. a consumer can read the 600M messages in any order.
Is it correct that in this case each consumer will simply pull the full topic (starting with "earliest)?
An alternative is to set an HDFS storage (we use Azure so it's called Storage Account) and simply supply a mount point. However, I do not have control of the throughput in this case.
Throughput calculation:
let's say 25 consumers run concurrently, each reading at 30MB/s -> 750MB/s .
Assuming data is read from disk, and disk rate is 50MB/s, I need to read concurrently from 750/50 = 15 disks.
Does it mean I need to have 15 brokers? I did not see how one broker can allocate partitions to several disks attached to it.
similar posts:
Kafka topic partitions to Spark streaming
How does one Kafka consumer read from more than one partition?
(Spring) Kafka appears to consume newly produced messages out of order
Kafka architecture many partitions or many topics?
Is it possible to read from multiple partitions using Kafka Simple Consumer?
Processing will probably be in Spark, but not mandatory
An alternative is to set an HDFS storage (we use Azure)
Spark can read from Azure Blob Storage, so I suggest you start with that first. You can easily scale up Spark executors in parallel for throughput.
If want to use Kafka, don't base consumption rate on disk speed alone, especially when Kafka can do zero-copy transfers. Use kafka-consumer-perf-test script to test how fast your consumers can go with one partition. Or, better, if your data has some key other than timestamp that you can order by, then use that.
It's not really clear if each "50 students" does the same processing on the data set, or some pre computations can be done, but if so, Kafka Streams KTables can be setup to aggregate some static statistics of the data, if it's all streamed though a topic, that way, you can distribute load for those queries, and not need 50 parallel consumers.
Otherwise, my first thought would be to simply use a TSDB like OpenTSDB, Timescale or Influx, maybe Druid . Which could also be used with Spark, or queried directly.
If you are using Apache Spark 3.0+ there are ways around consumer per partition bound, as it can use more executor threads than partitions are, so it's mostly about how fast your network and disks are.
Kafka stores latest offsets in memory, so probably for your use case most of reads will be from memory.
Desired minimum number of partitions to read from Kafka. By default, Spark has a 1-1 mapping of topicPartitions to Spark partitions consuming from Kafka. If you set this option to a value greater than your topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. Please note that this configuration is like a hint: the number of Spark tasks will be approximately minPartitions. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
https://spark.apache.org/docs/3.0.1/structured-streaming-kafka-integration.html
I am currently working on a setup which has 6 kafka-brokers, Data is being pushed into my topic from two producers at a rate of about 4000 messages per second, I have 5 Consumers for this topic working as a group. What should be the ideal number of partitions of my kafka topic?
Please feel free to tell me if any change is required in brokers/consumers/producers as well.
In general more the partitions - more the throughput. However there are other considerations too like the limits of hardware you are running on, whether you are using compression etc. There is a good enough information from Confluent here which provides you insight into rough calculation you can use to arrive at number of partitions.
A rough formula for picking the number of partitions is based on
throughput. You measure the throughout that you can achieve on a
single partition for production (call it p) and consumption (call it
c). Let’s say your target throughput is t. Then you need to have at
least max(t/p, t/c) partitions. The per-partition throughput that one
can achieve on the producer depends on configurations such as the
batching size, compression codec, type of acknowledgement, replication
factor, etc.
Moreover for consumer
The consumer throughput is often application dependent since it
corresponds to how fast the consumer logic can process each message
So the best way is to measure and benchmark for your own use case
I have a standalone Kafka setup with single disk. planning to stream over million records. How to decide partitions for my topic for better through-put? has to be 1 partition?
Is it recommended to have multiple partitions for a topic on standalone Kafka server?
Yes you need multiple partitions even for a single node kafka cluster. That is because you can only have as many consumers as you have partitions. If you have a single partition then you can only have a single consumer, and that will limit throughput. Especially if you want to stream millions of rows (although the period for those is not specified).
The only real downside to this is that messages are only consumed in order within the same partition. Other than that, you should go with multiple partitions. You will need to estimate the throughput of a single consumer in order to calculate the partitions, then maybe add one or 2 on top of that.
You can still add partitions later but it's probably better to try to start with the right amount first and change later as you learn more or as your volume increases/decreases.
There are two main factors to consider:
Number of producers and consumers
Each client, producer or consumer, can only connect to one partition. For this reason, the number of partitions must be at least the max(number of producers, number of consumers).
Throughput
You must determine the troughput to calculate how many consumers should be in the consumer group. The combined reading capacity of consumers should be at least as high as the combined writing capacity of producers.
I have a very very large(count of messages) Kafka topic, it might have more than 20M message per second, but, message size is small, it's just some plain text, each less than 1KB, I can use several partitions per topic, and also I can use several servers to work on one topic and they will consume one of the partitions in the topic...
what if I need +100 servers for a huge topic?
Is it logical to create +100 partitions or more on a single topic?
You should define "large" when mentioning Kafka topics:
Large means huge data in terms of volume size.
Message size is large that it takes time sending a message from queue to client for processing?
Intensive write to that topic? In that case, do you need to process read as fast as possible? (i.e: can we delay process data for about 1 hour)
...
In either case, you should better think on the consumer side for a better design topic and partition. For instances:
Processing time for each message is slow, and it better process fast between messages: In that case, you should create many partitions. It is like a load balancer and server relationship, you create many workers for doing your job.
If only some message types, the time processing is slow, you should consider moving to a new topic. There is a nice article: Should you put several event types in the same Kafka topic explains this decision.
Is the order of messages important? for example, message A happens before message B, message A should be processed first. In this case, you should make all messages of the same type going to the same partition (only the same partition can maintain message order), or move to a separate topic (with a single partition).
...
After you have a proper design for topic and partition, it is come to question: how many partitions should you have for each topic. Increasing total partitions will increase your throughput, but at the same time, it will affect availability or latency. There are some good topics here and here that explain carefully how will total partitions per topic affect the performance. In my opinion, you should benchmark directly on your system to choose the correct value. It depends on many factors of your system: processing power of server machine, network capacity, memory ...
And the last part, you don't need 100 servers for 100 partitions. Kafka will try to balance all partitions between servers, but it is just optional. For example, if you have 1 topic with 7 partitions running on 3 servers, there will be 2 servers store 2 partitions each and 1 server stores 3 partitions. (so 2*2 + 3*1 = 7). In the newer version of Kafka, the mapping between partition and server information will be stored on the zookeeper.
you will get better help, if you are more specific and provide some numbers like what is your expected load per second and what is each message size etc,
in general Kafka is pretty powerful and behind the seances it writes the data to buffer and periodically flush the data to disk. and as per the benchmark done by confluent a while back, Kafka cluster with 6 node supports around 0.8 million messages per second below is bench marking pic
Our friends were right, I refer you to this book
Kafka, The Definitive Guide
by Neha Narkhede, Gwen Shapira & Todd Palino
You can find the answer on page 47
How to Choose the Number of Partitions
There are several factors to consider when choosing the number of
partitions:
What is the throughput you expect to achieve for the topic?
For example, do you expect to write 100 KB per second or 1 GB per
second?
What is the maximum throughput you expect to achieve when consuming from a single partition? You will always have, at most, one consumer
reading from a partition, so if you know that your slower consumer
writes the data to a database and this database never handles more
than 50 MB per second from each thread writing to it, then you know
you are limited to 60MB throughput when consuming from a partition.
You can go through the same exercise to estimate the maxi mum throughput per producer for a single partition, but since producers
are typically much faster than consumers, it is usu‐ ally safe to skip
this.
If you are sending messages to partitions based on keys, adding partitions later can be very challenging, so calculate throughput
based on your expected future usage, not the cur‐ rent usage.
Consider the number of partitions you will place on each broker and available diskspace and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for leader
elections. With all this in mind, it’s clear that you want many
partitions but not too many. If you have some estimate regarding the
target throughput of the topic and the expected throughput of the con‐
sumers, you can divide the target throughput by the expected con‐
sumer throughput and derive the number of partitions this way. So if I
want to be able to write and read 1 GB/sec from a topic, and I know
each consumer can only process 50 MB/s, then I know I need at least 20
partitions. This way, I can have 20 consumers reading from the topic
and achieve 1 GB/sec. If you don’t have this detailed information, our
experience suggests that limiting the size of the partition on the
disk to less than 6 GB per day of retention often gives satisfactory
results.
I am working on a application/Kafka Cluster which will be producing/consuming messages (around 100k a second) to a Topic. The message format is identical so my initial thoughts were to have a single topic for all messages.
However is there any benefits to Kafka to split the messages into multiple Topics? There is a logical separation which could be applied which could split the topic into multiple (10ish) topics.
Apart from the Producer/Consumer side of things. Does Kafka itself have any preferences around performance, redundancy, stability, management etc by having 1 large topic versus multiple smaller topics?
Topic partitions are the usual means of parallelizing Kafka, however you could opt to split it into multiple topics as well if you wanted. But I would first look into the partition aspect of things. Here is a good Confluent article on how to pick the right number of partitions. Especially note that if you are partitioning on keys then adding partitions after the fact can result in split data, so think through it properly up front as best as you can.
Parallelism in kafka depends on the number of partitions in a topic.There will be an increase in throughput of data as long as the number of partitions is optimal(unnecessarily large number of partitions will create overhead).By increasing the number of consumer you can streams message from partitions simultaneously