Bucketizing Kafka Data with Partitions - apache-kafka

I have a situation where I’m loading data into Kafka. I would like to process the records in discrete 10m buckets. But bare in mind that the record time stamps come from the producers and so they may not be perfectly in the right order so I can’t simply use the standard Kafka consumer approach since that will result in records outside of my discrete bucket.
Is it possible to use partitions for this? I could look at the timestamp of each record before placing it in the topic, using that to select the appropriate partition. But I don’t know if Kafka supports adhoc named partitions.

They aren't "named" partitions. Sure, you could define a topic with 6 partitions (10 minute "buckets", ignoring hours and days) and a Partitioner subclass that computes which partition the record timestamp will go into with a simple math function, however, this is really only useful for ordering and doesn't address that you need to consume from two partitions for every non-exact 10 minute interval. E.g. records at minute 11 (partition 1) would need to consume records with minute 1-9 (partition 0).
Overall, sounds like you want sliding/hopping windowing features of Kafka Streams, not the plain Consumer API. And this will work without writing custom Producer Partitioners with any number of partitions.

Related

Too much load on kafka partition (extra load on single partition)

Suppose, there is trade capture application. The application consumes message via kafka. The partition id is stock-id (eg google, apple, tesla). This works fine in normal days. Suppose there is bad news for company and people are selling stocks for X company. Then in this case, all the messages would come to single partition during that trading session or day. How do i handle, this efficiently? Can we apply multiple consumers to same partition?
its due to overloaded partition on random day. We have more than dozens of partitions along with dozens of consumers. All the partitions/ consumers are equally distributed everytime throughout the year. Its when there is sudden spike of data in single partition which happens once in month or quarterly .
Can we apply multiple consumers to same partition
Not in the same consumer group, no.
The only way to reasonably handle this is to increase max.poll.records and other consumer properties to consume faster from that partition, and/or all partitions. Unfortunately, you won't know ahead of time which partition will get "overloaded".
The other alternative is to redesign your topic(s) such that "stock tickers" are not your partition ID and whatever you do choose as your partitioning strategy is not driven by end-user behavior that is out of your control (or otherwise define your own Paritioner class).

Kafka KSQL equivale of VIEW for consumer that need a subset of data

We are implementing an ETL in Kafka to load data from a single source into different target system with different consumer.
Every consumer needs a subset of the data and for this we have the following topics:
topicA ---> infinite retention store all the data from the source
topicB --> finite retention populated by a KSQL statement with a where clause
Example:
CREATE STREAM streamA WITH (KAFKA_TOPIC='topicA')
CREATE STREAM streamB WITH (KAFKA_TOPIC='topicB') AS SELECT * FROM streamA WHERE gender='MALE'
After that we have a sink connector or a consumer connected to topicB to consume only data which gender is male or with some columns name remapped
Since we are running an initial import of an important amount of data I would like to understand if there is any way to reduce the amount of storage required for the streamB since its data is just a replica of topicA.
In SQL I would implement it a VIEW, how can I do that in KSQL?
My ideas is to have a lower retention period for topicB but this doesn't solve issue with the initial load (e.g if I have to load 10TB of data at the beginning even if I have 1 day retention period for one day I would need 10TB + 5TB). Is there any other solution?
I see the following options if you want to minimise the space that topicB takes up in your cluster:
Reduce your time based retention setting for the topic, e.g. to 6 hours, or 1 hour, or 30 minutes, etc.
Use a size based retention setting for the topic, e.g. 100MB per partition.
However, note, in each case it will be up to you to ensure your consumer is capable of consuming the data before the retention policy kicks in and deletes the data. If data is deleted before the consumer can consume it, the consumer will log a warning.
Reduce the replication factor of the topic. You're hopefully running with a replication factor of at least 3 for your main 'golden truth' topic, so that its resilient to machine failures. But you may be able to run with a lower factor for topic b, e.g. 2, or 1. This would halve / third the storage cost. Of course, if you lost a machine/disk during the process and you only had 1 replica, you'd lose data and need to recover from this.
Expand your Kafka cluster!

What is the ideal number of partitions in kafka topic?

I am learning Kafka and trying to create a topic for my recent search application. The data being pushed to kafka topics is assumed be a high number.
My kafka cluster have 3 brokers and there are already topics created for other requirements.
Now what should be the number of partitions which i should choose for my recent search topic? And what if i do not provide the partition number explicitly? What are things needs to be considered when choosing the partition number?
This will depend on the throughput of your consumers. If you are producing 100 messages a second and your consumers can process 10 messages a second then you'll want at least 10 partitions (produce / consume) with 10 instances of your consumer. If you want this topic to be able to handle future growth, then you'll want to increase the partition count even higher so that you can add more instances of your consumer to handle the new volume.
Another piece of advice would be to make your partition count a highly divisible number so that you can scale up/down consumers while keeping their load balanced. For example, if you choose 10 partitions then you would have to have 1, 2, 5, or 10 instances of your consumer to keep them each processing from the same number of partitions. If you choose 12 partitions instead then you could be balanced with either 1, 2, 3, 4, 6, or 12 instances of your consumer.
I would consider evaluating two main things before deciding on the no of partitions.
First point is, how the partitions, consumers of a consumer group act together. In simple words, One consumer can consume messages from more than one partitions but one partition can't be consumed by more than one consumer. That means, it makes sense to have no.of partitions >= no.of consumers in a consumer group. Otherwise you will end up having consumers without any partition is being assigned.
Second point is, what's your requirement from latency vs throughout point of view.
In simple words,
Latency is the time required to perform some action or to produce some result. Latency is measured in units of time -- hours, minutes, seconds, nanoseconds or clock periods.
Throughput is the number of such actions executed or results produced per unit of time
Now, coming back to the comparison from kafka stand point, In general, more partitions in a Kafka cluster leads to higher throughput. But, you should be careful with this number if you are really looking for low latency.

Use Kafka offsets to calculate written messages statistics

I want to get some statistics from a Kafka topic:
total written messages
total written messages in the last 12 hours, last hour, ...
Can I safely assume that reading the offsets for each partition in a topic for a given timestamp (using getOffsetsByTimes) should give me the number of messages written in that specific time?
I can sum all the offsets for every partitions and then calculate the difference between a timestamp 1 and a timestamp 2. With these data I should be able to calculate a lot of statistics.
There are situations when these data can give me wrong results? I don't need a 100% precision, but I expect to have a reliable solution. Of course assuming that the topic is not deleted/reset.
There are other alternatives without using third party tools? (I cannot install other tools easily and I need data inside my app)
(using getOffsetsByTimes) should give me the number of messages written in that specific time?
In Kafka: The Definitive Guide it mentions that the getOffsetsByTime is not message-based, it is segment file based. Meaning the time index lookup won't read into a segment file, rather it gets the first segment containing the time you are interested in. (This may have changed in newer Kafka releases since the book was released)
If you don't need accuracy, this should be fine. Do note that compacted topics don't have sequentially ordered offsets one after the other, so a simple abs(offset#time2 - offset#time1) won't quite work for "total existing messages in a topic".
Otherwise, plenty of JMX metrics are exposed by the brokers like bytes-in and message rates, which you can aggregate and plot over time using Grafana, for example.

Apache Kafka order of messages with multiple partitions

As per Apache Kafka documentation, the order of the messages can be achieved within the partition or one partition in a topic. In this case, what is the parallelism benefit we are getting and it is equivalent to traditional MQs, isn't it?
In Kafka the parallelism is equal to the number of partitions for a topic.
For example, assume that your messages are partitioned based on user_id and consider 4 messages having user_ids 1,2,3 and 4. Assume that you have an "users" topic with 4 partitions.
Since partitioning is based on user_id, assume that message having user_id 1 will go to partition 1, message having user_id 2 will go to partition 2 and so on..
Also assume that you have 4 consumers for the topic. Since you have 4 consumers, Kafka will assign each consumer to one partition. So in this case as soon as 4 messages are pushed, they are immediately consumed by the consumers.
If you had 2 consumers for the topic instead of 4, then each consumer will be handling 2 partitions and the consuming throughput will be almost half.
To completely answer your question,
Kafka only provides a total order over messages within a partition, not between different partitions in a topic.
ie, if consumption is very slow in partition 2 and very fast in partition 4, then message with user_id 4 will be consumed before message with user_id 2. This is how Kafka is designed.
I decided to move my comment to a separate answer as I think it makes sense to do so.
While John is 100% right about what he wrote, you may consider rethinking your problem. Do you really need ALL messages to stay in order? Or do you need all messages for specific user_id (or whatever) to stay in order?
If the first, then there's no much you can do, you should use 1 partition and lose all the parallelism ability.
But if the second case, you might consider partitioning your messages by some key and thus all messages for that key will arrive to one partition (they actually might go to another partition if you resize topic, but that's a different case) and thus will guarantee that all messages for that key are in order.
In kafka Messages with the same key, from the same Producer, are delivered to the Consumer in order
another thing on top of that is, Data within a Partition will be stored in the order in which it is written therefore, data read from a Partition will be read in order for that partition
So if you want to get your messages in order across multi partitions, then you really need to group your messages with a key, so that messages with same key goes to same partition and with in that partition the messages are ordered.
In a nutshell, you will need to design a two level solution like above logically to get the messages ordered across multi partition.
You may consider having a field which has the Timestamp/Date at the time of creation of the dataset at the source.
Once, the data is consumed you can load the data into database. The data needs to be sorted at the database level before using the dataset for any usecase. Well, this is an attempt to help you think in multiple ways.
Let's consider we have a message key as the timestamp which is generated at the time of creation of the data and the value is the actual message string.
As and when a message is picked up by the consumer, the message is written into HBase with the RowKey as the kafka key and value as the kafka value.
Since, HBase is a sorted map having timestamp as a key will automatically sorts the data in order. Then you can serve the data from HBase for the downstream apps.
In this way you are not loosing the parallelism of kafka. You also have the privilege of processing sorting and performing multiple processing logics on the data at the database level.
Note: Any distributed message broker does not guarantee overall ordering. If you are insisting for that you may need to rethink using another message broker or you need to have single partition in kafka which is not a good idea. Kafka is all about parallelism by increasing partitions or increasing consumer groups.
Traditional MQ works in a way such that once a message has been processed, it gets removed from the queue. A message queue allows a bunch of subscribers to pull a message, or a batch of messages, from the end of the queue. Queues usually allow for some level of transaction when pulling a message off, to ensure that the desired action was executed, before the message gets removed, but once a message has been processed, it gets removed from the queue.
With Kafka on the other hand, you publish messages/events to topics, and they get persisted. They don’t get removed when consumers receive them. This allows you to replay messages, but more importantly, it allows a multitude of consumers to process logic based on the same messages/events.
You can still scale out to get parallel processing in the same domain, but more importantly, you can add different types of consumers that execute different logic based on the same event. In other words, with Kafka, you can adopt a reactive pub/sub architecture.
ref: https://hackernoon.com/a-super-quick-comparison-between-kafka-and-message-queues-e69742d855a8
Well, this is an old thread, but still relevant, hence decided to share my view.
I think this question is a bit confusing.
If you need strict ordering of messages, then the same strict ordering should be maintained while consuming the messages. There is absolutely no point in ordering message in queue, but not while consuming it. Kafka allows best of both worlds. It allows ordering the message within a partition right from the generation till consumption while allowing parallelism between multiple partition. Hence, if you need
Absolute ordering of all events published on a topic, use single partition. You will not have parallelism, nor do you need (again parallel and strict ordering don't go together).
Go for multiple partition and consumer, use consistent hashing to ensure all messages which need to follow relative order goes to a single partition.