This could be some kind of best practice question. Someone please who has worked on this clarify with examples. So that all of us could benefit!
For event-driven architectures with Kafka / Redis, when we create topics/streams for events, what are all the best practices to be followed.
Lets consider online order processing workflow.
I read some blogs saying that create topics/streams like order-created-events, order-deleted-events etc. But my question is how the order of the messages is guaranteed when we split this into multiple topics.
For ex:
order-created-events could have thousands of events and being slowly processed by a consumer. order-deleted-events could have only few records in the queue assuming only 5-10% would cancel the order.
Now, lets assume, an user first places an order. then he immediately cancels. This will make the order-deleted-event to process first as the topic/stream do not have much messages before some consumer processes order-created-event for the same order. It will cause some data inconsistency.
Hopefully my question is clear. So, how to come up with topics/streams design?
Kafka ensures sequencing for a particular partition only.
So, to take use of kafka partitioning and load balancing using partitions, multiple partitions for a single topic( like order) should be created.
Now, Use a partition class to generate a key for every message and that key should correspond to same partition only.
So , irrespective of Order A getting created , updated or deleted , they should always belong to same partition.
To properly achieve sequencing , this should be the basis of deciding topics , instead of 2 different topics for different activities.
Related
I have a usecase where I want to have thousands of producers writing messages which will be consumed by thousands of corresponding consumers. Each producer's message is meant for exactly one consumer.
Going through the core concepts here and here: it seems like each consumer-producer pair should have its own topic. Is this correct understanding? I also looked into consumer groups but it seems they are more for parallellizing consumption.
Right now I have multiple producer-consumer pairs sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time. Also in the event I have to delete the checkpoint this will be even more problematic as it starts reading from the very beginning.
Is creating thousands of topics the solution for this? Or is there any other way to use concepts like partitions, consumer groups etc? Both producers and consumers are spark streaming/batch applications. Thanks.
Each producer's message is meant for exactly one consumer
Assuming you commit the offsets, and don't allow retries, this is the expected behavior of all Kafka consumers (or rather, consumer groups)
seems like each consumer-producer pair should have its own topic
Not really. As you said, you have many-to-many relationship of clients. You do not need to have a known pair ahead of time; a producer could send data with no expected consumer, then any consumer application(s) in the future should be able to subscribe to that topic for the data they are interested in.
sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time
The consumption would take linearly more time on a higher production rate, yes, and partitions are the way to solve for that. Beyond that, you need faster network and processing. You still need to consume and deserialize in order to filter, so the filter is not the bottleneck here.
Is creating thousands of topics the solution for this?
Ultimately depends on your data, but I'm guessing not.
Is creating thousands of topics the solution for this? Or is there any
other way to use concepts like partitions, consumer groups etc? Both
producers and consumers are spark streaming/batch applications.
What's the reason you want to have thousands of consumers? or want to have a 1 to 1 explicit relationship? As mentioned earlier, only one consumer within a consumer group will process a message. This is normal.
If however you are trying to make your record processing extremely concurrent, instead of using very high partition counts or very large consumer groups, should use something like Parallel Consumer (PC).
By using PC, you can processing all your keys in parallel, regardless of how long it takes to process, and you can be as concurrent as you wish .
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).
your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.
I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.
I am writing a kafka producer and needs help in creating partitions.
I have a group and a user table. Group contains different users and at a time a user can be a part of only one group.
There can be two types of events which I will receive as input and based on that I will add them to Kafka.
The events related to users.
The events related to groups.
Whenever an event related to a group happens, all the users in that group must be updated in bulk at consumer end.
Whenever an event related to a user happens, it must be executed as such at the consumer end.
Also, I want to maintain ordering on basis of time.
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
I am trying to figure out the possibilities I can try here.
Also, I want to maintain ordering on basis of time.
Means that topics, no matter how many, cannot have more than one partition, as you could have received messages out-of-order.
Obviously, unless you implement something like sequence ids in your messages (and can share that sequence across possibly multiple producers).
If I create user level partitioning, then the bulk update won't be possible at consumer end.
If I create group level partitioning, then the parallel update of user events won't happen.
It sounds like a very simple messaging design, where you have a single queue (that's actually backed by a single topic with a single partition) that's consumed by multiple users. Actually any pub-sub messaging technology would be sufficient here (e.g. RabbitMQ's fanout exchanges).
The messages on the queue contain the information whether they are group updates or user updates - the consumers then filter the input depending on what they are interested in.
To discuss an alternative: single queue for group updates, and another for user updates - I understand that it would not be enough due to order demands - it's possible to get a group update independently of user update, breaking the ordering.
From the kafka documentation :
https://kafka.apache.org/documentation/#intro_consumers
Kafka only provides a total order over records within a partition, not
between different partitions in a topic. Per-partition ordering
combined with the ability to partition data by key is sufficient for
most applications. However, if you require a total order over records
this can be achieved with a topic that has only one partition, though
this will mean only one consumer process per consumer group.
so the best you can do is to have single partition-single topic.
I know that Kafka will not be able to guarantee ordering of data when a topic has multiple partitions. But my problem is:- I need to have multiple partitions to an event topic(user activities generating events) since I want multiple consumer groups to consume the data from the topic.
But there are times when I need to bootstrap the entire data,i.e, read the complete data right from the beginning to the end and rebuild my graph of events from the historical messages in Kafka and then I lose the ordering which is creating problem.
One approach might be to process it in a Map-Reduce paradigm where I map the data based on time and order it and consume it.
Is there anybody who has faced similar situation / problem and who would like to help me out with the right approach / solution.
Thanks in advance.
As per kafka documentation global ordering throughout partitions not guaranteed so you can create N number of partitions with N number of consumers. Create partitions based on type of data i.e. all type of data of category A should go in one partition as the order of messages maintained within partition you can consume those messages in separate consumer and process data.
I gone through some blogs which saying buffer those messages and apply sorting logic on those messages, but this is not seems to be a good practice as one of partition may be slow message message is late in some cases and you need to sort your messages as and when every new message arrives.
Considering a stream of different events the recommended way would be
one big topic containing all events
multiple topics for different types of events
Which option would be better?
I understand that messages not being in the same partition of a topic it means there are no order guarantee, but are there any other factors to be considered when making this decision?
A topic is a logical abstraction and should contain message of the same type. Let's say, you monitor a website and capture click stream events and on the other hand you have a database that populates it's changes into a changelog topics. You should have two different topics because click stream events are not related to you database changelog.
This has multiple advantages:
your data will have different format und you will need different (de)serializers to write read the data (using a single topic you would need a hybrid serializer and you will not get type safety when reading data)
you will have different consumer application and one application might be interested in click stream events only, while a second application is only interested in the database changelog and a third application is interested in both. If you have multiple topics, application one and two only subscribe to the topics they are interesting in -- if you have a single topic, application one an two need to read everything and filter the stuff they are not interested in increasing broker, network, can client load
As #Matthias J. Sax told before there is not a golden bullet over here. But we have to take different topics into account.
The conditioner: ordered deliveries
If you application needs guarantee order delivery, you need to work with only one topic, plus same keys for those messages which need to guarantee it.
If ordering is not mandatory, the game starts...
Does the schema same for all messages?
Would be consumers interested in the same type of different events?
What is gonna happen at the consumer side?, do we are reducing or increasing complexity in terms of implementation, maintainability, error handling...?
Does horizontal scalability important for us? More topics often means more partitions available, which means more horizontal scalability capacity. Also it allows more accurate scalability configuration at the broker side, because we can choose what number of partitions to increase per event type. or at the consumer side, what number of consumers stand up per event type.
Does makes sense parallelising consumption per message type?
...
Technically speaking, if we allow consumers to fine tune those type of events to be consumed we're potentially reducing the network bandwidth required to send undesired messages from the broker to the consumer, plus the number deserialisations for all of them (cpu used, which makes along time more free resources, energy cost reduction...).
Also is worthy to remember that splitting different type of messages in different topics doesn't mean have to consume them with different Kafka consumers because they allow consumption from different topics at the same time.
Well, there's not a clear answer for this question, but I have the feeling that with Kafka, because multiple features, if ordered deliveries are not needed we should split our messages per type in different topics.