Sharing partitioning logic across polyglot producers with Kafka - apache-kafka

We are building an event sourced system at my company, relying on Kafka.
In order to be GDPR compliant, we need to be able to update the events.
Our idea is to use the compaction and tombstone capabilities.
This means that we cannot use the default partitioning strategy, as we want each message to have an unique key (in order to overwrite a specific message), but we still want events occuring on the same aggregate to end on the same partition.
Which brings us to the creation of a custom partitioner (basically copying the "hash modulo" logic of the default partitioner, but using a different value than the message key to compute the hash).
The issue is that we're evolving in a polyglot environment (we have php, python and Java/Kotlin services publishing and consuming events).
We want to ensure that all these services will produce messages to the same partition given a specific partition key (in case different services will publish events to the same topic).
Our main idea was to use a common hashing algorithm, but we find it hard to find one with both a strong distribution guarantee and a good stability (not just part of an experimental lib).
PHP natively supports a wide range of hashing algorithms, but we find it hard to find the same support in the other languages.
As Kafka default partitioner relies on murmur2, we started looking in that direction as well. Unfortunately, it is not natively supported by php (although some implementations exist). Furthermore, this algorithm uses a seed, which means that we will need to use the exact same seed for all our publisher services, which is starting to make the approach look quite complex.
However, we could be looking at the design from the wrong angle. Sharing event store write capabilities across polyglot services might not be a good idea and each services could have its own partitioning logic as long as it ensures the "one partition per aggregate" requirement. The thing is that we have to think this ahead, because no technical safeguard will prevent one service in the future to publish on a "shared" event stream (and not using the exact same partitioning logic will have a huge impact when it happens).
Would someone has experience with building an event store with Kafka in a polyglot environment, and could highlight us on this specific topic, please?

Related

How to Partition a Queue in a distributed system

This problem accrued to me a while ago, unfortunately, I could not find the answer I was looking for on the web. Here is the problem statement:
Consider a simple producer-consumer environment where we only have one
producer writing to a queue and one consumer reading from it. Now
since the objects written on the queue are quite large in size and our
available resources are not much on our current machine, we decided to
implement a distributed queue system where the data inside the queue
is partitioned among multiple nodes. It is important to us that the
total ordering is conserved while pushing and poping the data,
meaning that from the point of a user this distributed queue acts just
like a single unified queue.
Before giving a solution to this problem we have to ask if high availability is more important to us or portion tolerance. I believe in both versions, there are interesting challenges to tackle and I thought that such a question must surely be raised before, however, after searching for existing solutions I could not find a complete and well-thought-out answer from an algorithmic or scientific point of view. Most of what I found were engineering and high-level approaches, leveraging tools like Kafka, RabitMQ, Redis etc.
So the problem remains and I would be thankful if you could share with me your designs, algorithms and thoughts on this problem or point me to some scientific journal or article etc that has already tackled such a problem.
This can be one of the ways in which the above can be achieved. Here the partitioning is achieved in the round-robin fashion.
To achieve high availability, you can have partition replicas.
Pros:-
By adding replicas system becomes highly available.
Multi-consumer groups can be implemented
Cons:-
route table becomes the single source of failure, hence redundancy can be achieved via using dynamo DB & consistent read here.

Real-time processing: Storm / flink vs standard application (java, c#...)

I am wondering about the choice of implementing an application processing events coming from Kafka, I have in mind two architecture patterns:
an application developed using the Apache Storm or Apache Flink framework that would process events consumed from Kafka
a Java application (or python, C#...), deployed X times (scalable depending on traffic), which would process events coming from Kafka
I find it difficult to see which of the scenarios is the most interesting.
Someone could help me on this topic ?
It's hard to give some definitive advice with so little information available. So I leave my response vague until you provide more specific information:
Choosing a processing framework over a native implementation gives you the following advantages:
Parallel processing with (in theory) infinite scalability: If you ever expect that you cannot process all events in a single thread in a timely manner, you first need to scale up (more threads) and eventually scale out (more machines). A frameworks takes care of all synchronization between threads and machines, so you just need to write sequential code glued together with some high-level primitives (similar to LINQ in C#).
Fault tolerance: What happens when your code screws up (some edge case not implemented)? When you run out of resources? When network (to Kinesis or other machines) temporarily breaks? A framework takes care of all these nasty little details.
In case of failure, when you restart application, most frameworks give you some form of exactly once processing: How do you avoid losing data? How do you avoid duplicates when reprocessing old data?
Managed state: If your application needs to remember things for a certain time (calculating sums/average or joining data), how do you ensure that the state is kept in sync with data in case of failure?
Advanced features: time triggers, complex event processing (=pattern matching on events), writing to different sinks (Kafka for low latency, s3 for batch processing)
Flexibility of storage: if you want to try out a different storage system, it's much easier to change source/sink in an application writing in a framework.
Integration in deployment platforms: If you want to scale to several machines, it's usually much easier to scale a platform that already offers related integration (at the time of writing that should be mostly Kubernetes). But all frameworks also support simple local setups where you just scale-up on one (bigger) machine.
Low-level optimizations: When using new engines with higher abstractions, it's possible that the frameworks generate code that is much more efficient than what you can implement yourself (with specific memory layout or serialized data processing).
The big downsides are usually:
Complexity of the framework: you need to understand how the framework works from a user's perspective. However, you usually save time by not going into the details of writing a custom consumer/producer, so it's not as bad as it initially seems.
Flexibility in code: you cannot write arbitrary code anymore. Since the framework handles parallelism for you, you need to think in terms of chunks of data and adjust your algorithms accordingly. Standard SQL operations are usually directly supported though in one form or another.
Less control over resource usage: since the platform schedules the task across machines, you may end up with unfortunate assignments and the platform may give you too little options to fix it. Note that most applications are more intrinsically bound to bad resource utilization because of data skew and suboptimal algorithms though.

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.

Microservices & Kafka: To couple or not to couple

I'm having a problem wrapping my mind around a probably normal setup of Microservices and Kafka we are currently setting up.
We are having one Topic in Kafka and multiple consumers reading from this Topic via separate consumer groups.
But somehow I think this could lead to coupling in terms of Microservices as we are having two consumers reading the exact data from the same Topic. Additionally we do not have any retention time for the messages and therefore I'm treating The Kafka as some Kind of data store. So I would think we should rather replicate the messages into its own topic for another Service/consumer.
We are having different opinions on how this is coupling or decoupling and I'd like to hear you opinions on what I'm getting wrong because I feel like I do. Thank you for your support!
In my opinion using a Kafka topic for multiple services or apps to consume is the right approach as long as your services don't rely on it repeatedly. Meaning a service should read the queue once, translate the data into whatever it requires and store it by itself if required. This way the topic doesn't become a permanent data store but a rather a decoupled way to input data (as if you were to call the service directly with that raw data, but in a more decoupled fashion by allowing the service to read the topic whenever ready for it in whatever frequency that is required). This increases the resilience of your overall system.
And there is a coupling, that is the raw data. But from my perspective it is totally OK for multiple services to understand the same data format (of the topic) - As long as its format is mostly stable. The assumption here is that this is raw data that each service has to transform into a form that is useful for itself. You just have to make sure the raw data format is versioned correctly whenever changes are necessary. And to allow services to continue to work you will have to potentially deliver multiple versions concurrently until all services support the latest version. This type of architectural style is used by many large systems and works, as long as you don't have a scenario where you need to require the raw data format to change very frequently in a way that makes it incompatible with your service designs. (If that were the case you'd probably need another layer of stable meta-model below that can describe the dynamic raw-data.)

Adding additional layer of Schema Registry out weights benefits?

Are there any benefits to adding an additional layer (aka point of failure) of Schema Registry when producing/consumer messages? If the service ever goes down, then messages won't be consumed or produced. Wouldn't the system using Kafka less prone to errors by not using Schema Registry which gives one less point of failure?
One key point of having a schema registry in your architecture is to ensure that your data pipelines are working end-to-end "even during normal operations".
That is, even when all systems are up and running ("all green, 100% uptime!"), a producer application managed by team A, for example, might get updated and now start to generate incompatible data that causes collateral damage to downstream consumers managed by teams B and C that weren't expecting this change.
So when you are making a decision whether or not to use a schema registry, you should not only ask yourself about the scenario "when things fail" (which most probably will happen at some point, that's why e.g. Confluent Schema Registry supports features like a high availability setup), but also about the guarantees you need for your data pipelines work in general.
If the service ever goes down, then messages won't be consumed or produced.
In general, yes. In practice, features such as high availability modes for the schema registry service, client-side caching of schemas, etc. all help to minimize any such damage.
Wouldn't the system using Kafka less prone to errors by not using Schema Registry which gives one less point of failure?
You are right that, in general, you'd want to avoid introducing a component that would be another point of failure in the chain.
That said, if you are running data pipelines in production -- particularly in a larger organization -- a schema registry also helps to remove "points of failures" by ensuring that data that is written can also always be read. One could argue that failures triggered by "data changes" can be at least as common as failures triggered by the unavailability of one or more systems.
The schema registry can be configured to be highly available so it is not a single point of failure.
That said, if you want the convenience and schema compatibility rules that come with the schema registry then you want to use it. Not all clients connecting to a Kafka cluster are required to use it, so you can try it without impacting other clients on the same cluster.
Your main alternative to using the schema registry for avro message is to add the schema to the message itself. Some users are OK with the larger message size and not systematically evolving schemas. The schema registry is for those that are concerned with such things.