data sync between 2 instances of same microservice using kafka - apache-kafka

We have a microservice acts as a cache service and decided to have only 2 instances of this microservice up and running. This microservice receives data through kafka topic and stores in it as in memory cache. But we are having a challenge to sync data between these 2 microservices. We decided to use different consumer group for each instance to receive same data, so that, both instances will be in sync. Being same codebase, how to achieve subscribing to different consumer group during startup. For example, if instance#1 subscribes to consumergrp1, other instance2 should be able to subscribe to consumergrp2. Please suggest me how to achieve this.

You can not sync in-memory data in microservices for multiple instance when you are getting data from streaming system or it's getting multiple times.If you are getting data only once in pod life, then you can achieve the sync in-memory data. For e,g. while service is getting up, you can get the data from source and persist in-memory.In this case both pod is having the same data.
You need to use the distributed cache database like redis, couchbase cache.That will be the more clean and neat approach for this.

You haven't specified any details about the way you use kafka (language/thirdparties), etc. So, speaking "in general", you can:
specify a random (or partially random) consumer group id. It won't be as "clean"
as "consumergrp1" and "conumergrp2", but its a string after all, so you can generate it randomly. This idea includes generating the identification of the process in a name of consumer group, for example, if the microservice instances are supposed to be running on different machines, you could include the name of machine as a part of the name of the consumer group.
More complicated, but still: if you have some shared storage, you could use it as a "synchronization" and store the monotonically increasing counter of the "current consumer group to create". once the value is read, it has to be increased. Of course the implementation details depend on the shared storage you actually use (DB, stuff like Redis, whatever).
So there are many different possible solutions. As a suggestion, in any solution you take, do not rely on the fact that you have exactly two instances of the service, maybe you'll reconsider that in future.

Related

When Kafka Streams GlobalKTable is a good choice as a data store in microservices world?

I'm new in Kafka Streams world. I'm wondering when to use Kafka Streams GlobalKTable (with compacted topic under the hood) instead of regular database for persisting data. And what are advantages and disadvantages of both solution. I guess both ensure data persistence on the same level.
Let's say there is an simple e-commerce app having users registering and updating their data. And there are two microservices - first one (service-users) is responsible for registering users and the second one (service-orders) is responsible for placing orders. And now there are two options:
When new user registers, service-user accepts request, save newly registered user data in it's database (SQL or noSQL, doesn't matter) and then send event to Kafka to propagate this to other services. service-orders receives such event and store necessary user data in it's own database. It's like a most common pattern (from my experience).
and now the second approach with GlobalKTable:
When new user registers or update, service-user accepts request and send event with user data snapshot to Kafka. service-user and service-orders use GlobalKTable to read information about users.
When should I use which solution? Which solution is better in which cases? What are advantages and disadvantages of both approaches? Doesn't the second approach breaks the rule 'each microservice should maintain it's own data in it's own database'?
Hope I explained my considerations well and they make sense at all.
In general the adventages of GlobalKTable are:
You can do a Foreign-Key Join to GlobalKTable
Application has a full data set in memory, the data set is automatically loaded during application startup and all data modifications are automatically synchronized across all instance. Comparing it to the architecture with external database, you don't need to communicate (via network) with any other resource (like relational database) during messages processing, so it is obvious that processing is much faster and as a result you can process large amount of data quickly. When you'd like to achieve similar performance of processing, you need implement by your own some kind of in memory cache (like Guava) and then, you need to solve all issues connected with proper caching management - warming, refreshing, evicting.
And the main disadvantages are:
Application has a full data set in memory, it is advantage but it can be very big issue, all depends on, how big is your data set, or how you model your data. Referring to your example, storing all users orders in GlobalKTable sounds like very bad idea, the data set will grow very fast, and the size of data is growing with time, so after few months/years of running application on production, the data set can has gigabytes and it will continuously grow. When we still like to store orders in GlobalKTable to efficent processing, we need to desing our data model differently. Probalby our entities (Orders, Documents etc) has some life cycle, like: new, paid, closed etc., few of them are terminating - I mean, there will be no further processing on entity with given id, (for example closed Order), so if there will be no processing, there is no need to store data in memory, we can forward it to some other storage, like Elasticsearch and remove it from GlobalKTable. We can name our data set with orders during processing hot storage and data set with terminated orders cold storage. Long story short: having only active/hot Orders in GlobalKTable could be a good idea.
Quering GlobalKTable is limited to iterating over all data set, sub set or getting data by record key, or key composed with timestamp
Processing based on state in external database is broadly used for many years, so, many developers know how to evolve and maintain that kind of applications. We cannot say the same of storing state in Kafka compacted topics.

How can i consume a message in Kafka in all the instances of a service

I have a use case where I need to consume a message in all the instances of service. let's say if my service is running on 5 instances, then the message coming through Kafka needs to be processed on every instance. Since this data is being used in many other APIs so we are storing this in local memory to serve APIs.
Since this data is used very frequently, I don't want to store this data in Redis or some other global cache which will increase latency and cost of network calls.
I want to create a pipeline where any change in data by third-party service will be updated to all the instances and new data is being served in the APIs by all the instances.
It isn't possible with kafka.
It seems that kafka isn't the right choice for this case.
I can suggest 3 solutions:
You can use Redis as you mentioned above, trading off a
little latency.
If the services are running on the same machine you could use a shard memory for all the processes to read from (and then you are agnostic to the process that got the event)
You can hack something but it is an anti-pattern and I won't suggest you to do so as you will probably affect the abilities of the Consumer Group. It's a totally abuse of kafka.
The hack you can do is to consume with a different Consumer Group at each instance. (Let's say a random UUID when you start polling).

How to manage sharded microcervice local storage?

Let's assume there is a single consumer group (from kafka perspective). Consumer group consists of 20 replicas of Service instances. All work is balanced among those 20 instances based on some property (UUID). Each instance manages its own storage/state/read which in turn contains only data belonging to that shard only. So there are 20 separate storages, one for each replica. But what happens in case of scaling up or down those Services? How would the remaining 10 Services manage to get all that data previously belonging to other instances? I assume that each service may emit so-called "state event" (stream-table duality?) and other instance may get the responsibility of managing a new part of overall data based on such stream. But this is still a lot of work to do. Such a stream may consist of millions of items (even if compacted). There must be a more efficient way to achieve this. And what if we scale up? Group leader must now inform somehow respective instance to drop part of its data. I have read some books/posts about that matter but I couldn't find any concrete information on how this is managed.
Unclear why this is tagged apache-kafka, since sharding isn't a Kafka term. In Kafka Streams, it can handle distribution of state stores across separate instances using the KTable API. When instances are scaled up and down, the data becomes temporarily unaccessible while the state is rebuilt. Different instances can query each other with "Interactive Queries".

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.

Simple approach to synchronizing data across an Akka cluster?

I've got some run-time data I'd like to exist on a designated actor on every node in my Akka cluster, which could be updated via internal event or API call to a single node. I could store this data in a shared database to make it permanent, but I'd rather just store it in memory for speed, since it doesn't need to be persisted. Akka Cluster Singleton, Distributed Pub Sub, and possibly other built-in modules use gossip protocols to keep distributed state in sync.
Is there a ready-built way to adopt data synchronization of my own actors across my cluster?
I've thought about just publishing changes to Distributed Pub Sub, but it seems like this wouldn't be resilient to dropped messages. If I stored it in a cluster singleton, it wouldn't be survivable if that node went down. I don't need persistence if the entire cluster goes down, but I do want resilience if individual nodes do.
You should have a look at Akka Distributed Data, which should really be called "Akka Replicated Data", as it will replicate the data across all nodes.
It provides a simple key-value store, and any changes made on one node will be replicated to all others. As all data is kept on all nodes, it's best used for small data sets. Also, the values in your key-value pairs need to be CRDTs (conflict free replicated data types). The module comes with some pre-defined CRDTs that cover a lot of use cases.