Simple approach to synchronizing data across an Akka cluster? - scala

I've got some run-time data I'd like to exist on a designated actor on every node in my Akka cluster, which could be updated via internal event or API call to a single node. I could store this data in a shared database to make it permanent, but I'd rather just store it in memory for speed, since it doesn't need to be persisted. Akka Cluster Singleton, Distributed Pub Sub, and possibly other built-in modules use gossip protocols to keep distributed state in sync.
Is there a ready-built way to adopt data synchronization of my own actors across my cluster?
I've thought about just publishing changes to Distributed Pub Sub, but it seems like this wouldn't be resilient to dropped messages. If I stored it in a cluster singleton, it wouldn't be survivable if that node went down. I don't need persistence if the entire cluster goes down, but I do want resilience if individual nodes do.

You should have a look at Akka Distributed Data, which should really be called "Akka Replicated Data", as it will replicate the data across all nodes.
It provides a simple key-value store, and any changes made on one node will be replicated to all others. As all data is kept on all nodes, it's best used for small data sets. Also, the values in your key-value pairs need to be CRDTs (conflict free replicated data types). The module comes with some pre-defined CRDTs that cover a lot of use cases.

Related

How can i consume a message in Kafka in all the instances of a service

I have a use case where I need to consume a message in all the instances of service. let's say if my service is running on 5 instances, then the message coming through Kafka needs to be processed on every instance. Since this data is being used in many other APIs so we are storing this in local memory to serve APIs.
Since this data is used very frequently, I don't want to store this data in Redis or some other global cache which will increase latency and cost of network calls.
I want to create a pipeline where any change in data by third-party service will be updated to all the instances and new data is being served in the APIs by all the instances.
It isn't possible with kafka.
It seems that kafka isn't the right choice for this case.
I can suggest 3 solutions:
You can use Redis as you mentioned above, trading off a
little latency.
If the services are running on the same machine you could use a shard memory for all the processes to read from (and then you are agnostic to the process that got the event)
You can hack something but it is an anti-pattern and I won't suggest you to do so as you will probably affect the abilities of the Consumer Group. It's a totally abuse of kafka.
The hack you can do is to consume with a different Consumer Group at each instance. (Let's say a random UUID when you start polling).

How to manage sharded microcervice local storage?

Let's assume there is a single consumer group (from kafka perspective). Consumer group consists of 20 replicas of Service instances. All work is balanced among those 20 instances based on some property (UUID). Each instance manages its own storage/state/read which in turn contains only data belonging to that shard only. So there are 20 separate storages, one for each replica. But what happens in case of scaling up or down those Services? How would the remaining 10 Services manage to get all that data previously belonging to other instances? I assume that each service may emit so-called "state event" (stream-table duality?) and other instance may get the responsibility of managing a new part of overall data based on such stream. But this is still a lot of work to do. Such a stream may consist of millions of items (even if compacted). There must be a more efficient way to achieve this. And what if we scale up? Group leader must now inform somehow respective instance to drop part of its data. I have read some books/posts about that matter but I couldn't find any concrete information on how this is managed.
Unclear why this is tagged apache-kafka, since sharding isn't a Kafka term. In Kafka Streams, it can handle distribution of state stores across separate instances using the KTable API. When instances are scaled up and down, the data becomes temporarily unaccessible while the state is rebuilt. Different instances can query each other with "Interactive Queries".

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

data sync between 2 instances of same microservice using kafka

We have a microservice acts as a cache service and decided to have only 2 instances of this microservice up and running. This microservice receives data through kafka topic and stores in it as in memory cache. But we are having a challenge to sync data between these 2 microservices. We decided to use different consumer group for each instance to receive same data, so that, both instances will be in sync. Being same codebase, how to achieve subscribing to different consumer group during startup. For example, if instance#1 subscribes to consumergrp1, other instance2 should be able to subscribe to consumergrp2. Please suggest me how to achieve this.
You can not sync in-memory data in microservices for multiple instance when you are getting data from streaming system or it's getting multiple times.If you are getting data only once in pod life, then you can achieve the sync in-memory data. For e,g. while service is getting up, you can get the data from source and persist in-memory.In this case both pod is having the same data.
You need to use the distributed cache database like redis, couchbase cache.That will be the more clean and neat approach for this.
You haven't specified any details about the way you use kafka (language/thirdparties), etc. So, speaking "in general", you can:
specify a random (or partially random) consumer group id. It won't be as "clean"
as "consumergrp1" and "conumergrp2", but its a string after all, so you can generate it randomly. This idea includes generating the identification of the process in a name of consumer group, for example, if the microservice instances are supposed to be running on different machines, you could include the name of machine as a part of the name of the consumer group.
More complicated, but still: if you have some shared storage, you could use it as a "synchronization" and store the monotonically increasing counter of the "current consumer group to create". once the value is read, it has to be increased. Of course the implementation details depend on the shared storage you actually use (DB, stuff like Redis, whatever).
So there are many different possible solutions. As a suggestion, in any solution you take, do not rely on the fact that you have exactly two instances of the service, maybe you'll reconsider that in future.

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.