How can i consume a message in Kafka in all the instances of a service - apache-kafka

I have a use case where I need to consume a message in all the instances of service. let's say if my service is running on 5 instances, then the message coming through Kafka needs to be processed on every instance. Since this data is being used in many other APIs so we are storing this in local memory to serve APIs.
Since this data is used very frequently, I don't want to store this data in Redis or some other global cache which will increase latency and cost of network calls.
I want to create a pipeline where any change in data by third-party service will be updated to all the instances and new data is being served in the APIs by all the instances.

It isn't possible with kafka.
It seems that kafka isn't the right choice for this case.
I can suggest 3 solutions:
You can use Redis as you mentioned above, trading off a
little latency.
If the services are running on the same machine you could use a shard memory for all the processes to read from (and then you are agnostic to the process that got the event)
You can hack something but it is an anti-pattern and I won't suggest you to do so as you will probably affect the abilities of the Consumer Group. It's a totally abuse of kafka.
The hack you can do is to consume with a different Consumer Group at each instance. (Let's say a random UUID when you start polling).

Related

data sync between 2 instances of same microservice using kafka

We have a microservice acts as a cache service and decided to have only 2 instances of this microservice up and running. This microservice receives data through kafka topic and stores in it as in memory cache. But we are having a challenge to sync data between these 2 microservices. We decided to use different consumer group for each instance to receive same data, so that, both instances will be in sync. Being same codebase, how to achieve subscribing to different consumer group during startup. For example, if instance#1 subscribes to consumergrp1, other instance2 should be able to subscribe to consumergrp2. Please suggest me how to achieve this.
You can not sync in-memory data in microservices for multiple instance when you are getting data from streaming system or it's getting multiple times.If you are getting data only once in pod life, then you can achieve the sync in-memory data. For e,g. while service is getting up, you can get the data from source and persist in-memory.In this case both pod is having the same data.
You need to use the distributed cache database like redis, couchbase cache.That will be the more clean and neat approach for this.
You haven't specified any details about the way you use kafka (language/thirdparties), etc. So, speaking "in general", you can:
specify a random (or partially random) consumer group id. It won't be as "clean"
as "consumergrp1" and "conumergrp2", but its a string after all, so you can generate it randomly. This idea includes generating the identification of the process in a name of consumer group, for example, if the microservice instances are supposed to be running on different machines, you could include the name of machine as a part of the name of the consumer group.
More complicated, but still: if you have some shared storage, you could use it as a "synchronization" and store the monotonically increasing counter of the "current consumer group to create". once the value is read, it has to be increased. Of course the implementation details depend on the shared storage you actually use (DB, stuff like Redis, whatever).
So there are many different possible solutions. As a suggestion, in any solution you take, do not rely on the fact that you have exactly two instances of the service, maybe you'll reconsider that in future.

What is the better way to have a statistical information among the events in Kafka?

I've a project where I need to provide statistical information via API to the external services. In the mentioned service I use only Kafka as a "storage". When the application starts it reads events from cluster for 1 week and counts some values. And actively listens to new events to update the information. For example information is "how many times x item was sold" etc.
Startup of the application takes a lot of time and brings some other problems with it. It is a Kubernetes service and readiness probe fails time to time, when reading last 1 weeks events takes much time.
Two alternatives came to my mind to replace the entire logic:
Kafka Streams or KSQL (I'm not sure if I will need same amount of memory and computation unit here)
Cache Database
I'm wondering which idea would be better here? Or is there any idea better than them?
First, I hope this is a compacted topic that you are reading, otherwise, your "x times" will be misleading as data is deleted from the topic.
Any option you chose will require reading from the beginning of the topic, so the solution will come down to starting a persistent consumer that:
Stores data on disk (such as Kafka Streams or KSQL KTable) in RocksDB
Some other database of your choice. Redis would be a good option, but so would Couchbase if you want to use Memcached

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.

Microservices & Kafka: To couple or not to couple

I'm having a problem wrapping my mind around a probably normal setup of Microservices and Kafka we are currently setting up.
We are having one Topic in Kafka and multiple consumers reading from this Topic via separate consumer groups.
But somehow I think this could lead to coupling in terms of Microservices as we are having two consumers reading the exact data from the same Topic. Additionally we do not have any retention time for the messages and therefore I'm treating The Kafka as some Kind of data store. So I would think we should rather replicate the messages into its own topic for another Service/consumer.
We are having different opinions on how this is coupling or decoupling and I'd like to hear you opinions on what I'm getting wrong because I feel like I do. Thank you for your support!
In my opinion using a Kafka topic for multiple services or apps to consume is the right approach as long as your services don't rely on it repeatedly. Meaning a service should read the queue once, translate the data into whatever it requires and store it by itself if required. This way the topic doesn't become a permanent data store but a rather a decoupled way to input data (as if you were to call the service directly with that raw data, but in a more decoupled fashion by allowing the service to read the topic whenever ready for it in whatever frequency that is required). This increases the resilience of your overall system.
And there is a coupling, that is the raw data. But from my perspective it is totally OK for multiple services to understand the same data format (of the topic) - As long as its format is mostly stable. The assumption here is that this is raw data that each service has to transform into a form that is useful for itself. You just have to make sure the raw data format is versioned correctly whenever changes are necessary. And to allow services to continue to work you will have to potentially deliver multiple versions concurrently until all services support the latest version. This type of architectural style is used by many large systems and works, as long as you don't have a scenario where you need to require the raw data format to change very frequently in a way that makes it incompatible with your service designs. (If that were the case you'd probably need another layer of stable meta-model below that can describe the dynamic raw-data.)

Simple approach to synchronizing data across an Akka cluster?

I've got some run-time data I'd like to exist on a designated actor on every node in my Akka cluster, which could be updated via internal event or API call to a single node. I could store this data in a shared database to make it permanent, but I'd rather just store it in memory for speed, since it doesn't need to be persisted. Akka Cluster Singleton, Distributed Pub Sub, and possibly other built-in modules use gossip protocols to keep distributed state in sync.
Is there a ready-built way to adopt data synchronization of my own actors across my cluster?
I've thought about just publishing changes to Distributed Pub Sub, but it seems like this wouldn't be resilient to dropped messages. If I stored it in a cluster singleton, it wouldn't be survivable if that node went down. I don't need persistence if the entire cluster goes down, but I do want resilience if individual nodes do.
You should have a look at Akka Distributed Data, which should really be called "Akka Replicated Data", as it will replicate the data across all nodes.
It provides a simple key-value store, and any changes made on one node will be replicated to all others. As all data is kept on all nodes, it's best used for small data sets. Also, the values in your key-value pairs need to be CRDTs (conflict free replicated data types). The module comes with some pre-defined CRDTs that cover a lot of use cases.