Distributed data visiting in Storm cluster and how does MemoryMapState store distributed data? - distributed-computing

I'm trying to develop a topology which could have some so-called "global variables", just like what question Distributed caching in storm did. I know it's better to use some kind of distributed cache store, like Memcache or Redis, but I still have a few questions:
How to visit caching data more efficiently? Seems it's not a good idea to visit caching in execute or nextTuple function directly, should I create a thread in spout/bolt to get those data periodically?
MemoryMapState is a "state" used to store persistent result in storm trident. It's said that MemoryMapState uses a ConcurrentHashMap to store caching data, as codes in class MemoryMapState show below
static ConcurrentHashMap<String, Map<List<Object>, Object>> _dbs =
new ConcurrentHashMap<String, Map<List<Object>, Object>>();
So, how could a ConcurrentHashMap store and query data among different machines in Storm cluster?

Related

When Kafka Streams GlobalKTable is a good choice as a data store in microservices world?

I'm new in Kafka Streams world. I'm wondering when to use Kafka Streams GlobalKTable (with compacted topic under the hood) instead of regular database for persisting data. And what are advantages and disadvantages of both solution. I guess both ensure data persistence on the same level.
Let's say there is an simple e-commerce app having users registering and updating their data. And there are two microservices - first one (service-users) is responsible for registering users and the second one (service-orders) is responsible for placing orders. And now there are two options:
When new user registers, service-user accepts request, save newly registered user data in it's database (SQL or noSQL, doesn't matter) and then send event to Kafka to propagate this to other services. service-orders receives such event and store necessary user data in it's own database. It's like a most common pattern (from my experience).
and now the second approach with GlobalKTable:
When new user registers or update, service-user accepts request and send event with user data snapshot to Kafka. service-user and service-orders use GlobalKTable to read information about users.
When should I use which solution? Which solution is better in which cases? What are advantages and disadvantages of both approaches? Doesn't the second approach breaks the rule 'each microservice should maintain it's own data in it's own database'?
Hope I explained my considerations well and they make sense at all.
In general the adventages of GlobalKTable are:
You can do a Foreign-Key Join to GlobalKTable
Application has a full data set in memory, the data set is automatically loaded during application startup and all data modifications are automatically synchronized across all instance. Comparing it to the architecture with external database, you don't need to communicate (via network) with any other resource (like relational database) during messages processing, so it is obvious that processing is much faster and as a result you can process large amount of data quickly. When you'd like to achieve similar performance of processing, you need implement by your own some kind of in memory cache (like Guava) and then, you need to solve all issues connected with proper caching management - warming, refreshing, evicting.
And the main disadvantages are:
Application has a full data set in memory, it is advantage but it can be very big issue, all depends on, how big is your data set, or how you model your data. Referring to your example, storing all users orders in GlobalKTable sounds like very bad idea, the data set will grow very fast, and the size of data is growing with time, so after few months/years of running application on production, the data set can has gigabytes and it will continuously grow. When we still like to store orders in GlobalKTable to efficent processing, we need to desing our data model differently. Probalby our entities (Orders, Documents etc) has some life cycle, like: new, paid, closed etc., few of them are terminating - I mean, there will be no further processing on entity with given id, (for example closed Order), so if there will be no processing, there is no need to store data in memory, we can forward it to some other storage, like Elasticsearch and remove it from GlobalKTable. We can name our data set with orders during processing hot storage and data set with terminated orders cold storage. Long story short: having only active/hot Orders in GlobalKTable could be a good idea.
Quering GlobalKTable is limited to iterating over all data set, sub set or getting data by record key, or key composed with timestamp
Processing based on state in external database is broadly used for many years, so, many developers know how to evolve and maintain that kind of applications. We cannot say the same of storing state in Kafka compacted topics.

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

data sync between 2 instances of same microservice using kafka

We have a microservice acts as a cache service and decided to have only 2 instances of this microservice up and running. This microservice receives data through kafka topic and stores in it as in memory cache. But we are having a challenge to sync data between these 2 microservices. We decided to use different consumer group for each instance to receive same data, so that, both instances will be in sync. Being same codebase, how to achieve subscribing to different consumer group during startup. For example, if instance#1 subscribes to consumergrp1, other instance2 should be able to subscribe to consumergrp2. Please suggest me how to achieve this.
You can not sync in-memory data in microservices for multiple instance when you are getting data from streaming system or it's getting multiple times.If you are getting data only once in pod life, then you can achieve the sync in-memory data. For e,g. while service is getting up, you can get the data from source and persist in-memory.In this case both pod is having the same data.
You need to use the distributed cache database like redis, couchbase cache.That will be the more clean and neat approach for this.
You haven't specified any details about the way you use kafka (language/thirdparties), etc. So, speaking "in general", you can:
specify a random (or partially random) consumer group id. It won't be as "clean"
as "consumergrp1" and "conumergrp2", but its a string after all, so you can generate it randomly. This idea includes generating the identification of the process in a name of consumer group, for example, if the microservice instances are supposed to be running on different machines, you could include the name of machine as a part of the name of the consumer group.
More complicated, but still: if you have some shared storage, you could use it as a "synchronization" and store the monotonically increasing counter of the "current consumer group to create". once the value is read, it has to be increased. Of course the implementation details depend on the shared storage you actually use (DB, stuff like Redis, whatever).
So there are many different possible solutions. As a suggestion, in any solution you take, do not rely on the fact that you have exactly two instances of the service, maybe you'll reconsider that in future.

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.

Simple approach to synchronizing data across an Akka cluster?

I've got some run-time data I'd like to exist on a designated actor on every node in my Akka cluster, which could be updated via internal event or API call to a single node. I could store this data in a shared database to make it permanent, but I'd rather just store it in memory for speed, since it doesn't need to be persisted. Akka Cluster Singleton, Distributed Pub Sub, and possibly other built-in modules use gossip protocols to keep distributed state in sync.
Is there a ready-built way to adopt data synchronization of my own actors across my cluster?
I've thought about just publishing changes to Distributed Pub Sub, but it seems like this wouldn't be resilient to dropped messages. If I stored it in a cluster singleton, it wouldn't be survivable if that node went down. I don't need persistence if the entire cluster goes down, but I do want resilience if individual nodes do.
You should have a look at Akka Distributed Data, which should really be called "Akka Replicated Data", as it will replicate the data across all nodes.
It provides a simple key-value store, and any changes made on one node will be replicated to all others. As all data is kept on all nodes, it's best used for small data sets. Also, the values in your key-value pairs need to be CRDTs (conflict free replicated data types). The module comes with some pre-defined CRDTs that cover a lot of use cases.