Kafka Streams - Define Custom Relational/Non_Key_Value StateStore With Fault Tolerance - apache-kafka

I am trying to implement event sourcing using kafka.
My vision for the stream processor application is a typical 3-layer Spring application in which:
The "presentation" layer is replaced by (implemented by?) Kafka streams API.
The business logic layer is utilized by the processor API in the topology.
Also, the DB is a relational H2, In-memory database which is accessed via Spring Data JPA Repositories. The repositories also implements necessary interfaces for them to be registered as Kafka state stores to use the benefits (restoration & fault tolerance)
But I'm wondering how should I implement the custom state store part?
I have been searching And:
There are some interfaces such as StateStore & StoreBuilder. StoreBuilder has a withLoggingEnabled() method; But if I enable it, when does the actual update & change logging happen? usually the examples are all key value stores even for the custom ones. What if I don't want key value? The example in interactive queries section in kafka documentation just doesn't cut it.
I am aware of interactive queries. But they seem to be good for queries & not updates; as the name suggests.
In a key value store the records that are sent to change log are straightforward. But if I don't use key value; when & how do I inform kafka that my state has changed?

You will need to implement StateStore for the actually store engine you want to use. This interface does not dictate anything about the store, and you can do whatever you want.
You also need to implement a StoreBuilder that act as a factory to create instances of your custom store.
MyCustomStore implements StateStore {
// define any interface you want to present to the user of the store
}
MyCustomStoreBuilder implements StoreBuilder<MyCustomStore> {
MyCustomStore builder() {
// create new instance of MyCustomStore and return it
}
// all other methods (except `name()`) are optional
// eg, you can do a dummy implementation that only returns `this`
}
Compare: https://docs.confluent.io/current/streams/developer-guide/processor-api.html#implementing-custom-state-stores
But if I don't use key value; when & how do I inform kafka that my state has changed?
If you want to implement withLoggingEnabled() (similar for caching), you will need to implement this logging (or caching) as part of your store. Because, Kafka Streams does not know how your store works, it cannot provide an implementation for this. Thus, it's your design decision, if your store supports logging into a changelog topic or not. And if you want to support logging, you need to come up with a design that maps store updates to key-value pairs (you can also write multiple per update) that you can write into a changelog topic and that allows you to recreate the state when reading those records fro the changelog topic.
Getting a fault-tolerant store is not only possible via change logging. For example, you could also plugin a remote store, that does replication etc internally and thus rely on the store's fault-tolerance capabilities instead of using change logging. Of course, using a remote store implies other challenges compare to using a local store.
For the Kafka Streams default stores, logging and caching is implemented as wrappers for the actual store, making it easily plugable. But you can implement this in any way that fits your store best. You might want to check out the following classes for the key-value-store as comparison:
https://github.com/apache/kafka/blob/2.0/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBStore.java
https://github.com/apache/kafka/blob/2.0/streams/src/main/java/org/apache/kafka/streams/state/internals/ChangeLoggingKeyValueBytesStore.java
https://github.com/apache/kafka/blob/2.0/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
For interactive queries, you implement a corresponding QueryableStoreType to integrate your custom store. Cf. https://docs.confluent.io/current/streams/developer-guide/interactive-queries.html#querying-local-custom-state-stores You are right, that Interactive Queries is a read-only interface for the existing stores, because the Processors should be responsible for maintaining the stores. However, nothing prevents you to open up your custom store for writes, too. However, this will make your application inherently non-deterministic, because if you rewind an input topic and reprocess it, it might compute a different result, depending what "external store writes" are performed. You should consider doing any write to the store via the input topics. But it's your decision. If you allow "external writes" you will need to make sure that they get logged, too, in case you want to implement logging.

Related

If many Kafka streams updates domain model (a.k.a materialized view)?

I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.

Accessing per-key state store in Apache Flink that changes dynamically

I have a stream of messages with different keys. For each key, I want to create an event time session window and do some processing on it only if:
MIN_EVENTS number of events has been accumulated in the window (essentially a keyed state)
For each key, MIN_EVENTS is different and might change during runtime. I am having difficulty implementing this. In particular, I am implementing this logic like so:
inputStream.keyBy(key).
window(EventTimeSessionWindow(INACTIVITY_PERIOD).
trigger(new MyCustomCountTrigger()).
apply(new MyProcessFn())
I am trying to create a custom MyCustomCountTrigger() that should be capable of reading from a state store such as MapState<String, Integer> stateStore that maps key to it's MIN_EVENTS parameter. I am aware that I can access a state store using the TriggerContext ctx object that is available to all Triggers.
How do I initialize this state store from outside the CountTrigger() class? I haven't been able to find examples to do so.
You can initialize the state based on parameters sent to the constructor of your Trigger class. But you can't access the state from outside that class.
If you need more flexibility, I suggest you use a process function instead of a window.

hazelcast spring-data write-through

I am using Spring-Boot, Spring-Data/JPA with Hazelcast client/server topology. In parts of my test application, I am calculating time when performing CRUD operations on the client side (the server is the one interacting with a relational db). I configured the map(Store) to be write-behind by setting write-delay-seconds to 10.
Spring-Data's save() returns the persisted entity. In the client app, therefore, the application flow will be blocked until the (server) returns the persisted entity.
Would like to know is there is an alternative in which case the client does NOT have to wait for the entity to persist. Was under the impression that once new data is stored in the Map, persisting to the backed happens asynchronously -> the client app would NOT have to wait.
Map config in hazelast.xml:
<map name="com.foo.MyMap">
<map-store enabled="true" initial-mode="EAGER">
<class-name>com.foo.MyMapStore</class-name>
<write-delay-seconds>10</write-delay-seconds>
</map-store>
</map>
#NeilStevenson I don't find your response particularly helpful. I asked on an earlier post about where and how to generate the Map keys. You pointed me to the documentation which fails to shed any light on this topic. Same goes for the hazelcast (and other) examples.
The point of having the cache in the 1st place, is to avoid hitting the database. When we add data (via save()), we need to also generate an unique key for the Map. This key also becomes the Entity.Id in the database table. Since, again, its the hazelcast client that generates these Ids, there is no need to wait for the record to be persisted in the backend.
The only reason to wait for save() to return the persisted object would be to catch any exceptions NOT because of the ID.
That unfortunately is how it is meant to work, see https://docs.spring.io/spring-data/commons/docs/current/api/org/springframework/data/repository/CrudRepository.html#save-S-.
Potentially the external store mutates the saved entry in some way.
Although you know it won't do this, there isn't a variant on the save defined.
So the answer seems to be this is not currently available in the general purpose Spring repository definition. Why not raise a feature request for the Spring Data team ?

Service Fabric Actors - save state to database

I'm working on a sample Service Fabric project, where I have to maintain a shopping list. For this I have a ShoppingList actor, which is identifiable by a specific id. It stores the current list content in its state using StateManager. All works fine.
However, in parallel I'd like to maintain the shopping list content in a sql database. In particular:
store all add/remove item request for future analysis (ML)
on first actor initialization load list content from db (e.g. after cluster has been re-created)
What is the best approach to achieve that? Create a custom StateProvider (how? can't find examples)?
Or maybe have another service/actor for handling all db operations (possibly using queues and reminders)?
All examples seem to completely rely on default StateManager, with no data persistence to external storage, so I'm not sure what's the best practice.
The best way will be to have a separate entity responsible for storing data to DB. And actor will just send an event (not implying SF events) with some data about performed operation, and another entity will catch it and perform the rest of the work.
But of course you can implement this thing in actor itself, but it will bring two possible issues:
Actor will be not able to process other requests if there will be some issues with DB or connectivity between actor and DB or if there will be high loading of DB itself and it will process requests slowly. The actor would have to wait till transferring to DB successfully completes.
Possible overloading of DB with many single connections from many actors instead of one or several connection from another entity and batch insertion.
So, your final solution will depend on workload of your system. But definitely you will need a reliable queue to safely store data in DB if value of such data is too high to afford a loss.
Also, I think you could use default state manager to store logs and information about transactions before it will be transferred to DB and remove from service's state after transaction completes. There is no need to have permanent storage of such data in services.
And another things to take into consideration — reading from DB. Probably, if you have relationship database and will update with new records only one table + if there will be huge amount of actors that will query such data on activation, you will have performance degradation as this table will be locked for reading or writing if you will not configure it to behave differently. So, probably, you will need caching system to read data for actors activation — depends on your workload.
And about implementing your custom State Manager: take a look at this example. Basically, all you need to do is to implement IReliableStateManagerReplica interface and pass it to StatefullService constructor.

quickfixengine: possible to restrict logging?

In quickfixengine is there a setting to specify the log level to restrict number of messages logged? It seems that we are login a lot of data so we would like to restrict it bit. I assume that logging too many messages should affect performance (don't have any hard data for or against).
You don't say which language you're using but I believe that this should work with both the C++ and Java APIs.
You will need to implement your own LogFactory and Log classes (the former is responsible for creating instances of the latter). Then you'll pass an instance of your custom LogFactory to your Initiator or Acceptor instance. Your Log class is where you will do the message filtering.
Understand that Log receives messages in string form, so you'll need to filtering either with string matching operations or convert the strings back to Messages and then filter using tags, though this may end up slowing you down more than just allowing all messages to be logger.