I have a stream of messages with different keys. For each key, I want to create an event time session window and do some processing on it only if:
MIN_EVENTS number of events has been accumulated in the window (essentially a keyed state)
For each key, MIN_EVENTS is different and might change during runtime. I am having difficulty implementing this. In particular, I am implementing this logic like so:
inputStream.keyBy(key).
window(EventTimeSessionWindow(INACTIVITY_PERIOD).
trigger(new MyCustomCountTrigger()).
apply(new MyProcessFn())
I am trying to create a custom MyCustomCountTrigger() that should be capable of reading from a state store such as MapState<String, Integer> stateStore that maps key to it's MIN_EVENTS parameter. I am aware that I can access a state store using the TriggerContext ctx object that is available to all Triggers.
How do I initialize this state store from outside the CountTrigger() class? I haven't been able to find examples to do so.
You can initialize the state based on parameters sent to the constructor of your Trigger class. But you can't access the state from outside that class.
If you need more flexibility, I suggest you use a process function instead of a window.
Related
I have a materialized view that is updated from many streams. Every one enrich it partially. Order doesn't matter. Updates comes in not specified time. Is following algorithm is a good approach:
Update comes and I check what is stored in materialized view via get(), that this is an initial one so enrich and save.
Second comes and get() shows that partial update exist - add next information
... and I continue with same style
If there is a query/join, object that is stored has a method that shows that the update is not complete isValid() that could be used in KafkaStreams#filter().
Could you share please is this a good plan? Is there any pattern in Kafka streams world that handle this case?
Please advice.
Your plan looks good , you have the general idea, but you'll have to use the lower Kafka Stream API : Processor API.
There is a .transform operator that allow you to access a KeyValueStatestore, inside this operation implementation you are free to decide if you current aggregated value is valid or not.
Therefore send it downstream or returning null waiting for more information.
I am trying to implement event sourcing using kafka.
My vision for the stream processor application is a typical 3-layer Spring application in which:
The "presentation" layer is replaced by (implemented by?) Kafka streams API.
The business logic layer is utilized by the processor API in the topology.
Also, the DB is a relational H2, In-memory database which is accessed via Spring Data JPA Repositories. The repositories also implements necessary interfaces for them to be registered as Kafka state stores to use the benefits (restoration & fault tolerance)
But I'm wondering how should I implement the custom state store part?
I have been searching And:
There are some interfaces such as StateStore & StoreBuilder. StoreBuilder has a withLoggingEnabled() method; But if I enable it, when does the actual update & change logging happen? usually the examples are all key value stores even for the custom ones. What if I don't want key value? The example in interactive queries section in kafka documentation just doesn't cut it.
I am aware of interactive queries. But they seem to be good for queries & not updates; as the name suggests.
In a key value store the records that are sent to change log are straightforward. But if I don't use key value; when & how do I inform kafka that my state has changed?
You will need to implement StateStore for the actually store engine you want to use. This interface does not dictate anything about the store, and you can do whatever you want.
You also need to implement a StoreBuilder that act as a factory to create instances of your custom store.
MyCustomStore implements StateStore {
// define any interface you want to present to the user of the store
}
MyCustomStoreBuilder implements StoreBuilder<MyCustomStore> {
MyCustomStore builder() {
// create new instance of MyCustomStore and return it
}
// all other methods (except `name()`) are optional
// eg, you can do a dummy implementation that only returns `this`
}
Compare: https://docs.confluent.io/current/streams/developer-guide/processor-api.html#implementing-custom-state-stores
But if I don't use key value; when & how do I inform kafka that my state has changed?
If you want to implement withLoggingEnabled() (similar for caching), you will need to implement this logging (or caching) as part of your store. Because, Kafka Streams does not know how your store works, it cannot provide an implementation for this. Thus, it's your design decision, if your store supports logging into a changelog topic or not. And if you want to support logging, you need to come up with a design that maps store updates to key-value pairs (you can also write multiple per update) that you can write into a changelog topic and that allows you to recreate the state when reading those records fro the changelog topic.
Getting a fault-tolerant store is not only possible via change logging. For example, you could also plugin a remote store, that does replication etc internally and thus rely on the store's fault-tolerance capabilities instead of using change logging. Of course, using a remote store implies other challenges compare to using a local store.
For the Kafka Streams default stores, logging and caching is implemented as wrappers for the actual store, making it easily plugable. But you can implement this in any way that fits your store best. You might want to check out the following classes for the key-value-store as comparison:
https://github.com/apache/kafka/blob/2.0/streams/src/main/java/org/apache/kafka/streams/state/internals/RocksDBStore.java
https://github.com/apache/kafka/blob/2.0/streams/src/main/java/org/apache/kafka/streams/state/internals/ChangeLoggingKeyValueBytesStore.java
https://github.com/apache/kafka/blob/2.0/streams/src/main/java/org/apache/kafka/streams/state/internals/CachingKeyValueStore.java
For interactive queries, you implement a corresponding QueryableStoreType to integrate your custom store. Cf. https://docs.confluent.io/current/streams/developer-guide/interactive-queries.html#querying-local-custom-state-stores You are right, that Interactive Queries is a read-only interface for the existing stores, because the Processors should be responsible for maintaining the stores. However, nothing prevents you to open up your custom store for writes, too. However, this will make your application inherently non-deterministic, because if you rewind an input topic and reprocess it, it might compute a different result, depending what "external store writes" are performed. You should consider doing any write to the store via the input topics. But it's your decision. If you allow "external writes" you will need to make sure that they get logged, too, in case you want to implement logging.
I'm trying to use Kafka Streams with a state store distributed among two instances. Here's how the store and the associated KTable are defined:
KTable<String, Double> userBalancesTable = kStreamBuilder.table(
"balances-table",
Consumed.with(String(), Double()),
Materialized.<String, Double, KeyValueStore<Bytes, byte[]>>as(BALANCES_STORE).withKeySerde(String()).withValueSerde(Double())
);
Next, I have some stream processing logic which is aggregating some data to this balance-table KTable:
transactionsStream
.leftJoin(...)
...
.aggregate(...)
.to("balances-table", Produced.with(String(), Double()));
And at some point I am, from a REST handler, querying the state store.
ReadOnlyKeyValueStore<String, Double> balances = streams.store(BALANCES_STORE, QueryableStoreTypes.<String, Double>keyValueStore());
return Optional.ofNullable(balances.get(userId)).orElse(0.0);
Which works perfectly - as long as I have a single stream processing instance.
Now, I'm adding a second instance (note: my topics all have 2 partititions). As explained in the docs, the state store BALANCES_STORE is distributed among the instances based on the key of each record (in my case, the key is an user ID). Therefore, an instance must:
Make a call to KafkaStreams#metadataForKey to discover which instance is handling the part of the state store containing the key it wants to retrieve
Make a RPC (e.g. REST) call to this instance to retrieve it
My problem is that the call to KafkaStreams#metadataForKey is always returning a null metadata object. However, KafkaStreams#allMetadataForStore() is returning a metadata object containing the two instances. So it behaves like it doesn't know about the key I'm querying, although looking it up in the state stores works.
Additional note: my application.server property is correctly set.
Thank you!
If a Play controller retrieves a values from the Request (e.g. logged in user and his role) and those values need to be passed to all the layers down to controllers (e.g. service layer, DAO layer etc) what's the best way to create a "threadlocal" type of object, which can be used by any class in the Application to retrieve those "user" and "userRole" values for that particular request? I am trying to avoid adding implicit parameters to a bunch of methods and Play Cache doesn't look like an appropriate fit here. Also play's different scope (session, flash etc) wouldn't behave right given all the code is asynchronous. Controller methods are async, service methods returns Future etc. That "threadlocal" type of effect in an asynchronous environment is desired.
Alternatives that are not a good fit
These alternatives are probably not helpful, because they assume a global state accessible by all functions across the processing of a request:
Thread local storage is a technique that is helpful for applications that process the request in a single thread, and that block until a response is generated. Although it's possible to do this with Play Framework, it's usually not the optimal design, since Play's strengths are of more benefit for asynchronous, non-blocking applications.
Session and flash are meant to carry data across HTTP requests. They're not globally available to all classes in an application; it would be necessary to pass the modified request across function calls to retrieve them.
A cache could in theory be used to carry this information, but it would have to have a unique key for each request, and it would be necessary to pass this key in each function call. Additionally, it would be necessary to make sure the cache data is not at risk of being evicted while processing the request, not even when cache memory is full.
Alternatives that may be a good fit
Assuming the controller, possibly though the Action call, retrieves the security data (user, role, etc.), and that the controller only deals with validating the request and generating a response, delegating domain logic to a domain object (possibly a service object):
Using the call stack: Pass the security data to all functions that need it, through an implicit parameter. Although the question is about finding an alternative to doing that, this approach makes it explicit what is being sent to the called function, and which functions require this data, instead of resorting to state maintained elsewhere.
Using OOP: Pass the security data in the constructor of the domain object, and in the domain object's methods, retrieve the security data from the object's instance.
Using actors: Pass the security data in the message sent to the actor.
If a domain object's method calls a function that also needs the security data, the same pattern would be applied: either pass it as (a possibly implicit) parameter, through a constructor, or in a message.
I am trying to write a matlab class which accepts request for financial data and later asynchronously provides the data by triggering events. The whole logic can be defined as following.
1) Get request for data on Security (SecId) with a callback func handle (#func)
2) Add a listener with event name "evnt_SecId" and callback func #func.
3) Collect all data, filter them by securities and fire event specific to a particular security.
Now everything seems to be easy and doable in Matlab except that I cannot dynamically define events. Currently, I must define events for each SecId in the { events ... end } block.
Does anyone knows of a way to dynamically declare events as the request arrive?
Alternate Solution that I thought of : I can have one update "event" and all listener associated with it while the filtering for SecId takes place in the callbacks. This solution is unacceptable due to performance reasons.
How about this:
make SecId a subclass of dynamicprops
instead of adding a regular listener, add a PostSet propListener and dynamically add a
new property
send the message by setting the value of the property.
I have no idea about the performance characteristics of that solution, but it might do what you need.
How about filtering for the SecId in the "master" event firing method? This way the filtering only happens once per fired event. The class must than associated the listeners it has with the SecId it was registered for.