EventStore basics - what's the difference between Event Meta Data/MetaData and Event Data? - metadata

I'm very much at the beginning of using / understanding EventStore or get-event-store as it may be known here.
I've consumed the documentation regarding clients, projections and subscriptions and feel ready to start using on some internal projects.
One thing I can't quite get past - is there a guide / set of recommendations to describe the difference between event metadata and data ? I'm aware of the notional differences; Event data is 'Core' to the domain, Meta data for describing, but it is becoming quite philisophical.
I wonder if there are hard rules regarding implementation (querying etc).
Any guidance at all gratefully received!

Shamelessly copying (and paraphrasing) parts from Szymon Kulec's blog post "Enriching your events with important metadata" (emphases mine):
But what information can be useful to store in the metadata, which info is worth to store despite the fact that it was not captured in
the creation of the model?
1. Audit data
who? – simply store the user id of the action invoker
when? – the timestamp of the action and the event(s)
why? – the serialized intent/action of the actor
2. Event versioning
The event sourcing deals with the effect of the actions. An action
executed on a state results in an action according to the current
implementation. Wait. The current implementation? Yes, the
implementation of your aggregate can change and it will either because
of bug fixing or introducing new features. Wouldn’t it be nice if
the version, like a commit id (SHA1 for gitters) or a semantic version
could be stored with the event as well? Imagine that you published a
broken version and your business sold 100 tickets before fixing a bug.
It’d be nice to be able which events were created on the basis of the
broken implementation. Having this knowledge you can easily compensate
transactions performed by the broken implementation.
3. Document implementation details
It’s quite common to introduce canary releases, feature toggling and
A/B tests for users. With automated deployment and small code
enhancement all of the mentioned approaches are feasible to have on a
project board. If you consider the toggles or different implementation
coexisting in the very same moment, storing the version only may be
not enough. How about adding information which features were applied
for the action? Just create a simple set of features enabled, or map
feature-status and add it to the event as well. Having this and the
command, it’s easy to repeat the process. Additionally, it’s easy to
result in your A/B experiments. Just run the scan for events with A
enabled and another for the B ones.
4. Optimized combination of 2. and 3.
If you think that this is too much, create a lookup for sets of
versions x features. It’s not that big and is repeatable across many
users, hence you can easily optimize storing the set elsewhere, under
a reference key. You can serialize this map and calculate SHA1, put
the values in a map (a table will do as well) and use identifiers to
put them in the event. There’s plenty of options to shift the load
either to the query (lookups) or to the storage (store everything as
named metadata).
Summing up
If you create an event sourced architecture, consider adding the
temporal dimension (version) and a bit of configuration to the
metadata. Once you have it, it’s much easier to reason about the
sources of your events and introduce tooling like compensation.
There’s no such thing like too much data, is there?

I will share my experiences with you which may help. I have been playing with akka-persistence, akka-persistence-eventstore and eventstore. akka-persistence stores it's event wrapper, a PersistentRepr, in binary format. I wanted this data in JSON so that I could:
use projections
make these events easily available to any other technologies
You can implement your own serialization for akka-persistence-eventstore to do this, but it still ended up just storing the wrapper which had my event embedded in a payload attribute. The other attributes were all akka-persistence specific. The author of akka-persistence-eventstore gave me some good advice, get the serializer to store the payload as the Data, and the rest as MetaData. That way my event is now just the business data, and the metadata aids the technology that put it there in the first place. My projections now don't need to parse out the metadata to get at the payload.

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

Axon or Kafka to support CQRS/ES

Consider the simple use case in which I want to store product ratings as events in an event store.
I could use two different approaches:
Using Axon: A Rating aggregate is responsible for handling the CreateRatingCommand and sending the RatingCreatedEvent. Sending the event would case the Rating to be stored in the event store. Other event handlers have the possibility to replay the event stream when connecting to the Axon server instance and doing whatever needed with the ratings. In this case, the event handler will be used as a stream processor.
Using Kafka: A KafkaProducer will be used to store a Rating POJO (after proper serialization) in a Kafka topic. Setting the topic's retention time to indefinite would cause no events to get lost in time. Kafka Streams would in this case be used to do the actual rating processing logic.
Some architectural questions appear to me for both approaches:
When using Axon:
Is there any added value to use Axon (or similar solutions) if there is no real state to be maintained or altered within the aggregate? The aggregate just serves as a "dumb" placeholder for the data, but does not provide any state changing logic.
How does Axon handle multiple event handlers of the same event type? Will they all handle the same event (same aggregate id) in parallel, or is the same event only handled once by one of the handlers?
Are events stored in the Axon event store kept until the end of time?
When using Kafka:
Kafka stores events/messages with the same key in the same partition. How does one select the best value for a key in the use case of user-product ratings? UserId, ProductId or a separate topic for both and publish each event in both topics.
Would it be wise to use a separate topic for each user and each product resulting in a massive amount of topics on the cluster? (Approximately <5k products and >10k users).
I don't know if SO is the preferred forum for this kind of questions... I was just wondering what you (would) recommend in this particular use case as the best practise. Looking forward to your feedback and feel free to point out other points of thought I missed in the previous questions.
EDIT#12/11/2020 : I just found a related discussion containing useful information related to my question.
As Jan Galinski already puts it, this hasn't got a fool proof answer to it really. This is worth a broader discussion on for example indeed AxonIQ's Discuss forum. Regardless, there are some questions in here I can definitely give an answer to, so let's get to it:
Axon Question 1 - Axon Framework is as you've noticed used a lot for DDD centric applications. Nothing however forces you to base yourself on that notion at all. You can strip the framework from Event Sourcing specifics, as well as modelling specifics entirely and purely go for the messaging idea of distinct commands, events and queries. It has been a conscious decision to segregate Axon Framework version 3 into these sub-part when version 4 (current) was released actually. Next to that, I think there is great value in not just basing yourself on event messages. Using distinct commands and queries only further decouples your components, making for a far richer and easier to extend application landscape.
Axon Question 2 - This depends on where the #EventHandler annotated methods are located actually. If they're in the same class only one will be invoked. If they're positioned into distinct classes, then both will receive the same event. Furthermore if they're segregated between distinct classes, it is important to note Axon uses an Event Processor as the technical solution to invoking your event handlers. If distinct classes are grouped under the same Event Processor, you can impose a certain ordering which handler is invoked first. Next to this if the event handling should occur in parallel, you will have to configure a so called TrackingEventProcessor (the default in Axon Framework), as it allows configuration of several threads to handle events concurrently. Well, to conclude this section, everything you're asking in question two is an option, neither a necessity. Just a matter of configuration really. Might be worth checking up on this documentation page of Axon Framework on the matter.
Axon Question 3 - As Axon Server serves the purpose of an Event Store, there is no retention period at all. So yes, they're by default kept until the end of time. There is nothing stopping your from dropping the events though, if you feel there's no value in storing the events to for example base all your models on (as you'd do when using Event Sourcing).
It's the Kafka question I'm personally less familiar with (figures as a contributor to Axon Framework I guess). I can give you my two cents on the matter here too though, although I'd recommend a second opinion here:
Kafka Question 1 - From my personal feeling of what such an application would require, I'd assume you'd want to be able to retrieve all data for a given product as efficient as possible. I'd wager it's important that all events are in the same partition to make this process as efficient as possible, is it wouldn't require any merging afterwards. With this in mind, I'd think using the ProductId will make most sense.
Kafka Question 2 - If you are anticipating only 5_000 products and 10_000 users, I'd guess it should be doable to have separate topics for these. Opinion incoming - It is here though were I personally feel that Kafka's intent to provide you direct power to decide on when to use topics over complicates from what you'd actually try to achieve, which business functionality. Giving the power to segregate streams feels more like an after thought from the perspective of application development. As soon as you'd require an enterprise grade/efficient message bus, that's when this option really shines I think, as then you can optimize for bulk.
Hoping all this helps you further #KDW!

understanding Lagoms persistent read side

I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.

J Oliver EventStore V2.0 questions

I am embarking upon an implementation of a project using CQRS and intend to use the J Oliver EventStore V2.0 as my persistence engine for events.
1) In the documentation, ExampleUsage.cs uses 3 serializers in "BuildSerializer". I presume this is just to show the flexibility of the deserialization process?
2) In the "Restart after failure" case where some events were not dispatched I believe I need startup code that invokes GetUndispatchedCommits() and then dispatch them, correct?
3) Again, in "ExampleUseage.cs" it would be useful if "TakeSnapshot" added the third event to the eventstore and then "LoadFromSnapShotForward" not only retrieve the most recent snapshot but also retrieved events that were post snapshot to simulate the rebuild of an aggregate.
4) I'm failing to see the use of retaining older snapshots. Can you give a use case where they would be useful?
5) If I have a service that is handling receipt of commands and generation of events what is a suggested strategy for keeping track of the number of events since the last snapshot for a given aggregate. I certainly don't want to invoke "GetStreamsToSnapshot" too often.
6) In the SqlPersistence.SqlDialects namespace the sql statement name is "GetStreamsRequiringSnaphots" rather than "GetStreamsRequiringSnapShots"
1) There are a few "base" serializers--such as the Binary, JSON, and BSON serializers. The other two in the example--GZip/Compression and Encryption serializers are wrapping serializers and are only meant to modify what's already been serialized into a byte stream. For the example, I'm just showing flexibility. You don't have to encrypt if you don't want to. In fact, I've got stuff running production that uses simple JSON which makes debugging very easy because everything is text.
2) The SynchronousDispatcher and AsychronousDispatcher implementations are both configured to query and find any undispatched commits. You shouldn't have to do anything special.
3) Greg Young talked about how he used to "inline" his snapshots with the main event stream, but there were a number of optimistic concurrency and race conditions in high-performance systems that came up. He therefore decided to move them "out of band". I have followed this decision for many of the same reasons.
In addition snapshots are really a performance consideration when you have extrememly low SLAs. If you have a stream with a few thousand events on it and you don't have low SLAs, why not just take the minimal performance hit instead of adding additional complexity into your system. In other words, snapshots are "ancillary" concepts. They're in the EventStore API, but they're an optional concept that should be considered for certain use cases.
4) Let's suppose you had an aggregate with tens of millions of events and you wanted to run a "what if" scenario from before your most recent snapshot. It's a lot cheaper to go from another snapshot forward. The really nice thing about snapshots being a secondary concept is that if you wanted to drop older snapshots you could and it wouldn't affect your system at all.
5) There is a method in each implementation of IPersistStreams called GetStreamsRequiringSnapshots. You provide a threshold of 50, for example which finds all streams having 50 or more events since their last snapshot. This can (and probably should) be done asynchronously from your normal processing.
6) "Snapshots" is the correct casing for that word. Much like "website" used to be "Web site" but because of common usage it became "website".

Recreate a graph that change in time

I have an entity in my domain that represent a city electrical network. Actually my model is an entity with a List that contains breakers, transformers, lines.
The network change every time a breaker is opened/closed, user can change connections etc...
In all examples of CQRS the EventStore is queried with Version and aggregateId.
Do you think I have to implement events only for the "network" aggregate or also for every "Connectable" item?
In this case when I have to replay all events to get the "actual" status (based on a date) I can have near 10000-20000 events to process.
An Event modify one property or I need an Event that modify an object (containing all properties of the object)?
Theres always an exception to the rule but I think you need to have an event for every command handled in your domain. You can get around the problem of processing so many events by making use of Snapshots.
http://thinkbeforecoding.com/post/2010/02/25/Event-Sourcing-and-CQRS-Snapshots
I assume you mean currently your "connectable items" are part of the "network" aggregate and you are asking if they should be their own aggregate? That really depends on the nature of your system and problem and is more of a DDD issue than simple a CQRS one. However if the nature of your changes is typically to operate on the items independently of one another then then should probably be aggregate roots themselves. Regardless in order to answer that question we would need to know much more about the system you are modeling.
As for the challenge of replaying thousands of events, you certainly do not have to replay all your events for each command. Sure snapshotting is an option, but even better is caching the aggregate root objects in memory after they are first loaded to ensure that you do not have to source from events with each command (unless the system crashes, in which case you can rely on snapshots for quicker recovery though you may not need them with caching since you only pay the penalty of loading once).
Now if you are distributing this system across multiple hosts or threads there are some other issues to consider but I think that discussion is best left for another question or the forums.
Finally you asked (I think) can an event modify more than one property of the state of an object? Yes if that is what makes sense based on what that event represents. The idea of an event is simply that it represents a state change in the aggregate, however these events should also represent concepts that make sense to the business.
I hope that helps.