How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam? - apache-beam

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.

The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

kdb - customized data streaming/ticker plant?

We've been using kdb to handle a number of calculations focused more on traditional desktop sources. We have deployed our web application and are looking to make the leap as to how best to pick up data changes and re-calculate them in kdb to render a "real-time" view of the data as it changes.
From what I've been reading, the use of data loaders(feed handlers) into our own equivalent of a "ticker plant" as a data store is the most documented ideal solution. So far, we have been "pushing" data into kdb directly and calculating as part of a script so we are trying to make the leap from calculation-on-demand to a "live" calculation as data inputs are edited by user.
I'm trying to understand how to manage the feed handlers and timing of updates. We really only want to move data when it changes (web-front end so trying to figure out how best to "trigger" when things change (such as save or lost focus on an editable data grid for example.) We are also thinking our database as the "ticker plant" itself which may minimize feedhandlers.
I found a reference below and it looks like its running a forever-loop which feels excessive but understand the original use case for kdb and streaming data.
Feedhandler - sending data to tickerplant
Does this sound like a solid workflow?
Many thanks in advance!
Resources we've referencing:
Official Manual -https://github.com/KxSystems/kdb/blob/master/d/tick.htm
kdb+ Tick overview: http://www.timestored.com/kdb-guides/kdb-tick-data-store
Source code: https://github.com/KxSystems/kdb-tick
There's a lot to parse here but some general thoughts/ideas:
Yes, most examples of feedhandlers are set up as forever loops but this is often just for convenience for demoing.
Ideally a live data flow should work based on event handling, aka on-event triggers. Kdb/q has this out of the box in the form of the .z handlers. Other languages should have similar concepts of event handling
Some more examples of python/java feeders are here: https://github.com/exxeleron
There's also some details on the official Kx site: https://code.kx.com/q/wp/capi/#publishing-to-a-kdb-tickerplant
It still might be a viable option to have a forever loop, or at least a short timer in the event you want to batch data.
Depending on the amount of dataflow a tickerplant might be overkill for your use-case, but a tickerplant is still useful for (a) separating your processing from the processing of dataflow (i.e. data can still flow through the tickerplant while another process is consuming/calculating) and (b) logging data for recovery purposes.

Event Sourcing - stream design

I am sitting here looking into CQRS and event sourcing, really interesting topics. When it comes to stream design, and and aggregate roots, i feel a bit left in the dark. How do you do it?
Lets imagine that i have an UI, where i can add stuff to a basket, generating a lines in a basket.
Would I have:
a stream pr basket (with basic info attached, like shipping details, name, email etc)
a stream pr basketline
So i would have many streams
streams/basket-[basketid]
streams/basketline-[basketid]
Basically i only send the minimal data over the wire.
or would i simply have one stream
stream/basket-[basketid]
And every time i add a line to my basket, i send the whole basket over the wire.
As i understand it, it is best to have one to many streams, and not one big streams/basket stream. Or am I mistaken here as well?
My focus here is streams. Any "best practices" on this kind of design: Links, books etc would be appriciated.
How do you do it?
Start by watching All Our Aggregates are Wrong (Mauro Servienti, 2019), which considers the question of how many different aggregates you might need to represent a digital shopping cart.
I tend to think of aggregates as graphs of information - if two pieces of information must change together (A changes, and therefore B must also change RIGHT NOW; or A can't change, because its range of allowed values is constrained by B), then they belong to the same aggregate. The boundary of the aggregate separates information that is tightly coupled together from everything else.
Because distributed transactions are hard, it follows that we want our aggregates stored in such a way that changing an aggregate only requires holding one single lock. For example, we won't normally spread a single instance of an aggregate across multiple databases, because ensuring that all of the databases change in exactly the right way at the "same" time is really hard.
We normally store all of the information that is tightly coupled together in a single event stream for exactly the same reason: there's only a single lock to manage.

Kafka Topics--should I have more or fewer of them?

We are new to Kafka, so I am looking for some high level guidance. We have data for a single entity (we can call it an "Order") that is essentially a number of different entities (we can call one a "Widget" and one a "Gizmo," but there are about 20 different entity types).
Obviously, there is benefit to thinking of Orders as a single topic because all the parts are related to one order. But design wise, does it make more sense for these to be separate topics (Orders, Widgets, Gizmos, etc.)?
There is no direct correlation between the Widgets and Gizmos--the benefit of keeping them together would be things like order of processing, etc. And suggestions or good resources to read would be very helpful. Thanks!
I would recommend initially recording the event as a single atomic message, and not splitting it up into several messages in several topics. It’s best to record events exactly as you receive them, in a form that is as raw as possible. You can always split up the compound event later, using a stream processor—but it’s much harder to reconstruct the original event if you split it up prematurely. Even better, you can give the initial event a unique ID (e.g. a UUID); that way later on when you split the original event into one event for each entity involved, you can carry that ID forward, making the provenance of each event traceable.

Recreate a graph that change in time

I have an entity in my domain that represent a city electrical network. Actually my model is an entity with a List that contains breakers, transformers, lines.
The network change every time a breaker is opened/closed, user can change connections etc...
In all examples of CQRS the EventStore is queried with Version and aggregateId.
Do you think I have to implement events only for the "network" aggregate or also for every "Connectable" item?
In this case when I have to replay all events to get the "actual" status (based on a date) I can have near 10000-20000 events to process.
An Event modify one property or I need an Event that modify an object (containing all properties of the object)?
Theres always an exception to the rule but I think you need to have an event for every command handled in your domain. You can get around the problem of processing so many events by making use of Snapshots.
http://thinkbeforecoding.com/post/2010/02/25/Event-Sourcing-and-CQRS-Snapshots
I assume you mean currently your "connectable items" are part of the "network" aggregate and you are asking if they should be their own aggregate? That really depends on the nature of your system and problem and is more of a DDD issue than simple a CQRS one. However if the nature of your changes is typically to operate on the items independently of one another then then should probably be aggregate roots themselves. Regardless in order to answer that question we would need to know much more about the system you are modeling.
As for the challenge of replaying thousands of events, you certainly do not have to replay all your events for each command. Sure snapshotting is an option, but even better is caching the aggregate root objects in memory after they are first loaded to ensure that you do not have to source from events with each command (unless the system crashes, in which case you can rely on snapshots for quicker recovery though you may not need them with caching since you only pay the penalty of loading once).
Now if you are distributing this system across multiple hosts or threads there are some other issues to consider but I think that discussion is best left for another question or the forums.
Finally you asked (I think) can an event modify more than one property of the state of an object? Yes if that is what makes sense based on what that event represents. The idea of an event is simply that it represents a state change in the aggregate, however these events should also represent concepts that make sense to the business.
I hope that helps.