Adaptive bitrate streaming protocol (for any data) - streaming

I'm looking for some ideas/hints for streaming protocol (similar to video/audio streaming) to send any data in so called real-time.
In simple words:
I'm producing some data each second (let's say one array with 1MB of data per second) and I'm sorting that data from most important to not so important (like putting them to priority queues or similar)
I would like to keep streaming those data via some protocol and in perfect case I would like to send all of it
If not possible (bandwidth, dropping packets etc.) I would like to send from each produced array as much as possible (first n-bytes) just to keep data going (it is important to start sending new produced array each second).
And now - I'm looking for such protocol/library that will handle adaptive bit rate stuff for any data. I would expect from it to tell me how much data I can send (put into send buffers or similar approach). The most similar thing is video/audio streaming when in poor network conditions (en)coder is changing quality depending on network conditions.
It is also OK if I miss some send data (so UDP deep down of this stuff is OK) but preferably I would like to send as much data as possible per second without loosing anything (from those first n-bytes send).
Do you have any ideas of what protocol/libraries I could use for client/server? (hopefully some libs in python, C or C++).

I think IPFIX (the generic NetFlow standard) has everything you need.
You can avoid a timestamp per sample by sending a samplingInterval update every time you change your rate. You can also add other updating the change in sampling asynchronously.
As for where to put your data. You can create a new field or just use an existing one with that has a datatype you want. IE: if you are just sending uint64 sample values then it might be easier to use packetDeltaCount then create your own field definition.
There are plenty of IPFIX libraries.

Related

Firebase analytics - Unity - time spent on a level

is there any possibility to get exact time spent on a certain level in a game via firebase analytics? Thank you so much 🙏
I tried to use logEvents.
The best way to do so would be measuring the time on the level within your codebase, then have a very dedicated event for level completion, in which you would pass the time spent on the level.
Let's get to details. I will use Kotlin as an example, but it should be obvious what I'm doing here and you can see more language examples here.
firebaseAnalytics.setUserProperty("user_id", userId)
firebaseAnalytics.logEvent("level_completed") {
param("name", levelName)
param("difficulty", difficulty)
param("subscription_status", subscriptionStatus)
param("minutes", minutesSpentOnLevel)
param("score", score)
}
Now see how I have a bunch of parameters with the event? These parameters are important since they will allow you to conduct a more thorough and robust analysis later on, answer more questions. Like, Hey, what is the most difficult level? Do people still have troubles on it when the game difficulty is lower? How many times has this level been rage-quit or lost (for that you'd likely need a level_started event). What about our paid players, are they having similar troubles on this level as well? How many people have ragequit the game on this level and never played again? That would likely be easier answer with sql at this point, taking the latest value of the level name for the level_started, grouped by the user_id. Or, you could also have levelName as a UserProperty as well as the EventProperty, then it would be somewhat trivial to answer in the default analytics interface.
Note that you're limited in the number of event parameters you can send per event. The total number of unique parameter names is limited too. As well as the number of unique event names you're allowed to have. In our case, the event name would be level_completed. See the limits here.
Because of those limitations, it's important to name your event properties in somewhat generic way so that you would be able to efficiently reuse them elsewhere. For this reason, I named minutes and not something like minutes_spent_on_the_level. You could then reuse this property to send the minutes the player spent actively playing, minutes the player spent idling, minutes the player spent on any info page, minutes they spent choosing their upgrades, etc. Same idea about having name property rather than level_name. Could as well be id.
You need to carefully and thoughtfully stuff your event with event properties. I normally have a wrapper around the firebase sdk, in which I would enrich events with dimensions that I always want to be there, like the user_id or subscription_status to not have to add them manually every time I send an event. I also usually have some more adequate logging there Firebase Analytics default logging is completely awful. I also have some sanitizing there, lowercasing all values unless I'm passing something case-sensitive like base64 values, making sure I don't have double spaces (so replacing \s+ with " " (space)), maybe also adding the user's local timestamp as another parameter. The latter is very helpful to indicate time-cheating users, especially if your game is an idler.
Good. We're halfway there :) Bear with me.
Now You need to go to firebase and register your eps (event parameters) into cds (custom dimensions and metrics). If you don't register your eps, they won't be counted towards the global cd limit count (it's about 50 custom dimensions and 50 custom metrics). You register the cds in the Custom Definitions section of FB.
Now you need to know whether this is a dimension or a metric, as well as the scope of your dimension. It's much easier than it sounds. The rule of thumb is: if you want to be able to run mathematical aggregation functions on your dimension, then it's a metric. Otherwise - it's a dimension. So:
firebaseAnalytics.setUserProperty("user_id", userId) <-- dimension
param("name", levelName) <-- dimension
param("difficulty", difficulty) <-- dimension (or can be a metric, depends)
param("subscription_status", subscriptionStatus) <-- dimension (can be a metric too, but even less likely)
param("minutes", minutesSpentOnLevel) <-- metric
param("score", score) <-- metric
Now another important thing to understand is the scope. Because Firebase and GA4 are still, essentially just in Beta being actively worked on, you only have user or hit scope for the dimensions and only hit for the metrics. The scope basically just indicates how the value persists. In my example, we only need the user_id as a user-scoped cd. Because user_id is the user-level dimension, it is set separately form the logEvent function. Although I suspect you can do it there too. Haven't tried tho.
Now, we're almost there.
Finally, you don't want to use Firebase to look at your data. It's horrible at data presentation. It's good at debugging though. Cuz that's what it was intended for initially. Because of how horrible it is, it's always advised to link it to GA4. Now GA4 will allow you to look at the Firebase values much more efficiently. Note that you will likely need to re-register your custom dimensions from Firebase in GA4. Because GA4 is capable of getting multiple data streams, of which firebase would be just one data source. But GA4's CDs limits are very close to Firebase's. Ok, let's be frank. GA4's data model is almost exactly copied from that of Firebase's. But GA4 has a much better analytics capabilities.
Good, you've moved to GA4. Now, GA4 is a very raw not-officially-beta product as well as Firebase Analytics. Because of that, it's advised to first change your data retention to 12 months and only use the explorer for analysis, pretty much ignoring the pre-generated reports. They are just not very reliable at this point.
Finally, you may find it easier to just use SQL to get your analysis done. For that, you can easily copy your data from GA4 to a sandbox instance of BQ. It's very easy to do.This is the best, most reliable known method of using GA4 at this moment. I mean, advanced analysts do the export into BQ, then ETL the data from BQ into a proper storage like Snowflake or even s3, or Aurora, or whatever you prefer and then on top of that, use a proper BI tool like Looker, PowerBI, Tableau, etc. A lot of people just stay in BQ though, it's fine. Lots of BI tools have BQ connectors, it's just BQ gets expensive quickly if you do a lot of analysis.
Whew, I hope you'll enjoy analyzing your game's data. Data-driven decisions rock in games. Well... They rock everywhere, to be honest.

kdb - customized data streaming/ticker plant?

We've been using kdb to handle a number of calculations focused more on traditional desktop sources. We have deployed our web application and are looking to make the leap as to how best to pick up data changes and re-calculate them in kdb to render a "real-time" view of the data as it changes.
From what I've been reading, the use of data loaders(feed handlers) into our own equivalent of a "ticker plant" as a data store is the most documented ideal solution. So far, we have been "pushing" data into kdb directly and calculating as part of a script so we are trying to make the leap from calculation-on-demand to a "live" calculation as data inputs are edited by user.
I'm trying to understand how to manage the feed handlers and timing of updates. We really only want to move data when it changes (web-front end so trying to figure out how best to "trigger" when things change (such as save or lost focus on an editable data grid for example.) We are also thinking our database as the "ticker plant" itself which may minimize feedhandlers.
I found a reference below and it looks like its running a forever-loop which feels excessive but understand the original use case for kdb and streaming data.
Feedhandler - sending data to tickerplant
Does this sound like a solid workflow?
Many thanks in advance!
Resources we've referencing:
Official Manual -https://github.com/KxSystems/kdb/blob/master/d/tick.htm
kdb+ Tick overview: http://www.timestored.com/kdb-guides/kdb-tick-data-store
Source code: https://github.com/KxSystems/kdb-tick
There's a lot to parse here but some general thoughts/ideas:
Yes, most examples of feedhandlers are set up as forever loops but this is often just for convenience for demoing.
Ideally a live data flow should work based on event handling, aka on-event triggers. Kdb/q has this out of the box in the form of the .z handlers. Other languages should have similar concepts of event handling
Some more examples of python/java feeders are here: https://github.com/exxeleron
There's also some details on the official Kx site: https://code.kx.com/q/wp/capi/#publishing-to-a-kdb-tickerplant
It still might be a viable option to have a forever loop, or at least a short timer in the event you want to batch data.
Depending on the amount of dataflow a tickerplant might be overkill for your use-case, but a tickerplant is still useful for (a) separating your processing from the processing of dataflow (i.e. data can still flow through the tickerplant while another process is consuming/calculating) and (b) logging data for recovery purposes.

How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam?

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.
The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.

How to store sagas’ data?

From what I read aggregates must only contain properties which are used to protect their invariants.
I also read sagas can be aggregates which makes sense to me.
Now I modeled a registration process using a saga: on RegistrationStarted event it sends a ReserveEmail command which will trigger an EmailReserved or EmailReservationFailed given if the email is free or not. A listener will then either send a validation link or a message telling an account already exists.
I would like to use data from the RegistrationStarted event in this listener (say the IP and user-agent). How should I do it?
Storing these data in the saga? But they’re not used to protect invariants.
Pushing them through ReserveEmail command and the resulting event? Sounds tedious.
Project the saga to the read model? What about eventual consistency?
Another way?
Rinat Abdullin wrote a good overview of sagas / process managers.
The usual answer is that the saga has copies of the events that it cares about, and uses the information in those events to compute the command messages to send.
List[Command] processManager(List[Event] events)
Pushing them through ReserveEmail command and the resulting event?
Yes, that's the usual approach; we get a list [RegistrationStarted], and we use that to calculate the result [ReserveEmail]. Later on, we'll get [RegistrationStarted, EmailReserved], and we can use that to compute the next set of commands (if any).
Sounds tedious.
The data has to travel between the two capabilities somehow. So you are either copying the data from one message to another, or you are copying a correlation identifier from one message to another and then allowing the consumer to decide how to use the correlation identifier to fetch a copy of the data.
Storing these data in the saga? But they’re not used to protect invariants.
You are typically going to be storing events in the sagas (to keep track of what has happened). That gives you a copy of the data provided in the event. You don't have an invariant to protect because you are just caching a copy of a decision made somewhere else. You won't usually have the process manager running queries to collect additional data.
What about eventual consistency?
By their nature, sagas are always going to be "eventually consistent"; the "state" of an instance of a saga is just cached copies of data controlled elsewhere. The data is probably nanoseconds old by the time the saga sees it, there's no point in pretending that the data is "now".
If I understand correctly I could model my saga as a Registration aggregate storing all the events whose correlation identifier is its own identifier?
Udi Dahan, writing about CQRS:
Here’s the strongest indication I can give you to know that you’re doing CQRS correctly: Your aggregate roots are sagas.

ZeroMQ Subscriber filtering on multipart messages

With ZeroMQ PUB/SUB model, is it possible for a subscriber to filter based on contents of more than just the first frame?
For example, if we have a multi-frame message that contains three frames1) data type,2) instrument, and then 3) the actual data,is it possible to subscribe to a specific data type, instrument pair?
(All of the examples I've seen only shows filtering based off of the first message of a multipart message).
Different modes of ZeroMQ PUB/SUB filtering
Initial ZeroMQ model used a subscriber-side subscription-based filtering.
That was easier for publisher, as it need not handle subscriber-specific handling and simply fed all data to all SUB-s ( yes, at the cost of the vasted network traffic and causing SUB-side workload to process all incoming BLOBs, all the way up from the PHY-media, even if it did not subscribe to the particular topic. Yes, pretty expensive in low-latency designs )
PUB-side filtering was initially proposed for later implementation
still, this mode does not allow your idea to fly just on using the PUB/SUB Scaleable Formal Communication Pattern.
ZeroMQ protocol design strives to inspire users, how to achieve just-enough designed distributed systems' behaviour.
As understood from your 1-2-3 intention, a custom logic shall be just enough to achieve the desired processing.
So, how to approach the solution?
Do not hesitate to setup several messaging / control relays in parallel in your application domain-specific solution, it works much better for any tailor-made solutions and a way safer than to try to "bend" a library's primitive archetype ( in this case the trivial PUB/SUB topic-based filtering ) to make something a bit different, than for what an original use-case was designed.
The more true, if your domain-specific use is FOREX / Equity trading, where latency is your worst enemy, so a successful approach has to minimise stream-decoding and alternative branching as much as possible.
nanoseconds do matter
If one reads into details about multiframe composition ( a sender-side BLOB assembly ), there is no latency-wise advantage, as the whole BLOB gets onto wire only after it has been completed - se there is no advantage for your SUB-side in case your idea was to handle initial frame content for signalling, as the whole BLOB arrives "together"