What is the default watermark in fixed windowing? - streaming

I am reading the article the-world-beyond-batch-streaming-102 by Tyler Akidau. For the watermark I am still a bit confused, i.e. about the code in the article:
PCollection<KV<String, Integer>> scores = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AtWatermark()))
.apply(Sum.integersPerKey());
It simply tells the engine trigger at the watermark, but how does the engine know the watermark ? As I understand it should be some kind of time delay user needs to say. Or is the engine built so smart that it tries to make one (according some default strategy or configuration) for the users ?
Thanks very much.

Google Dataflow (which is what Tyler Akidau is describing in the article you cite) can use a heuristic to estimate watermarks -- see this answer for more details.
Flink, on the other hand, depends on explicit watermarks which are either emitted by the data source or by a watermark generator. The most common approach is to assume a bounded delay.

Related

kdb - customized data streaming/ticker plant?

We've been using kdb to handle a number of calculations focused more on traditional desktop sources. We have deployed our web application and are looking to make the leap as to how best to pick up data changes and re-calculate them in kdb to render a "real-time" view of the data as it changes.
From what I've been reading, the use of data loaders(feed handlers) into our own equivalent of a "ticker plant" as a data store is the most documented ideal solution. So far, we have been "pushing" data into kdb directly and calculating as part of a script so we are trying to make the leap from calculation-on-demand to a "live" calculation as data inputs are edited by user.
I'm trying to understand how to manage the feed handlers and timing of updates. We really only want to move data when it changes (web-front end so trying to figure out how best to "trigger" when things change (such as save or lost focus on an editable data grid for example.) We are also thinking our database as the "ticker plant" itself which may minimize feedhandlers.
I found a reference below and it looks like its running a forever-loop which feels excessive but understand the original use case for kdb and streaming data.
Feedhandler - sending data to tickerplant
Does this sound like a solid workflow?
Many thanks in advance!
Resources we've referencing:
Official Manual -https://github.com/KxSystems/kdb/blob/master/d/tick.htm
kdb+ Tick overview: http://www.timestored.com/kdb-guides/kdb-tick-data-store
Source code: https://github.com/KxSystems/kdb-tick
There's a lot to parse here but some general thoughts/ideas:
Yes, most examples of feedhandlers are set up as forever loops but this is often just for convenience for demoing.
Ideally a live data flow should work based on event handling, aka on-event triggers. Kdb/q has this out of the box in the form of the .z handlers. Other languages should have similar concepts of event handling
Some more examples of python/java feeders are here: https://github.com/exxeleron
There's also some details on the official Kx site: https://code.kx.com/q/wp/capi/#publishing-to-a-kdb-tickerplant
It still might be a viable option to have a forever loop, or at least a short timer in the event you want to batch data.
Depending on the amount of dataflow a tickerplant might be overkill for your use-case, but a tickerplant is still useful for (a) separating your processing from the processing of dataflow (i.e. data can still flow through the tickerplant while another process is consuming/calculating) and (b) logging data for recovery purposes.

Kafka Window Stores clarification

In my application, I defined a global state store (backed by a topic “query-topic”) in order to perform specific time based operations such as "give me all events in the query-topic from yesterday 8PM until today 5AM", or “give me all events in the query-topic for the last 3 days". I created the store using a
window store builder as it seemed more efficient to execute time-ranged queries than a simple key value store.
Stores.windowStoreBuilder(
Stores.persistentWindowStore(name, retentionPeriod, windowSize, retainDuplicates),
Serdes.String(),
valueSerde);
Nevertheless, the explanations regarding exactly how these window stores work is quite light. I couldn't find any relevant resources on Kafka official documentation and therefore had to rely on the Javadocs, which are not really explicit either. Moreover, I saw that another implementation called persistentTimestampedWindowStore also exists, which is a bit confusing for me because I thought that the WindowStore was already relying on Kafka event timestamp for the keys.
Could someone explain or redirect me to resources showcasing how such window stores work ? I can see that we can specify a retention period and window size, but how are these windows created ? When you receive a new record, do the windows move accordingly to this new value or are they current-time based ? How do range queries work when spanned over several windows ? I’m a bit lost.
I recommend you to start with, if you didn't yet, reviewing the time and windowing concepts using the official Kafka documentation as it goes straight to the point with some nice graphics to illustrate the concepts:
Windowing
Time
From there you might go to Confluent Developer series which has a lot of nice free resources like videos, code samples, articles. Official training is paid and expensive but their free material and code samples is extensive.
Confluent Developer
They also have some free ebooks.

UE5: import csv for a data driven animation

I was wondering if UE5 can support 50k+ lines of a db/CSV as they rappresent the parameters of the whole animation. (coordinates[x,y,z], TimeDelta, Speed, Brake)
Any documentation is very much appreciated
There is no existing functionality in the engine itself for this extremely specific use case. Of course, it can "support" it if you write a custom solution using the many available tools within the engine.
You can use IFileHandle to stream in a file (your csv): link
You can then parse the incoming data to create a FVector3 of your coordinates, a float of your TimeDelta, etc. For example, FVector::InitFromString may help: link
However, this depends very much on the format of your data. Parsing string/texts into values is not specific to UE4, you can find a lot of info on converting streams of binary/character data to needed values.
Applying the animation as the data is read is a separate, quite big, task. Since you provide no details on what the animation data represents, or what you need to apply it to, I cannot really help.
In general though, it can help you a lot to break down your question into 3-4 separate, more specific, questions. In any case though, this is a task that will require a lot of research and work.
And even before that, it might be good to research alternative approaches and changing the pipeline, to avoid using such non-standard file structures for animation.

How to "join" a frequently updating stream with an irregularly updating stream in Apache Beam?

I have a stream of measurements keyed by an ID PCollection<KV<ID,Measurement>> and something like a changelog stream of additional information for that ID PCollection<KV<ID,SomeIDInfo>>. New data is added to the measurement stream quite regularly, say once per second for every ID. The stream with additional information on the other hand is only updated when a user performs manual re-configuration. We can't tell often this happens and, in particular, the update frequency may vary among IDs.
My goal is now to enrich each entry in the measurements stream by the additional information for its ID. That is, the output should be something like PCollection<KV<ID,Pair<Measurement,SomeIDInfo>>>. Or, in other words, I would like to do a left join of the measurements stream with the additional information stream.
I would expect this to be a quite common use case. Coming from Kafka Streams, this can be quite easily implemented with a KStream-KTable-Join. With Beam, however, all my approaches so far seem not to work. I already thought about the following ideas.
Idea 1: CoGroupByKey with fixed time windows
Applying a window to the measurements stream would not be an issue. However, as the additional information stream is updating irregularly and also significantly less frequently than the measurements stream, there is no reasonable common window size such that there is at least one updated information for each ID.
Idea 2: CoGroupByKey with global window and as non-default trigger
Refining the previous idea, I thought about using a processing-time trigger, which fires e.g. every 5 seconds. The issue with this idea is that I need to use accumulatingFiredPanes() for the additional information as there might be no new data for a key between two firings, but I have to use discardingFiredPanes() for the measurements stream as otherwise my panes would quickly become too large. This simply does not work. When I configure my pipeline that way, also the additional information stream discards changes. Setting both trigger to accumulating it works, but, as I said, this is not scalable.
Idea 3: Side inputs
Another idea would be to use side inputs, but also this solution is not really scalable - at least if I don't miss something. With side inputs, I would create a PCollectionView from the additional information stream, which is a map of IDs to the (latest) additional information. The "join" can than be done in a DoFn with a side input of that view. However, the view seems to be shared by all instances that perform the side input. (It's a bit hard to find any information regarding this.) We would like to not make any assumptions regarding the amount of IDs and the size of additional info. Thus, using a side input seems also not to work here.
The side input option you discuss is currently the best option, although you are correct about the scalability concern due to the side input being broadcast to all workers.
Alternatively, you can store the infrequently-updated side in an external key-value store and just do lookups from a DoFn. If you go this route, it's generally useful to do a GroupByKey first on the main input with ID as a key, which lets you cache the lookups with a good cache-hit ratio.

EventStore basics - what's the difference between Event Meta Data/MetaData and Event Data?

I'm very much at the beginning of using / understanding EventStore or get-event-store as it may be known here.
I've consumed the documentation regarding clients, projections and subscriptions and feel ready to start using on some internal projects.
One thing I can't quite get past - is there a guide / set of recommendations to describe the difference between event metadata and data ? I'm aware of the notional differences; Event data is 'Core' to the domain, Meta data for describing, but it is becoming quite philisophical.
I wonder if there are hard rules regarding implementation (querying etc).
Any guidance at all gratefully received!
Shamelessly copying (and paraphrasing) parts from Szymon Kulec's blog post "Enriching your events with important metadata" (emphases mine):
But what information can be useful to store in the metadata, which info is worth to store despite the fact that it was not captured in
the creation of the model?
1. Audit data
who? – simply store the user id of the action invoker
when? – the timestamp of the action and the event(s)
why? – the serialized intent/action of the actor
2. Event versioning
The event sourcing deals with the effect of the actions. An action
executed on a state results in an action according to the current
implementation. Wait. The current implementation? Yes, the
implementation of your aggregate can change and it will either because
of bug fixing or introducing new features. Wouldn’t it be nice if
the version, like a commit id (SHA1 for gitters) or a semantic version
could be stored with the event as well? Imagine that you published a
broken version and your business sold 100 tickets before fixing a bug.
It’d be nice to be able which events were created on the basis of the
broken implementation. Having this knowledge you can easily compensate
transactions performed by the broken implementation.
3. Document implementation details
It’s quite common to introduce canary releases, feature toggling and
A/B tests for users. With automated deployment and small code
enhancement all of the mentioned approaches are feasible to have on a
project board. If you consider the toggles or different implementation
coexisting in the very same moment, storing the version only may be
not enough. How about adding information which features were applied
for the action? Just create a simple set of features enabled, or map
feature-status and add it to the event as well. Having this and the
command, it’s easy to repeat the process. Additionally, it’s easy to
result in your A/B experiments. Just run the scan for events with A
enabled and another for the B ones.
4. Optimized combination of 2. and 3.
If you think that this is too much, create a lookup for sets of
versions x features. It’s not that big and is repeatable across many
users, hence you can easily optimize storing the set elsewhere, under
a reference key. You can serialize this map and calculate SHA1, put
the values in a map (a table will do as well) and use identifiers to
put them in the event. There’s plenty of options to shift the load
either to the query (lookups) or to the storage (store everything as
named metadata).
Summing up
If you create an event sourced architecture, consider adding the
temporal dimension (version) and a bit of configuration to the
metadata. Once you have it, it’s much easier to reason about the
sources of your events and introduce tooling like compensation.
There’s no such thing like too much data, is there?
I will share my experiences with you which may help. I have been playing with akka-persistence, akka-persistence-eventstore and eventstore. akka-persistence stores it's event wrapper, a PersistentRepr, in binary format. I wanted this data in JSON so that I could:
use projections
make these events easily available to any other technologies
You can implement your own serialization for akka-persistence-eventstore to do this, but it still ended up just storing the wrapper which had my event embedded in a payload attribute. The other attributes were all akka-persistence specific. The author of akka-persistence-eventstore gave me some good advice, get the serializer to store the payload as the Data, and the rest as MetaData. That way my event is now just the business data, and the metadata aids the technology that put it there in the first place. My projections now don't need to parse out the metadata to get at the payload.