Why flink can't support RichFunction on Reduce/Fold/Aggregate now? - streaming

I've searched RichAggregateFunction in github repository, just found below.
.aggregate() does not support [[RichAggregateFunction]], since the reduce function is used internally in a [[org.apache.flink.api.common.state.AggregatingState]].
Is that mean Flink can't merge elements' state in group window?

Depending on what you mean by "merge" here, you would generally do that work in the ProcessWindowFunction when you call stream.aggregate. The PWF would be your second parameter and it will receive the aggregation which you can perform additional operations on.
If you need to combine the aggregated elements together in some other way you can take the stream that comes out of the aggregate and do additional operations on them (such as a ProcessFunction).

Related

Dynamic "Fan-In" for artifact outputs in Argo?

I have an Argo workflow with dynamic fan-out tasks that do some map operation (in a Map-Reduce meaning context). I want to create a reducer that aggregates their results. It's possible to do that when the outputs of each mapper are small and can be put as an output parameter. See this SO question-answer for the description of how to do it.
But how to aggregate output artifacts with Argo without writing custom logic of writing them to some storage in each mapper and read from it in reducer?
Artifacts are more difficult to aggregate than parameters.
Parameters are always text and are generally small. This makes it easy for Argo Workflows to aggregate them into a single JSON object which can then be consumed by a "reduce" step.
Artifacts, on the other hand, may be any type or size. So Argo Workflows is limited in how much it can help with aggregation.
The main relevant feature it provides is declarative repository write/read operations. You can specify, for example, an S3 prefix to write each parameter to. Then, in the reduce step, you can load everything from that prefix and perform your aggregation logic.
Argo Workflows provides a generic map/reduce example. But besides artifact writing/reading, you pretty much have to do the aggregation logic yourself.

DDD, Event Sourcing, and the shape of the Aggregate state

I'm having a hard time understanding the shape of the state that's derived applying that entity's events vs a projection of that entity's data.
Is an Aggregate's state ONLY used for determining whether or not a command can successfully be applied? Or should that state be usable in other ways?
An example - I have a Post entity for a standard blog post. I might have events like postCreated, postPublished, postUnpublished, etc. For my projections that I'll be persisting in my read tables, I need a projection for the base posts (which will include all posts, regardless of status, with lots of detail) as well as published_posts projection (which will only represent posts that are currently published with only the information necessary for rendering.
In the situation above, is my aggregate state ONLY supposed to be used to determine, for example, if a post can be published or unpublished, etc? If this is the case, is the shape of my state within the aggregate purely defined by what's required for these validations? For example, in my base post projection, I want to have a list of all users that have made a change to the post. In terms of validation for the aggregate/commands, I couldn't care less about the list of users that have made changes. Does that mean that this list should not be a part of my state within my aggregate?
TL;DR: yes - limit the "state" in the aggregate to that data that you choose to cache in support of data change.
In my aggregates, I distinguish two different ideas:
the history , aka the sequence of events that describes the changes in the lifetime of the aggregate
the cache, aka the data values we tuck away because querying the event history every time kind of sucks.
There's not a lot of value in caching results that we are never going to use.
One of the underlying lessons of CQRS is that we don't need aggregates everywhere
An AGGREGATE is a cluster of associated objects that we treat as a unit for the purpose of data changes. -- Evans, 2003
If we aren't changing the data, then we can safely work directly with immutable copies of the data.
The only essential purpose of the aggregate is to determine what events, if any, need to be applied to bring the aggregate's state in line with a command (if the aggregate can be brought so in line). All state that's not needed for that purpose can be offloaded to a read-side, which can be thought of as a remix of the event stream (with each read-side only maintaining the state it needs).
That said, there are in practice, reasons to use the aggregate state directly, with the primary one being a desire for a stronger consistency for the aggregate: CQRS is inherently eventually consistent. As with all questions of consistent updates, it's important to recognize that consistency isn't free and very often isn't even cheap; I tend to think of a project as having a consistency budget and I'm pretty miserly about spending it.
In your case, there's probably no reason to include the list of users changing a post in the aggregate state, unless e.g. there's something like "no single user can modify a given post more than n times".

Reusing PCollection from output of a Transform in another Transform which is later stage of pipeline

In java or any other Programming we can save state of a variable and refer the variable value later as needed. This seems to be not possible with Apache beam, can someone confirm? If it is possible please point me to some samples or documentation.
I am trying to solve below which needs context of my previous transform output.
I am new to Apache Beam so finding it hard to understand how to solve the above.
Approach#1:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn());
PCollection<Users> users = config.apply(FetchUsersFn());
// Now Process using both 'records' and 'users', How can this be done with beam?
Approach#2:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn()).apply(FetchUsersAndProcessRecordsFn());
// Above line 'FetchUsersAndProcessRecordsFn' needs 'config' so it can fetch Users but there is seems to be no possible way?
If I understand correctly, you want to use elements from the two collections records and users in a processing step? There are two commonly used patterns in Beam to accomplish this:
If you are looking to join the two collections, you probably want to use a CoGroupByKey to group related records and users together for processing.
If one of the collections (records or users) is "small" and the entire set needs to be available during processing, you might want to send it as a side input to your processing step.
It isn't clear what might be in the PCollection config in your example, so I may have misinterpreted... Does this meet your use case?

Parallel design of program working with Flink and scala

This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flagļ¼Œ which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.
sc.pubsubSubscription[String](psSubscription)
.withFixedWindows(windowSize) // apply windowing logic
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue
.withWindow[IntervalWindow]
.swap
.groupByKey
.map {
s =>
println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
println(s)
}
Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here:
https://cloud.google.com/dataflow/model/group-by-key
Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.
If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.
Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.
(see https://beam.apache.org/documentation/programming-guide/#triggers)
With regards to Franz's comment above (I would reply to his comment specifically if StackOverflow would let me!,) I see that the docs say that triggering is not implemented... but they also say that Realtime Database functions are not available, while our current project is actively using them. They're just new.
See trigger functions here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/trigger.py
Beware, the API is unfinished as this is not "release-ready" code. But it is available.