Dynamic "Fan-In" for artifact outputs in Argo? - kubernetes

I have an Argo workflow with dynamic fan-out tasks that do some map operation (in a Map-Reduce meaning context). I want to create a reducer that aggregates their results. It's possible to do that when the outputs of each mapper are small and can be put as an output parameter. See this SO question-answer for the description of how to do it.
But how to aggregate output artifacts with Argo without writing custom logic of writing them to some storage in each mapper and read from it in reducer?

Artifacts are more difficult to aggregate than parameters.
Parameters are always text and are generally small. This makes it easy for Argo Workflows to aggregate them into a single JSON object which can then be consumed by a "reduce" step.
Artifacts, on the other hand, may be any type or size. So Argo Workflows is limited in how much it can help with aggregation.
The main relevant feature it provides is declarative repository write/read operations. You can specify, for example, an S3 prefix to write each parameter to. Then, in the reduce step, you can load everything from that prefix and perform your aggregation logic.
Argo Workflows provides a generic map/reduce example. But besides artifact writing/reading, you pretty much have to do the aggregation logic yourself.

Related

How to implement Featuretools into my ML Process?

I am exploring the possibility of implementing Featuretools into my pipeline, to be able to create new features from my Df.
Currently I am using a GridSearchCV, with a Pipeline embedded inside it. Since Featuretools is creating new features with aggregation on columns, like STD(column) etc, I feel like it is suspectible to data leakage. In their FAQ, they are giving an example approach to tackle it, which is not suitable for a Pipeline structure I am using.
Idea 0: I would love to integrate it directly into my Pipeline but it seems like not compatible with Pipelines. It would use fold train data to construct features, transform fold test data. K times. At the end, it would use whole data to construct, during Refit= True stage of GridSearchCV. If you have any example opposed to this fact, you are very welcome.
Idea 1: I can switch to a manual CV structure, not embedded into pipeline. And inside it, I can use Train data to construct new features, and test data to transform with these. It will work K times. At the end, all data can be used to construct Ultimate model.
It is the safest option, with time and complexity disadvantages.
Idea 2: Using it with whole data, ignore the leakage possibility. I am not in favor of this of course. But when I look at Project Github page, all the examples are combining Train and Test data, creating these features with whole data. Then go on with Train-Test division for modeling.
https://github.com/Featuretools/predict-taxi-trip-duration/blob/master/NYC%20Taxi%203%20-%20Simple%20Featuretools.ipynb
Actually if the developers of the project think like that, I could give it a chance with whole data.
What do you think, I would love to hear about your experiences on FeatureTools.

How does PCollection gets created by a runner

Similar to the code below get called internally from Read or GroupBy transform during expand. In terms of Beam code this will result in construction of an instance of PCollection. It is not apparent and clear what is actually being constructed by looking at the code as it is limited to just new operation. In terms of runner what does it mean by calling new PCollection(...)?
PCollection.createPrimitiveOutputInternal(
input.getPipeline(),
WindowingStrategy.globalDefault(),
IsBounded.BOUNDED,
ByteArrayCoder.of())
From the Apache Beam programming guide:
A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism.
PCollection implements PValue, from its document:
Dataflow users should not construct PValue objects directly in their
pipelines.
Think it in this way: when using the SDK building a pipeline, you are constructing a directed acyclic graph of nodes of PTransforms and edges of PCollections. In the DAG, a PCollection instance is abstract and represents an input/output of PTransform[s]. When the DAG is executed on a runner, the data of each PCollection can reside on multiple machines/VMs/workers. You cannot view the data until you materialize them through some IO transforms.
If internally in the SDK, you see new PCollection(...), it builds the edge/node with necessary information that could later make sense to the runner when executing the DAG. A PCollection itself is not a data structure that holds data in memory.

Why flink can't support RichFunction on Reduce/Fold/Aggregate now?

I've searched RichAggregateFunction in github repository, just found below.
.aggregate() does not support [[RichAggregateFunction]], since the reduce function is used internally in a [[org.apache.flink.api.common.state.AggregatingState]].
Is that mean Flink can't merge elements' state in group window?
Depending on what you mean by "merge" here, you would generally do that work in the ProcessWindowFunction when you call stream.aggregate. The PWF would be your second parameter and it will receive the aggregation which you can perform additional operations on.
If you need to combine the aggregated elements together in some other way you can take the stream that comes out of the aggregate and do additional operations on them (such as a ProcessFunction).

Any alternative to BPMN and DMN notations for describing business logic?

I am looking for some tool capable of creating complex process of data manipulation which can be more or less easily modified by people who do not write code.
For example, my task is:
fetch data from sourceA
2.1 if data is full - filter it by condition 45
2.2 if data is not full - fetch additional data from source B
if result passes validation - return 1, otherwise 0
This should be described in some readable manner, best option is if one can modify this process in some UI tool.
What are the requirements?
Each process consists of two parts: steps, and a way to arrange them in a sequence.
(1)
The process in each step should be able to
1. emit commands for fetching some data from data-sources and inserting this into process context
2. filter, enrich, transform datasets obtained
Thus each step of this process should be described with some more or less simple DSL.
(2)
The selection of the step to go, i.e. the consequence of steps should be described by some visual tool, or again, as in (1), with some simple dsl.
Can you advise something for this typical, from my point of view, task?
Meanwhile, here are my own ideas.
First think comes to mind is BPMN combined with Drools.
For steps I may use DRL rules: they can make only basic data manipulation themselves, but I can call Java functions from them if I need something complicated.
For steps consequence I may use standart BPMN diagramm.
Mat be, there is something better?
The combination of BPMN with DMN would allow you indeed to describe with these visual standards, the execution of the process and decision logic to be applied, in order to achieve what in the "For example" paragraph.
In order to make it fully accessible by the business people, the BPMN task for fetching the data or performing any interaction with external system, should be prepared in advance and made available during the composition of the BPMN/DMN diagrams.
Alternatively to BPMN+DMN combination, you can look into Fuse or Fuse Online, it cannot describe all the semantics of the BPMN+DMN combination, but with Fuse Online for instance you can fully visually implement the steps you described in the "For example" paragraph.

Scio: groupByKey doesn't work when using Pub/Sub as collection source

I changed source of WindowsWordCount example program from text file to cloud Pub/Sub as shown below. I published shakespeare file's data to Pub/Sub which did get fetched properly but none of the transformations after .groupByKey seem to work.
sc.pubsubSubscription[String](psSubscription)
.withFixedWindows(windowSize) // apply windowing logic
.flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty))
.countByValue
.withWindow[IntervalWindow]
.swap
.groupByKey
.map {
s =>
println("\n\n\n\n\n\n\n This never prints \n\n\n\n\n")
println(s)
}
Changing the input from a text file to PubSub the PCollection "unbounded". Grouping that by key requires to define aggregation triggers, otherwise the grouper will wait forever. It's mentioned in the dataflow documentation here:
https://cloud.google.com/dataflow/model/group-by-key
Note: Either non-global Windowing or an aggregation trigger is required in order to perform a GroupByKey on an unbounded PCollection. This is because a bounded GroupByKey must wait for all the data with a certain key to be collected; but with an unbounded collection, the data is unlimited. Windowing and/or Triggers allow grouping to operate on logical, finite bundles of data within the unbounded data stream.
If you apply GroupByKey to an unbounded PCollection without setting either a non-global windowing strategy, a trigger strategy, or both, Dataflow will generate an IllegalStateException error when your pipeline is constructed.
Unfortunately, in the Python SDK of Apache Beam seems not to support triggers (yet), so I'm not sure what the solution would be in python.
(see https://beam.apache.org/documentation/programming-guide/#triggers)
With regards to Franz's comment above (I would reply to his comment specifically if StackOverflow would let me!,) I see that the docs say that triggering is not implemented... but they also say that Realtime Database functions are not available, while our current project is actively using them. They're just new.
See trigger functions here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/trigger.py
Beware, the API is unfinished as this is not "release-ready" code. But it is available.