How can I apply different windows to one PCollection at once? - apache-beam

So my usecase is that the elements in my PCollection should be put into windows of different lengths (which are specified in the Row itself), but the following operations like the GroupBy are the same, so I don't want to split up the PCollection at this point.
So what I'm trying to do is basically this:
windowed_items = (
items
| 'windowing' >> beam.WindowInto(window.SlidingWindows(lambda row: int(row.WINDOW_LENGTH), 60))
)
However, when building the pipeline I get the error TypeError: '<=' not supported between instances of 'function' and 'int'.
An alternative to applying different windows to one PCollection would be to split/branch the PCollection based on the defined window into multiple PCollections and apply the respective window to each. However, this would mean to hardcode the windowing for every allowed value, and in my case this is possibly a huge number which is why I want to avoid it.
So from the error I'm getting (but not being able to find it explicitely in the docs) I understand that the SlidingWindows parameters have to be provided when building the pipeline and cannot be determined at runtime. Is this correct? Is there some workaround how I can apply different windows to one PCollection at once or is it simply not possible? If that is the case, are there any other alternative approaches to the one I outlined above?

I believe that custom session windowing is what you are looking for. However, it's not supported in the Python SDK yet.

Related

Creating attributes for elements of a PCollection in Apache Beam

I'm fairly new to Apache Beam and was wondering if I can create my own attributes for elements in a PCollection.
I went through the docs but could not find anything.
Example 2 in this ParDo doc shows how to access the TimestampParam and the WindowParam, which are - from what I understand - attributes of each element in a PCollection:
class AnalyzeElement(beam.DoFn):
def process(
self,
elem,
timestamp=beam.DoFn.TimestampParam,
window=beam.DoFn.WindowParam):
yield [...]
So my question is, if it is possible to create such attributes (e.g. TablenameParam) for the elements in a PCollection and if not, if there is some kind of workaround to achieve that?
What you are describing would simply be part of the element. For your example, the TablenameParam would be a field of the type you add to the PCollection.
The reason that WindowParam and TimestampParam are treated differently is that they are often propagated implicitly, and are a part of the Beam model for every element no matter the actual data. Most code that is only operating on the main data does not need to touch them.

Azure Data Flow compare two string with lookup

I'm using Azure Data flow to do some transformation on the data but I'm facing some challenges.
I have a use case where I have two streams, these two streams have some common data, and what I'm looking for is to output the common data between these two streams.
I do matching data with some common fields(product_name(string) and brand(string)), I have not got ID.
to do the matching, I picked lookup activity and I tried to compare the brand in the two streams, but THE RESULT IS NOT CORRECT because for example:
left stream : the brand = Estēe Lauder
right stream. : the brand = Estée Lauder
for me this is the same brand, but they have different text format, I wanted to use 'like' operator but lookup activity does not support it, I'm using '==' operator to compare.
is there a way to override this problem please ?
If you use the Exists transformation instead of Lookup, you will have much more flexibility because you can use custom expressions including regex matching. Also, you can look at using fuzzy matching functions in the Exists expression like soundex(), rlike(), etc.

Dataflow. ValueProvider. How to create from several options?

I successfully use NestedValueProvider if I need to perform some transformations with input value before providing it into step.
But how should I act if I need to combine 2+ Value inputs?
In documentations it said:
Note: NestedValueProvider accepts only one value input. You can't use a NestedValueProvider to combine two different values.
NestedValueProvider is used to take another ValueProvider and transform it using a function. It currently does not support combining values from two or more ValueProviders. Any constant values can be provided as a part of the function definition.

Why flink can't support RichFunction on Reduce/Fold/Aggregate now?

I've searched RichAggregateFunction in github repository, just found below.
.aggregate() does not support [[RichAggregateFunction]], since the reduce function is used internally in a [[org.apache.flink.api.common.state.AggregatingState]].
Is that mean Flink can't merge elements' state in group window?
Depending on what you mean by "merge" here, you would generally do that work in the ProcessWindowFunction when you call stream.aggregate. The PWF would be your second parameter and it will receive the aggregation which you can perform additional operations on.
If you need to combine the aggregated elements together in some other way you can take the stream that comes out of the aggregate and do additional operations on them (such as a ProcessFunction).

Spark: difference of semantics between reduce and reduceByKey

In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function.
However, the method reduceByKey ONLY requires an associative binary function.
sc.textFile("file4kB", 4)
I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not?
Example, if a load some (small) text with 4 partitions (minimum):
val r = sc.textFile("file4k", 4)
then:
r.reduce(_ + _)
returns a string where parts are not always in the same order, whereas:
r.map(x => (1,x)).reduceByKey(_ + _).first
always returns the same string (where everything is in the same order than in the original file).
(I checked with r.glom and the file content is indeed spread over 4 partitions, there is no empty partition).
As far as I am concerned this is an error in the documentation and results you see are simply incidental. Practice, other resources and a simple analysis of the code show that function passed to reduceByKey should be not only associative but commutative as well.
practice - while it looks like the order is preserved in a local mode it is no longer true when you run Spark on a cluster, including standalone mode.
other resources - to quote Data Exploration Using Spark from AmpCamp 3:
There is a convenient method called reduceByKey in Spark for exactly this pattern. Note that the second argument to reduceByKey determines the number of reducers to use. By default, Spark assumes that the reduce function is commutative and associative and applies combiners on the mapper side.
code - reduceByKey is implemented using combineByKeyWithClassTag and creates ShuffledRDD. Since Spark doesn't guarantee the order after shuffling the only way to restore it would be to attach some metadata to the partially reduced records. As far as I can tell nothing like this takes place.
On a side note reduce as it is implemented in PySpark will work just fine with a function which is only commutative. It is of course just a detail of an implementation and not a part of the contract.
According to the code documentation, recently updated/corrected. (thanks #zero323) :
reduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
So it was in fact actually a documentation error like #zero323 pointed out in his answer.
You can check the following links to the code to make sure :
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L304
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1560