Dataflow. ValueProvider. How to create from several options? - apache-beam

I successfully use NestedValueProvider if I need to perform some transformations with input value before providing it into step.
But how should I act if I need to combine 2+ Value inputs?
In documentations it said:
Note: NestedValueProvider accepts only one value input. You can't use a NestedValueProvider to combine two different values.

NestedValueProvider is used to take another ValueProvider and transform it using a function. It currently does not support combining values from two or more ValueProviders. Any constant values can be provided as a part of the function definition.

Related

How can I apply different windows to one PCollection at once?

So my usecase is that the elements in my PCollection should be put into windows of different lengths (which are specified in the Row itself), but the following operations like the GroupBy are the same, so I don't want to split up the PCollection at this point.
So what I'm trying to do is basically this:
windowed_items = (
items
| 'windowing' >> beam.WindowInto(window.SlidingWindows(lambda row: int(row.WINDOW_LENGTH), 60))
)
However, when building the pipeline I get the error TypeError: '<=' not supported between instances of 'function' and 'int'.
An alternative to applying different windows to one PCollection would be to split/branch the PCollection based on the defined window into multiple PCollections and apply the respective window to each. However, this would mean to hardcode the windowing for every allowed value, and in my case this is possibly a huge number which is why I want to avoid it.
So from the error I'm getting (but not being able to find it explicitely in the docs) I understand that the SlidingWindows parameters have to be provided when building the pipeline and cannot be determined at runtime. Is this correct? Is there some workaround how I can apply different windows to one PCollection at once or is it simply not possible? If that is the case, are there any other alternative approaches to the one I outlined above?
I believe that custom session windowing is what you are looking for. However, it's not supported in the Python SDK yet.

Azure Data Flow compare two string with lookup

I'm using Azure Data flow to do some transformation on the data but I'm facing some challenges.
I have a use case where I have two streams, these two streams have some common data, and what I'm looking for is to output the common data between these two streams.
I do matching data with some common fields(product_name(string) and brand(string)), I have not got ID.
to do the matching, I picked lookup activity and I tried to compare the brand in the two streams, but THE RESULT IS NOT CORRECT because for example:
left stream : the brand = Estēe Lauder
right stream. : the brand = Estée Lauder
for me this is the same brand, but they have different text format, I wanted to use 'like' operator but lookup activity does not support it, I'm using '==' operator to compare.
is there a way to override this problem please ?
If you use the Exists transformation instead of Lookup, you will have much more flexibility because you can use custom expressions including regex matching. Also, you can look at using fuzzy matching functions in the Exists expression like soundex(), rlike(), etc.

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:
UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.
Initial value is defined using initialize method, seqOp with update method and combOp with merge method.
Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.
Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.
Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

XML input for Powershell Function

I have an XML which has multiple nodes and sub-nodes from which am consuming data as input for multiple functions from the main function.
I have a basic question on optimized code here.
Is it good to pass the XML object as an input to multiple function which consumes some data from the XML?
Is it good to pass the XML path to the function and instantiate XML object inside each function?
Is there a way to pass just the node which is required for a particular function? ( In-case i have 10 nodes and 10 functions - where each function requires just one particular node for consuming data)
Thanks
I would argue that it's better to pass only specific arguments to each function. The less broad your input, the simpler your input validation. Also, I'd strongly recommend to avoid repeatedly reading/parsing the same data. There's no benefit at all in doing that.

how to use records as parameters in stored function

I want to use record type as parameter but I got message that function cannot have record type parameters. I have a Dao function which perform various operation on a Arraylist passed through parameter and I need to implement it in stored procedure. So any help will be greatly appreciated. thanks!
The function m looking for is something like:
CREATE OR REPLACE FUNCTION est_fn_get_emp_report(rec record,...)
I am new using postgresql but have used stored functions before but never have to use record type parameters.
The simple issue is you can't specify a record. You can specify some polymorphic types (ANYARRAY, ANYELEMENT) as an input of a function but it needs to have a structure known at planning time and this can lead to issues with polymorphic types as input args on even on a good day. The problem with a record is that PostgreSQL wont necessarily know what the internal structure is when it is passed in. ROW(1, 'test') is not useful in a functional context.
Instead you want to specify complex types. You can actually take this very far in terms of relying on PostgreSQL. This allows you to specify a specific type of record when passing it in.