Create composite type in flink table - scala

I am trying to write a user-defined scalar function in Flink which takes in multiple expressions (arbitrary number of expressions) and combine that into a single expression.
Coming from Spark world, I could achieve this by using struct which returns a Row type and pass it to a udf, like
val structCol = SparkSql.functions.struct(cols: _*)
vecUdf(structCol)
I am not able to find an equivalent in Flink. I am also trying to see if I can write a ScalarFunction that takes in the arbitrary number of expressions, but not able to find any examples.
Can anyone help guide me to either of the above two approaches? Thanks!
Note, I can't make it an Array since each expression can be of different type (actually, same value type but could be arrays or scalars).

Related

Dataflow. ValueProvider. How to create from several options?

I successfully use NestedValueProvider if I need to perform some transformations with input value before providing it into step.
But how should I act if I need to combine 2+ Value inputs?
In documentations it said:
Note: NestedValueProvider accepts only one value input. You can't use a NestedValueProvider to combine two different values.
NestedValueProvider is used to take another ValueProvider and transform it using a function. It currently does not support combining values from two or more ValueProviders. Any constant values can be provided as a part of the function definition.

Spark Dataset encoder for heterogeneous lists (not tuples)

I'd like to use a Spark Dataset to store a collection of data points where each row is a heterogeneous list combining primitive types and case classes. For example, one row might be
val row = List[Any](1, 2.0, 3L, CaseClass1(4, 5, 6), CaseClass2(7, 8))
I'm trying to figure out how to make an Encoder for the entire List[Any].
At runtime, I will have an Encoder and a TypeTag for each of the individual types, but unfortunately I won't know the exact length of the list at compile time (which I think prevents me from using a tuple to store each row).
Also, I've tried using a RowEncoder based on the schema of the data (which I can construct using the TypeTags and ScalaReflection.schemaFor). However, that approach doesn't seem to handle case classes appearing inside each row. (With the above example, it gives me an error saying that CaseClass1 (or CaseClass2) is not a supported data type.)
So, if I have rows of heterogenous lists with a corresponding Encoder and a TypeTag for each position in the list (available at runtime), but I don't know the length of the list at compile time, how can I encode these rows as either a Dataset or DataFrame?
I suspect that a solution will involve either (1) transforming the data to turn any nested case classes into nested sql.Row instances, or (2) explicitly constructing an Encoder or ExpressionEncoder based on the available Encoders and TypeTags.

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:
UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.
Initial value is defined using initialize method, seqOp with update method and combOp with merge method.
Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.
Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.
Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

Spark: difference of semantics between reduce and reduceByKey

In Spark's documentation, it says that RDDs method reduce requires a associative AND commutative binary function.
However, the method reduceByKey ONLY requires an associative binary function.
sc.textFile("file4kB", 4)
I did some tests, and apparently it's the behavior I get. Why this difference? Why does reduceByKey ensure the binary function is always applied in certain order (to accommodate for the lack of commutativity) when reduce does not?
Example, if a load some (small) text with 4 partitions (minimum):
val r = sc.textFile("file4k", 4)
then:
r.reduce(_ + _)
returns a string where parts are not always in the same order, whereas:
r.map(x => (1,x)).reduceByKey(_ + _).first
always returns the same string (where everything is in the same order than in the original file).
(I checked with r.glom and the file content is indeed spread over 4 partitions, there is no empty partition).
As far as I am concerned this is an error in the documentation and results you see are simply incidental. Practice, other resources and a simple analysis of the code show that function passed to reduceByKey should be not only associative but commutative as well.
practice - while it looks like the order is preserved in a local mode it is no longer true when you run Spark on a cluster, including standalone mode.
other resources - to quote Data Exploration Using Spark from AmpCamp 3:
There is a convenient method called reduceByKey in Spark for exactly this pattern. Note that the second argument to reduceByKey determines the number of reducers to use. By default, Spark assumes that the reduce function is commutative and associative and applies combiners on the mapper side.
code - reduceByKey is implemented using combineByKeyWithClassTag and creates ShuffledRDD. Since Spark doesn't guarantee the order after shuffling the only way to restore it would be to attach some metadata to the partially reduced records. As far as I can tell nothing like this takes place.
On a side note reduce as it is implemented in PySpark will work just fine with a function which is only commutative. It is of course just a detail of an implementation and not a part of the contract.
According to the code documentation, recently updated/corrected. (thanks #zero323) :
reduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
So it was in fact actually a documentation error like #zero323 pointed out in his answer.
You can check the following links to the code to make sure :
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L304
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1560

how to use records as parameters in stored function

I want to use record type as parameter but I got message that function cannot have record type parameters. I have a Dao function which perform various operation on a Arraylist passed through parameter and I need to implement it in stored procedure. So any help will be greatly appreciated. thanks!
The function m looking for is something like:
CREATE OR REPLACE FUNCTION est_fn_get_emp_report(rec record,...)
I am new using postgresql but have used stored functions before but never have to use record type parameters.
The simple issue is you can't specify a record. You can specify some polymorphic types (ANYARRAY, ANYELEMENT) as an input of a function but it needs to have a structure known at planning time and this can lead to issues with polymorphic types as input args on even on a good day. The problem with a record is that PostgreSQL wont necessarily know what the internal structure is when it is passed in. ROW(1, 'test') is not useful in a functional context.
Instead you want to specify complex types. You can actually take this very far in terms of relying on PostgreSQL. This allows you to specify a specific type of record when passing it in.