Spark Dataset encoder for heterogeneous lists (not tuples) - scala

I'd like to use a Spark Dataset to store a collection of data points where each row is a heterogeneous list combining primitive types and case classes. For example, one row might be
val row = List[Any](1, 2.0, 3L, CaseClass1(4, 5, 6), CaseClass2(7, 8))
I'm trying to figure out how to make an Encoder for the entire List[Any].
At runtime, I will have an Encoder and a TypeTag for each of the individual types, but unfortunately I won't know the exact length of the list at compile time (which I think prevents me from using a tuple to store each row).
Also, I've tried using a RowEncoder based on the schema of the data (which I can construct using the TypeTags and ScalaReflection.schemaFor). However, that approach doesn't seem to handle case classes appearing inside each row. (With the above example, it gives me an error saying that CaseClass1 (or CaseClass2) is not a supported data type.)
So, if I have rows of heterogenous lists with a corresponding Encoder and a TypeTag for each position in the list (available at runtime), but I don't know the length of the list at compile time, how can I encode these rows as either a Dataset or DataFrame?
I suspect that a solution will involve either (1) transforming the data to turn any nested case classes into nested sql.Row instances, or (2) explicitly constructing an Encoder or ExpressionEncoder based on the available Encoders and TypeTags.

Related

Is it possible to only evaluate the Key when reading a SequenceFile in Spark?

I'm trying to read a sequence file with custom Writable subclasses for both K and V of a sequencefile input to a spark job.
the vast majority of rows need to be filtered out by a match to a broadcast variable ("candidateSet") and the Kclass.getId. Unfortunately values V are deserialized for every record no matter what with the standard approach, and according to a profile that is where the majority of time is being spent.
here is my code. note my most recent attempt to read here as "Writable" generically, then later cast back which worked functionally but still caused the full deserialize in the iterator.
val rdd = sc.sequenceFile(
path,
classOf[MyKeyClassWritable],
classOf[Writable]
).filter(a => candidateSet.value.contains(a._1.getId))```
Turns out Twitter has a library that handles this case pretty well. Specifically, using this class allows to evaluate the serialized fields in a later step by reading them as DataInputBuffers
https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java

Create composite type in flink table

I am trying to write a user-defined scalar function in Flink which takes in multiple expressions (arbitrary number of expressions) and combine that into a single expression.
Coming from Spark world, I could achieve this by using struct which returns a Row type and pass it to a udf, like
val structCol = SparkSql.functions.struct(cols: _*)
vecUdf(structCol)
I am not able to find an equivalent in Flink. I am also trying to see if I can write a ScalarFunction that takes in the arbitrary number of expressions, but not able to find any examples.
Can anyone help guide me to either of the above two approaches? Thanks!
Note, I can't make it an Array since each expression can be of different type (actually, same value type but could be arrays or scalars).

When should you use Array and when should you use ArrayBuffer in Scala?

I understand the basic concepts behind both, but I am wondering which is best to use for what type of structure for optimal complexity.
Is the general consensus that you simply use Array when you know the length of the structure you're creating and then ArrayBuffer when the length is unknown?
However, what confuses me is that it is stated Array is more efficient for built-in Scala types such as Int, String and so on. Does this also apply to the use case where you simply initiate an Array[Int] and then add values to it with :+= or is it only when it has the fixed length?
I assume that in cases where you know the length is unknown and you are working with datatypes other than the built-in ones the best solution would be ArrayBuffer, but I'm unsure of when built-in ones are used.
And what if you have a matrix, lets say Array[Array[Int]]. Is this the most optimal if you will be adding rows to it with :+=? Or maybe in that case it should be ArrayBuffer[Array[Int]]?
The different between Array and ArrayBuffer boils down to the amortized cost of resizing the array storage.
For exact details you can read this post:
https://stackoverflow.com/a/31213983/1119997
If you can determine ahead of time the storage requirements of your data, then Array will be better than ArrayBuffer since you don't need that book-keeping that is done by ArrayBuffer.
Generally Array is preferred when you need a fixed size collection and ArrayBuffer is much better when you need to add or remove elements from the end.
However, what confuses me is that it is stated Array is more efficient for built-in Scala types such as Int, String and so on
Array is much better when dealing with AnyVal types, so Int, Long, Float, Double, Boolean, Byte, Short, Char. This doesn't apply to String or any other AnyRef type. This is because normally when primitives are used as generic types they need to be boxed into object wrappers but with Array they don't.
This is more obvious in Java where Boxed and unboxed primitives have different types but boxing happens in Scala too, and it can make a big difference.

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:
UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.
Initial value is defined using initialize method, seqOp with update method and combOp with merge method.
Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.
Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.
Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

Scala how to update values in immutable list

I have a immutable list and need a new copy of it with elements replaced at multiple index locations. The List.updated is an O(n) operation and can only replace one at a time. What is the efficient way of doing this? Thanks!
List is not a good fit if you need random element access/update. From the documentation:
This class is optimal for last-in-first-out (LIFO), stack-like access patterns. If you need another access pattern, for example, random access or FIFO, consider using a collection more suited to this than List.
More generally, what you need is an indexed sequence instead of a linear one (such as List). From the documentation of IndexedSeq:
Indexed sequences support constant-time or near constant-time element access and length computation. They are defined in terms of abstract methods apply for indexing and length.
Indexed sequences do not add any new methods to Seq, but promise efficient implementations of random access patterns.
The default concrete implementation of IndexedSeq is Vector, so you may consider using it.
Here's an extract from its documentation (emphasis added):
Vector is a general-purpose, immutable data structure. It provides random access and updates in effectively constant time, as well as very fast append and prepend. Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the default implementation of immutable indexed sequences
list
.iterator
.zipWithIndex
.map { case (index, element) => newElementFor(index) }
.toList