I'm working with Datasets and trying to group by and then use map.
I am managing to do it with RDD's but with dataset after group by I don't have the option to use map.
Is there a way I can do it?
You can apply groupByKey:
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
which returns KeyValueGroupedDataset and then mapGroups:
def mapGroups[U](f: (K, Iterator[V]) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
Related
I have a Spark Dataset, and I would like to group the data and process the groups, yielding zero or one element per each group. Something like:
val resulDataset = inputDataset
.groupBy('x, 'y)
.flatMap(...)
I didn't find a way to apply a function after a groupBy, but it appears I can use groupByKey instead (is it a good idea? is there a better way?):
val resulDataset = inputDataset
.groupByKey(v => (v.x, v.y))
.flatMap(...)
This works, but here is a thing: I would like process the groups as Datasets. The reason is that I already have convenient functions to use on Datasets and would like to reuse them when calculating the result for each group. But, the groupByKey.flatMap yields an Iterator over the grouped elements, not the Dataset.
The question: is there a way in Spark to group an input Dataset and map a custom function to each group, while treating the grouped elements as a Dataset ? E.g.:
val inputDataset: Dataset[T] = ...
val resulDataset: Dataset[U] = inputDataset
.groupBy(...)
.flatMap(group: Dataset[T] => {
// using Dataset API to calculate resulting value, e.g.:
group.withColumn(row_number().over(...))....as[U]
})
Note, that grouped data is bounded, and it is OK to process it on a single node. But the number of groups can be very high, so the resulting Dataset needs to be distributed. The point of using the Dataset API to process a group is purely a question of using a convenient API.
What I tried so far:
creating a Dataset from an Iterator in the mapped function - it fails with an NPE from a SparkSession (my understanding is that it boils down to the fact that one cannot create a Dataset within the functions which process a Dataset; see this and this)
tried to overcome the issues in the first solution, attempted to create new SparkSession to create the Dataset within a new session; fails with NPE from SparkSession.newSession
(ab)using repartition('x, 'y).mapPartitions(...), but this also yields an Iterator[T] for each partition, not a Dataset[T]
finally, (ab)using filter: I can collect all distinct values of the grouping criteria into an Array (select.distinct.collect), and iterate this array to filter the source Dataset, yielding one Dataset for each group (sort of joins the idea of multiplexing from this article); although this works, my understanding is that it collects all the data on a single node, so it doesn't scale and will eventually have memory issues
Suppose I have a RDD with the following type:
RDD[(Long, List(Integer))]
Can I assume that the entire list is located at the same worker? I want to know if certain operations are acceptable on the RDD level or should be calculated at driver. For instance:
val data: RDD[(Long, List(Integer))] = someFunction() //creates list for each timeslot
Please note that the List may be the result of aggregate or any other operation and not necessarily be created as one piece.
val diffFromMax = data.map(item => (item._1, findDiffFromMax(item._2)))
def findDiffFromMax(data: List[Integer]): List[Integer] = {
val maxItem = data.max
data.map(item => (maxItem - item))
}
The thing is that is the List is distributed calculating the maxItem may cause a lot of network traffic. This can be handles with an RDD of the following type:
RDD[(Long, Integer /*Max Item*/,List(Integer))]
Where the max item is calculated at driver.
So the question (actually 2 questions) are:
At what point of RDD data I can assume that the data is located at one worker? (answers with reference to doc or personal evaluations would be great) if any? what happens in the case of Tuple inside Tuple: ((Long, Integer), Double)?
What is the common practice for design of algorithms with Tuples? Should I always treat the data as if it may appear on different workers? should I always break it to the minimal granularity at the first Tuple field - for a case where there is data(Double) for user(String) in timeslot(Long) - should the data be (Long, (Strong, Double)) or ((Long, String), Double) or maybe (String, (Long, Double))? or maybe this is not optimal and matrices are better?
The short answer is yes, your list would be located in a single worker.
Your tuple is a single record in the RDD. A single record is ALWAYS on a single partition (which would be on a single worker).
When you do your findDiffFromMax you are running it on the target worker (so the function is serialized to all the workers to run).
The thing you should note is that when you generate a tuple of (k,v) in general this means a key value pair so you can do key based operations on the RDD. The order ( (Long, (Strong, Double)) vs. ((Long, String), Double) or any other way) doesn't really matter as it is all a single record. The only thing that would matter is which is the key in order to do key operations so the question would be the logic of your calculation
I'm trying to expend my RDD table by one column (with string values) using this question answers but I cannot add a column name this way... I'm using Scala.
Is there any easy way to add a column to RDD?
Apache Spark has a functional approach to the elaboration of data. Fundamentally, an RDD[T] is some sort of collection of objects (RDD stands for Resilient Distributed Data structure).
Following the functional approach, you elaborate the objects inside the RDD using transformations. Transformations construct a new RDD from a previous one.
One example of transformation is the map method. Using map, you can transform each object of your RDD in every other type of object you need. So, if you have a data structure that represents a row, you can trasform that structure in a new one with an added row.
For example, take the following piece of code.
val rdd: (String, String) = sc.pallelize(List(("Hello", "World"), ("Such", "Wow"))
// This new RDD will have one more "column",
// which is the concatenation of the previous
val rddWithOneMoreColumn =
rdd.map {
case(a, b) =>
(a, b, a + b)
In this example an RDD of Tuple2 (a.k.a. a couple) is transformed into an RDD of Tuple3, simply applying a function to each RDD element.
Clearly, you have to apply an action over the object rddWithOneMoreColumn to make the computation happen. In fact, Apache Spark computes lazily the result of all of your transformation.
In Spark's RDDs and DStreams we have the 'reduce' function for transforming an entire RDD into one element. However the reduce function takes (T,T) => T
However if we want to reduce a List in Scala we can use foldLeft or foldRight which takes type (B)( (B,A) => B) This is very useful because you start folding with a type other then what is in your list.
Is there a way in Spark to do something similar? Where I can start with a value that is of different type then the elements in the RDD itself
Use aggregate instead of reduce. It allows you also to specify a "zero" value of type B and a function like the one you want: (B,A) => B. Do note that you also need to merge separate aggregations done on separate executors, so a (B, B) => B function is also required.
Alternatively, if you want this aggregation as a side effect, an option is to use an accumulator. In particular, the accumulable type allows for the result type to be of a different type than the accumulating type.
Also, if you even need to do the same with a key-value RDD, use aggregateByKey.
If I am returning a Seq[T] from a function, when there maybe a chance that it is empty, is it still a Seq or will it error?
In other words, do I need to wrap it in an Option or is that overkill?
It's generally overkill although it may convey some information depending on the context. Suppose you have a huge database of people, where some data could be missing. You could write queries like:
def getChildren( p: Person ): Seq[Person]
But if it returns an empty sequence, you cannot guess if the data is missing or if the data is available that there are no children. In contrast with the definition:
def getChildren( p: Person ): Option[Seq[Person]]
You will obtain None when the data is missing and Some(s) where s is an empty sequence if there are no children.
Seq is like a Monoid, it has a zero form.