In Spark's RDDs and DStreams we have the 'reduce' function for transforming an entire RDD into one element. However the reduce function takes (T,T) => T
However if we want to reduce a List in Scala we can use foldLeft or foldRight which takes type (B)( (B,A) => B) This is very useful because you start folding with a type other then what is in your list.
Is there a way in Spark to do something similar? Where I can start with a value that is of different type then the elements in the RDD itself
Use aggregate instead of reduce. It allows you also to specify a "zero" value of type B and a function like the one you want: (B,A) => B. Do note that you also need to merge separate aggregations done on separate executors, so a (B, B) => B function is also required.
Alternatively, if you want this aggregation as a side effect, an option is to use an accumulator. In particular, the accumulable type allows for the result type to be of a different type than the accumulating type.
Also, if you even need to do the same with a key-value RDD, use aggregateByKey.
Related
I'm working with Datasets and trying to group by and then use map.
I am managing to do it with RDD's but with dataset after group by I don't have the option to use map.
Is there a way I can do it?
You can apply groupByKey:
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
which returns KeyValueGroupedDataset and then mapGroups:
def mapGroups[U](f: (K, Iterator[V]) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Applies the given function to each group of data. For each unique group, the function will be passed the group key and an iterator that contains all of the elements in the group. The function can return an element of arbitrary type which will be returned as a new Dataset.
This function does not support partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an org.apache.spark.sql.expressions#Aggregator.
Internally, the implementation will spill to disk if any given group is too large to fit into memory. However, users must take care to avoid materializing the whole iterator for a group (for example, by calling toList) unless they are sure that this is possible given the memory constraints of their cluster.
I have a query that takes Seq[Int] as it's argument (and performs filtering like WHERE x IN (...)), and I need to compile it since this query is failry complex. However, when I try the naive approach:
Compiled((xs: Set[Int]) => someQuery.filter(_.x inSet xs))
It fails with message that
Computation of type Set[Int] => Query[SomeTable, SomeValue, Seq] cannot be compiled (as type C)
Can Slick compile queries that takes a sets of integer as parameters?
UPDATE: I use PostgreSQL as database, so it can be possible to use arrays instead of IN clause, but how?
As for the PostgreSQL database, the solution is much simpler than I expected.
First of all, there is a need of special Slick driver for PostgreSQL that support arrays. It usually already included in projects that rely on PgSQL features, so there is no trouble at all. I use this driver.
The main idea is to replace plain SQL IN (...) clause which takes the same amount of bind parameters as the amount of items in list, and thus cannot be statically compiled by Slick with PgSQL-specific array operator x = ANY(arr), which takes only one parameter for the array. It's easy to do with code like this:
val compiledQuery = Compiled((x: Rep[List[Int]]) => query.filter(_.id === x.any))
This code will generate query like WHERE x = ANY(?) which will use only one parameter, so Slick will accept it for compilation.
I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.
I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.
val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))
My "values" for actual task are huge, having 300 plus columns in key-values pair
And when I will do group by on common key it will result in shuffle which i want to avoid.
I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>
rdd.groupByKey()
which gives me following Output
(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))
But when I use
rdd.reduceByKey((x,y) => x+y)
I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice
(k3,v31)
(k2,v21v22)
(k1,v11v21)
Please help
If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html
For groupByKey It says
“When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.”
The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.
Further it says
“Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”
So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.
Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)
I have a DataFrame containing three DataFrames of the same type (same parquet schema). They only differ in the content/values they are containing:
I want to flatten the structure, so that the three DataFrames are getting merged into one single Parquet DataFrame containing all of the content/values.
I tried it with flatten and flatMap, but with that I always get the error:
Error: No implicit view available from org.apache.spark.sql.DataFrame => Traversable[U].parquetsFiles.flatten
Error: not enough arguments for method flatten: (implicit as Trav: org.apache.spark.sql.DataFrame => Traversable[U], implicit m: scala.reflect.ClassTag[U]. Unspecified value parameters asTrav, m. parquetFiles.flatten
I also converted it to a List and then tried to flatten and this is also producing the same error.
Do you have any idea how to solve it or what is the problem here?
Thanks, Alex
The scala compiler is looking for a way to convert the DataFrames to a Traversable so it can apply the flatten. But a DataFrame is not Traversable, so it will fail. Also, no ClassTag available because DataFrames are not statically typed.
The code you're looking for is
parquetFiles.reduce(_ unionAll _)
which can be optimized by the DataFrame execution engine.
So it seems like you want to join these three DataFrames together, to do that the unionAll function would work really well. You could do parquetFiles.reduce((x, y) => x.unionAll(y)) (note this will explode on an empty list but if you might have that just look at one of the folds instead of reduce).
I have a problem running GraphX
val adjGraph= adjGraph_CC.vertices
.flatMap { case (id, (compID, adjSet)) => (mapMsgGen(id, compID, adjSet)) }
// mapMsgGen will generate a list of msgs each msg has the form K->V
.reduceByKey((fst, snd) =>mapMsgMerg(fst, snd)).collect
// mapMsgMerg will merge each two msgs passed to it
what I was expecting reduceByKey to do is to group the whole output of flatMap by the key (K) and process the list of values (Vs) for each Key (K) using the function provided.
what is happening is the each output of flatMap (using the function mapMsgGen) which is a list of K->V pairs (not the same K usually) is processed immediately using reduceByKey function mapMsgMerg and before the whole flatMap finish.
need some clarification please
I don't undestand what is going wrong or is it that I understand flatMap and reduceByKey wrong??
Regards,
Maher
There's no need to produce the entire output of flatMap before starting reduceByKey. In fact, if you're not using the intermediate output of flatMap it's better not to produce it and possibly save some memory.
If your flatMap outputs a list that contains 'k' -> v1 and 'k' -> v2 there's no reason to wait until the entire list has been produced to pass v1 and v2 to mapMsgMerge. As soon as those two tuples are output v1 and v2 can be combined as mapMsgMerge(v1, v2) and v1 and v2 discarded if the intermediate list isn't used.
I don't know the details of the Spark scheduler well enough to say if this is guaranteed behavior but it does seem like an instance of what the original paper calls 'pipelining' of operations.