How can you apply filter for a RelationalGroupedDataset class from apache.spark.sql using Scala? - scala

I was trying to find a filter function (takes a List type object and a function s.t. the function should be of type of the input list elements and should return a bool value, and the output of the filter of these two functions contains the original list element in which the function returns true on the element).
When I try to apply filter, I get an error. Are there any ways to apply filter to a RelationalGroupedDataset? (I wasn't able to find any in the attached docs: https://spark.apache.org/docs/2.4.4/api/java/org/apache/spark/sql/RelationalGroupedDataset.html)
Also, is there proper notation for how I should be accessing a specific column value for a RelationalGroupedDataset?
Thanks!
Original Call
Error Message

Here is is an example:
df.groupBy("department")
.agg(
sum("salary").as("sum_salary"),
avg("salary").as("avg_salary"),
sum("bonus").as("sum_bonus"),
max("bonus").as("max_bonus"))
.where(col("sum_bonus") >= 50000)
.show(false)
It should give you guidance.

Try to add :_* to passed cols into groupBy:
def showGroupByDesc(df: DataFrame, cols: Column*): Unit = {
df.groupBy(cols:_*).count().sort($"count".desc).show()
}
it's a special syntax for passing arguments to varargs functions in scala.
Without :_* compiler is looking for function which accepts Seq[Column] and will not found it.
You can read more about functions with varargs here for example.

Related

How to refer Spark RDD element multiple times using underscore notation?

How to refer Spark RDD element multiple times using underscore notations.
For example I need to convert RDD[String] to RDD[(String, Int)]. I can create anonymous function using function variables but I would like to do this using Underscore notation. How I can achieve this.
PFB sample code.
val x = List("apple", "banana")
val rdd1 = sc.parallelize(x)
// Working
val rdd2 = rdd1.map(x => (x, x.length))
// Not working
val rdd3 = rdd1.map((_, _.length))
Why does the last line above not work?
An underscore or (more commonly) a placeholder syntax is a marker of a single input parameter. It's nice to use for simple functions, but can get tricky to get right with two or more.
You can find the definitive answer in the Scala language specification's Placeholder Syntax for Anonymous Functions:
An expression (of syntactic category Expr) may contain embedded underscore symbols _ at places where identifiers are legal. Such an expression represents an anonymous function where subsequent occurrences of underscores denote successive parameters.
Note that one underscore references one input parameter, two underscores are for two different input parameters and so on.
With that said, you cannot use the placeholder twice and expect that they'll reference the same input parameter. That's not how it works in Scala and hence the compiler error.
// Not working
val rdd3 = rdd1.map((_, _.length))
The above is equivalent to the following:
// Not working
val rdd3 = rdd1.map { (a: String, b: String) => (a, b.length)) }
which is clearly incorrect as map expects a function of one input parameter.

Scala. Need for loop where the iterations return a growing list

I have a function that takes a value and returns a list of pairs, pairUp.
and a key set, flightPass.keys
I want to write a for loop that runs pairUp for each value of flightPass.keys, and returns a big list of all these returned values.
val result:List[(Int, Int)] = pairUp(flightPass.keys.toSeq(0)).toList
for (flight<- flightPass.keys.toSeq.drop(1))
{val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
I've tried a few different variations on this, always getting the error:
<console>:23: error: forward reference extends over definition of value result
for (flight<- flightPass.keys.toSeq.drop(1)) {val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
^
I feel like this should work in other languages, so what am I doing wrong here?
First off, you've defined result as a val, which means it is immutable and can't be modified.
So if you want to apply "pairUp for each value of flightPass.keys", why not map()?
val result = flightPass.keys.map(pairUp) //add .toList if needed
A Scala method which converts a List of values into a List of Lists and then reduces them to a single List is called flatMap which is short for map then flatten. You would use it like this:
flightPass.keys.toSeq.flatMap(k => pairUp(k))
This will take each 'key' from flightPass.keys and pass it to pairUp (the mapping part), then take the resulting Lists from each call to pairUp and 'flatten' them, resulting in a single joined list.

Scala: missing arguments for method count in trait

In one line of code I'm attempting to take the first 10 lines of an RDD and count the records (which obviously should be 10). However, when I do some I get the error:
<console>:24: error: missing arguments for method count in trait
TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
Here is the code:
logfiles.filter(line => line.contains("jpg")).take(10).count
After you take(10), you're no longer dealing with an RDD, but a Traversable (Scala collection type). You want to use size instead of count, since count takes a predicate to filter by:
val count = logfiles.filter(line => line.contains("jpg")).take(10).size
As you've stated, this will trivially always return 10 items as long as your RDD has at least that many items, and you most likely want to use RDD.count() instead.
val count = logfiles.filter(line => line.contains("jpg")).count()
As suggested by the documentation of RDD
def take(num: Int): Array[T]
Returns Array not an RDD hence the count function doesn't work.
Also in RDD there is no native way of selecting 10 elements. If you really want to do that you should probably convert the RDD to a dataframe and use limit function in dataframe
df.limit(10) will return a dataframe of 10 elements
Where you can perform the count operation

Can we replace map with flatMap?

I was trying to find line with maximum words, and i wrote the following lines, to run on spark-shell:
import java.lang.Math
val counts = textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
But since, map is one to one , and flatMap is one to either zero or anything. So i tried replacing map with flatMap, in above code. But its giving error as:
<console>:24: error: type mismatch;
found : Int
required: TraversableOnce[?]
val counts = F1.flatMap(s => s.split(" ").size).reduce((a,b)=> Math.max(a,b))
If anybody could make me understand the reason, it will really be helpful.
flatMap must return an Iterable which is clearly not what you want. You do want a map because you want to map a line to the number of words, so you want a one-to-one function that takes a line and maps it to the number of words (though you could create a collection with one element, being the size of course...).
FlatMap is meant to associate a collection to an input, for instance if you wanted to map a line to all its words you would do:
val words = textFile.flatMap(x => x.split(" "))
and that would return an RDD[String] containing all the words.
In the end, map transforms an RDD of size N into another RDD of size N (e.g. your lines to their length) whereas flatMap transforms an RDD of size N into an RDD of size P (actually an RDD of size N into an RDD of size N made of collections, all these collections are then flattened to produce the RDD of size P).
P.S.: one last word that has nothing to do with your problem, it is more efficient to do (for a string s)
val nbWords = s.split(" ").length
than call .size(). Indeed, the split method returns an array of String and arrays do not have a size method. So when you call .size() you have an implicit conversion from Array[String] to SeqLike[String] which creates new objects. But Array[T] do have a length field so there's no conversion calling length. (It's a detail but I think it's good habit though).
Any use of map can be replaced by flatMap, but the function argument has to be changed to return a single-element List: textFile.flatMap(line => List(line.split(" ").size)). This isn't a good idea: it just makes your code less understandable and less efficient.
After reading Tired of Null Pointer Exceptions? Consider Using Java SE 8's Optional!'s part about why use flatMap() rather than Map(), I have realized the truly reason why flatMap() can not replace map() is that map() is not a special case of flatMap().
It's true that flatMap() means one-to-many, but that's not the only thing flatMap() does. It can also strip outer Stream() if put it simply.
See the definations of map and flatMap:
Stream<R> map(Function<? super T, ? extends R> mapper)
Stream<R> flatMap(Function<? super T, ? extends Stream<? extends R>> mapper)
the only difference is the type of returned value in inner function. What map() returned is "Stream<'what inner function returned'>", while what flatMap() returned is just "what inner function returned".
So you can say that flatMap() can kick outer Stream() away, but map() can't. This is the most difference in my opinion, and also why map() is not just a special case of flatMap().
ps:
If you really want to make one-to-one with flatMap, then you should change it into one-to-List(one). That means you should add an outer Stream() manually which will be stripped by flatMap() later. After that you'll get the same effect as using map().(Certainly, it's clumsy. So don't do like that.)
Here are examples for Java8, but the same as Scala:
use map():
list.stream().map(line -> line.split(" ").length)
deprecated use flatMap():
list.stream().flatMap(line -> Arrays.asList(line.split(" ").length).stream())

Spark Group By Key to (Key,List) Pair

I am trying to group some data by key where the value would be a list:
Sample data:
A 1
A 2
B 1
B 2
Expected result:
(A,(1,2))
(B,(1,2))
I am able to do this with the following code:
data.groupByKey().mapValues(List(_))
The problem is that when I then try to do a Map operation like the following:
groupedData.map((k,v) => (k,v(0)))
It tells me I have the wrong number of parameters.
If I try:
groupedData.map(s => (s(0),s(1)))
It tells me that "(Any,List(Iterable(Any)) does not take parameters"
No clue what I am doing wrong. Is my grouping wrong? What would be a better way to do this?
Scala answers only please. Thanks!!
You're almost there. Just replace List(_) with _.toList
data.groupByKey.mapValues(_.toList)
When you write an anonymous inline function of the form
ARGS => OPERATION
the entire part before the arrow (=>) is taken as the argument list. So, in the case of
(k, v) => ...
the interpreter takes that to mean a function that takes two arguments. In your case, however, you have a single argument which happens to be a tuple (here, a Tuple2, or a Pair - more fully, you appear to have a list of Pair[Any,List[Any]]). There are a couple of ways to get around this. First, you can use the sugared form of representing a pair, wrapped in an extra set of parentheses to show that this is the single expected argument for the function:
((x, y)) => ...
or, you can write the anonymous function in the form of a partial function that matches on tuples:
groupedData.map( case (k,v) => (k,v(0)) )
Finally, you can simply go with a single specified argument, as per your last attempt, but - realising it is a tuple - reference the specific field(s) within the tuple that you need:
groupedData.map(s => (s._2(0),s._2(1))) // The key is s._1, and the value list is s._2