Spark Group By Key to (Key,List) Pair - scala

I am trying to group some data by key where the value would be a list:
Sample data:
A 1
A 2
B 1
B 2
Expected result:
(A,(1,2))
(B,(1,2))
I am able to do this with the following code:
data.groupByKey().mapValues(List(_))
The problem is that when I then try to do a Map operation like the following:
groupedData.map((k,v) => (k,v(0)))
It tells me I have the wrong number of parameters.
If I try:
groupedData.map(s => (s(0),s(1)))
It tells me that "(Any,List(Iterable(Any)) does not take parameters"
No clue what I am doing wrong. Is my grouping wrong? What would be a better way to do this?
Scala answers only please. Thanks!!

You're almost there. Just replace List(_) with _.toList
data.groupByKey.mapValues(_.toList)

When you write an anonymous inline function of the form
ARGS => OPERATION
the entire part before the arrow (=>) is taken as the argument list. So, in the case of
(k, v) => ...
the interpreter takes that to mean a function that takes two arguments. In your case, however, you have a single argument which happens to be a tuple (here, a Tuple2, or a Pair - more fully, you appear to have a list of Pair[Any,List[Any]]). There are a couple of ways to get around this. First, you can use the sugared form of representing a pair, wrapped in an extra set of parentheses to show that this is the single expected argument for the function:
((x, y)) => ...
or, you can write the anonymous function in the form of a partial function that matches on tuples:
groupedData.map( case (k,v) => (k,v(0)) )
Finally, you can simply go with a single specified argument, as per your last attempt, but - realising it is a tuple - reference the specific field(s) within the tuple that you need:
groupedData.map(s => (s._2(0),s._2(1))) // The key is s._1, and the value list is s._2

Related

Scala combination function issue

I have a input file like this:
The Works of Shakespeare, by William Shakespeare
Language: English
and I want to use flatMap with the combinations method to get the K-V pairs per line.
This is what I do:
var pairs = input.flatMap{line =>
line.split("[\\s*$&#/\"'\\,.:;?!\\[\\(){}<>~\\-_]+")
.filter(_.matches("[A-Za-z]+"))
.combinations(2)
.toSeq
.map{ case array => array(0) -> array(1)}
}
I got 17 pairs after this, but missed 2 of them: (by,shakespeare) and (william,shakespeare). I think there might be something wrong with the last word of the first sentence, but I don't know how to solve it, can anyone tell me?
The combinations method will not give duplicates even if the values are in the opposite order. So the values you are missing already appear in the solution in the other order.
This code will create all ordered pairs of words in the text.
for {
line <- input
t <- line.split("""\W+""").tails if t.length > 1
a = t.head
b <- t.tail
} yield a -> b
Here is the description of the tails method:
Iterates over the tails of this traversable collection. The first value will be this traversable collection and the final one will be an empty traversable collection, with the intervening values the results of successive applications of tail.

Scala. Need for loop where the iterations return a growing list

I have a function that takes a value and returns a list of pairs, pairUp.
and a key set, flightPass.keys
I want to write a for loop that runs pairUp for each value of flightPass.keys, and returns a big list of all these returned values.
val result:List[(Int, Int)] = pairUp(flightPass.keys.toSeq(0)).toList
for (flight<- flightPass.keys.toSeq.drop(1))
{val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
I've tried a few different variations on this, always getting the error:
<console>:23: error: forward reference extends over definition of value result
for (flight<- flightPass.keys.toSeq.drop(1)) {val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
^
I feel like this should work in other languages, so what am I doing wrong here?
First off, you've defined result as a val, which means it is immutable and can't be modified.
So if you want to apply "pairUp for each value of flightPass.keys", why not map()?
val result = flightPass.keys.map(pairUp) //add .toList if needed
A Scala method which converts a List of values into a List of Lists and then reduces them to a single List is called flatMap which is short for map then flatten. You would use it like this:
flightPass.keys.toSeq.flatMap(k => pairUp(k))
This will take each 'key' from flightPass.keys and pass it to pairUp (the mapping part), then take the resulting Lists from each call to pairUp and 'flatten' them, resulting in a single joined list.

Scala - Use of .indexOf() and .indexWhere()

I have a tuple like the following:
(Age, List(19,17,11,3,2))
and I would like to get the position of the first element where their position in the list is greater than their value. To do this I tried to use .indexOf() and .indexWhere() but I probably can't find exactly the right syntax and so I keep getting:
value indexWhere is not a member of org.apache.spark.rdd.RDD[(String,
Iterable[Int])]
My code so far is:
val test =("Age", List(19,17,11,3,2))
test.indexWhere(_.2(_)<=_.2(_).indexOf(_.2(_)) )
I also searched the documentation here with no result: http://www.scala-lang.org/api/current/index.html#scala.collection.immutable.List
If you want to perform this for each element in an RDD, you can use RDD's mapValues (which would only map the right-hand-side of the tuple) and pass a function that uses indexWhere:
rdd.mapValues(_.zipWithIndex.indexWhere { case (v, i) => i+1 > v} + 1)
Notes:
Your example seems wrong, if you want the last matching item it should be 5 (position of 2) and not 4
You did not define what should be done when no item matches your condition, e.g. for List(0,0,0) - in this case the result would be 0 but not sure that's what you need

Spark: Dividing one array by elements in another

I am new to Apache Spark and Scala. I am trying to understand something here: -
I have one array:
Companies= Array(
(Microsoft,478953),
(IBM,332042),
(JP Morgan,226003),
(Google,342033)
)
I wanted to divide this by another array, element by element:
Count = Array((Microsoft,4), (IBM,3), (JP Morgan,2), (Google,3))
I used this code :
val result: Array[(String, Double)] = wordMapCount
.zip(letterMapCount)
.map { case ((letter, wc), (_, lc)) => (letter, lc.toDouble / wc) }
From here: Divide Arrays
This works. However, I do not understand it. Why does zip require the second array and not the first one also the case matching how is that working here?
Why does zip require the second array and not the first one?
Because that's how zip works. It takes two separate RDD instances and maps one over the other to create pair of the first and second element:
def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
case matching how is that working here
You have two tuples:
(Microsoft, 478953), (Microsoft,4)
What this partial function does decomposition of the tuple type via a call to Tuple2.unapply. This:
case ((letter, wc), (_, lc))
Means "extract the first argument (_1) from the first tuple into a fresh value named letter, and the second argument (_2) to a fresh value named wc. Same goes for the second tuple. And then, it creates a new tuple with letter as the first value and the division of lc and wc as the second argument.

What do _._1 and _++_ mean in Scala (two separate operations)?

My interpretation of _._1 is:
_ = wildcard parameter
_1 = first parameter in method parameter list
But when used together with . what does it signify?
This is how its used :
.toList.sortWith(_._1 < _._1)
For this statement:
_++_
I'm lost. Is it concatenation two wildcard parameters somehow?
This is how its used:
.reduce(_++_)
I would be particularly interested if they above code could be made more verbose and remove any implicits, just so I can understand it better?
_._1 calls the method _1 on the wildcard parameter _, which gets the first element of a tuple. Thus, sortWith(_._1 < _._1) sorts the list of tuple by their first element.
_++_ calls the method ++ on the first wildcard parameter with the second parameter as an argument. ++ does concatenation for sequences. Thus .reduce(_++_) concatenates a list of sequences together. Usually you can use flatten for that.
_1 is a method name. Specifically tuples have a method named _1, which returns the first element of the tuple. So _._1 < _._1 means "call the _1 method on both arguments and check whether the first is less than the second".
And yes, _++_ concatenates both arguments (assuming the first argument has a ++ method that performs concatenation).
.reduce(_++_)
is really just:
.reduce{ (acc, n) => acc ++ n }