How to pass custom function to reduceByKey of RDD in scala - scala

My requirement is to find the maximum of each group in RDD.
I tried the below;
scala> val x = sc.parallelize(Array(Array("A",3), Array("B",5), Array("A",6)))
x: org.apache.spark.rdd.RDD[Array[Any]] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> x.collect
res0: Array[Array[Any]] = Array(Array(A, 3), Array(B, 5), Array(A, 6))
scala> x.filter(math.max(_,_))
<console>:30: error: wrong number of parameters; expected = 1
x.filter(math.max(_,_))
^
I also tried the below;
Option 1:
scala> x.filter((x: Int, y: Int) => { math.max(x,y)} )
<console>:30: error: type mismatch;
found : (Int, Int) => Int
required: Array[Any] => Boolean
x.filter((x: Int, y: Int) => { math.max(x,y)} )
Option 2:
scala> val myMaxFunc = (x: Int, y: Int) => { math.max(x,y)}
myMaxFunc: (Int, Int) => Int = <function2>
scala> myMaxFunc(56,12)
res10: Int = 56
scala> x.filter(myMaxFunc(_,_) )
<console>:32: error: wrong number of parameters; expected = 1
x.filter(myMaxFunc(_,_) )
How to get this right ?

I can only guess, but probably you want to do:
val rdd = sc.parallelize(Array(("A", 3), ("B", 5), ("A", 6)))
val max = rdd.reduceByKey(math.max)
println(max.collect().toList) // List((B,5), (A,6))
Instead of "How to get this right ?" you should have explained what your expected result is. I think you made a few mistakes:
using filter instead of reduceByKey (why??)
reduceByKey only works on PairRDDs, so you need tuples instead of Array[Any] (which is a bad type anyways)
you do not need to write your own wrapper function for math.max, you can just use it as-is

Related

How to access to a value of a scala Tuples

I have a sequence of tuples that with a value and his power 2:
val fields3: Seq[(Int, Int)] = Seq((3, 9), (5, 25))
the thing that I want to know is if there is a way to access to a value of the same tuple directly when I create the object whithout use a foreach:
val fields3: Seq[(Int, Int)] = Seq((3, 3 * 3 ), (5, 5 * 5))
my idea is something like:
val fields3: Seq[(Int, Int)] = Seq((3, _1 * _1 ), (5, _1 * _1)) //like this doesn't compile
You can do something like this:
Seq(2,3,4).map(i => (i, i*i))
You could wrap the tuple in a case class potentially:
case class TupleInt(base: Int) {
val tuple: (Int, Int) = (base, base*base)
}
Then you could create the sequence like this:
val fields3: Seq[(Int, Int)] = Seq(TupleInt(3), TupleInt(5)).map(_.tuple)
I would prefer the answer #geek94 gave, this is too verbose for what you want to do.
An equally valid way to express this is:
val fields3: Seq[(Int, Int)] = Seq(3, 5).map(i => i -> i*i)

How to use zip and map to create zipwith function to Array in scala

I tried to use zip and map to create zipwith function like:
def zipWithArray(f : (Int, Int) => Int)(xs : Array[Int], ys: Array[Int]) : Array[Int] = xs zip ys map f
But I got the following compile error:
type mismatch;
found : (Int, Int) => Int
required: ((Int, Int)) => ?
I know the zip is (Array[Int], Array[Int])=>Array((Int, Int)), so the f should be (Int, Int) => Int and the total result is Array[Int]. Could anyone help to explain the case please. Thanks a lot.
(Int, Int) => Int is function which takes two Int as argument.
((Int, Int)) => ? is function which takes one tuple which consists of two Int as argument.
Since xs zip ys is array of tuple, what you need is function which takes tuple as argument and returns Int.
So xz zip ys map f.tupled should work.
Reference: How to apply a function to a tuple?
It's pretty much as the error message states; change your function signature to:
def zipWithArray(f : ((Int, Int)) => Int)(xs : Array[Int], ys: Array[Int])
Without the extra parentheses, f looks like a function that takes two integers, rather than a function that takes a tuple.
Convert the function to accept arguments as tuples and then map can be used to call the function.
For example :
scala> def add(a : Int, b: Int) : Int = a + b
add: (a: Int, b: Int)Int
scala> val addTuple = add _ tupled
<console>:12: warning: postfix operator tupled should be enabled
by making the implicit value scala.language.postfixOps visible.
This can be achieved by adding the import clause 'import scala.language.postfixOps'
or by setting the compiler option -language:postfixOps.
See the Scaladoc for value scala.language.postfixOps for a discussion
why the feature should be explicitly enabled.
val addTuple = add _ tupled
^
addTuple: ((Int, Int)) => Int = scala.Function2$$Lambda$224/1945604815#63f855b
scala> val array = Array((1, 2), (3, 4), (5, 6))
array: Array[(Int, Int)] = Array((1,2), (3,4), (5,6))
scala> val addArray = array.map(addTuple)
addArray: Array[Int] = Array(3, 7, 11)

map error when applying on list of tuples in scala

If applying map method to a list of tuple in Scala, it complains error as below:
scala> val s = List((1,2), (3,4))
s: List[(Int, Int)] = List((1,2), (3,4))
scala> s.map((a,b) => a+b)
<console>:13: error: missing parameter type
Note: The expected type requires a one-argument function accepting a 2-Tuple.
Consider a pattern matching anonymous function, `{ case (a, b) => ... }`
s.map((a,b) => a+b)
^
<console>:13: error: missing parameter type
s.map((a,b) => a+b)
But if I apply similar map method to list of Int, it works fine:
scala> val t = List(1,2,3)
t: List[Int] = List(1, 2, 3)
scala> t.map(a => a+1)
res14: List[Int] = List(2, 3, 4)
Anyone knows why it is? Thanks.
Scala dosen't deconstruct tuples automatically. You'll need to either use curly brackets:
val s = List((1,2), (3,4))
val result = s.map { case (a, b) => a + b }
Or use a single parameter of type tuple:
val s = List((1,2), (3,4))
val result = s.map(x => x._1 + x._2)
Dotty (the future Scala compiler) will bring automatic deconstruction of tuples.

Use 4 (or N) collections to yield only one value at a time (1xN) (i.e. zipped for tuple4+)

scala> val a = List(1,2)
a: List[Int] = List(1, 2)
scala> val b = List(3,4)
b: List[Int] = List(3, 4)
scala> val c = List(5,6)
c: List[Int] = List(5, 6)
scala> val d = List(7,8)
d: List[Int] = List(7, 8)
scala> (a,b,c).zipped.toList
res6: List[(Int, Int, Int)] = List((1,3,5), (2,4,6))
Now:
scala> (a,b,c,d).zipped.toList
<console>:12: error: value zipped is not a member of (List[Int], List[Int], List[Int], List[Int])
(a,b,c,d).zipped.toList
^
I've searched for this elsewhere, including this one and this one, but no conclusive answer.
I want to do the following or similar:
for((itemA,itemB,itemC,itemD) <- (something)) yield itemA + itemB + itemC + itemD
Any suggestions?
Short answer:
for (List(w,x,y,z) <- List(a,b,c,d).transpose) yield (w,x,y,z)
// List[(Int, Int, Int, Int)] = List((1,3,5,7), (2,4,6,8))
Why you want them as tuples, I'm not sure, but a slightly more interesting case would be when your lists are of different types, and for example, you want to combine them into a list of objects:
case class Person(name: String, age: Int, height: Double, weight: Double)
val names = List("Alf", "Betty")
val ages = List(22, 33)
val heights = List(111.1, 122.2)
val weights = List(70.1, 80.2)
val persons: List[Person] = ???
Solution 1: using transpose, as above:
for { List(name: String, age: Int, height: Double, weight: Double) <-
List(names, ages, heights, weights).transpose
} yield Person(name, age, height, weight)
Here, we need the type annotations in the List extractor, because transpose gives a List[List[Any]].
Solution 2: using iterators:
val namesIt = names.iterator
val agesIt = ages.iterator
val heightsIt = heights.iterator
val weightsIt = weights.iterator
for { name <- names }
yield Person(namesIt.next, agesIt.next, heightsIt.next, weightsIt.next)
Some people would avoid iterators because they involve mutable state and so are not "functional". But they're easy to understand if you come from the Java world and might be suitable if what you actually have are already iterators (input streams etc).
Shameless plug-- product-collections does something similar:
a flatZip b flatZip c flatZip d
res0: org.catch22.collections.immutable.CollSeq4[Int,Int,Int,Int] =
CollSeq((1,3,5,7),
(2,4,6,8))
scala> res0(0) //first row
res1: Product4[Int,Int,Int,Int] = (1,3,5,7)
scala> res0._1 //first column
res2: Seq[Int] = List(1, 2)
val g = List(a,b,c,d)
val result = ( g.map(x=>x(0)), g.map(x=>x(1) ) )
result : (List(1, 3, 5, 7),List(2, 4, 6, 8))
basic, zipped assit tuple2 , tuple3
http://www.scala-lang.org/api/current/index.html#scala.runtime.Tuple3Zipped
so, You want 'tuple4zippped' you make it
gool luck
found a possible solution, although it's very imperative to my taste:
val a = List(1,2)
val b = List(3,4)
val c = List(5,6)
val d = List(7,8)
val g : List[Tuple4[Int,Int,Int,Int]] = {
a.zipWithIndex.map { case (value,index) => (value, b(index), c(index), d(index))}
}
zipWithIndex would allow me to go through all the other collections. However, i'm sure there's a better way to do this. Any suggestions?
Previous attempts included:
Ryan LeCompte's zipMany or transpose.
however, it a List, not a tuple4. this is not as convenient to work with since i can't name the variables.
Tranpose it's already built in in the standard library and doesn't require higher kinds imports so it's preferrable, but not ideal.
I also, incorrectly, tried the following example with Shapeless
scala> import Traversables._
import Tuples._
import Traversables._
import Tuples._
import scala.language.postfixOps
scala> val a = List(1,2)
a: List[Int] = List(1, 2)
scala> val b = List(3,4)
b: List[Int] = List(3, 4)
scala> val c = List(5,6)
c: List[Int] = List(5, 6)
scala> val d = List(7,8)
d: List[Int] = List(7, 8)
scala> val x = List(a,b,c,d).toHList[Int :: Int :: Int :: Int :: HNil] map tupled
x: Option[(Int, Int, Int, Int)] = None

In Scala, how to use Ordering[T] with List.min or List.max and keep code readable

In Scala 2.8, I had a need to call List.min and provide my own compare function to get the value based on the second element of a Tuple2. I had to write this kind of code:
val list = ("a", 5) :: ("b", 3) :: ("c", 2) :: Nil
list.min( new Ordering[Tuple2[String,Int]] {
def compare(x:Tuple2[String,Int],y:Tuple2[String,Int]): Int = x._2 compare y._2
} )
Is there a way to make this more readable or to create an Ordering out of an anonymous function like you can do with list.sortBy(_._2)?
In Scala 2.9, you can do list minBy { _._2 }.
C'mon guys, you made the poor questioner find "on" himself. Pretty shabby performance. You could shave a little further writing it like this:
list min Ordering[Int].on[(_,Int)](_._2)
Which is still far too noisy but that's where we are at the moment.
One thing you can do is use the more concise standard tuple type syntax instead of using Tuple2:
val min = list.min(new Ordering[(String, Int)] {
def compare(x: (String, Int), y: (String, Int)): Int = x._2 compare y._2
})
Or use reduceLeft to have a more concise solution altogether:
val min = list.reduceLeft((a, b) => (if (a._2 < b._2) a else b))
Or you could sort the list by your criterion and get the first element (or last for the max):
val min = list.sort( (a, b) => a._2 < b._2 ).first
Which can be further shortened using the placeholder syntax:
val min = list.sort( _._2 < _._2 ).first
Which, as you wrote yourself, can be shortened to:
val min = list.sortBy( _._2 ).first
But as you suggested sortBy yourself, I'm not sure if you are looking for something different here.
The function Ordering#on witnesses the fact that Ordering is a contra-variant functor. Others include Comparator, Function1, Comparable and scalaz.Equal.
Scalaz provides a unified view on these types, so for any of them you can adapt the input with value contramap f, or with symbolic denotation, value ∙ f
scala> import scalaz._
import scalaz._
scala> import Scalaz._
import Scalaz._
scala> val ordering = implicitly[scala.Ordering[Int]] ∙ {x: (_, Int) => x._2}
ordering: scala.math.Ordering[Tuple2[_, Int]] = scala.math.Ordering$$anon$2#34df289d
scala> List(("1", 1), ("2", 2)) min ordering
res2: (java.lang.String, Int) = (1,1)
Here's the conversion from the Ordering[Int] to Ordering[(_, Int)] in more detail:
scala> scalaz.Scalaz.maContravariantImplicit[Ordering, Int](Ordering.Int).contramap { x: (_, Int) => x._2 }
res8: scala.math.Ordering[Tuple2[_, Int]] = scala.math.Ordering$$anon$2#4fa666bf
list.min(Ordering.fromLessThan[(String, Int)](_._2 < _._2))
Which is still too verbose, of course. I'd probably declare it as a val or object.
You could always define your own implicit conversion:
implicit def funToOrdering[T,R <% Ordered[R]](f: T => R) = new Ordering[T] {
def compare(x: T, y: T) = f(x) compare f(y)
}
val list = ("a", 5) :: ("b", 3) :: ("c", 2) :: Nil
list.min { t: (String,Int) => t._2 } // (c, 2)
EDIT: Per #Dario's comments.
Might be more readable if the conversion wasn't implicit, but using an "on" function:
def on[T,R <% Ordered[R]](f: T => R) = new Ordering[T] {
def compare(x: T, y: T) = f(x) compare f(y)
}
val list = ("a", 5) :: ("b", 3) :: ("c", 2) :: Nil
list.min( on { t: (String,Int) => t._2 } ) // (c, 2)