spark rdd, need to reduce over (key,(tuple)) - scala

A previous process gave me the accumulator and count of every group in the next way:
val data: Array[(Int, (Double, Int))] = Array((2,(2.1463120403829962,7340)), (1,(1.4532644653720025,4280)))
the structure is (groupId,(acum,count))
now want reduce to get the sum of every tuple:
(k1,(a1,n1)),(k2,(a2,n2))
need:
(a1+a2),(n1+n2)
Sound like a simple task, So do:
val mainMean = groups.reduce((acc,v)=>(acc._1 + v._1,acc._2 + v._2))
And get:
:33: error: type mismatch; found : (Double, Int)
required: String
val mainMean = groups.reduce((acc,v)=>(acc._1 + v._1,acc._2 + v._2))
Also tried:
val mainMean = groups.reduce((k,(acc,v))=>(acc._1 + v._1,acc._2 + v._2))
and tell me: Note: Tuples cannot be directly destructured in method or function parameters.
Either create a single parameter accepting the Tuple2,
or consider a pattern matching anonymous function
So:
val mainMean = groups.reduce({case(k,(acc,n))=>(k,(acc._1+n._1,acc._1+n._2))})
and get
error: type mismatch; found : (Int, (Double, Int)) required: Int
I know it a newbie question but I am stuck on it

There can be some difficulties working with tuples.
Below you can see working code, but let me explain.
val data = Array((2,(2.1463120403829962,7340)), (1,(1.4532644653720025,4280)))
def tupleSum(t1: (Int, (Double, Int)), t2: (Int, (Double, Int))): (Int, (Double, Int)) =
(0,(t1._2._1 + t2._2._1, t1._2._2 + t2._2._2))
val mainMean = data.reduce(tupleSum)._2
We can introduce reduce arguments like
data.reduce((tuple1, tuple2) => tupleSum(tuple1, tuple2))
where tuple1 is kind of accumulator. On the first iteration it takes the first value of the array, and every next value adds to the value of accumulator.
So if you want to perform reduce using pattern matching it will look like this:
val mainMean = data.reduce((tuple1, tuple2) => {
val t1 = tuple1 match { case (i, t) => t }
val t2 = tuple2 match { case (i, t) => t }
// now t1 and t2 represents inner tuples of input tuples
(0, (t1._1 + t2._1, t1._2 + t2._2))}
)
UPD.
I rewrite previous listing adding type annotations and println statements. I hope it will help to get the point. And there is some explanation after.
val data = Array((3, (3.0, 3)), (2,(2.0,2)), (1,(1.0,1)))
val mainMean = data.reduce((tuple1: (Int, (Double, Int)),
tuple2: (Int, (Double, Int))) => {
println("tuple1: " + tuple1)
println("tuple2: " + tuple2)
val t1: (Double, Int) = tuple1 match {
case (i: Int, t: (Double, Int)) => t
}
val t2: (Double, Int) = tuple2 match {
case (i: Int, t: (Double, Int)) => t
}
// now t1 and t2 represents inner tuples of input tuples
(0, (t1._1 + t2._1, t1._2 + t2._2))}
)
println("mainMean: " + mainMean)
And the output will be:
tuple1: (3,(3.0,3)) // 1st element of the array
tuple2: (2,(2.0,2)) // 2nd element of the array
tuple1: (0,(5.0,5)) // sum of 1st and 2nd elements
tuple2: (1,(1.0,1)) // 3d element
mainMean: (0,(6.0,6)) // result sum
tuple1 and tuple2 type is (Int, (Double, Int)). We know it always be only this type, that is why we use only one case in pattern matching. We unpack tuple1 to i: Int and t: (Int, Double). As far as we are not interested in key, we return only t. Now t1 is representing the inner tuple of tuple1. The same story with tuple2 andt2.
You can find more information about fold functions here and here

Related

toMap error in mapping a Data Set

I'm having an error
error: Cannot prove that (Int, String, String, String, String, Double, String) <:< (T, U).
}.collect.toMap
when executing my application having the following code snippet.
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, fields(1),fields(2),fields(3),fields(4).toDouble,fields(5))
}.collect.toMap
What could be the cause and can anyone please suggest a solution ?
if you want to do toMap on Seq, you show have a Seq of Tuple2. ScalaDoc of toMap states :
This method is unavailable unless the elements are members of Tuple2,
each ((T, U)) becoming a key-value pair in the map
So you should do:
val trains = sparkEnvironment.sc.textFile(dataDirectoryPath + "/trains.csv").map { line =>
val fields = line.split(",")
// format: (trainID,trainName,departure,arrival,cost,trainClass)
(fields(0).toInt, // first element of Tuple2 -> "key"
(fields(1),fields(2),fields(3),fields(4).toDouble,fields(5)) // 2nd element of Tuple2 -> "value"
)
}.collect.toMap
such that your map-statwment returns RDD[(Int, (String, String, String, String, Double, String))]

Understanding Currying in Scala

I'm getting problems to understand the currying concept, or at least the SCALA currying notation.
wikipedia says that currying is the technique of translating the evaluation of a function that takes multiple arguments into evaluating a sequence of functions, each with a single argument.
Following this explanation, are the two next lines the same for scala?
def addCurr(a: String)(b: String): String = {a + " " + b}
def add(a:String): String => String = {b => a + " " + b}
I've run both lines with the same strings a and b getting the same result, but I don't know if they are different under the hood
My way of thinking about addCurr (and currying itself) is that it is a function that receives a string parameter a, and returns another function that also receives a string parameter b and returns the string a + " " + b?
So if I'm getting right, addCurr is only syntactic sugar of the function add and both are curryed functions?
According to the previous example, the next functions are also equivalent for scala?
def add(a: String)(b: String)(c: String):String = { a + " " + b + " " + c}
def add1(a: String)(b: String): String => String = {c => a + " " + b + " " + c}
def add2(a:String): (String => (String => String)) = {b => (c => a + " " + b + " " + c)}
They have a bit different semantics, but their use-cases are mostly the same, both practically and how it looks in the code.
Currying
Currying a function in Scala in that mathematical sense is a very straightforward:
val function = (x: Int, y: Int, z: Int) => 0
// function: (Int, Int, Int) => Int = <function3>
function.curried
// res0: Int => (Int => (Int => Int)) = <function1>
Functions & methods
You seem to be confused by the fact that in Scala, (=>) functions are not the same as (def) methods. Method isn't a first-class object, while function is (i.e. it has curried and tupled methods, and Function1 has even more goodness).
Methods, however, can be lifted to functions by an operation known as eta expansion. See this SO answer for some details. You can trigger it manually by writing methodName _, or it will be done implicitly if you give a method to where a function type is expected.
def sumAndAdd4(i: Int, j: Int) = i + j + 4
// sumAndAdd4.curried // <- won't compile
val asFunction = sumAndAdd4 _ // trigger eta expansion
// asFunction: (Int, Int) => Int = <function2>
val asFunction2: (Int, Int) => Int = sumAndAdd4
// asFunction2: (Int, Int) => Int = <function2>
val asFunction3 = sumAndAdd4: (Int, Int) => Int
// asFunction3: (Int, Int) => Int = <function2>
asFunction.curried
// res0: Int => (Int => Int) = <function1>
asFunction2.curried
// res1: Int => (Int => Int) = <function1>
asFunction3.curried
// res2: Int => (Int => Int) = <function1>
{sumAndAdd4 _}.tupled // you can do it inline too
// res3: Int => (Int => Int) = <function1>
Eta expansion of multiple parameter list
Like you might expect, eta expansion lifts every parameter list to its own function
def singleArgumentList(x: Int, y: Int) = x + y
def twoArgumentLists(x: Int)(y: Int) = x + y
singleArgumentList _ // (Int, Int) => Int
twoArgumentLists _ // Int => (Int => Int) - curried!
val testSubject = List(1, 2, 3)
testSubject.reduce(singleArgumentList) // Int (6)
testSubject.map(twoArgumentLists) // List[Int => Int]
// testSubject.map(singleArgumentList) // does not compile, map needs Int => A
// testSubject.reduce(twoArgumentLists) // does not compile, reduce needs (Int, Int) => Int
But it's not that currying in mathematical sense:
def hmm(i: Int, j: Int)(s: String, t: String) = s"$i, $j; $s - $t"
{hmm _} // (Int, Int) => (String, String) => String
Here, we get a function of two arguments, returning another function of two arguments.
And it's not that straightforward to specify only some of its argume
val function = hmm(5, 6) _ // <- still need that underscore!
Where as with functions, you get back a function without any fuss:
val alreadyFunction = (i: Int, j: Int) => (k: Int) => i + j + k
val f = alreadyFunction(4, 5) // Int => Int
Do which way you like it - Scala is fairly un-opinionated about many things. I prefer multiple parameter lists, personally, because more often than not I'll need to partially apply a function and then pass it somewhere, where the remaining parameters will be given, so I don't need to explicitly do eta-expansion, and I get to enjoy a terser syntax at method definition site.
Curried methods are syntactic sugar, you were right about this part. But this syntactic sugar is a bit different. Consider following example:
def addCur(a: String)(b: String): String = { a + b }
def add(a: String): String => String = { b => a + b }
val functionFirst: String => String = add("34")
val functionFirst2 = add("34")_
val functionSecond: String => String = add("34")
Generaly speaking curried methods allows for partial application and are necessary for the scala implicits mechanism to work. In the example above i provided examples of usage, as you can see in the second one we have to use underscore sign to allow compiler to do the "trick". If it was not present you would receive error similar to the following one:
Error:(75, 19) missing argument list for method curried in object XXX
Unapplied methods are only converted to functions when a function type
is expected. You can make this conversion explicit by writing curried_ or curried(_)(_) instead of curried.
Your question interested me so I tried this out my self. They actually desugar down to some very different constructs. Using
def addCurr(a: String)(b: String): String = {a + " " + b}
This actually compiles to
def addCurr(a: String, b: String): String = {a + " " + b}
So it completely removes any currying effect, making it a regular arity-2 method. Eta expansion is used to allow you to curry it.
def add(a:String): String => String = {b => a + " " + b}
This one works as you would expect, compiling to a method that returns a Function1[String,String]

Type mismatch in Scala Map from getOrElse returning Equals

I have this Scala snippet from a custom mapper (for use in a Spark mapPartitions) I'm writing to compute histograms of multiple Int fields simultaneously.
def multiFeatureHistogramFunc(iter: Iterator[Row]) : Iterator[(Int, (Int, Long))] = {
var featureHistMap:Map[Int, (Int, Long)] = Map()
while (iter.hasNext)
{
val cur = iter.next;
indices.foreach( { index:Int =>
val v:Int = if ( cur.isNullAt(index) ) -100 else cur.getInt(index)
var featureHist:Map[Int, Long] = featureHistMap.getOrElse(index, Map())
val newCount = featureHist.getOrElse(v,0L) + 1L
featureHist += (v -> newCount)
featureHistMap += (index -> featureHist)
})
}
featureHistMap.iterator
}
But the error I'm getting is this:
<console>:49: error: type mismatch;
found : Equals
required: Map[Int,Long]
var featureHist:Map[Int, Long] =
featureHistMap.getOrElse(index, Map())
^
I couldn't find the answer to this specific issue. It looks to me like the default parameter in featureHistMap.getOrElse is a different type than the value field of the featureHistMap itself and the common parent type is Equals so this causes a type mismatch. I tried a number of different things like changing the default parameter to be a more specific type, but this just caused a different error.
Can someone explain what's going on here and how to fix it?
The problem is that you declared your featureHistMap as Map[Int, (Int, Long)] - note that you are mapping an Int to a pair (Int, Long). Later, you try to retrieve a value from it as a Map[Int, Long], instead of a pair (Int, Long).
You either need to redeclare the type of featureHistMap to Map[Int, Map[Int, Long]], or the type of featureHist to (Int, Long).

How to order tuples based on another, referal tuple

The following Iterable can be o size one, two, or (up to) three.
org.apache.spark.rdd.RDD[Iterable[(String, String, String, String, Long)]] = MappedRDD[17] at map at <console>:75
The second element of each tuple can have any of the following values: A, B, C. Each of these values can appear (at most) once.
What I would like to do is sort them based on the following order (B , A , C), and then create a string by concatenating the elements of the 3rd place. If the corresponding tag is missing then concatenate with a blank space: ``. For example:
this:
CompactBuffer((blah,A,val1,blah,blah), (blah,B,val2,blah,blah), (blah,C,val3,blah,blah))
should result in:
val2,val1,val3
this:
CompactBuffer((blah,A,val1,blah,blah), (blah,C,val3,blah,blah))
should result in:
,val1,val3
this:
CompactBuffer((blah,A,val1,blah,blah), (blah,B,val2,blah,blah))
should result in:
val2,val1,
this:
CompactBuffer((blah,B,val2,blah,blah))
should result in:
val2,,
and so on so forth.
In your case when A, B and C appear at most once, you could add the corresponding values to a temporary map and retrieve the values from the map in the correct order.
If we use getOrElse to get the values from the map, we can specify the empty string as default value. This way we still get the correct result if our Iterable doesn't contain all the tuples with A, B and C.
type YourTuple = (String, String, String, String, Long)
def orderTuples(order: List[String])(iter: Iterable[YourTuple]) = {
val orderMap = iter.map { case (_, key, value, _, _) => key -> value }.toMap
order.map(s => orderMap.getOrElse(s, "")).mkString(",")
}
We can use this function as follows :
val a = ("blah","A","val1","blah",1L)
val b = ("blah","B","val2","blah",2L)
val c = ("blah","C","val3","blah",3L)
val order = List("B", "A", "C")
val bacOrder = orderTuples(order) _
bacOrder(Iterable(a, b, c)) // String = val2,val1,val3
bacOrder(Iterable(a, c)) // String = ,val1,val3
bacOrder(Iterable(a, b)) // String = val2,val1,
bacOrder(Iterable(b)) // String = val2,,
def orderTuples(xs: Iterable[(String, String, String, String, String)],
order: (String, String, String) = ("B", "A", "C")) = {
type T = Iterable[(String, String, String, String, String)]
type KV = Iterable[(String, String)]
val ord = List(order._1, order._2, order._3)
def loop(xs: T, acc: KV, vs: Iterable[String] = ord): KV = xs match {
case Nil if vs.isEmpty => acc
case Nil => vs.map((_, ",")) ++: acc
case x :: xs => loop(xs, List((x._2, x._3)) ++: acc, vs.filterNot(_ == x._2))
}
def comp(x: String) = ord.zipWithIndex.toMap.apply(x)
loop(xs, Nil).toList.sortBy(x => comp(x._1)).map(_._2).mkString(",")
}

How to apply function to each tuple of a multi-dimensional array in Scala?

I've got a two dimensional array and I want to apply a function to each value in the array.
Here's what I'm working with:
scala> val array = Array.tabulate(2,2)((x,y) => (0,0))
array: Array[Array[(Int, Int)]] = Array(Array((0,0), (0,0)), Array((0,0), (0,0)))
I'm using foreach to extract the tuples:
scala> array.foreach(i => i.foreach(j => println(i)))
[Lscala.Tuple2;#11d559a
[Lscala.Tuple2;#11d559a
[Lscala.Tuple2;#df11d5
[Lscala.Tuple2;#df11d5
Let's make a simple function:
//Takes two ints and return a Tuple2. Not sure this is the best approach.
scala> def foo(i: Int, j: Int):Tuple2[Int,Int] = (i+1,j+2)
foo: (i: Int,j: Int)(Int, Int)
This runs, but need to apply to array(if mutable) or return new array.
scala> array.foreach(i => i.foreach(j => foo(j._1, j._2)))
Shouldn't be to bad. I'm missing some basics I think...
(UPDATE: removed the for comprehension code which was not correct - it returned a one dimensional array)
def foo(t: (Int, Int)): (Int, Int) = (t._1 + 1, t._2 + 1)
array.map{_.map{foo}}
To apply to a mutable array
val array = ArrayBuffer.tabulate(2,2)((x,y) => (0,0))
for (sub <- array;
(cell, i) <- sub.zipWithIndex)
sub(i) = foo(cell._1, cell._2)
2dArray.map(_.map(((_: Int).+(1) -> (_: Int).+(1)).tupled))
e.g.
scala> List[List[(Int, Int)]](List((1,3))).map(_.map(((_: Int).+(1) -> (_: Int).+(1)).tupled))
res123: List[List[(Int, Int)]] = List(List((2,4)))
Dot notation required to force