Scala/Spark Ordering on multiple dimensions with reverse on specific one

Scala/Spark Ordering on multiple dimensions with reverse on specific one - scala

I would like to order for exemple an 2d array or RDD like follow :
val a = Array((1,1),(1,2),(1,3),(2,1),(2,2),(2,3))
To obtain an ascending sort on d1 and a descending one on d2 :
val b = Array((1,3),(1,2),(1,1),(2,3),(2,2),(2,1))
Unfortunately when i apply reverse in the ordering it apply on all dimensions
a.sortBy( x=> (x._1,x._2) )(Ordering[(Int,Int)].reverse.on(x=> (x._1,x._2)))
Array((2,3), (2,2), (2,1), (1,3), (1,2), (1,1))
So i would like to be able to sort on multiple dimension choising on which one i need a reverse sorting.

This post contains the answear Scala idiom for ordering by multiple criteria
val ord1 = Ordering.by{ x:Int => x }
val ord2 = Ordering.by{ x:Int => x }.reverse
val multOrd = Ordering.by{ x:(Int,Int) => x }(Ordering.Tuple2(ord1,ord2))
a.sortBy( identity )(multOrd)
Array((1,3),(1,2),(1,1),(2,3),(2,2),(2,1))
Hope it can help

Related

Vector product using map and reduce in scala

I'm trying to calculate the vector product between two vector using the map and reduce functions.
Let's see what happens in the REPL of Scala:
First of all I define 2 vectors with same length
scala> val v1 = Array(1,4,5,2)
v1: Array[Int] = Array(1, 4, 5, 2)
scala> val v2 = Array (3,5,1,5)
v2: Array[Int] = Array(3, 5, 1, 5)
Now I create a new array vecZip using the zip function
scala> val vecZip = v1 zip v2
vecZip: Array[(Int, Int)] = Array((1,3), (4,5), (5,1), (2,5))
Now I'd like to apply the reduce method
(to do the product of each tuple) for each element of this array.
I thought this:
val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
I want to get a list (vecToSum) where apply the reduce method to calculate the total result. However I get this error:
scala> val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
<console>:10: error: value * is not a member of (Int, Int)
val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
^

You just need to call map and multiply the tuples values with each other, like this:
val vecToSum = vecZip.map(x => x._1 * x._2)
vecToSum is a List of tuples, so x is a Tuple of (Int, Int). Therefore if you call List(x).reduce(...), you're creating a List with the only value being the tuple, so that's not really what you want.

What your code is actually trying to do is it creates a list of a single tuple element, and then tries to reduce it. It would never work this way, as there is nothing to reduce - there is already single element in a list - a tuple.
Instead you need to map your vecZip array elements (tuples) via multiplying their elements:
vecZip.map { case (x, y) => x * y }

You don't need to reduce here. Reducing an Array[(Int, Int)] would mean performing some associative binary operation on all tuples inside the array. Note that it could be performing the operation on the first couple of tuples, then on the result of that and the third tuple, then on the result of that and the fourth tuple etc. but also, due to associativity, it could perform the operation on first and second tuple, simultaneously on third and fourth tuple, and then on their results etc., which is nice for parallelization (and frameworks such as Spark rely on it heavily)).
For example you could sum all first elements and all second elements of each tuple:
val reduced = vecZip.reduce((pair1, pair2) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
What you want however is to simply map each tuple into the product of its elements:
val vecToSum = vecZip.map { case (x, y) => x * y }
Note that I used the partial function (see that case over there) in order to perform pattern matching on the tuple; without the partial function it would look like this:
val vecToSum = vecZip.map(tuple => tuple._1 * tuple._2)

Spark Scala: Split each line between multiple RDDs

I have a file on HDFS in the form of:
61,139,75
63,140,77
64,129,82
68,128,56
71,140,47
73,141,38
75,128,59
64,129,61
64,129,80
64,129,99
I create an RDD from it and and zip the elements with their index:
val data = sc.textFile("hdfs://localhost:54310/usrp/sample.txt")
val points = data.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
val indexed = points.zipWithIndex()
val indexedData = indexed.map{case (value,index) => (index,value)}
Now I need to create rdd1 with the index and the first two elements of each line. Then need to create rdd2 with the index and third element of each row. I am new to Scala, can you please help me with how to do this ?
This does not work since y is not of type Vector but org.apache.spark.mllib.linalg.Vector
val rdd1 = indexedData.map{case (x,y) => (x,y.take(2))}
Basically how to get he first two elements of such a vector ?
Thanks.

You can make use of DenseVector's unapply method to get the underlying Array[Double] in your pattern-matching, and then call take/drop on the Array, re-wrapping it with a Vector:
val rdd1 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.take(2))) }
val rdd2 = indexedData.map { case (i, DenseVector(arr)) => (i, Vectors.dense(arr.drop(2))) }
As you can see - this means the original DenseVector you created isn't really that useful, so if you're not going to use indexedData anywhere else, it might be better to create indexedData as a RDD[(Long, Array[Double])] in the first place:
val points = data.map(s => s.split(',').map(_.toDouble))
val indexedData: RDD[(Long, Array[Double])] = points.zipWithIndex().map(_.swap)
val rdd1 = indexedData.mapValues(arr => Vectors.dense(arr.take(2)))
val rdd2 = indexedData.mapValues(arr => Vectors.dense(arr.drop(2)))
Last tip: you probably want to call .cache() on indexedData before scanning it twice to createrdd1 and rdd2 - otherwise the file will be loaded and parsed twice.

You can achieve the above output by following the below steps:
Original Data:
indexedData.foreach(println)
(0,[61.0,139.0,75.0])
(1,[63.0,140.0,77.0])
(2,[64.0,129.0,82.0])
(3,[68.0,128.0,56.0])
(4,[71.0,140.0,47.0])
(5,[73.0,141.0,38.0])
(6,[75.0,128.0,59.0])
(7,[64.0,129.0,61.0])
(8,[64.0,129.0,80.0])
(9,[64.0,129.0,99.0])
RRD1 Data:
Having index along with first two elements of each line.
val rdd1 = indexedData.map{case (x,y) => (x, (y.toArray(0), y.toArray(1)))}
rdd1.foreach(println)
(0,(61.0,139.0))
(1,(63.0,140.0))
(2,(64.0,129.0))
(3,(68.0,128.0))
(4,(71.0,140.0))
(5,(73.0,141.0))
(6,(75.0,128.0))
(7,(64.0,129.0))
(8,(64.0,129.0))
(9,(64.0,129.0))
RRD2 Data:
Having index along with third element of row.
val rdd2 = indexedData.map{case (x,y) => (x, y.toArray(2))}
rdd2.foreach(println)
(0,75.0)
(1,77.0)
(2,82.0)
(3,56.0)
(4,47.0)
(5,38.0)
(6,59.0)
(7,61.0)
(8,80.0)
(9,99.0)

RDD split and do aggregation on new RDDs

I have an RDD of (String,String,Int).
I want to reduce it based on the first two strings
And Then based on the first String I want to group the (String,Int) and sort them
After sorting I need to group them into small groups each containing n elements.
I have done the code below. The problem is the number of elements in the step 2 is very large for a single key
and the reduceByKey(x++y) takes a lot of time.
//Input
val data = Array(
("c1","a1",1), ("c1","b1",1), ("c2","a1",1),("c1","a2",1), ("c1","b2",1),
("c2","a2",1), ("c1","a1",1), ("c1","b1",1), ("c2","a1",1))
val rdd = sc.parallelize(data)
val r1 = rdd.map(x => ((x._1, x._2), (x._3)))
val r2 = r1.reduceByKey((x, y) => x + y ).map(x => ((x._1._1), (x._1._2, x._2)))
// This is taking long time.
val r3 = r2.mapValues(x => ArrayBuffer(x)).reduceByKey((x, y) => x ++ y)
// from the list I will be doing grouping.
val r4 = r3.map(x => (x._1 , x._2.toList.sorted.grouped(2).toList))
Problem is the "c1" has lot of unique entries like b1 ,b2....million and reduceByKey is killing time because all the values are going to single node.
Is there a way to achieve this more efficiently?
// output
Array((c1,List(List((a1,2), (a2,1)), List((b1,2), (b2,1)))), (c2,List(List((a1,2), (a2,1)))))

There at least few problems with a way you group your data. The first problem is introduced by
mapValues(x => ArrayBuffer(x))
It creates a large amount of mutable objects which provide no additional value since you cannot leverage their mutability in the subsequent reduceByKey
reduceByKey((x, y) => x ++ y)
where each ++ creates a new collection and neither argument can be safely mutated. Since reduceByKey applies map side aggregation situation is even worse and pretty much creates GC hell.
Is there a way to achieve this more efficiently?
Unless you have some deeper knowledge about data distribution which can be used to define smarter partitioner the simplest improvement is to replace mapValues + reduceByKey with simple groupByKey:
val r3 = r2.groupByKey
It should be also possible to use a custom partitioner for both reduceByKey calls and mapPartitions with preservesPartitioning instead of map.
class FirsElementPartitioner(partitions: Int)
extends org.apache.spark.Partitioner {
def numPartitions = partitions
def getPartition(key: Any): Int = {
key.asInstanceOf[(Any, Any)]._1.## % numPartitions
}
}
val r2 = r1
.reduceByKey(new FirsElementPartitioner(8), (x, y) => x + y)
.mapPartitions(iter => iter.map(x => ((x._1._1), (x._1._2, x._2))), true)
// No shuffle required here.
val r3 = r2.groupByKey
It requires only a single shuffle and groupByKey is simply a local operations:
r3.toDebugString
// (8) MapPartitionsRDD[41] at groupByKey at <console>:37 []
// | MapPartitionsRDD[40] at mapPartitions at <console>:35 []
// | ShuffledRDD[39] at reduceByKey at <console>:34 []
// +-(8) MapPartitionsRDD[1] at map at <console>:28 []
// | ParallelCollectionRDD[0] at parallelize at <console>:26 []

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey

In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.

This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))

You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.

#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.

The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}

Merge lists, group By and create a sub-list with all values from one parameter

i'm trying to figure out how i achieve the following.
I have a list of tuples
scala> List(t1,t2,t3)
res16: List[(Int, java.lang.String)] = List((1001,Test), (1002,Schnitzel), (1001,Käse))
What i want out of this lists is s.th. like this
List[(Int, Seq[java.lang.String]) = List((1001, Seq(Test, Käse)), (1002, Seq(Schnitzel)))

There is already a function groupBy which achieves almost what you want. E.g.,
val xs = List(t1, t2, t3)
val m = xs.groupBy(_._1)
groups the antries of xs by their first components, resulting in the map
Map(1002 -> List((1002,Schnitzel)), 1001 -> List((1001,Test), (1001,Kaese)))
This does not have the type you want and also the keys are still part of the entries. This can be resolved by, e.g.,
val ys : List[(Int, Seq[String])] = m.mapValues(_.map(_._2)).toList