Flattening the key of a RDD - scala

I have a Spark RDD of type (Array[breeze.linalg.DenseVector[Double]], breeze.linalg.DenseVector[Double]). I wish to flatten its key to transform it into a RDD of type breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double]). I am currently doing:
val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))
The signature of anonymousOrdering() is String => (Array[DenseVector[Double]], DenseVector[Double]).
It returns type mismatch: required: TraversableOnce[?]. The Python code doing the same thing is:
newRDD = oldRDD.flatMap(lambda point: [(tile, point) for tile in anonymousOrdering(point)])
How to do the same thing in Scala ? I generally use flatMapValuesbut here I need to flatten the key.

If I understand your question correctly, you can do:
val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))
// newRDD is RDD[(Array[DenseVector], DenseVector)]
In that case, you can "flatten" the Array portion of the tuple using pattern matching and a for/yield statement:
newRDD = newRDD.flatMap{case (a: Array[DenseVector[Double]], b: DenseVector[Double]) => for (v <- a) yield (v, b)}
// newRDD is RDD[(DenseVector, DenseVector)]
Although it's still not clear to me where/how you want to use groupByKey

Change the code to use Map instead of FlatMap:
val newRDD = oldRDD.map(ob => anonymousOrdering(ob)).groupByKey()
You would only want to use flatmap here if anonymousOrdering returned a list of tuples and you wanted it flattened down.

As anonymousOrdering() is a function that you have in your code, update it in order to return a Seq[(breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double])]. It is like doing (tile, point) for tile in anonymousOrdering(point)] but directly at the end of the anonymous function. The flatMap will then take care to create one partition for each element of the sequences.
As a general rule, avoid having a collection as a key in a RDD.

Related

How to Reduce by key in "Scala" [Not In Spark]

I am trying to reduceByKeys in Scala, is there any method to reduce the values based on the keys in Scala. [ i know we can do by reduceByKey method in spark, but how do we do the same in Scala ? ]
The input Data is :
val File = Source.fromFile("C:/Users/svk12/git/data/retail_db/order_items/part-00000")
.getLines()
.toList
val map = File.map(x => x.split(","))
.map(x => (x(1),x(4)))
map.take(10).foreach(println)
After Above Step i am getting the result as:
(2,250.0)
(2,129.99)
(4,49.98)
(4,299.95)
(4,150.0)
(4,199.92)
(5,299.98)
(5,299.95)
Expected Result :
(2,379.99)
(5,499.93)
.......
Starting Scala 2.13, you can use the groupMapReduce method which is (as its name suggests) an equivalent of a groupBy followed by mapValues and a reduce step:
io.Source.fromFile("file.txt")
.getLines.to(LazyList)
.map(_.split(','))
.groupMapReduce(_(1))(_(4).toDouble)(_ + _)
The groupMapReduce stage:
groups splited arrays by their 2nd element (_(1)) (group part of groupMapReduce)
maps each array occurrence within each group to its 4th element and cast it to Double (_(4).toDouble) (map part of groupMapReduce)
reduces values within each group (_ + _) by summing them (reduce part of groupMapReduce).
This is a one-pass version of what can be translated by:
seq.groupBy(_(1)).mapValues(_.map(_(4).toDouble).reduce(_ + _))
Also note the cast from Iterator to LazyList in order to use a collection which provides groupMapReduce (we don't use a Stream, since starting Scala 2.13, LazyList is the recommended replacement of Streams).
It looks like you want the sum of some values from a file. One problem is that files are strings, so you have to cast the String to a number format before it can be summed.
These are the steps you might use.
io.Source.fromFile("so.txt") //open file
.getLines() //read line-by-line
.map(_.split(",")) //each line is Array[String]
.toSeq //to something that can groupBy()
.groupBy(_(1)) //now is Map[String,Array[String]]
.mapValues(_.map(_(4).toInt).sum) //now is Map[String,Int]
.toSeq //un-Map it to (String,Int) tuples
.sorted //presentation order
.take(10) //sample
.foreach(println) //report
This will, of course, throw if any file data is not in the required format.
There is nothing built-in, but you can write it like this:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
var result = Map.empty[A, B]
items.foreach {
case (a, b) =>
result += (a -> result.get(a).map(b1 => f(b1, b)).getOrElse(b))
}
result
}
There is some space to optimize this (e.g. use mutable maps), but the general idea remains the same.
Another approach, more declarative but less efficient (creates several intermediate collections; can be rewritten but with loss of clarity:
def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
items
.groupBy { case (a, _) => a }
.mapValues(_.map { case (_, b) => b }.reduce(f))
// mapValues returns a view, view.force changes it back to a realized map
.view.force
}
First group the tuple using key, first element here and then reduce.
Following code will work -
val reducedList = map.groupBy(_._1).map(l => (l._1, l._2.map(_._2).reduce(_+_)))
print(reducedList)
Here another solution using a foldLeft:
val File : List[String] = ???
File.map(x => x.split(","))
.map(x => (x(1),x(4).toInt))
.foldLeft(Map.empty[String,Int]){case (state, (key,value)) => state.updated(key,state.get(key).getOrElse(0)+value)}
.toSeq
.sortBy(_._1)
.take(10)
.foreach(println)

How to convert (key,array(value)) to (key,value) in Spark

I have a RDD like below:
val rdd1 = sc.parallelize(Array((1,Array((3,4),(4,5))),(2,Array((4,2),(4,4),(3,9)))))
which is RDD[(Int,Array[(Int,Int)])] I want to get the result like RDD[(Int,(Int,Int)] by some operations such as flatMap or else. In this example, the result should be:
(1,(3,4))
(1,(4,5))
(2,(4,2))
(2,(4,4))
(2,(3,9))
I am quite new to spark, so what could I do to achieve this?
Thanks a lot.
you can use flatMap in your case like this :
val newRDD: RDD[(Int, (Int, Int))] = rdd1
.flatMap { case (k, values) => values.map(v => (k, v))}
Assume that as RDD as rd. Use below code to get the data as you want
rdd1.flatMap(x => x._2.map(y => (x._1,y)))
Internal map method in flatmap read x._2 which is array and read each value of array at a time as y. After that flat map will give them as separate items. x._1 is the first value in the RDD.

Apply a sequence of functions to value and get the final result

I wish to apply a sequence of functions to an object (each of the functions may return the same or modified object) and get the ultimate result returned by the last function.
Is there an idiomatic Scala way to turn this (pseudocode):
val pipeline = ListMap(("a" -> obj1), ("b" -> obj2), ("c" -> obj3))
into this?
val initial_value = Something("foo", "bar")
val result = obj3.func(obj2.func(obj1.func(initial_value)))
The pipeline is initialized at runtime and contains an undetermined number of "manglers".
I tried with foreach but it requires an intermediate var to store the result, and foldLeft only works on types of ListMap, while the initial value and the result are of type Something.
Thanks
This should do it:
pipeline.foldLeft(initial_value){case (acc, (k,obj)) => obj.func(acc)}
No idea why pipeline contains pairs, though.
Assuming input and output types are the same, I'd go with a reduceLeft and composition by andThen:
def pipe[A](a: A, funcs: List[A => A]): A = funcs.reduceLeft(_ andThen _)(a)
I think foldLeft is the right choice:
val pipeline = List("a"-> func1, "b"-> func2, "c"-> func3)
...
val result = pipeline.foldLeft(initial_value) {case (acc,(key,func)) => func(acc)}
Get rid of your keys, first:
pipeline.values.foldLeft(initial_value)((a, f) => f.func(a))

Spark use reduceByKey on nested structure

Currently I have a structure like this:
Array[(Int, Array[(String, Int)])], and I want to use reduceByKey on the Array[(String, Int)], which is inside the Array of tuple. I tried code like
//data is in Array[(Int, Array[(String, Int)])] structure
val result = data.map(l => (l._1, l._2.reduceByKey(_ + _)))
The error is telling that Array[(String,Int)]does not have method called reduceByKey, and I understand that this method can only be used on RDD. So my question is, is there any way to use "reduceByKey" feature, doesn't need to use exactly this method, in the nested structure?
Thanks guys.
You simply use Array's reduce method here as you are now working with an Array and not an RDD (assuming you really meant the outer wrapper to be an RDD)
val data = sc.parallelize(List((1,List(("foo", 1), ("foo", 1)))))
data.map(l=>(l._1, l._2.foldLeft(List[(String, Int)]())((accum, curr)=>{
val accumAsMap = accum.toMap
accumAsMap.get(curr._1) match {
case Some(value : Int) => (accumAsMap + (curr._1 -> (value + curr._2))).toList
case None => curr :: accum
}
}))).collect
Ultimately, it seems that you do not understand what an RDD is, so you might want to read some of the docs on them.

Iterate Over a tuple

I need to implement a generic method that takes a tuple and returns a Map
Example :
val tuple=((1,2),(("A","B"),("C",3)),4)
I have been trying to break this tuple into a list :
val list=tuple.productIterator.toList
Scala>list: List[Any] = List((1,2), ((A,B),(C,3)), 4)
But this way returns List[Any] .
I am trying now to find out how to iterate over the following tuple ,for example :
((1,2),(("A","B"),("C",3)),4)
in order to loop over each element 1,2,"A",B",...etc. How could I do this kind of iteration over the tuple
What about? :
def flatProduct(t: Product): Iterator[Any] = t.productIterator.flatMap {
case p: Product => flatProduct(p)
case x => Iterator(x)
}
val tuple = ((1,2),(("A","B"),("C",3)),4)
flatProduct(tuple).mkString(",") // 1,2,A,B,C,3,4
Ok, the Any-problem remains. At least that´s due to the return type of productIterator.
Instead of tuples, use Shapeless data structures like HList. You can have generic processing, and also don't lose type information.
The only problem is that documentation isn't very comprehensive.
tuple.productIterator map {
case (a,b) => println(a,b)
case (a) => println(a)
}
This works for me. tranform is a tuple consists of dataframes
def apply_function(a: DataFrame) = a.write.format("parquet").save("..." + a + ".parquet")
transform.productIterator.map(_.asInstanceOf[DataFrame]).foreach(a => apply_function(a))