Break tuple in RDD to two tuples - scala

I have an RDD[(String, (Iterable[Int], Iterable[Coordinate]))]
What I would like to do, is to break the Iterable[Int] to tuples, that each one will be like (String,Int,Iterable[Coordinate])
For an example, I would like to transform:
('a',<1,2,3>,<(45.34,32.33),(45.36,32.34)>)
('b',<1>,<(46.64,32.66),(46.67,32.71)>)
to
('a',1,<(45.34,32.33),(45.36,32.34)>)
('a',2,<(45.34,32.33),(45.36,32.34)>)
('a',3,<(45.34,32.33),(45.36,32.34)>)
('b',1,<(46.64,32.66),(46.67,32.71)>)
How is done is Scala?

Try to use flatMap:
rdd.flatMap {case (v, i1, i2) => i1.map(i=>(v, i, i2)}

Related

reduceByKey of List[Int]

Suppose I have RDD(String,List[Int]), i.e. ("David",List(60,70,80)),("John",List(70,80,90)). How can I use reduceByKey in scala to calculate average of List[Int]. In the end, I want to have another RDD which is like ("David",70),("John",80)
Something based on reduceByKey doesn't directly look good because of it's type signature:
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
V in your case is List[Int] so you're getting a RDD[(String, List[Int])].
A workaround is use a List of one element, the actual average:
val rddAvg: RDD[(String, Int)] =
rdd1
.reduceByKey { case (key, numbers) => List(numbers.sum / numbers.length) }
.mapValues(_.head)
You could as well attempt something based on aggregateByKey: this function can return a different result type and would do the trick in one step.
Later edit: I dropped the example with groupByKey as it is performance wise inferior to reduceByKey or aggregateByKey for a use-case like computing an average
val data1 = List(("David", List(60, 70, 80)), ("John", List(70, 80, 90)))
val rdd1 = sc.parallelize(data1)
print(rdd1.mapValues(value => value.sum.toDouble / value.size).collect)

Flatten value in paired RDD in spark

I have a paired RDD that looks like
(a1, (a2, a3))
(b1, (b2, b3))
...
I want to flatten the values to obtain
(a1, a2, a3)
(b1, b2, b3)
...
Currently I'm doing
rddData.map(x => (x._1, x._2._1, x._2._2))
Is there a better way of performing the conversion? The above solution gets ugly if value contains many elements instead of just 2.
When I'm trying to avoid all the ugly underscore number stuff that comes with tuple manipulation I like to use case notation:
rddData.map { case (a, (b, c)) => (a, b, c) }
You can also give your variables meaningful names to make your code self documenting and the use of curly braces means you have fewer nested parentheses.
EDIT:
The map { case ... } pattern is pretty compact and can be used for surprisingly deep nested tuples as long as the structure is known at compile time. If you absolutely, positively cannot know the structure of the tuple at compile time, then here is some hacky, slow code that, probably, can flatten any arbitrarily nested tuple... as long as there are no more than 23 elements in total. It works by recursivly converting each element of the tuple to a list, flatmap-ing it to a single list, then using scary reflection to convert the list back into a tuple as seen here.
def flatten(b:Product): List[Any] = {
b.productIterator.toList.flatMap {
case x: Product => flatten(x)
case y: Any => List(y)
}
}
def toTuple[Any](as:List[Any]):Product = {
val tupleClass = Class.forName("scala.Tuple" + as.size)
tupleClass.getConstructors.apply(0).newInstance(as.map(_.asInstanceOf[AnyRef]):_*).asInstanceOf[Product]
}
rddData.map(t => toTuple(flatten(t)))
There is no better way. The 1st answer is equivalent to:
val abc2 = xyz.map{ case (k, v) => (k, v._1, v._2) }
which is equivalent to your own example.

How to sum on a groupBy with an iterator?

given Iterator[(String, Int)]
I would like to group by the String and sum the Int and return the results as a Map[String, Int]
You can convert it to a list or other strict structure:
iter.toList.groupBy(_._1).mapValues(_.map(_._2).sum)
If you don't want to convert to a strict structure (which forces all of the entries into memory), you can foldLeft and build the map as you go:
(Map.empty[String,Int] /: iter) {case (acc, (k,v)) =>
acc + (k -> acc.get(k).map(_ + v).getOrElse(v))
}

Adding Constant to RDD

I have a really stupid question, I know that a RDD is immutable, but is there any way that you can add a column of constant to a RDD?
More specifically, I have an RDD of RDD[a:String, b:String], I wish to add a column of 1's after it so that I have a RDD of RDD[a:Stirng, b:String, c:Int].
The reason is that I want to use the reduceByKey function to process these strings, and an arbitrary Int (that will be constantly updated) will help the function in reducing.
Solution in Scala is to use map simply
rdd.map( t => (t._1, t._2, 1))
Or
rdd.map{ case (a, b) => (a, b, 1)}
You can easily do it with map function, here's an example in Python:
rdd.map(lambda (a,b): (a,b,1))

Scala List of tuples to flat list

I have list of tuple pairs, List[(String,String)] and want to flatten it to a list of strings, List[String].
Some of the options might be:
concatenate:
list.map(t => t._1 + t._2)
one after the other interleaved (after your comment it seems you were asking for this):
list.flatMap(t => List(t._1, t._2))
split and append them:
list.map(_._1) ++ list.map(_._2)
Well, you can always use flatMap as in:
list flatMap (x => List(x._1, x._2))
Although your question is a little vague.
Try:
val tt = List(("John","Paul"),("George","Ringo"))
tt.flatMap{ case (a,b) => List(a,b) }
This results in:
List(John, Paul, George, Ringo)
In general for lists of tuples of any arity, consider this,
myTuplesList.map(_.productIterator.map(_.toString)).flatten
Note the productIterator casts all types in a tuple to Any, hence we recast values here to String.
See -
https://stackoverflow.com/a/43716004/4610065
In this case -
import syntax.std.tuple._
List(("John","Paul"),("George","Ringo")).flatMap(_.toList)