How to flatmap nested lists in spark - scala

I have a RDD in spark which looks like this -
[Foo1, Bar[bar1,bar2]]
The Bar object has a getList method which may return the lists [bar11,bar12,bar13] and [bar21, bar22] respectively. I want the output to look like this -
[Foo1, [bar11, bar12, bar13, bar21, bar22]]
The approach that I am able to think of is something like this -
my_rdd.map(x => (x._1,x._2.getList))
.flatmap{
case(x,y) => y.map(x, _)
}
The first map operation is returning me Foo1 and all the lists. However I am not able to flatten them beyond that.

In your code the x._2.getList returns a list of lists. Use flatten method as follows to have the expected result :
my_rdd.map(x => (x._1,x._2.getList.flatten))

You can do this with one line:
my_rdd.mapValues(_.flatMap(_.getList))
There is another answer which uses map instead of mapValues. While this produces the same RDD elements, I think it's important to get in the practice of using the "minimal" function necessary with Spark RDDs, because you can actually pay a pretty huge performance cost for using map instead of mapValues without realizing it -- The map function on RDD strips the partitioner, if it exists, and mapValues does not.
If you have an RDD[(K, V)] and call rdd.groupByKey(), you'll end up with an RDD[(K, Array[V])] that is partitioned by K. If you want to join with another RDD by K, you've already done most of the work.
If you add a map in between the groupByKey() and join, Spark will re-shuffle that RDD. This is very painful! mapValues is safe.

Related

Why does nested flatMap - map in Scala give an RDD of type Object instead of a list of tuples?

I have an rdd that I want to group according to some key, but it just doesn't work. I am a Scala and Spark beginner So I have the following RDD:
rdd: RDD[WikipediaArticle])
val meinVal = rdd.flatMap(article=>langs.map(lang=>{if (article.mentionsLanguage(lang){ Tuple2(lang,article)} else{None}})).filter(_!=None)
meinVal.collect.foreach(println) gives:
(Scala,WikipediaArticle(2,Scala and Java run on the JVM))
(Java,WikipediaArticle(2,Scala and Java run on the JVM))
(Scala,WikipediaArticle(3,Scala is not purely functional))
I have two questions:
Why can I not apply the groupByKey function? It is an rdd that contains a list of tuples, the first tuple-entry is the key.
I don't see how to apply groupby either. I thought I could do meinVal.groupby(x=> x._1), but that trows an error.
I notice, that when I use an IDE and hover over "meinVal" it shows that it is RDD[Object] whereas it should be RDD[(String,WikipediaArticle)]. I do not know how to get this information without the IDE. So it seems that the rdd contains just one big object. I only don't see why that is.
Anyone? Please?
Irene
Ok, so thanks to this post https://stackoverflow.com/a/29426336/909909 I figured it out. The problem was not the nested flatmap-map construct, but the condition in the map instruction. In my code I returned "None" if the condition was not met. Since None is not of type tuple I get an RDD[Object] and therefore I cannot use groupByKey.
To solve this I use Option and then flatten the rdd to get rid of the Option and its Nones again.
val meinVal = rdd.flatMap( article=> langs.map(lang=> { if(article.mentionsLanguage(lang)){Some(Tuple2(lang,article))}else{None}}).flatten)

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

Spark - Convert Tuples to Tab Seperated String

I want to create a function that takes an RDD of tuples and converts each tuple to a tab separated string. I want the function to be able to handle Tuples of any size.
If I already have this RDD created, I can get the desired output using:
rdd.map(line => (0 to (line.productArity-1)).map(line.productElement(_)).toList.mkString("\t"))
How can I convert this piece of code to work as a function that takes an RDD of tuples, or is there a good library that already does this?
Something like this should work:
def toTab[T <: Product](rdd:RDD[T]) = rdd.map(_.productIterator.mkString("\t"))

Conversion from scala parallel collection to regular collection

I'm trying to convert back from a parallel collection to a regular map. According to the api, if I call toMap on any appropriately defined parallel collection, it's supposed to return a standard Map, but it's returning ParMap over a flattened collection of iterables.
I have a
val task: Stream[Future[Iterable[Tuple2[String, String]]]]
And from which I get:
val res: ParSeq[Iterable[Tuple2[String, String]]] = tasks.par.map(f => f.apply())
Finally:
val finalresult = res.flatten.toMap
Unfortunately, the type of finalresult is ParMap[String, String].
On the other hand, if I call it like:
tasks.par.map(f => f.apply()).reduce(_++_).toMap
then the return type is Map[String, String].
Can someone tell me why this is? And (out of curiosity) how I can force convert a ParMap to a Map when scala won't let me?
As you go explicitly from sequential to parallel collection via .par, you go back to sequential via .seq. Since sets and maps have parallel implementations, toMap and toSet calls leave the collection in the current domain.
The example of reduce works because it, well, reduces the collection (the outer ParSeq disappears, leaving you with the inner (sequential) Iterable[Tuple2[...]]).

summing a transformation of a list of numbers in scala

I frequently need to sum the transformation of a list of numbers in Scala. One way to do this of course is:
list.map(transform(_)).sum
However, this creates memory when creating memory is not required. An alternative is to fold the list.
list.foldLeft(0.0) { (total, x) => total + f(x) }
I find the first expression far easier to write than the second expression. Is there a method I can use that has the ease of the first with the efficiency of the second? Or am I better off writing my own implicit method?
list.mapSum(transform(_))
You can use a view to make your transformer methods (map, filter...) lazy. See here for more information.
So for example in your case of a method called transform, you would write
list.view.map(transform).sum
(note you can optionally omit the (_))
This operation is called foldMap, and you can find it in Scalaz.
list foldMap transform