Conversion from scala parallel collection to regular collection

Conversion from scala parallel collection to regular collection - scala

I'm trying to convert back from a parallel collection to a regular map. According to the api, if I call toMap on any appropriately defined parallel collection, it's supposed to return a standard Map, but it's returning ParMap over a flattened collection of iterables.
I have a
val task: Stream[Future[Iterable[Tuple2[String, String]]]]
And from which I get:
val res: ParSeq[Iterable[Tuple2[String, String]]] = tasks.par.map(f => f.apply())
Finally:
val finalresult = res.flatten.toMap
Unfortunately, the type of finalresult is ParMap[String, String].
On the other hand, if I call it like:
tasks.par.map(f => f.apply()).reduce(_++_).toMap
then the return type is Map[String, String].
Can someone tell me why this is? And (out of curiosity) how I can force convert a ParMap to a Map when scala won't let me?

As you go explicitly from sequential to parallel collection via .par, you go back to sequential via .seq. Since sets and maps have parallel implementations, toMap and toSet calls leave the collection in the current domain.
The example of reduce works because it, well, reduces the collection (the outer ParSeq disappears, leaving you with the inner (sequential) Iterable[Tuple2[...]]).

Related

How to flatmap nested lists in spark

I have a RDD in spark which looks like this -
[Foo1, Bar[bar1,bar2]]
The Bar object has a getList method which may return the lists [bar11,bar12,bar13] and [bar21, bar22] respectively. I want the output to look like this -
[Foo1, [bar11, bar12, bar13, bar21, bar22]]
The approach that I am able to think of is something like this -
my_rdd.map(x => (x._1,x._2.getList))
.flatmap{
case(x,y) => y.map(x, _)
}
The first map operation is returning me Foo1 and all the lists. However I am not able to flatten them beyond that.

In your code the x._2.getList returns a list of lists. Use flatten method as follows to have the expected result :
my_rdd.map(x => (x._1,x._2.getList.flatten))

You can do this with one line:
my_rdd.mapValues(_.flatMap(_.getList))
There is another answer which uses map instead of mapValues. While this produces the same RDD elements, I think it's important to get in the practice of using the "minimal" function necessary with Spark RDDs, because you can actually pay a pretty huge performance cost for using map instead of mapValues without realizing it -- The map function on RDD strips the partitioner, if it exists, and mapValues does not.
If you have an RDD[(K, V)] and call rdd.groupByKey(), you'll end up with an RDD[(K, Array[V])] that is partitioned by K. If you want to join with another RDD by K, you've already done most of the work.
If you add a map in between the groupByKey() and join, Spark will re-shuffle that RDD. This is very painful! mapValues is safe.

Option as a singleton collection - real life use cases

The title pretty much sums it up. Option as a singleton collection can sometimes be confusing, but sometimes it allows for an interesting application. I have one example on top of my head, and would like to learn more of such examples.
My only example is running for comprehension on the Option[List[T]]. We can do the following:
val v = Some(List(1, 2, 3))
for {
list <- v.toList
elem <- list
} yield elem + 1
Without having Option.toList, it wouldn't be possible to stay in the same for comprehension, and I'd be forced to write something like this:
for {
list <- v
} yield for {
elem <- list
} yield elem + 1
The first example is cleaner, and it's an advantage of Option being a collection. Of course, the result type will be different in these 2 examples, but let's assume it doesn't matter for the sake of discussion.
Any other examples? I'd especially like to concentrate on collection-like usage, and not usage of Option's monadic properties - those are pretty much obvious. In other words, map and flatMap functions are out of scope of this question. They're definitely very useful, just coming from elsewhere.

I find that working with Option[T] as a collection's main benefit is that you get to use operations defined on a collection, such as map, flatmap, filter, foreach etc. This makes it easier to do operations on a given option, instead of using pattern matching or checking Option[T].isDefined to see if a value exists.
For example, let's take the user repository example from Daniel Westheide blog post about Option[T]:
Say you have a UserRepository object which returns users based on their ID. The user may or may not exist, hence it returns an Option[Person]. Now let's say we want to search a person by id and then filter their age. We can do:
val age: Some[Int] = UserRepository.findById(1).map(_.age)
Now let's say that a Person also has a gender property of type Option[String]. If you wanted to extract that out, you could use map:
val gender: Option[Option[String]] = UserRepository.findById(1).map(_.gender)
But working with nested options isn't too convenient. For that, you have flatMap:
val gender: Option[String] = UserRepository.findById(1).flatMap(_.gender)
And if we want to print out the gender if it exists, we can use foreach:
gender.foreach(println)
You'll find yourself working with scala types that have nested Option[T] fields defined and it's really handy to have collection like methods which help you remove out boilerplate and noise for extracting the actual value out of the operation.
A more real life use case I just encountered the other day was working with the awscala SDK, where I wanted to retrieve an object from S3 storage:
val bucket: Option[Bucket] = s3.bucket(amazonConfig.bucketName)
val result: Option[S3Object] = bucket.flatMap(_.get(amazonConfig.offsetKey))
result.flatMap(s3Object =>
Source.fromInputStream(s3Object.content).mkString.decodeOption[Array[KafkaOffset]])
So what happens here is that you query the S3 service for a bucket, which may or may not exist. Then, you want to extract an S3Object out of it which actually contains the data, but the API itself returns an Option[S3Object], so it's handy to use flatMap to flat out get an Option[S3Object] instead of Option[Option[S3Object]]. Finally, I want to deserialize the S3Object which actually contains a JSON, and using the Argonaut library, it returns an Option[MyObject], so then again using flatMap to the rescue of extracting the inner option type.
Edit:
As you pointed out, map and flatMap belong to the monadic property of Option[T]. I've written a blog post describing the reduction of two options where the final solution was:
def reduce[T](a: Option[T], b: Option[T], f: (T, T) => T): Option[T] = {
(a ++ b).reduceLeftOption(f)
}
Which takes advantage of the ++ operator defined on any collection which is also specifically defined on Option[T], being a collection.

I'd suggest to take a look at the corresponding chapter of The Neophyte's Guide to Scala.
In my experience, most useful use-cases of Option-as-collection are to filter an option and to make flatMap that implicitly filters None values.

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}

The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

Nested Java hashmap to nested Scala map conversion

What is the right way to convert a variable of type java.util.HashMap<java.lang.String, java.util.List<java.lang.String>> in Java, to its Scala equivalent: Map[Map[String, List[String]]]? (with Scala Map, String and List)
I tried to use import scala.collection.JavaConverters._ and do JavaNestedMap.asScala but it failed. Is there a smart way of doing this (rather than having two maps)?

There's no single call way I know of.
This is succinct likely inefficient in a hot loop. Profile if it ends up being too slow and then you'd want to use builders directly.
val in: JMap[JMap[String, String]] = ???
val out: Map[Map[String, String]] = in.asScala.mapValues(_.asScala)
val again: JMap[JMap[String, String]] = out.map(_.asJava).asJava
It's worth noting that .asScala gives you a mutable map for consistency with the java map. If you want to get an immutable map back, you need to call .toMap afterwords.

When applying `map` to a `Set` you sometimes want the result not to be a set but overlook this

Or how to avoid accidental removal of duplicates when mapping a Set?
This is a mistake I'm doing very often. Look at the following code:
def countSubelements[A](sl: Set[List[A]]): Int = sl.map(_.size).sum
The function shall count the accumulated size of all the contained lists. The problem is that after mapping the lists to their lengths, the result is still a Set and all lists of size 1 are reduced to a single representative.
Is it just me having this problem? Is there something I can do to prevent this happening? I think I'd love to have two methods mapToSet and mapToSeq for Set. But there is no way to enforce this, and sometimes you don't locally notice that you are working with a Set.
Maybe it's even possible that you were writing code for a Seq and something changes in another class and the underlying object becomes a Set?
Maybe something like a best practise to not let this situation arise at all?
Remote edits break my code
Imagine the following situation:
val totalEdges = graph.nodes.map(_.getEdges).map(_.size).sum / 2
You fetch a collection of Node objects from a graph, use them to get their adjacent edges and sum over them. This works if graph.nodes returns a Seq.
And it breaks if someone decides to make Graph return its nodes as a Set; without this code looking suspicious (at least not to me, do you expect every collection could possibly end up being a Set?) and without touching it.

It seems there will be many possible "gotcha's" if one expects a Seq and gets a Set. It's not a surprise that method implementations can depend on the type of the object and (with overloading) the arguments. With Scala implicits, the method can even depend on the expected return type.
A way to defend against surprises is to explicitly label types. For example, annotating methods with return types, even if it's not required. At least this way, if the type of graph.nodes is changed from Seq to Set, the programmer is aware that there's potential breakage.
For your specific issue, why not define your ownmapToSeq method,
scala> def mapToSeq[A, B](t: Traversable[A])(f: A => B): Seq[B] =
t.map(f)(collection.breakOut)
mapToSeq: [A, B](t: Traversable[A])(f: A => B)Seq[B]
scala> mapToSeq(Set(Seq(1), Seq(1,2)))(_.sum)
res1: Seq[Int] = Vector(1, 3)
scala> mapToSeq(Seq(Seq(1), Seq(1,2)))(_.sum)
res2: Seq[Int] = Vector(1, 3)
The advantage of using breakOut: CanBuildFrom is that the conversion from a Set to a Seq has no additional overhead.
You can make use the pimp my library pattern to make mapToSeq appear to be part of the Traversable trait, inherited by Seq and Set.

Workarounds:
def countSubelements[A](sl: Set[List[A]]): Int = sl.toSeq.map(_.size).sum
def countSubelements[A](sl: Set[List[A]]): Int = sl.foldLeft(0)(_ + _.size)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Conversion from scala parallel collection to regular collection - scala

Related

How to flatmap nested lists in spark

Option as a singleton collection - real life use cases

Scala's collect inefficient in Spark?

Nested Java hashmap to nested Scala map conversion

When applying `map` to a `Set` you sometimes want the result not to be a set but overlook this

Categories

Resources