Flattening a Set of pairs of sets to one pair of sets - scala

I have a for-comprehension with a generator from a Set[MyType]
This MyType has a lazy val variable called factsPair which returns a pair of sets:
(Set[MyFact], Set[MyFact]).
I wish to loop through all of them and unify the facts into one flattened pair (Set[MyFact], Set[MyFact]) as follows, however I am getting No implicit view available ... and not enough arguments for flatten: implicit (asTraversable ... errors. (I am a bit new to Scala so still trying to get used to the errors).
lazy val allFacts =
(for {
mytype <- mytypeList
} yield mytype.factsPair).flatten
What do I need to specify to flatten for this to work?

Scala flatten works on same types. You have a Seq[(Set[MyFact], Set[MyFact])], which can't be flattened.
I would recommend learning the foldLeft function, because it's very general and quite easy to use as soon as you get the hang of it:
lazy val allFacts = myTypeList.foldLeft((Set[MyFact](), Set[MyFact]())) {
case (accumulator, next) =>
val pairs1 = accumulator._1 ++ next.factsPair._1
val pairs2 = accumulator._2 ++ next.factsPair._2
(pairs1, pairs2)
}
The first parameter takes the initial element it will append the other elements to. We start with an empty Tuple[Set[MyFact], Set[MyFact]] initialized like this: (Set[MyFact](), Set[MyFact]()).
Next we have to specify the function that takes the accumulator and appends the next element to it and returns with the new accumulator that has the next element in it. Because of all the tuples, it doesn't look nice, but works.

You won't be able to use flatten for this, because flatten on a collection returns a collection, and a tuple is not a collection.
You can, of course, just split, flatten, and join again:
val pairs = for {
mytype <- mytypeList
} yield mytype.factsPair
val (first, second) = pairs.unzip
val allFacts = (first.flatten, second.flatten)

A tuple isn't traverable, so you can't flatten over it. You need to return something that can be iterated over, like a List, for example:
List((1,2), (3,4)).flatten // bad
List(List(1,2), List(3,4)).flatten // good

I'd like to offer a more algebraic view. What you have here can be nicely solved using monoids. For each monoid there is a zero element and an operation to combine two elements into one.
In this case, sets for a monoid: the zero element is an empty set and the operation is a union. And if we have two monoids, their Cartesian product is also a monoid, where the operations are defined pairwise (see examples on Wikipedia).
Scalaz defines monoids for sets as well as tuples, so we don't need to do anything there. We'll just need a helper function that combines multiple monoid elements into one, which is implemented easily using folding:
def msum[A](ps: Iterable[A])(implicit m: Monoid[A]): A =
ps.foldLeft(m.zero)(m.append(_, _))
(perhaps there already is such a function in Scala, I didn't find it). Using msum we can easily define
def pairs(ps: Iterable[MyType]): (Set[MyFact], Set[MyFact]) =
msum(ps.map(_.factsPair))
using Scalaz's implicit monoids for tuples and sets.

Related

scala add _2 s in a list of Tuple 2

I have the following mutable Hashmap in Scala:
HashMap((b,3), (c,4), (a,8), (a,2))
and need to be converted to the following:
HashMap((b,3), (c,4), (a,10))
I need something like reduceByKey function logic.
I added the code here
def main(args: Array[String]) = {
val m = new mutable.HashMap[String,Tuple2[String,Int]]()
println("Hello, world")
m.+=(("xx",("a",2)))
m.+=(("uu",("b",3)))
m.+=(("zz",("a",8)))
m.+=(("yy",("c",4)))
println(m.values)
}
For pre 2.13 Scala versions you can try using groupBy with map:
m.values
.groupBy(_._1)
.mapValues(_.map(_._2).sum)
It sounds like what you have is not a hashmap but m.values of type Iterable[Tuple2[String, Int]], which is more manageable. In that case, as hinted at in the comments, groupMapReduce does it all in one function. This function groups "matching" elements together, applies a transformation to each element, and then reduces the groups using a binary operation.
m.values.groupMapReduce(_._1)(_._2)(_ + _)
This says "Group the values by the first element of their tuple, then keep the second element (i.e. the number), and then add all of the numbers in each group". This produces a map from the first element of the tuple to the sum.
Map(a -> 10, b -> 3, c -> 4)
Note that this is a Map, not necessarily a HashMap. If you want a HashMap (i.e. for mutability), you'll need to convert it yourself.

Scala. Need for loop where the iterations return a growing list

I have a function that takes a value and returns a list of pairs, pairUp.
and a key set, flightPass.keys
I want to write a for loop that runs pairUp for each value of flightPass.keys, and returns a big list of all these returned values.
val result:List[(Int, Int)] = pairUp(flightPass.keys.toSeq(0)).toList
for (flight<- flightPass.keys.toSeq.drop(1))
{val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
I've tried a few different variations on this, always getting the error:
<console>:23: error: forward reference extends over definition of value result
for (flight<- flightPass.keys.toSeq.drop(1)) {val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
^
I feel like this should work in other languages, so what am I doing wrong here?
First off, you've defined result as a val, which means it is immutable and can't be modified.
So if you want to apply "pairUp for each value of flightPass.keys", why not map()?
val result = flightPass.keys.map(pairUp) //add .toList if needed
A Scala method which converts a List of values into a List of Lists and then reduces them to a single List is called flatMap which is short for map then flatten. You would use it like this:
flightPass.keys.toSeq.flatMap(k => pairUp(k))
This will take each 'key' from flightPass.keys and pass it to pairUp (the mapping part), then take the resulting Lists from each call to pairUp and 'flatten' them, resulting in a single joined list.

Spark Dataset equivalent for scala's "collect" taking a partial function

Regular scala collections have a nifty collect method which lets me do a filter-map operation in one pass using a partial function. Is there an equivalent operation on spark Datasets?
I'd like it for two reasons:
syntactic simplicity
it reduces filter-map style operations to a single pass (although in spark I am guessing there are optimizations which spot these things for you)
Here is an example to show what I mean. Suppose I have a sequence of options and I want to extract and double just the defined integers (those in a Some):
val input = Seq(Some(3), None, Some(-1), None, Some(4), Some(5))
Method 1 - collect
input.collect {
case Some(value) => value * 2
}
// List(6, -2, 8, 10)
The collect makes this quite neat syntactically and does one pass.
Method 2 - filter-map
input.filter(_.isDefined).map(_.get * 2)
I can carry this kind of pattern over to spark because datasets and data frames have analogous methods.
But I don't like this so much because isDefined and get seem like code smells to me. There's an implicit assumption that map is receiving only Somes. The compiler can't verify this. In a bigger example, that assumption would be harder for a developer to spot and the developer might swap the filter and map around for example without getting a syntax error.
Method 3 - fold* operations
input.foldRight[List[Int]](Nil) {
case (nextOpt, acc) => nextOpt match {
case Some(next) => next*2 :: acc
case None => acc
}
}
I haven't used spark enough to know if fold has an equivalent so this might be a bit tangential.
Anyway, the pattern match, the fold boiler plate and the rebuilding of the list all get jumbled together and it's hard to read.
So overall I find the collect syntax the nicest and I'm hoping spark has something like this.
The answers here are incorrect, at least with the current of Spark.
RDDs do in fact have a collect method that takes a partial function and applies a filter & map to the data. This is completely different from the parameterless .collect() method. See the Spark source code RDD.scala # line 955:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
This does not materialize the data from the RDD, as opposed to the parameterless .collect() method in RDD.scala # line 923:
/**
* Return an array that contains all of the elements in this RDD.
*/
def collect(): Array[T] = withScope {
val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
Array.concat(results: _*)
}
In the documentation, notice how the
def collect[U](f: PartialFunction[T, U]): RDD[U]
method does not have a warning associated with it about the data being loaded into the driver's memory:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD#collect[U](f:PartialFunction[T,U])(implicitevidence$29:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U]
It's very confusing for Spark to have these overloaded methods doing completely different things.
edit: My mistake! I misread the question, we're talking about DataSets not RDDs. Still, the accepted answer says that
"the Spark documentation points out, however, "this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory."
Which is incorrect! The data is not loaded into the driver's memory when calling the partial function version of .collect() - only when calling the parameterless version. Calling .collect(partial_function) should have about the same performance as calling .filter() and .map() sequentially, as shown in the source code above.
Just for the sake of completeness:
The RDD API does have such a method, so it's always an option to convert a given Dataset / DataFrame to RDD, perform the collect operation and convert back, e.g.:
val dataset = Seq(Some(1), None, Some(2)).toDS()
val dsResult = dataset.rdd.collect { case Some(i) => i * 2 }.toDS()
However, this will probably perform worse than using a map and filter on the Dataset (for the reason explained in #stefanobaghino's answer).
As for DataFrames, this particular example (using Option) is somewhat misleading, as the conversion into a DataFrame actually does the "flatenning" of Options into their values (or null for None), so the equivalent expression would be:
val dataframe = Seq(Some(1), None, Some(2)).toDF("opt")
dataframe.withColumn("opt", $"opt".multiply(2)).filter(not(isnull($"opt")))
Which, I think, suffers less from your concerns of having the map operation "assume" anything about its input.
The collect method defined over RDDs and Datasets is used to materialize the data in the driver program.
Despite not having something akin to the Collections API collect method, your intuition is right: since both operations are evaluated lazily, the engine has the opportunity to optimize the operations and chain them so that they are performed with maximum locality.
For the use case you mentioned in particular I would suggest you take flatMap in consideration, which works on both RDDs and Datasets:
// Assumes the usual spark-shell environment
// sc: SparkContext, spark: SparkSession
val collection = Seq(Some(1), None, Some(2), None, Some(3))
val rdd = sc.parallelize(collection)
val dataset = spark.createDataset(rdd)
// Both operations will yield `Array(2, 4, 6)`
rdd.flatMap(_.map(_ * 2)).collect
dataset.flatMap(_.map(_ * 2)).collect
// You can also express the operation in terms of a for-comprehension
(for (option <- rdd; n <- option) yield n * 2).collect
(for (option <- dataset; n <- option) yield n * 2).collect
// The same approach is valid for traditional collections as well
collection.flatMap(_.map(_ * 2))
for (option <- collection; n <- option) yield n * 2
EDIT
As correctly pointed out in another question, RDDs actually have the collect method that transforms an RDD by applying a partial function just like it happens in normal collections. As the Spark documentation points out, however, "this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory."
I just wanted to extend stefanobaghino's answer by including an example of a for comprehension with a case class as many use cases for this will probably involve case classes.
Also options are monads which makes the accepted answer very simple in this case as the for neatly drops out the None values, but that approach wouldn't extend to non-monads like case classes:
case class A(b: Boolean, i: Int, d: Double)
val collection = Seq(A(true, 3), A(false, 10), A(true, -1))
val rdd = ...
val dataset = ...
// Select out and double all the 'i' values where 'b' is true:
for {
A(b, i, _) <- dataset
if b
} yield i * 2
You can always create your own extension method:
implicit class DatasetOps[T](ds: Dataset[T]) {
def collectt[U](pf: PartialFunction[T, U])(implicit enc: Encoder[U]): Dataset[U] = {
ds.flatMap(pf.lift(_))
}
}
such that:
// val ds = Dataset(1, 2, 3)
ds.collectt { case x if x % 2 == 1 => x * 3 }
// Dataset(3, 9)
Note that I've unfortunately not been able to name it collect (thus the awful suffix t) as the signature would otherwise (I think) clash with the existing Dataset#collect method that transforms a Dataset into an Array.

How to create a nested ListBuffer within another ListBuffer n times in Scala?

I have an emptyListBuffer[ListBuffer[(String, Int)]]() initialized like so, and given a number n, I want to fill it with n ListBuffer[(String, Int)].
For example, if n=2 then I can initialize two ListBuffer[(String, Int)] within ListBuffer[ListBuffer[(String, Int)]]() if that makes any sense. I was trying to loop n times and use the insertAll function to insert an empty list but I didn't work.
use fill
fill is a standard Scala library function in order to fill a data structure with predefined elements. Its quite handy and save lot of typing.
ListBuffer.fill(100)(ListBuffer("Scala" -> 1))
Scala REPL
scala> import scala.collection.mutable._
import scala.collection.mutable._
scala> ListBuffer.fill(100)(ListBuffer("Scala" -> 1))
res4: scala.collection.mutable.ListBuffer[scala.collection.mutable.ListBuffer[(String, Int)]] = ListBuffer(ListBuffer((Scala,1)), ListBuffer((Scala,1)), ListBuffer((Scala,1)), ListBuffer((Scala,1)), ListBuffer((Scala,1)) ...
fill implementation in Standard library
def fill[A](n: Int)(elem: => A): CC[A] = {
val b = newBuilder[A]
b.sizeHint(n)
var i = 0
while (i < n) {
b += elem
i += 1
}
b.result()
}
The above implementation is for one dimensional data structure.
General suggestions
Looks like you are using Scala like the Java way. This is not good. Embrace functional way for doing things for obvious benefits.
Use immutable collections like List, Vector instead of mutable collections. Do not use mutable collections until and unless you have string reason for it.
Same thing can be done using immutable List
List.fill(100)(List("scala" -> 1))
scala -> 1 is same as ("scala", 1)

Can we replace map with flatMap?

I was trying to find line with maximum words, and i wrote the following lines, to run on spark-shell:
import java.lang.Math
val counts = textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
But since, map is one to one , and flatMap is one to either zero or anything. So i tried replacing map with flatMap, in above code. But its giving error as:
<console>:24: error: type mismatch;
found : Int
required: TraversableOnce[?]
val counts = F1.flatMap(s => s.split(" ").size).reduce((a,b)=> Math.max(a,b))
If anybody could make me understand the reason, it will really be helpful.
flatMap must return an Iterable which is clearly not what you want. You do want a map because you want to map a line to the number of words, so you want a one-to-one function that takes a line and maps it to the number of words (though you could create a collection with one element, being the size of course...).
FlatMap is meant to associate a collection to an input, for instance if you wanted to map a line to all its words you would do:
val words = textFile.flatMap(x => x.split(" "))
and that would return an RDD[String] containing all the words.
In the end, map transforms an RDD of size N into another RDD of size N (e.g. your lines to their length) whereas flatMap transforms an RDD of size N into an RDD of size P (actually an RDD of size N into an RDD of size N made of collections, all these collections are then flattened to produce the RDD of size P).
P.S.: one last word that has nothing to do with your problem, it is more efficient to do (for a string s)
val nbWords = s.split(" ").length
than call .size(). Indeed, the split method returns an array of String and arrays do not have a size method. So when you call .size() you have an implicit conversion from Array[String] to SeqLike[String] which creates new objects. But Array[T] do have a length field so there's no conversion calling length. (It's a detail but I think it's good habit though).
Any use of map can be replaced by flatMap, but the function argument has to be changed to return a single-element List: textFile.flatMap(line => List(line.split(" ").size)). This isn't a good idea: it just makes your code less understandable and less efficient.
After reading Tired of Null Pointer Exceptions? Consider Using Java SE 8's Optional!'s part about why use flatMap() rather than Map(), I have realized the truly reason why flatMap() can not replace map() is that map() is not a special case of flatMap().
It's true that flatMap() means one-to-many, but that's not the only thing flatMap() does. It can also strip outer Stream() if put it simply.
See the definations of map and flatMap:
Stream<R> map(Function<? super T, ? extends R> mapper)
Stream<R> flatMap(Function<? super T, ? extends Stream<? extends R>> mapper)
the only difference is the type of returned value in inner function. What map() returned is "Stream<'what inner function returned'>", while what flatMap() returned is just "what inner function returned".
So you can say that flatMap() can kick outer Stream() away, but map() can't. This is the most difference in my opinion, and also why map() is not just a special case of flatMap().
ps:
If you really want to make one-to-one with flatMap, then you should change it into one-to-List(one). That means you should add an outer Stream() manually which will be stripped by flatMap() later. After that you'll get the same effect as using map().(Certainly, it's clumsy. So don't do like that.)
Here are examples for Java8, but the same as Scala:
use map():
list.stream().map(line -> line.split(" ").length)
deprecated use flatMap():
list.stream().flatMap(line -> Arrays.asList(line.split(" ").length).stream())