Scala. Need for loop where the iterations return a growing list - scala

I have a function that takes a value and returns a list of pairs, pairUp.
and a key set, flightPass.keys
I want to write a for loop that runs pairUp for each value of flightPass.keys, and returns a big list of all these returned values.
val result:List[(Int, Int)] = pairUp(flightPass.keys.toSeq(0)).toList
for (flight<- flightPass.keys.toSeq.drop(1))
{val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
I've tried a few different variations on this, always getting the error:
<console>:23: error: forward reference extends over definition of value result
for (flight<- flightPass.keys.toSeq.drop(1)) {val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
^
I feel like this should work in other languages, so what am I doing wrong here?

First off, you've defined result as a val, which means it is immutable and can't be modified.
So if you want to apply "pairUp for each value of flightPass.keys", why not map()?
val result = flightPass.keys.map(pairUp) //add .toList if needed

A Scala method which converts a List of values into a List of Lists and then reduces them to a single List is called flatMap which is short for map then flatten. You would use it like this:
flightPass.keys.toSeq.flatMap(k => pairUp(k))
This will take each 'key' from flightPass.keys and pass it to pairUp (the mapping part), then take the resulting Lists from each call to pairUp and 'flatten' them, resulting in a single joined list.

Related

Spark-Scala: Map the first element of list with every other element of list when lists are of varying length

I have dataset of the following type in a textile:
1004,bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018,36ba7a7|2021-09-19 01:33:00,
1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022,3623fe40|2021-09-19 01:33:00,
1028,6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032,ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,
I read the file using sc.textFile which returns an RDD of type Array[String] after which I perform the operations .map(x=>x.substring(1,x.length()-1)).map(x=>x.split(",").toList)
After split.toList I want to map the first element of each of the lists obtained to every other element of the list for which I use .map(x=>(x(0),x(1))).toDF("c1","c2")
This works fine for those lists which have only one value after split but skips on all other elements of the lists having more than one value for obvious reasons. For eg:
.map(x=>(x(0),x(1))) returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59] but skips out on the third element here 77a90a1c97b|2021-09-19 01:34:53
How can I write a map function which returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59], [1020,77a90a1c97b|2021-09-19 01:34:53] given that all the lists created using .map(x=>x.split(",").toList) are of varying lengths (have varying number of elements)?
I noted the ',' at the end of the file, but split ignores nulls.
The solution is as follows, just try it and you will see it works:
// x._n cannot work here initially.
val rdd = spark.sparkContext.textFile("/FileStore/tables/oddfile_01.txt")
val rdd2 = rdd.map(line => line.split(','))
val rdd3 = rdd2.map(x => (x(0), x.tail.toList))
val rdd4 = rdd3.flatMap{case (x, y) => y.map((x, _))}
rdd4.collect
Cardinality does change in this approach though.

Scala: missing arguments for method count in trait

In one line of code I'm attempting to take the first 10 lines of an RDD and count the records (which obviously should be 10). However, when I do some I get the error:
<console>:24: error: missing arguments for method count in trait
TraversableOnce;
follow this method with `_' if you want to treat it as a partially applied function
Here is the code:
logfiles.filter(line => line.contains("jpg")).take(10).count
After you take(10), you're no longer dealing with an RDD, but a Traversable (Scala collection type). You want to use size instead of count, since count takes a predicate to filter by:
val count = logfiles.filter(line => line.contains("jpg")).take(10).size
As you've stated, this will trivially always return 10 items as long as your RDD has at least that many items, and you most likely want to use RDD.count() instead.
val count = logfiles.filter(line => line.contains("jpg")).count()
As suggested by the documentation of RDD
def take(num: Int): Array[T]
Returns Array not an RDD hence the count function doesn't work.
Also in RDD there is no native way of selecting 10 elements. If you really want to do that you should probably convert the RDD to a dataframe and use limit function in dataframe
df.limit(10) will return a dataframe of 10 elements
Where you can perform the count operation

Can we replace map with flatMap?

I was trying to find line with maximum words, and i wrote the following lines, to run on spark-shell:
import java.lang.Math
val counts = textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
But since, map is one to one , and flatMap is one to either zero or anything. So i tried replacing map with flatMap, in above code. But its giving error as:
<console>:24: error: type mismatch;
found : Int
required: TraversableOnce[?]
val counts = F1.flatMap(s => s.split(" ").size).reduce((a,b)=> Math.max(a,b))
If anybody could make me understand the reason, it will really be helpful.
flatMap must return an Iterable which is clearly not what you want. You do want a map because you want to map a line to the number of words, so you want a one-to-one function that takes a line and maps it to the number of words (though you could create a collection with one element, being the size of course...).
FlatMap is meant to associate a collection to an input, for instance if you wanted to map a line to all its words you would do:
val words = textFile.flatMap(x => x.split(" "))
and that would return an RDD[String] containing all the words.
In the end, map transforms an RDD of size N into another RDD of size N (e.g. your lines to their length) whereas flatMap transforms an RDD of size N into an RDD of size P (actually an RDD of size N into an RDD of size N made of collections, all these collections are then flattened to produce the RDD of size P).
P.S.: one last word that has nothing to do with your problem, it is more efficient to do (for a string s)
val nbWords = s.split(" ").length
than call .size(). Indeed, the split method returns an array of String and arrays do not have a size method. So when you call .size() you have an implicit conversion from Array[String] to SeqLike[String] which creates new objects. But Array[T] do have a length field so there's no conversion calling length. (It's a detail but I think it's good habit though).
Any use of map can be replaced by flatMap, but the function argument has to be changed to return a single-element List: textFile.flatMap(line => List(line.split(" ").size)). This isn't a good idea: it just makes your code less understandable and less efficient.
After reading Tired of Null Pointer Exceptions? Consider Using Java SE 8's Optional!'s part about why use flatMap() rather than Map(), I have realized the truly reason why flatMap() can not replace map() is that map() is not a special case of flatMap().
It's true that flatMap() means one-to-many, but that's not the only thing flatMap() does. It can also strip outer Stream() if put it simply.
See the definations of map and flatMap:
Stream<R> map(Function<? super T, ? extends R> mapper)
Stream<R> flatMap(Function<? super T, ? extends Stream<? extends R>> mapper)
the only difference is the type of returned value in inner function. What map() returned is "Stream<'what inner function returned'>", while what flatMap() returned is just "what inner function returned".
So you can say that flatMap() can kick outer Stream() away, but map() can't. This is the most difference in my opinion, and also why map() is not just a special case of flatMap().
ps:
If you really want to make one-to-one with flatMap, then you should change it into one-to-List(one). That means you should add an outer Stream() manually which will be stripped by flatMap() later. After that you'll get the same effect as using map().(Certainly, it's clumsy. So don't do like that.)
Here are examples for Java8, but the same as Scala:
use map():
list.stream().map(line -> line.split(" ").length)
deprecated use flatMap():
list.stream().flatMap(line -> Arrays.asList(line.split(" ").length).stream())

Spark Group By Key to (Key,List) Pair

I am trying to group some data by key where the value would be a list:
Sample data:
A 1
A 2
B 1
B 2
Expected result:
(A,(1,2))
(B,(1,2))
I am able to do this with the following code:
data.groupByKey().mapValues(List(_))
The problem is that when I then try to do a Map operation like the following:
groupedData.map((k,v) => (k,v(0)))
It tells me I have the wrong number of parameters.
If I try:
groupedData.map(s => (s(0),s(1)))
It tells me that "(Any,List(Iterable(Any)) does not take parameters"
No clue what I am doing wrong. Is my grouping wrong? What would be a better way to do this?
Scala answers only please. Thanks!!
You're almost there. Just replace List(_) with _.toList
data.groupByKey.mapValues(_.toList)
When you write an anonymous inline function of the form
ARGS => OPERATION
the entire part before the arrow (=>) is taken as the argument list. So, in the case of
(k, v) => ...
the interpreter takes that to mean a function that takes two arguments. In your case, however, you have a single argument which happens to be a tuple (here, a Tuple2, or a Pair - more fully, you appear to have a list of Pair[Any,List[Any]]). There are a couple of ways to get around this. First, you can use the sugared form of representing a pair, wrapped in an extra set of parentheses to show that this is the single expected argument for the function:
((x, y)) => ...
or, you can write the anonymous function in the form of a partial function that matches on tuples:
groupedData.map( case (k,v) => (k,v(0)) )
Finally, you can simply go with a single specified argument, as per your last attempt, but - realising it is a tuple - reference the specific field(s) within the tuple that you need:
groupedData.map(s => (s._2(0),s._2(1))) // The key is s._1, and the value list is s._2

Flattening a Set of pairs of sets to one pair of sets

I have a for-comprehension with a generator from a Set[MyType]
This MyType has a lazy val variable called factsPair which returns a pair of sets:
(Set[MyFact], Set[MyFact]).
I wish to loop through all of them and unify the facts into one flattened pair (Set[MyFact], Set[MyFact]) as follows, however I am getting No implicit view available ... and not enough arguments for flatten: implicit (asTraversable ... errors. (I am a bit new to Scala so still trying to get used to the errors).
lazy val allFacts =
(for {
mytype <- mytypeList
} yield mytype.factsPair).flatten
What do I need to specify to flatten for this to work?
Scala flatten works on same types. You have a Seq[(Set[MyFact], Set[MyFact])], which can't be flattened.
I would recommend learning the foldLeft function, because it's very general and quite easy to use as soon as you get the hang of it:
lazy val allFacts = myTypeList.foldLeft((Set[MyFact](), Set[MyFact]())) {
case (accumulator, next) =>
val pairs1 = accumulator._1 ++ next.factsPair._1
val pairs2 = accumulator._2 ++ next.factsPair._2
(pairs1, pairs2)
}
The first parameter takes the initial element it will append the other elements to. We start with an empty Tuple[Set[MyFact], Set[MyFact]] initialized like this: (Set[MyFact](), Set[MyFact]()).
Next we have to specify the function that takes the accumulator and appends the next element to it and returns with the new accumulator that has the next element in it. Because of all the tuples, it doesn't look nice, but works.
You won't be able to use flatten for this, because flatten on a collection returns a collection, and a tuple is not a collection.
You can, of course, just split, flatten, and join again:
val pairs = for {
mytype <- mytypeList
} yield mytype.factsPair
val (first, second) = pairs.unzip
val allFacts = (first.flatten, second.flatten)
A tuple isn't traverable, so you can't flatten over it. You need to return something that can be iterated over, like a List, for example:
List((1,2), (3,4)).flatten // bad
List(List(1,2), List(3,4)).flatten // good
I'd like to offer a more algebraic view. What you have here can be nicely solved using monoids. For each monoid there is a zero element and an operation to combine two elements into one.
In this case, sets for a monoid: the zero element is an empty set and the operation is a union. And if we have two monoids, their Cartesian product is also a monoid, where the operations are defined pairwise (see examples on Wikipedia).
Scalaz defines monoids for sets as well as tuples, so we don't need to do anything there. We'll just need a helper function that combines multiple monoid elements into one, which is implemented easily using folding:
def msum[A](ps: Iterable[A])(implicit m: Monoid[A]): A =
ps.foldLeft(m.zero)(m.append(_, _))
(perhaps there already is such a function in Scala, I didn't find it). Using msum we can easily define
def pairs(ps: Iterable[MyType]): (Set[MyFact], Set[MyFact]) =
msum(ps.map(_.factsPair))
using Scalaz's implicit monoids for tuples and sets.