loop with accumulator on an rdd - scala

I want to loop n times where n is an accumulator over the same rdd
lets say n = 10 so I want the code bellow to loop 5 times (since the accumulator is increased by two)
val key = keyAcm.value.toInt
val rest = rdd.filter(_._1 > (key + 1))
val combined = rdd.filter(k => (k._1 == key) || (k._1 == key + 1))
.map(x => (key, x._2))
.reduceByKey { case (x, y) => (x ++ y) }
keyAcm.add(2)
combined.union(rest)
with this code I filter the rdd and keep keys 0 (init value of accumulator) and 1. Then, I am trying to merge its second parameter and change the key to create a new rdd with key 0 and a merged array. After that, I union this rdd with the original one leaving behind the filtered values (0 and 1). Lastly, I increase the accumulator by two.How can I repeat theses steps until the accumulator is 10?
any ideas?

val rdd: RDD[(Int, String)] = ???
val res: RDD[(Int, Iterable[String])] = rdd.map(x => (x._1 / 2, x._2)).groupByKey()

Related

Scala: Remove duplicated integers from Vector( tuples(Int,Int) , ...)

I have a big size of a vector (about 2000 elements), inside consists of many tuples, Tuple(Int,Int), i.e.
val myVectorEG = Vector((65,61), (29,49), (4,57), (12,49), (24,98), (21,52), (81,86), (91,23), (73,34), (97,41),...))
I wish to remove the repeated/duplicated integers for every tuple at the index (0), i.e. if Tuple(65,xx) repeated at other Tuple(65, yy) inside the vector, it should be removed)
I enable to access them and print out in this method:
val (id1,id2) = ( allSource.foreach(i=>println(i._1)), allSource.foreach(i=>i._2))
How can I remove duplicate integers? Or I should use another method, rather than using foreach to access my element index at 0
To remove all duplicates, first group by the first tuple and only collect the tuples where there is only one tuple that belongs to that particular key (_._1). Then flatten the result.
myVectorEG.groupBy(_._1).collect{
case (k, v) if v.size == 1 => v
}.flatten
This returns a List which you can call .toVector on if you need a Vector
This does the job and preserves order (unlike other solutions) but is O(n^2) so potentially slow for 2000 elements:
myVectorEG.filter(x => myVectorEG.count(_._1 == x._1) == 1)
This is more efficient for larger vectors but still preserves order:
val keep =
myVectorEG.groupBy(_._1).collect{
case (k, v) if v.size == 1 => k
}.toSet
myVectorEG.filter(x => keep.contains(x._1))
You can use a distinctBy to remove duplicates.
In the case of Vector[(Int, Int)] it will look like this
myVectorEG.distinctBy(_._1)
Updated, if you need to remove all the duplicates:
You can use groupBy but this will rearrange your order.
myVectorEG.groupBy(_._1).filter(_._2.size == 1).flatMap(_._2).toVector
Another option, taking advantage that you want the list sorted at the end.
def sortAndRemoveDuplicatesByFirst[A : Ordering, B](input: List[(A, B)]): List[(A, B)] = {
import Ordering.Implicits._
val sorted = input.sortBy(_._1)
#annotation.tailrec
def loop(remaining: List[(A, B)], previous: (A, B), repeated: Boolean, acc: List[(A, B)]): List[(A, B)] =
remaining match {
case x :: xs =>
if (x._1 == previous._1)
loop(remaining = xs, previous, repeated = true, acc)
else if (!repeated)
loop(remaining = xs, previous = x, repeated = false, previous :: acc)
else
loop(remaining = xs, previous = x, repeated = false, acc)
case Nil =>
(previous :: acc).reverse
}
sorted match {
case x :: xs =>
loop(remaining = xs, previous = x, repeated = false, acc = List.empty)
case Nil =>
List.empty
}
}
Which you can test like this:
val data = List(
1 -> "A",
3 -> "B",
1 -> "C",
4 -> "D",
3 -> "E",
5 -> "F",
1 -> "G",
0 -> "H"
)
sortAndRemoveDuplicatesByFirst(data)
// res: List[(Int, String)] = List((0,H), (4,D), (5,F))
(I used List instead of Vector to make it easy and performant to write the tail-rec algorithm)

Reduce/fold over scala sequence with grouping

In scala, given an Iterable of pairs, say Iterable[(String, Int]),
is there a way to accumulate or fold over the ._2s based on the ._1s? Like in the following, add up all the #s that come after A and separately the # after B
List(("A", 2), ("B", 1), ("A", 3))
I could do this in 2 steps with groupBy
val mapBy1 = list.groupBy( _._1 )
for ((key,sublist) <- mapBy1) yield (key, sublist.foldLeft(0) (_+_._2))
but then I would be allocating the sublists, which I would rather avoid.
You could build the Map as you go and convert it back to a List after the fact.
listOfPairs.foldLeft(Map[String,Int]().withDefaultValue(0)){
case (m,(k,v)) => m + (k -> (v + m(k)))
}.toList
You could do something like:
list.foldLeft(Map[String, Int]()) {
case (map, (k,v)) => map + (k -> (map.getOrElse(k, 0) + v))
}
You could also use groupBy with mapValues:
list.groupBy(_._1).mapValues(_.map(_._2).sum).toList
res1: List[(String, Int)] = List((A,5), (B,1))

Joining pairs of key-value with pairs of key-map

I am having this dataset:
(apple,1)
(banana,4)
(orange,3)
(grape,2)
(watermelon,2)
, and the other dataset is:
(apple,Map(Bob -> 1))
(banana,Map(Chris -> 1))
(orange,Map(John -> 1))
(grape,Map(Smith -> 1))
(watermelon,Map(Phil -> 1))
I aiming to combine both sets to get:
(apple,1,Map(Bob -> 1))
(banana,4,Map(Chris -> 1))
(orange,3,Map(John -> 1))
(grape,2,Map(Smith -> 1))
(watermelon,2,Map(Phil -> 1))
The code I have:
...
val counts_firstDataset = words.map(word =>
(word.firstWord, 1)).reduceByKey{case (x, y) => x + y}
Second dataset:
...
val counts_secondDataset = secondSet.map(x => (x._1,
x._2.toList.groupBy(identity).mapValues(_.size)))
I tried to use the join method val joined_data = counts_firstDataset.join(counts_secondDataset) but did not work because the join takes pair of [K,V]. How would I get around this issue?
The easiest way is just to convert to DataFrames and then join:
import spark.implicits._
val counts_firstDataset = words
.map(word => (word.firstWord, 1))
.reduceByKey{case (x, y) => x + y}
.toDF("type", "value")
val counts_secondDataset = secondSet
.map(x => (x._1,x._2.toList.groupBy(identity).mapValues(_.size)))
.toDF("type_2","map")
counts_firstDataset
.join(counts_secondDataset, 'type === 'type_2)
.drop('type_2)
As first element (name of fruits) of both the lists are in the same order, you can combine the two lists of tuples using zip and then use map to change the list to a tuple in the following way:
counts_firstDataset.zip(counts_secondDataset)
.map(vk => (vk._1._1, vk._1._2, vk._2._2))

Spark Scala Understanding reduceByKey(_ + _)

I can't understand reduceByKey(_ + _) in the first example of spark with scala
object WordCount {
def main(args: Array[String]): Unit = {
val inputPath = args(0)
val outputPath = args(1)
val sc = new SparkContext()
val lines = sc.textFile(inputPath)
val wordCounts = lines.flatMap {line => line.split(" ")}
.map(word => (word, 1))
.reduceByKey(_ + _) **I cant't understand this line**
wordCounts.saveAsTextFile(outputPath)
}
}
Reduce takes two elements and produce a third after applying a function to the two parameters.
The code you shown is equivalent to the the following
reduceByKey((x,y)=> x + y)
Instead of defining dummy variables and write a lambda, Scala is smart enough to figure out that what you trying achieve is applying a func (sum in this case) on any two parameters it receives and hence the syntax
reduceByKey(_ + _)
reduceByKey takes two parameters, apply a function and returns
reduceByKey(_ + _) is equivalent to reduceByKey((x,y)=> x + y)
Example :
val numbers = Array(1, 2, 3, 4, 5)
val sum = numbers.reduceLeft[Int](_+_)
println("The sum of the numbers one through five is " + sum)
Results :
The sum of the numbers one through five is 15
numbers: Array[Int] = Array(1, 2, 3, 4, 5)
sum: Int = 15
Same reduceByKey(_ ++ _) is equivalent to reduceByKey((x,y)=> x ++ y)

scala Stream transformation and evaluation model

Consider a following list transformation:
List(1,2,3,4) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
It is evaluated in the following way:
List(1, 2, 3, 4) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
List(11, 12, 13, 14) filter (_ % 2 == 0) map (_ * 3)
List(12, 14) map (_ * 3)
List(36, 42)
So there are three passes and with each one a new list structure created.
So, the first question: can Stream help to avoid it and if yes -- how? Can all evaluations be made in a single pass and without additional structures created?
Isn't the following Stream evaluation model correct:
Stream(1, ?) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
Stream(11, ?) filter (_ % 2 == 0) map (_ * 3)
// filter condition fail, evaluate the next element
Stream(2, ?) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
Stream(12, ?) filter (_ % 2 == 0) map (_ * 3)
Stream(12, ?) map (_ * 3)
Stream(36, ?)
// finish
If it is, then there are the same number of passes and the same number of new Stream structures created as in the case of a List. If it is not -- then the second question: what is Stream evaluation model in particularly this type of transformation chain?
One way to avoid intermediate collections is to use view.
List(1,2,3,4).view map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
It doesn't avoid every intermediate, but it can be useful. This page has lots of info and is well worth the time.
No, you can't avoid it by using Stream.
But you do can avoid it by using the method collect, and you should keep the idea that everytime you use a map after filter you may need a collect.
Here is the code:
scala> def time(n: Int)(call : => Unit): Long = {
| val start = System.currentTimeMillis
| var cnt = n
| while(cnt > 0) {
| cnt -= 1
| call
| }
| System.currentTimeMillis - start
| }
time: (n: Int)(call: => Unit)Long
scala> val xs = List.fill(10000)((math.random * 100).toInt)
xs: List[Int] = List(37, 86, 74, 1, ...)
scala> val ys = Stream(xs :_*)
ys: scala.collection.immutable.Stream[Int] = Stream(37, ?)
scala> time(10000){ xs map (_+10) filter (_%2 == 0) map (_*3) }
res0: Long = 7182
//Note call force to evaluation of the whole stream.
scala> time(10000){ ys map (_+10) filter (_%2 == 0) map (_*3) force }
res1: Long = 17408
scala> time(10000){ xs.view map (_+10) filter (_%2 == 0) map (_*3) force }
res2: Long = 6322
scala> time(10000){ xs collect { case x if (x+10)%2 == 0 => (x+10)*3 } }
res3: Long = 2339
As far as I know, If you always iterate through the whole collection Stream does not help you.
It will create the same number as new Streams as with the List.
Correct me if I am wrong, but I understand it as follows:
Stream is a lazy structure, so when you do:
val result = Stream(1, ?) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
the result is another stream that links to the results of the previous transformations. So if force evaluation with a foreach (or e.g. mkString)
result.foreach(println)
for each iteration the above chain is evaluated to get the current item.
However, you can reduce passes by 1, if you replace filter with withFilter. Then the filter is kind of applied with the map function.
List(1,2,3,4) map (_ + 10) withFilter (_ % 2 == 0) map (_ * 3)
You can reduce it to one pass with flatMap:
List(1,2,3,4) flatMap { x =>
val y = x + 10
if (y % 2 == 0) Some(y * 3) else None
}
Scala can filter and transform a collection in a variety of ways.
First your example:
List(1,2,3,4) map (_ + 10) filter (_ % 2 == 0) map (_ * 3)
Could be optimized:
List(1,2,3,4) filter (_ % 2 == 0) map (v => (v+10)*3)
Or, folds could be used:
List(1,2,3,4).foldLeft(List[Int]()){ case (a,b) if b % 2 == 0 => a ++ List((b+10)*3) case (a,b) => a }
Or, perhaps a for-expression:
for( v <- List(1,2,3,4); w=v+10 if w % 2 == 0 ) yield w*3
Or, maybe the clearest to understand, a collection:
List(1,2,3,4).collect{ case v if v % 2 == 0 => (v+10)*3 }
But to address your questions about Streams; Yes, streams can be used
and for large collections where what is wanted is often found early, a
Stream is a good choice:
def myStream( s:Stream[Int] ): Stream[Int] =
((s.head+10)*3) #:: myStream(s.tail.filter( _ % 2 == 0 ))
myStream(Stream.from(2)).take(2).toList // An infinitely long list yields
// 36 & 42 where the 3rd element
// has not been processed yet
With this Stream example the filter is only applied to the next element as it is needed, not to the entire list -- good thing, or it would never stop :)