how to generate combinations of key, value pairs in apache spark(scala)

how to generate combinations of key, value pairs in apache spark(scala) - scala

I have an RDD like this(all numbers are Int):
(2,List(2,2,7))
(7,List(9,7,9))
(9,List(2,7,9))
How do I generate an RDD such that for each list we have all possibilities of keys:
(2,List(2,2,7))
(7,List(2,2,7))
(9,List(2,2,7))
(2,List(9,7,9))
(7,List(9,7,9))
(9,List(9,7,9))
(2,List(2,7,9))
(7,List(2,7,9))
(9,List(2,7,9))
Follow up, I need to calculate the counts for each row when key value is equal to the values in the list, for example,
(2,List(2,2,7)) results in (2, 2) since there are two 2s in the list
(7,List(2,2,7)) results in (7, 1) since there is one 7 in the list

To generate all possible key-value pairs I would use something like
rdd.map(_._1).cartesian(rdd.map(_._2))
That gives exactly
(2,List(9, 7, 9))
(2,List(2, 7, 9))
(7,List(2, 2, 7))
(7,List(9, 7, 9))
(7,List(2, 7, 9))
(9,List(2, 2, 7))
(9,List(9, 7, 9))
(9,List(2, 7, 9))
(2,List(2, 2, 7))
For final result you can use map:
rdd.map(_._1).cartesian(rdd.map(_._2)).map{case (k, v) => {(k, v, v.count(_ == k))}}
(2,List(2, 2, 7),2)
(2,List(2, 7, 9),1)
(7,List(2, 7, 9),1)
(2,List(9, 7, 9),0)
(7,List(2, 2, 7),1)
(9,List(2, 2, 7),0)
(9,List(9, 7, 9),2)
(7,List(9, 7, 9),1)
(9,List(2, 7, 9),1)
You can exclude the list itself from the final tuples, I added it just to check if it works right.
In terms of your problem it may be very useful to check list to be null and handle it in a proper way

Related

Get items from seq traits in Scala

I'm looking for an elegant way to get elements by generated array consisting of indices
create a seq
val nums = Seq(1, 2, 3, 4, 5, 7, 9, 11, 14, 12, 16)
create a slice
val i = Array.range(2, 10, 3)
Array[Int] = Array(2, 5, 8)
How to get 3, 7, 14 from Seq?

If I understood correctly, you want to pick elements from nums corresponding to indexes from i.
The most elegant way I can think of it:
i.map(nums.apply)
scastie: https://scastie.scala-lang.org/KacperFKorban/ckwx5KMKQMOss1L0ce661w

The Difference between using while and for loop with Set type in Scala

why the Set type works perfectly well with for loops? instead, why the while loop running through a Set can not index by position?

Short Answer:
There is no for loop in scala. There is only for comprehension which is just syntactic sugar for some specific method calls.
for (i <- Set(1, 2, 33)) println(i)
//is translated to
Set(1, 2, 33).foreach { i => println(i) }
val newSet = for (i <- Set(1, 2, 33)) yield i*2
//is translated to
val newSet = Set(1, 2, 33).map { i => i*2 }
// There is more of such translations. Read doc :)
While loop on the other hand is just normal loop known from other languages. It just loops as long a condition is satisfied, nothing fancy here.
It means when you write for loop in scala you are just using foreach or map methods on Set[T] instance.

Because the index of any particular element in a Set is a private implementation detail, unlike in an IndexedSeq.
Consider for example:
(1 to 19).map(n => (n * n) % 19)
Vector(1, 4, 9, 16, 6, 17, 11, 7, 5, 5, 7, 11, 17, 6, 16, 9, 4, 1, 0): IndexedSeq[Int]
Now convert that to a Set:
(1 to 19).map(n => (n * n) % 19).toSet
HashSet(0, 5, 1, 6, 9, 17, 7, 16, 11, 4): scala.collection.immutable.Set[Int]
Can you tell in which order the duplicate elements were removed?
Now let's remove the even integers:
(1 to 19).map(n => (n * n) % 19).toSet.filter(_ % 2 == 1)
HashSet(5, 1, 9, 17, 7, 11): scala.collection.immutable.Set[Int]
Here, filter probably went through one immutable Set in the order that it's stored in and copied the elements that fit the predicate to a new immutable Set in the same order.
But what if we use a mutable set instead?
import scala.collection.mutable.HashSet
var someSet: HashSet[Int] = new HashSet()
someSet ++= ((1 to 19).map(n => (n * n) % 19))
HashSet(0, 1, 4, 5, 6, 7, 9, 11, 16, 17): scala.collection.mutable.HashSet[Int]
someSet.filter(_ % 2 == 1)
HashSet(1, 17, 5, 7, 9, 11): scala.collection.mutable.HashSet[Int]
(Scastie snippet)
Here's what I think happened: since we're dealing with a mutable Set here, what filter does is that it goes through and removes the elements that don't fit the predicate. But instead of shifting all later elements to fill the hole (as would be the case in a Java ArrayList<>), it simply grabs the current last element and shifts it to the now vacant position.
So 0, which now happens to be at position 0, gets removed, 17 replaces it at position 0, and the size property is decremented by 1.
No, wait, that's not exactly what happened. You'd have to look at the scala.collection.mutable.HashSet source to figure out exactly what happened.
But it doesn't actually matter. The API does not indicate any obligation to reveal the order the elements are internally stored in. Our assumption that a REPL displays the elements in the order they're stored in might well turn out to be wrong altogether.

How can I merge 2 observables in a custom fashion?

Custom fashion is:
obs1 = [1, 3, 5, 7, 9], obs2 = [2, 4, 6, 8, 10] -> mergedObs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
I was thinking about obs1.zipWith(obs2) and my bifunction was (a, b) -> Observable.just(a, b) and then it's not trivial for me to flatten Observable<Observable<Integer>>.

That looks like an ordered merge: merge so that the smallest is picked from the sources when they all have items ready:
Flowables.orderedMerge():
Given a fixed number of input sources (which can be self-comparable or given a Comparator) merges them into a single stream by repeatedly picking the smallest one from each source until all of them completes.
Flowables.orderedMerge(Flowable.just(1, 3, 5), Flowable.just(2, 4, 6))
.test()
.assertResult(1, 2, 3, 4, 5, 6);
Edit
If the sources are guaranteed to be the same length, you can also zip them into a structure and then flatten that:
Observable.zip(source1, source2, (a, b) -> Arrays.asList(a, b))
.flatMapIterable(list -> list)
;

Remove unique items from sequence [duplicate]

This question already has answers here:
How to get a set of all elements that occur multiple times in a list in Scala?
(2 answers)
Closed 8 years ago.
I find a lot about how to remove duplicates, but what is the most elegant way to remove unique items first and then the remaining duplicates.
E.g. a sequence (1, 2, 5, 2, 3, 4, 4, 0, 2) should be converted into (2, 4).
I can think of using a for-loop to add a count to each distinct item, but I could imagine that Scala has a more elegant way to achieve this.

distinct and diff will works for you:
val a = List(1, 2, 5, 2, 3, 4, 4, 0, 2)
> a: List[Int] = List(1, 2, 5, 2, 3, 4, 4, 0, 2)
val b = a diff a.distinct
> b: List[Int] = List(2, 4, 2)
val c = (a diff a.distinct).distinct
> c: List[Int] = List(2, 4)
In place distinct you can use toSet as well.

Also keep in mind that i => i can be replaced by identity and map(_._1) by keys, like this:
Seq(1, 2, 5, 2, 3, 4, 4, 0, 2).groupBy(identity).filter(_._2.size > 1).keys.toSeq
This is where a countByKey method, such as the one that can be found in Spark's API, would be useful.

Pretty straight forward:
Seq(1, 2, 5, 2, 3, 4, 4, 0, 2).groupBy(i => i).filter(_._2.size > 1).map(_._1).toSeq
Using the link from Ende Neu I think your code would become this:
Seq(1, 2, 5, 2, 3, 4, 4, 0, 2).groupBy(identity).collect { case (v, l) if l.length > 1 => v } toSeq

Filter condition using filtered value

I would like to filter collection, so distance between adjacent elements would be at least 5.
So List(1, 2, 3, 4, 5, 6, 7, 11, 20) will become List(1, 6, 11, 20).
Is it possible to achieve in one pass using filter? What would be scala-way?

How about this one-liner:
scala> l.foldLeft(Vector(l.head)) { (acc, item) => if (item - acc.last >= 5) acc :+ item else acc }
res7: scala.collection.immutable.Vector[Int] = Vector(1, 6, 11, 20)

Try with foldLeft():
val input = List(1, 2, 3, 4, 5, 6, 7, 11, 20)
input.tail.foldLeft(List(input.head))((out, cur) =>
if(cur - out.head >= 5) cur :: out else out
).reverse
If it's not obvious:
Algorithm starts with first element (probably you need some edge cases handled) in the output collection
It iterates over all remaining elements from the input. If the difference between this element (cur) and first element of input is greater than or equal than 5, prepend to input. Otherwise skip and proceed
input was built by prepending and examining head to get better performance. .reverse is needed in the end
This is basically how you would implement this in imperative way, but with more concise syntax.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

how to generate combinations of key, value pairs in apache spark(scala) - scala

Related

Get items from seq traits in Scala

The Difference between using while and for loop with Set type in Scala

How can I merge 2 observables in a custom fashion?

Remove unique items from sequence [duplicate]

Filter condition using filtered value

Categories

Resources