Best way to find min of element in tuple - scala

I have an Array of 3-tuples in scala as such: (a:Int, b:Int, val:Double), and I need to return an Array which has for all pairs (a, b) the minimum val. It is clear to me how to do this by going to a Map:
a.map(t => ((t._1,t ._2), t._3)).groupBy(_._1).mapValues(_.map(_._2).min).toArray
but I would like to avoid maps for the purposes of memory optimization. Is there a clean way to do this without Map?

Try groupMapReduce, it does all the same, but in a single pass:
tuples.groupMapReduce(t => (t._1, t._2))(_._3)(_ min _)
Runnable example:
val tuples = List(
(0, 0, 58),
(1, 1, 100),
(0, 0, 1),
(0, 1, 42),
(1, 0, 123),
(1, 0, 3),
(0, 1, 2),
(1, 1, 4)
)
tuples.groupMapReduce(t => (t._1, t._2))(_._3)(_ min _).foreach(println)
Output:
((0,0),1)
((1,1),4)
((1,0),3)
((0,1),2)
It should strictly decrease the load on GC compared to your solution, because it doesn't generate any intermediate lists for the grouped values, and neither for those mapped grouped values in the _.map(_._2)-step in your original solution.
It does not completely eliminate the intermediate Map, but unless you can provide any more efficient structure for storing the values (such as a 2-D array for, by limiting the possible as and bs to be relatively small & strictly positive), the occurrence of a Map seems more or less unavoidable.

Related

How to sort an array with a custom order in Scala

I have a collection of data like the following:
val data = Seq(
("M", 1),
("F", 2),
("F", 3),
("F/M", 4),
("M", 5),
("M", 6),
("F/M", 7),
("F", 8)
)
I would like to sort this array according to the first value of the tuple. But I don't want to sort them in alphabetic order, I want them to be sorted like that: all Fs first, then all Ms and finally all F/Ms (I don't are about the inner sorting for values with the same key).
I thought about extending the Ordering class, but it feel quite overkilling for such a simple problem. Any idea?
EDIT: See #Eastsun's comment below for an even simpler solution.
I finally came up with a simple solution based on a map:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
data.sortWith((p1, p2) => sortingOrder(p1._1) < sortingOrder(p2._1))
This will of course fail if there is a unknown key in data, but it will be fine in my case.
In order to avoid an error when a new key is met, we can do the following:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
val nKeys = sortingOrder.size
data.sortWith((p1, p2) => sortingOrder.getOrElse(p1._1, nKeys) < sortingOrder.getOrElse(p2._1, nKeys))
This will push tuples with unknown keys at the end of the list.

Find the indices at which two arrays intersect

I'm wondering if in Scala there's a decent way to get the indices at which two arrays intersect.
So given arrays:
a1 = [0, 5, 10, 15, 20, 25, 30]
a2 = [10, 20, 30, 40, 50]
Ideally, taking advantage of the fact that both arrays are ordered and contain no duplicates.
These share common elements a1.intersect(a2) = [10, 20, and 30]. The index (position) at which these elements occur is different for each array.
I would like to produce a sequence of tuples with the positions from each list where they intersect:
intersectingIndices(a1, a2) = [(2, 0), (4, 1), (6, 2)]
While intersect gives the intersecting values, I need to know the original indices and would prefer not to have to do an O(N) scan to find each one - as these arrays get very long (millions of elements). I also suspect the complexity of intersect is unnecessarily large given both arrays will always be sorted in advance, so a single-pass option would be preferable.
If both lists are storted, it seems fairly straightforward, just a slight variation of the "merge" phase of merge-sort.
#taiilrec
def intersect(
left: List[Int],
right: List[Int],
lidx: Int = 0,
ridx: Int = 0,
result: List[(Int, Int)] = Nil
): List[(Int, Int)] = (left, right) match {
case (Nil, _) | (_, Nil) => result.reverse
case (l::tail, r::_) if l < r => intersect(tail, right, lidx+1, ridx, result)
case (l::_, r::tail) if l > r => intersect(left, tail, lidx, ridx+1, result)
case (l::ltail, r::rtail) => intersect(ltail, rtail, lidx+1, ridx+1, (lidx, ridx) :: result)
}
Or just hash one of the lists, and then scan the other (it is still O(n), albeit somewhat more expensive, but much simpler):
val hashed = left.zipWithIndex.toMap
right.zipWithIndex.flatMap { case(x, idx) => hashed.get(x).map(idx -> _) }

How do I remove the proper subsets from a list of sets in Scala?

I have a list of sets of integers as followed: {(1, 0), (0, 1, 2), (1, 2), (1, 2, 3, 4, 5), (3, 4)}.
I want to write a program in Scala to remove the sets that are proper subset of some set in the given list, i.e. the final result would be: {(0,1,2), (1,2,3,4,5)}.
An O(n2) solution can be done by checking each set against the entire list but that would be very expensive and does not scale very well for ~100000 sets. I also thought of generating edges from the sets, remove duplicate edges and run a DFS but I have no idea how to do it in Scala (the more Scala-ish way and not one-to-one from Java code).
Individual elements (sets) need only be compared to other elements of the same size or larger.
val ss = List(Set(1, 0), Set(0, 1, 2), Set(1, 2), Set(1, 2, 3, 4, 5), Set(3, 4))
ss.sortBy(- _.size) match {
case Nil => Nil
case hd::tl =>
tl.foldLeft(List(hd)){case (acc, s) =>
if (acc.exists(s.forall(_))) acc
else s::acc
}
}
//res0: List[Set[Int]] = List(Set(0, 1, 2), Set(5, 1, 2, 3, 4))

Getting unique values of pairs in an RDD when the order within the pair is irrelevant

I have a rdd with the values of
a,b
a,c
a,d
b,a
c,d
d,c
d,e
what I need is an rdd that contains the reciprocal pairs, but just one set. It would have to be:
a,b or b,a
c,d or d,c
I was thinking they could be added to a list and looped thru to find the the opposite pair, if one exists filter the first value out, and delete the reciprocal pair. I am thinking there must be a way of using scala functions like join or case, but I am having difficulty understanding them
If you don't mind the order of each pair to change(e.g., (a,b) to become (b,a)), you can give a simple and easy to parallelize solution. The examples below use numbers but the pairs can be anything; as long as the values are comparable.
In vanilla Scala:
List(
(2, 1),
(3, 2),
(1, 2),
(2, 4),
(4, 2)).map{ case(a,b) => if (a>b) (a,b) else (b,a) }.toSet
This will result in:
res1: Set[(Int, Int)] = Set((2, 1), (3, 2), (4, 2))
In Spark RDD the above can be expressed as:
sc.parallelize((2, 1)::(3, 2)::(2, 1)::(4, 2)::(4, 2)::Nil).map{ case(a,b) =>
if (a>b) (a,b) else (b,a) }.distinct()

Reverse a word-frequency map in Scala

I have a word-frequency array like this:
[("hello", 1), ("world", 5), ("globle", 1)]
I have to reverse it such that I get frequency-to-wordCount map like this:
[(1, 2), (5, 1)]
Notice that since two words ("hello" and "globe") have the frequency 1, the value of the reversed mapping is 2. However, since there is only one word with a frequency 5, so, the value of that entry is 1. How can I do this in scala?
Update:
I happened to figure this out as well:
arr.groupBy(_._2).map(x => (x._1,x._2.toList.length))
You can first group by the count, and then just get the size of each group
val frequencies = List(("hello", 1), ("world", 5), ("globle", 1))
val reversed = frequencies.groupBy(_._2).mapValues(_.size).toList
res0: List[(Int, Int)] = List((5,1), (1,2))