How to sort an array with a custom order in Scala - scala

I have a collection of data like the following:
val data = Seq(
("M", 1),
("F", 2),
("F", 3),
("F/M", 4),
("M", 5),
("M", 6),
("F/M", 7),
("F", 8)
)
I would like to sort this array according to the first value of the tuple. But I don't want to sort them in alphabetic order, I want them to be sorted like that: all Fs first, then all Ms and finally all F/Ms (I don't are about the inner sorting for values with the same key).
I thought about extending the Ordering class, but it feel quite overkilling for such a simple problem. Any idea?

EDIT: See #Eastsun's comment below for an even simpler solution.
I finally came up with a simple solution based on a map:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
data.sortWith((p1, p2) => sortingOrder(p1._1) < sortingOrder(p2._1))
This will of course fail if there is a unknown key in data, but it will be fine in my case.
In order to avoid an error when a new key is met, we can do the following:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
val nKeys = sortingOrder.size
data.sortWith((p1, p2) => sortingOrder.getOrElse(p1._1, nKeys) < sortingOrder.getOrElse(p2._1, nKeys))
This will push tuples with unknown keys at the end of the list.

Related

Best way to find min of element in tuple

I have an Array of 3-tuples in scala as such: (a:Int, b:Int, val:Double), and I need to return an Array which has for all pairs (a, b) the minimum val. It is clear to me how to do this by going to a Map:
a.map(t => ((t._1,t ._2), t._3)).groupBy(_._1).mapValues(_.map(_._2).min).toArray
but I would like to avoid maps for the purposes of memory optimization. Is there a clean way to do this without Map?
Try groupMapReduce, it does all the same, but in a single pass:
tuples.groupMapReduce(t => (t._1, t._2))(_._3)(_ min _)
Runnable example:
val tuples = List(
(0, 0, 58),
(1, 1, 100),
(0, 0, 1),
(0, 1, 42),
(1, 0, 123),
(1, 0, 3),
(0, 1, 2),
(1, 1, 4)
)
tuples.groupMapReduce(t => (t._1, t._2))(_._3)(_ min _).foreach(println)
Output:
((0,0),1)
((1,1),4)
((1,0),3)
((0,1),2)
It should strictly decrease the load on GC compared to your solution, because it doesn't generate any intermediate lists for the grouped values, and neither for those mapped grouped values in the _.map(_._2)-step in your original solution.
It does not completely eliminate the intermediate Map, but unless you can provide any more efficient structure for storing the values (such as a 2-D array for, by limiting the possible as and bs to be relatively small & strictly positive), the occurrence of a Map seems more or less unavoidable.

Scala, how to use functional programming to do a kind of join operation in between to complex sequences?

Suppose I have a Seq[Seq[String]] sequence, like
val seq = for (i <- (1 to 50) toSeq) yield Seq(s"foo$i", s"bar$i")
, which is
List(List("foo1", "bar1"), List("foo2", "bar2"), ..., List("foo50", "bar50"))
I also had an index tuple, say
((5, 10), (13, 17), (25, 33), (42, 49))
What I want to get is filter the seq, get only those whose index is between the range in the second tuple and I want to add a new id in the last position.
The result should be:
List(List("foo5", "bar5", "0"), List("foo6", "bar6", "0")..., List("foo10", "bar10", "0"), List("foo13", "bar13", "1"), List("foo14", "bar14", "1")..., List("foo17", "bar17", "1"), List("foo25", "bar25", "2"), List("foo26", "bar26", "2")... List("foo33", "bar33", "2"), List("foo42", "bar42", "3"), List("foo43", "bar43", "3")..., List("foo48", "bar48", "3"), List("foo49", "bar49", "3"))
I can achieve this using procedural programming, but the code smell is ugly. How can this be done in a scala style functional way?
Generally speaking, join-type operations can easily be expressed using for comprehensions:
val ranges = Seq((5, 10), (13, 17), (25, 33), (42, 49))
for {
i <- (1 to 50).toSeq
((lower, upper), j) <- ranges.zipWithIndex
if i >= lower && i <= upper
} yield List(s"foo$i", s"bar$i", j.toString)
However be aware that this will require time proportional to the product of the lengths of the two lists. If these get sufficiently large, you'll need a more optimized solution.

Reverse a word-frequency map in Scala

I have a word-frequency array like this:
[("hello", 1), ("world", 5), ("globle", 1)]
I have to reverse it such that I get frequency-to-wordCount map like this:
[(1, 2), (5, 1)]
Notice that since two words ("hello" and "globe") have the frequency 1, the value of the reversed mapping is 2. However, since there is only one word with a frequency 5, so, the value of that entry is 1. How can I do this in scala?
Update:
I happened to figure this out as well:
arr.groupBy(_._2).map(x => (x._1,x._2.toList.length))
You can first group by the count, and then just get the size of each group
val frequencies = List(("hello", 1), ("world", 5), ("globle", 1))
val reversed = frequencies.groupBy(_._2).mapValues(_.size).toList
res0: List[(Int, Int)] = List((5,1), (1,2))

How to transform RDD[(Key, Value)] into Map[Key, RDD[Value]]

I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)] into a Map[key, RDD[Value]], knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]] by key and call saveAsNewAPIHadoopFile for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter on each key A, B, C of RDD[(Key, Value)], but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache ?)
Thank you
You should use the code like this (Python):
rdd = sc.parallelize( [("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)] ).cache()
keys = rdd.keys().distinct().collect()
for key in keys:
out = rdd.filter(lambda x: x[0] == key).map(lambda (x,y): y)
out.saveAsNewAPIHadoopFile (...)
One RDD cannot be a part of another RDD and you have no option to just collect keys and transform their related values to a separate RDD. In my example you would iterate over the cached RDD which is ok and would work fast
It sounds like what you really want is to save your KV RDD to a separate file for each key. Rather than creating a Map[Key, RDD[Value]] consider using a MultipleTextOutputFormat similar to the example here. The code is pretty much all there in the example.
The benefit of this approach is that you're guaranteed to only take one pass over the RDD after the shuffle and you get the same result you wanted. If you did this by filtering and creating several IDs as suggested in the other answer (unless your source supported pushdown filters) you would end up taking one pass over the dataset for each individual key which would be way slower.
This is my simple test code.
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val groupby_RDD = test_RDD.groupByKey()
val result_RDD = groupby_RDD.map{v =>
var result_list:List[Int] = Nil
for (i <- v._2) {
result_list ::= i
}
(v._1, result_list)
}
The result is below
result_RDD.take(3)
>> res86: Array[(String, List[Int])] = Array((A,List(1, 3, 2)), (B,List(5, 4)), (C,List(6)))
Or you can do it like this
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val nil_list:List[Int] = Nil
val result2 = test_RDD.aggregateByKey(nil_list)(
(acc, value) => value :: acc,
(acc1, acc2) => acc1 ::: acc2 )
The result is this
result2.take(3)
>> res209: Array[(String, List[Int])] = Array((A,List(3, 2, 1)), (B,List(5, 4)), (C,List(6)))

sort two lists by their first element and zip them in scala

val descrList = cursorReal.interfaceInfo.interfaces.map {
case values => (values.ifIndex , values.ifName , values.ifType)
}
val ipAddressList = cursorReal.interfaceIpAndIndex.filter(x=> (!x.ifIpAddress.equalsIgnoreCase("0"))).map {
case values => (values.ifIndex,values.ifIpAddress)
}
For instance,
val descrList =
List((12,"VoIP-Null0",1), (8,"FastEthernet6",6), (19,"Vlan11",53),
(4,"FastEthernet2",6), (15,"Vlan1",53), (11,"GigabitEthernet0",6),
(9,"FastEthernet7",6), (22,"Vlan20",53), (13,"Wlan-GigabitEthernet0",6),
(16,"Async1",1), (5,"FastEthernet3",6), (10,"FastEthernet8",6),
(21,"Vlan12",53), (6,"FastEthernet4",6), (1,"wlan-ap0",24),
(17,"Virtual-Template1",131), (14,"Null0",1), (20,"Vlan10",53),
(2,"FastEthernet0",6), (18,"NVI0",1), (7,"FastEthernet5",6),
(29,"Virtual-Access7",131), (3,"FastEthernet1",6), (28,"Virtual-Access6",131))
val ipAddressList = List((21,"192.168.12.1"), (19,"192.168.11.1"),
(11,"104.36.252.115"), (20,"192.168.10.1"),
(22,"192.168.20.1"))
In both the lists first element is index and i have to merge these two list index wise . It means
(21,"192.168.12.1") this ipAddress should merge with (21,"Vlan12",53) and form new list like below (21,"Vlan12",53,"192.168.12.1").
scala> descrList map {case (index, v1, v2) =>
(index, v1, v2, ipAddressList.toMap.getOrElse(index, "empty"))}
res0: List[(Int, String, Int, String)] = List(
(12,VoIP-Null0,1,empty), (8,FastEthernet6,6,empty), (19,Vlan11,53,192.168.11.1),
(4,FastEthernet2,6,empty), (15,Vlan1,53,empty), (11,GigabitEthernet0,6,104.36.252.115),
(9,FastEthernet7,6,empty), (22,Vlan20,53,192.168.20.1), (13,Wlan-GigabitEthernet0,6,empty),
(16,Async1,1,empty), (5,FastEthernet3,6,empty), (10,FastEthernet8,6,empty),
(21,Vlan12,53,192.168.12.1), (6,FastEthernet4,6,empty), (1,wlan-ap0,24,empty), (17,Virtual-
Template1,131,empty), (14,Null0,1,empty), (20,Vlan10,53,192.168.10.1), (2,FastEthernet0,6,empty),
(18,NVI0,1,empty), (7,FastEthernet5,6,empty), (29,Virtual-Access7,131,empty),
(3,FastEthernet1,6,empty), (28,Virtual-Access6,131,empty))
First, I would suggest you produce a Map instead of a List. A Map by nature has an indexer, and in your case this would be the ifIndex value.
Once you have Maps in place, you can use something like this (sample from this other SO Best way to merge two maps and sum the values of same key?)
From Rex Kerr:
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
Or like this from Matthew Farwell:
(map1.keySet ++ map2.keySet).map (i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0))}.toMap
If you cannot use Maps for whatever reason, then look into your existing project libraries. If you have Scalaz, then you have some tools already available.
Scalaz: https://github.com/scalaz/scalaz
If you have Slick, you also have some nice tools to directly use.
Slick: http://slick.typesafe.com/docs/
Consider first converting decrList to a Map, like this,
val a = (for ( (k,v1,v2) <- descrList) yield k -> (v1,v2)).toMap
Then we can look up keys for ipAddressList and agglomerate elements into a new tuple, as follows,
for ( (k,ip) <- ipAddressList ; v = a.getOrElse(k,("none","none")) ) yield (k,v._1,v._2,ip)
Hence, for ipAddressList,
res: List((21,Vlan12,53,192.168.12.1), (19,Vlan11,53,192.168.11.1),
(11,GigabitEthernet0,6,104.36.252.115), (20,Vlan10,53,192.168.10.1),
(22,Vlan20,53,192.168.20.1))
Given the data:
val descrList =
List((12, "VoIP-Null0", 1), (8, "FastEthernet6", 6), (19, "Vlan11", 53),
(4, "FastEthernet2", 6), (15, "Vlan1", 53), (11, "GigabitEthernet0", 6),
(9, "FastEthernet7", 6), (22, "Vlan20", 53), (13, "Wlan-GigabitEthernet0", 6),
(16, "Async1", 1), (5, "FastEthernet3", 6), (10, "FastEthernet8", 6),
(21, "Vlan12", 53), (6, "FastEthernet4", 6), (1, "wlan-ap0", 24),
(17, "Virtual-Template1", 131), (14, "Null0", 1), (20, "Vlan10", 53),
(2, "FastEthernet0", 6), (18, "NVI0", 1), (7, "FastEthernet5", 6),
(29, "Virtual-Access7", 131), (3, "FastEthernet1", 6), (28, "Virtual-Access6", 131))
val ipAddressList = List((21, "192.168.12.1"), (19, "192.168.11.1"),
(11, "104.36.252.115"), (20, "192.168.10.1"),
(22, "192.168.20.1"))
Merge and sort:
val addrMap = ipAddressList.toMap
val output = descrList
.filter(x => addrMap.contains(x._1))
.map(x => x match { case (i, a, b) => (i, a, b, addrMap(i)) })
.sortBy(_._1)
output foreach println
Output:
(11,GigabitEthernet0,6,104.36.252.115)
(19,Vlan11,53,192.168.11.1)
(20,Vlan10,53,192.168.10.1)
(21,Vlan12,53,192.168.12.1)
(22,Vlan20,53,192.168.20.1)