Reverse a word-frequency map in Scala - scala

I have a word-frequency array like this:
[("hello", 1), ("world", 5), ("globle", 1)]
I have to reverse it such that I get frequency-to-wordCount map like this:
[(1, 2), (5, 1)]
Notice that since two words ("hello" and "globe") have the frequency 1, the value of the reversed mapping is 2. However, since there is only one word with a frequency 5, so, the value of that entry is 1. How can I do this in scala?
Update:
I happened to figure this out as well:
arr.groupBy(_._2).map(x => (x._1,x._2.toList.length))

You can first group by the count, and then just get the size of each group
val frequencies = List(("hello", 1), ("world", 5), ("globle", 1))
val reversed = frequencies.groupBy(_._2).mapValues(_.size).toList
res0: List[(Int, Int)] = List((5,1), (1,2))

Related

How to sort an array with a custom order in Scala

I have a collection of data like the following:
val data = Seq(
("M", 1),
("F", 2),
("F", 3),
("F/M", 4),
("M", 5),
("M", 6),
("F/M", 7),
("F", 8)
)
I would like to sort this array according to the first value of the tuple. But I don't want to sort them in alphabetic order, I want them to be sorted like that: all Fs first, then all Ms and finally all F/Ms (I don't are about the inner sorting for values with the same key).
I thought about extending the Ordering class, but it feel quite overkilling for such a simple problem. Any idea?
EDIT: See #Eastsun's comment below for an even simpler solution.
I finally came up with a simple solution based on a map:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
data.sortWith((p1, p2) => sortingOrder(p1._1) < sortingOrder(p2._1))
This will of course fail if there is a unknown key in data, but it will be fine in my case.
In order to avoid an error when a new key is met, we can do the following:
val sortingOrder = Map("F" -> 0, "M" -> 1, "F/M" -> 2)
val nKeys = sortingOrder.size
data.sortWith((p1, p2) => sortingOrder.getOrElse(p1._1, nKeys) < sortingOrder.getOrElse(p2._1, nKeys))
This will push tuples with unknown keys at the end of the list.

Counting number of occurrences of Array element in a RDD

I have a RDD1 with Key-Value pair of type [(String, Array[String])] (i will refer to them as (X, Y)), and a Array Z[String].
I'm trying for every element in Z to count how many X instances there are that have Z in Y. I want my output as ((X, Z(i)), #ofinstances).
RDD1= ((A, (2, 3, 4), (B, (4, 4, 4)), (A, (4, 5)))
Z = (1, 4)
then i want to get:
(((A, 4), 2), ((B, 4), 1))
Hope that made sense.
As you can see over, i only want an element if there is atleast one occurence.
I have tried this so far:
val newRDD = RDD1.map{case(x, y) => for(i <- 0 to (z.size-1)){if(y.contains(z(i))) {((x, z(i)), 1)}}}
My output here is an RDD[Unit]
Im not sure if what i'm asking for is even possible, or if i have to do it an other way.
So it is just another word count
val rdd = sc.parallelize(Seq(
("A", Array("2", "3", "4")),
("B", Array("4", "4", "4")),
("A", Array("4", "5"))))
val z = Array("1", "4")
To make lookups efficient convert z to Set:
val zs = z.toSet
val result = rdd
.flatMapValues(_.filter(zs contains _).distinct)
.map((_, 1))
.reduceByKey(_ + _)
where
_.filter(zs contains _).distinct
filters out values that occur in z and deduplicates.
result.take(2).foreach(println)
// ((B,4),1)
// ((A,4),2)

Spark Scala - Split columns into multiple rows

Following the question that I post here:
Spark Mllib - Scala
I've another one doubt... Is possible to transform a dataset like this:
2,1,3
1
3,6,8
Into this:
2,1
2,3
1,3
1
3,6
3,8
6,8
Basically I want to discover all the relationships between the movies. Is possible to do this?
My current code is:
val input = sc.textFile("PATH")
val raw = input.lines.map(_.split(",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)
Given that input is a multi-line string.
scala> val raw = input.lines.map(_.split(",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))
Following approach discards one-element arrays, 1 in your example.
scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))
It can be fixed by appending filtered raw collection.
scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))
Order of combinations is not relevant I believe.
Update
SparkContext.textFile returns RDD of lines, so it could be plugged in as:
val raw = rdd.map(_.split(","))

How to transform RDD[(Key, Value)] into Map[Key, RDD[Value]]

I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)] into a Map[key, RDD[Value]], knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]] by key and call saveAsNewAPIHadoopFile for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter on each key A, B, C of RDD[(Key, Value)], but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache ?)
Thank you
You should use the code like this (Python):
rdd = sc.parallelize( [("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)] ).cache()
keys = rdd.keys().distinct().collect()
for key in keys:
out = rdd.filter(lambda x: x[0] == key).map(lambda (x,y): y)
out.saveAsNewAPIHadoopFile (...)
One RDD cannot be a part of another RDD and you have no option to just collect keys and transform their related values to a separate RDD. In my example you would iterate over the cached RDD which is ok and would work fast
It sounds like what you really want is to save your KV RDD to a separate file for each key. Rather than creating a Map[Key, RDD[Value]] consider using a MultipleTextOutputFormat similar to the example here. The code is pretty much all there in the example.
The benefit of this approach is that you're guaranteed to only take one pass over the RDD after the shuffle and you get the same result you wanted. If you did this by filtering and creating several IDs as suggested in the other answer (unless your source supported pushdown filters) you would end up taking one pass over the dataset for each individual key which would be way slower.
This is my simple test code.
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val groupby_RDD = test_RDD.groupByKey()
val result_RDD = groupby_RDD.map{v =>
var result_list:List[Int] = Nil
for (i <- v._2) {
result_list ::= i
}
(v._1, result_list)
}
The result is below
result_RDD.take(3)
>> res86: Array[(String, List[Int])] = Array((A,List(1, 3, 2)), (B,List(5, 4)), (C,List(6)))
Or you can do it like this
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val nil_list:List[Int] = Nil
val result2 = test_RDD.aggregateByKey(nil_list)(
(acc, value) => value :: acc,
(acc1, acc2) => acc1 ::: acc2 )
The result is this
result2.take(3)
>> res209: Array[(String, List[Int])] = Array((A,List(3, 2, 1)), (B,List(5, 4)), (C,List(6)))

sort two lists by their first element and zip them in scala

val descrList = cursorReal.interfaceInfo.interfaces.map {
case values => (values.ifIndex , values.ifName , values.ifType)
}
val ipAddressList = cursorReal.interfaceIpAndIndex.filter(x=> (!x.ifIpAddress.equalsIgnoreCase("0"))).map {
case values => (values.ifIndex,values.ifIpAddress)
}
For instance,
val descrList =
List((12,"VoIP-Null0",1), (8,"FastEthernet6",6), (19,"Vlan11",53),
(4,"FastEthernet2",6), (15,"Vlan1",53), (11,"GigabitEthernet0",6),
(9,"FastEthernet7",6), (22,"Vlan20",53), (13,"Wlan-GigabitEthernet0",6),
(16,"Async1",1), (5,"FastEthernet3",6), (10,"FastEthernet8",6),
(21,"Vlan12",53), (6,"FastEthernet4",6), (1,"wlan-ap0",24),
(17,"Virtual-Template1",131), (14,"Null0",1), (20,"Vlan10",53),
(2,"FastEthernet0",6), (18,"NVI0",1), (7,"FastEthernet5",6),
(29,"Virtual-Access7",131), (3,"FastEthernet1",6), (28,"Virtual-Access6",131))
val ipAddressList = List((21,"192.168.12.1"), (19,"192.168.11.1"),
(11,"104.36.252.115"), (20,"192.168.10.1"),
(22,"192.168.20.1"))
In both the lists first element is index and i have to merge these two list index wise . It means
(21,"192.168.12.1") this ipAddress should merge with (21,"Vlan12",53) and form new list like below (21,"Vlan12",53,"192.168.12.1").
scala> descrList map {case (index, v1, v2) =>
(index, v1, v2, ipAddressList.toMap.getOrElse(index, "empty"))}
res0: List[(Int, String, Int, String)] = List(
(12,VoIP-Null0,1,empty), (8,FastEthernet6,6,empty), (19,Vlan11,53,192.168.11.1),
(4,FastEthernet2,6,empty), (15,Vlan1,53,empty), (11,GigabitEthernet0,6,104.36.252.115),
(9,FastEthernet7,6,empty), (22,Vlan20,53,192.168.20.1), (13,Wlan-GigabitEthernet0,6,empty),
(16,Async1,1,empty), (5,FastEthernet3,6,empty), (10,FastEthernet8,6,empty),
(21,Vlan12,53,192.168.12.1), (6,FastEthernet4,6,empty), (1,wlan-ap0,24,empty), (17,Virtual-
Template1,131,empty), (14,Null0,1,empty), (20,Vlan10,53,192.168.10.1), (2,FastEthernet0,6,empty),
(18,NVI0,1,empty), (7,FastEthernet5,6,empty), (29,Virtual-Access7,131,empty),
(3,FastEthernet1,6,empty), (28,Virtual-Access6,131,empty))
First, I would suggest you produce a Map instead of a List. A Map by nature has an indexer, and in your case this would be the ifIndex value.
Once you have Maps in place, you can use something like this (sample from this other SO Best way to merge two maps and sum the values of same key?)
From Rex Kerr:
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
Or like this from Matthew Farwell:
(map1.keySet ++ map2.keySet).map (i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0))}.toMap
If you cannot use Maps for whatever reason, then look into your existing project libraries. If you have Scalaz, then you have some tools already available.
Scalaz: https://github.com/scalaz/scalaz
If you have Slick, you also have some nice tools to directly use.
Slick: http://slick.typesafe.com/docs/
Consider first converting decrList to a Map, like this,
val a = (for ( (k,v1,v2) <- descrList) yield k -> (v1,v2)).toMap
Then we can look up keys for ipAddressList and agglomerate elements into a new tuple, as follows,
for ( (k,ip) <- ipAddressList ; v = a.getOrElse(k,("none","none")) ) yield (k,v._1,v._2,ip)
Hence, for ipAddressList,
res: List((21,Vlan12,53,192.168.12.1), (19,Vlan11,53,192.168.11.1),
(11,GigabitEthernet0,6,104.36.252.115), (20,Vlan10,53,192.168.10.1),
(22,Vlan20,53,192.168.20.1))
Given the data:
val descrList =
List((12, "VoIP-Null0", 1), (8, "FastEthernet6", 6), (19, "Vlan11", 53),
(4, "FastEthernet2", 6), (15, "Vlan1", 53), (11, "GigabitEthernet0", 6),
(9, "FastEthernet7", 6), (22, "Vlan20", 53), (13, "Wlan-GigabitEthernet0", 6),
(16, "Async1", 1), (5, "FastEthernet3", 6), (10, "FastEthernet8", 6),
(21, "Vlan12", 53), (6, "FastEthernet4", 6), (1, "wlan-ap0", 24),
(17, "Virtual-Template1", 131), (14, "Null0", 1), (20, "Vlan10", 53),
(2, "FastEthernet0", 6), (18, "NVI0", 1), (7, "FastEthernet5", 6),
(29, "Virtual-Access7", 131), (3, "FastEthernet1", 6), (28, "Virtual-Access6", 131))
val ipAddressList = List((21, "192.168.12.1"), (19, "192.168.11.1"),
(11, "104.36.252.115"), (20, "192.168.10.1"),
(22, "192.168.20.1"))
Merge and sort:
val addrMap = ipAddressList.toMap
val output = descrList
.filter(x => addrMap.contains(x._1))
.map(x => x match { case (i, a, b) => (i, a, b, addrMap(i)) })
.sortBy(_._1)
output foreach println
Output:
(11,GigabitEthernet0,6,104.36.252.115)
(19,Vlan11,53,192.168.11.1)
(20,Vlan10,53,192.168.10.1)
(21,Vlan12,53,192.168.12.1)
(22,Vlan20,53,192.168.20.1)