how to use applied cogroup method with functional operation in spark? - scala

I've tried to solve applied cogroup problem. but I really don't know that...
If there are two RDDs with different keys as in the example below, Is it possible to extract valid data1 only when the first word is the same using cogroup?
val data1 = sc.parallelize(Seq(("aa", 1), ("ba", 2), ("bc", 2), ("b", 3), ("c", 1)))
val data2 = sc.parallelize(Seq(("a", 3), ("b", 5)))
val cogroupRdd: RDD[(String, (Iterable[Int], Iterable[Int]))] = data1.cogroup(data2)
/* List(
(ba,(CompactBuffer(2),CompactBuffer())),
(bc,(CompactBuffer(2),CompactBuffer())),
(a,(CompactBuffer(),CompactBuffer(3))),
(b,(CompactBuffer(3),CompactBuffer(5))),
(c,(CompactBuffer(1),CompactBuffer())),
(aa,(CompactBuffer(1),CompactBuffer()))
) */
The result should be Array(("aa", 1), ("ba", 2), ("bc", 2), ("b", 3))
I solved this problem by using broadcast() as #mrsrinivas said.
But broadcast() is not appropriate for large data.
val bcast = sc.broadcast(data2.map(_._1).collect())
val result = data1.filter(r => bcast.value.contains(myFuncOper(r._1)))
Is there a method to solve this problem using cogroup with the functional operation?

You can use cogroup after extracting a key that would match data2's keys, and then use filter and map to remove values without matches and "restructure" the data:
val result: RDD[(String, Int)] = data1
.keyBy(_._1.substring(0, 1)) // key by first character
.cogroup(data2)
.filter { case (_, (_, data2Values)) => data2Values.nonEmpty }
.flatMap { case (_, (data1Values, _)) => data1Values }

Short:
val result = data1
.flatMap(x => x._1.split("").map(y => (y, x)))
.join(data2)
.map(x => x._2._1)
.distinct
Detailed:
flatMap(x => x._1.split("").map(y => (y, x))) holds
List(
(a, (aa, 1)),
(a, (aa, 1)),
(b, (ba, 2)),
(a, (ba, 2)),
(b, (bc, 2)),
(c, (bc, 2)),
(b, (b, 3)),
(c, (c, 1))
)
after join(data2)
List(
(a, ((aa, 1), 3)),
(a, ((aa, 1), 3)),
(a, ((ba, 2), 3)),
(b, ((ba, 2), 5)),
(b, ((bc, 2), 5)),
(b, ((b, 3), 5))
)
Now all we interested in distinct 2nd first pairs, which can be done by map(x => x._2._1).distinct

Related

spark rdd filter after groupbykey

//create RDD
val rdd = sc.makeRDD(List(("a", (1, "m")), ("b", (1, "m")),
("a", (1, "n")), ("b", (2, "n")), ("c", (1, "m")),
("c", (5, "m")), ("d", (1, "m")), ("d", (1, "n"))))
val groupRDD = rdd.groupByKey()
after groupByKey i want to filter the second element is not equal 1 and get
("b", (1, "m")),("b", (2, "n")), ("c", (1, "m")), ("c", (5, "m"))
groupByKey() is must necessary, could help me, thanks a lot.
add:
but if the second element type is string,filter the second element All of them equal x ,like
("a",("x","m")), ("a",("x","n")), ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m")), ("d",("x","m")), ("d",("x","n"))
and also get the same result ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m"))
You could do:
val groupRDD = rdd
.groupByKey()
.filter(value => value._2.map(tuple => tuple._1).sum != value._2.size)
.flatMapValues(list => list) // to get the result as you like, because right now, they are, e.g. (b, Seq((1, m), (1, n)))
What this does, is that we are first grouping keys through groupByKey, then we are filtering through filter by summing the keys from your grouped entries, and checking whether the sum is as much as the grouped entries size. For example:
(a, Seq((1, m), (1, n)) -> grouped by key
(a, Seq((1, m), (1, n), 2 (the sum of 1 + 1), 2 (size of sequence))
2 = 2, filter this row out
The final result:
(c,(1,m))
(b,(1,m))
(c,(5,m))
(b,(2,n))
Good luck!
EDIT
Under the assumption that key from tuple can be any string; assuming rdd is your data that contains:
(a,(x,m))
(c,(x,m))
(c,(z,m))
(d,(x,m))
(b,(x,m))
(a,(x,n))
(d,(x,n))
(b,(y,n))
Then we can construct uniqueCount as:
val uniqueCount = rdd
// we swap places, we want to check for combination of (a, 1), (b, a), (b, b), (c, a), etc.
.map(entry => ((entry._1, entry._2._1), entry._2._2))
// we count keys, meaning that (a, 1) gives us 2, (b, a) gives us 1, (b, b) gives us 1, etc.
.countByKey()
// we filter out > 2, because they are duplicates
.filter(a => a._2 == 1)
// we get the very keys, so we can filter below
.map(a => a._1._1)
.toList
Then this:
val filteredRDD = rdd.filter(a => uniqueCount.contains(a._1))
Gives this output:
(b,(y,n))
(c,(x,m))
(c,(z,m))
(b,(x,m))

scala combining multiple sequences

I have a couple of lists:
val aa = Seq(1,2,3,4)
val bb = Seq(Seq(2.0,3.0,4.0,5.0), Seq(1.0,2.0,3.0,4.0))
val cc = Seq("a", "B")
And want to combine them in the desired format of:
(1, 2.0, a), (2, 3.0, a), (3, 4.0, a), (4, 5.0, a), (1, 1.0, b), (2, 2.0, b), (3, 3.0, b), (4, 4.0, b)
but my combination of zip and flatMap
(aa, bb,cc).zipped.flatMap{
case (a, b,c) => {
b.map(b1 => (a,b1,c))
}
}
is only producing
(1,2.0,a), (1,3.0,a), (1,4.0,a), (1,5.0,a), (2,1.0,B), (2,2.0,B), (2,3.0,B), (2,4.0,B)
In java I would just iterate for over bb and then again in a nested loop iterate over the values.
What do I need to change to get the data in the desired format using neat functional scala?
How about this:
for {
(bs, c) <- bb zip cc
(a, b) <- aa zip bs
} yield (a, b, c)
Produces:
List(
(1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a),
(1,1.0,b), (2,2.0,b), (3,3.0,b), (4,4.0,b)
)
I doubt this could be made any more neat & functional.
Not exactly pretty to read but here is an option:
bb
.map(b => aa.zip(b)) // List(List((1,2.0), (2,3.0), (3,4.0), (4,5.0)), List((1,1.0), (2,2.0), (3,3.0), (4,4.0)))
.zip(cc) // List((List((1,2.0), (2,3.0), (3,4.0), (4,5.0)),a), (List((1,1.0), (2,2.0), (3,3.0), (4,4.0)),B))
.flatMap{ case (l, c) => l.map(t => (t._1, t._2, c)) } // List((1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a), (1,1.0,B), (2,2.0,B), (3,3.0,B), (4,4.0,B))
Another approach using collect and map
scala> val result = bb.zip(cc).collect{
case bc => (aa.zip(bc._1).map(e => (e._1,e._2, bc._2)))
}.flatten
result: Seq[(Int, Double, String)] = List((1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a), (1,1.0,B), (2,2.0,B), (3,3.0,B), (4,4.0,B))

Getting first n distinct Key Tuples in Scala Spark

I have a RDD with Tuple as follows
(a, 1), (a, 2), (b,1)
How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)
What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.
Here's what I threw together in Scala.
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.reduceByKey((k1,k2) => k1)
.collect()
Outputs
Array[(String, Int)] = Array((a,1), (b,1))
As you already have a RDD of Pair, your RDD has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions. Lets make use of that.
val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]
val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]
val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]
Or in short
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.groupByKey()
.map((key, iter) => (key, iter.head))
It is easy.....
you just need to use the function just like the bellow:
val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)

Scala - groupBy map values to list

I have the following input data
((A, 1, 4), (A, 2, 5), (A, 3, 6))
I would like to produce the following result
(A, (1, 2, 3), (4, 5, 6))
through grouping input by keys
What would be the correct way to do so in Scala?
((A, 1, 4), (A, 2, 5), (A, 3, 6))
In the case that this represents a List[(String, Int, Int)] type then try the following.
val l = List(("A", 1, 4), ("A", 2, 5), ("A", 3, 6), ("B", 1, 4), ("B", 2, 5), ("B", 3, 6))
l groupBy {_._1} map {
case (k, v) => (k, v map {
case (k, v1, v2) => (v1, v2)
} unzip)
}
This will result in a Map[String,(List[Int], List[Int])], i.e., a map with string keys mapped to tuples of two lists.
Map(A -> (List(1, 2, 3), List(4, 5, 6)), B -> (List(1, 2, 3), List(4, 5, 6)))
Try something like this:
def takeHeads[A](lists:List[List[A]]): (List[A], List[List[A]]) =
(lists.map(_.head), lists.map(_.tail))
def translate[A](lists:List[List[A]]): List[List[A]] =
if (lists.flatten.isEmpty) Nil else {
val t = takeHeads(lists)
t._1 :: translate(t._2)
}
yourValue.groupBy(_.head).mapValues(v => translate(v.map(_.tail)))
This produces a Map[Any,Any] when used on your value... but it should get you going in the right direction.

sort two lists by their first element and zip them in scala

val descrList = cursorReal.interfaceInfo.interfaces.map {
case values => (values.ifIndex , values.ifName , values.ifType)
}
val ipAddressList = cursorReal.interfaceIpAndIndex.filter(x=> (!x.ifIpAddress.equalsIgnoreCase("0"))).map {
case values => (values.ifIndex,values.ifIpAddress)
}
For instance,
val descrList =
List((12,"VoIP-Null0",1), (8,"FastEthernet6",6), (19,"Vlan11",53),
(4,"FastEthernet2",6), (15,"Vlan1",53), (11,"GigabitEthernet0",6),
(9,"FastEthernet7",6), (22,"Vlan20",53), (13,"Wlan-GigabitEthernet0",6),
(16,"Async1",1), (5,"FastEthernet3",6), (10,"FastEthernet8",6),
(21,"Vlan12",53), (6,"FastEthernet4",6), (1,"wlan-ap0",24),
(17,"Virtual-Template1",131), (14,"Null0",1), (20,"Vlan10",53),
(2,"FastEthernet0",6), (18,"NVI0",1), (7,"FastEthernet5",6),
(29,"Virtual-Access7",131), (3,"FastEthernet1",6), (28,"Virtual-Access6",131))
val ipAddressList = List((21,"192.168.12.1"), (19,"192.168.11.1"),
(11,"104.36.252.115"), (20,"192.168.10.1"),
(22,"192.168.20.1"))
In both the lists first element is index and i have to merge these two list index wise . It means
(21,"192.168.12.1") this ipAddress should merge with (21,"Vlan12",53) and form new list like below (21,"Vlan12",53,"192.168.12.1").
scala> descrList map {case (index, v1, v2) =>
(index, v1, v2, ipAddressList.toMap.getOrElse(index, "empty"))}
res0: List[(Int, String, Int, String)] = List(
(12,VoIP-Null0,1,empty), (8,FastEthernet6,6,empty), (19,Vlan11,53,192.168.11.1),
(4,FastEthernet2,6,empty), (15,Vlan1,53,empty), (11,GigabitEthernet0,6,104.36.252.115),
(9,FastEthernet7,6,empty), (22,Vlan20,53,192.168.20.1), (13,Wlan-GigabitEthernet0,6,empty),
(16,Async1,1,empty), (5,FastEthernet3,6,empty), (10,FastEthernet8,6,empty),
(21,Vlan12,53,192.168.12.1), (6,FastEthernet4,6,empty), (1,wlan-ap0,24,empty), (17,Virtual-
Template1,131,empty), (14,Null0,1,empty), (20,Vlan10,53,192.168.10.1), (2,FastEthernet0,6,empty),
(18,NVI0,1,empty), (7,FastEthernet5,6,empty), (29,Virtual-Access7,131,empty),
(3,FastEthernet1,6,empty), (28,Virtual-Access6,131,empty))
First, I would suggest you produce a Map instead of a List. A Map by nature has an indexer, and in your case this would be the ifIndex value.
Once you have Maps in place, you can use something like this (sample from this other SO Best way to merge two maps and sum the values of same key?)
From Rex Kerr:
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
Or like this from Matthew Farwell:
(map1.keySet ++ map2.keySet).map (i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0))}.toMap
If you cannot use Maps for whatever reason, then look into your existing project libraries. If you have Scalaz, then you have some tools already available.
Scalaz: https://github.com/scalaz/scalaz
If you have Slick, you also have some nice tools to directly use.
Slick: http://slick.typesafe.com/docs/
Consider first converting decrList to a Map, like this,
val a = (for ( (k,v1,v2) <- descrList) yield k -> (v1,v2)).toMap
Then we can look up keys for ipAddressList and agglomerate elements into a new tuple, as follows,
for ( (k,ip) <- ipAddressList ; v = a.getOrElse(k,("none","none")) ) yield (k,v._1,v._2,ip)
Hence, for ipAddressList,
res: List((21,Vlan12,53,192.168.12.1), (19,Vlan11,53,192.168.11.1),
(11,GigabitEthernet0,6,104.36.252.115), (20,Vlan10,53,192.168.10.1),
(22,Vlan20,53,192.168.20.1))
Given the data:
val descrList =
List((12, "VoIP-Null0", 1), (8, "FastEthernet6", 6), (19, "Vlan11", 53),
(4, "FastEthernet2", 6), (15, "Vlan1", 53), (11, "GigabitEthernet0", 6),
(9, "FastEthernet7", 6), (22, "Vlan20", 53), (13, "Wlan-GigabitEthernet0", 6),
(16, "Async1", 1), (5, "FastEthernet3", 6), (10, "FastEthernet8", 6),
(21, "Vlan12", 53), (6, "FastEthernet4", 6), (1, "wlan-ap0", 24),
(17, "Virtual-Template1", 131), (14, "Null0", 1), (20, "Vlan10", 53),
(2, "FastEthernet0", 6), (18, "NVI0", 1), (7, "FastEthernet5", 6),
(29, "Virtual-Access7", 131), (3, "FastEthernet1", 6), (28, "Virtual-Access6", 131))
val ipAddressList = List((21, "192.168.12.1"), (19, "192.168.11.1"),
(11, "104.36.252.115"), (20, "192.168.10.1"),
(22, "192.168.20.1"))
Merge and sort:
val addrMap = ipAddressList.toMap
val output = descrList
.filter(x => addrMap.contains(x._1))
.map(x => x match { case (i, a, b) => (i, a, b, addrMap(i)) })
.sortBy(_._1)
output foreach println
Output:
(11,GigabitEthernet0,6,104.36.252.115)
(19,Vlan11,53,192.168.11.1)
(20,Vlan10,53,192.168.10.1)
(21,Vlan12,53,192.168.12.1)
(22,Vlan20,53,192.168.20.1)