spark rdd filter after groupbykey - scala

//create RDD
val rdd = sc.makeRDD(List(("a", (1, "m")), ("b", (1, "m")),
("a", (1, "n")), ("b", (2, "n")), ("c", (1, "m")),
("c", (5, "m")), ("d", (1, "m")), ("d", (1, "n"))))
val groupRDD = rdd.groupByKey()
after groupByKey i want to filter the second element is not equal 1 and get
("b", (1, "m")),("b", (2, "n")), ("c", (1, "m")), ("c", (5, "m"))
groupByKey() is must necessary, could help me, thanks a lot.
add:
but if the second element type is string,filter the second element All of them equal x ,like
("a",("x","m")), ("a",("x","n")), ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m")), ("d",("x","m")), ("d",("x","n"))
and also get the same result ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m"))

You could do:
val groupRDD = rdd
.groupByKey()
.filter(value => value._2.map(tuple => tuple._1).sum != value._2.size)
.flatMapValues(list => list) // to get the result as you like, because right now, they are, e.g. (b, Seq((1, m), (1, n)))
What this does, is that we are first grouping keys through groupByKey, then we are filtering through filter by summing the keys from your grouped entries, and checking whether the sum is as much as the grouped entries size. For example:
(a, Seq((1, m), (1, n)) -> grouped by key
(a, Seq((1, m), (1, n), 2 (the sum of 1 + 1), 2 (size of sequence))
2 = 2, filter this row out
The final result:
(c,(1,m))
(b,(1,m))
(c,(5,m))
(b,(2,n))
Good luck!
EDIT
Under the assumption that key from tuple can be any string; assuming rdd is your data that contains:
(a,(x,m))
(c,(x,m))
(c,(z,m))
(d,(x,m))
(b,(x,m))
(a,(x,n))
(d,(x,n))
(b,(y,n))
Then we can construct uniqueCount as:
val uniqueCount = rdd
// we swap places, we want to check for combination of (a, 1), (b, a), (b, b), (c, a), etc.
.map(entry => ((entry._1, entry._2._1), entry._2._2))
// we count keys, meaning that (a, 1) gives us 2, (b, a) gives us 1, (b, b) gives us 1, etc.
.countByKey()
// we filter out > 2, because they are duplicates
.filter(a => a._2 == 1)
// we get the very keys, so we can filter below
.map(a => a._1._1)
.toList
Then this:
val filteredRDD = rdd.filter(a => uniqueCount.contains(a._1))
Gives this output:
(b,(y,n))
(c,(x,m))
(c,(z,m))
(b,(x,m))

Related

Counting number of occurrences of Array element in a RDD

I have a RDD1 with Key-Value pair of type [(String, Array[String])] (i will refer to them as (X, Y)), and a Array Z[String].
I'm trying for every element in Z to count how many X instances there are that have Z in Y. I want my output as ((X, Z(i)), #ofinstances).
RDD1= ((A, (2, 3, 4), (B, (4, 4, 4)), (A, (4, 5)))
Z = (1, 4)
then i want to get:
(((A, 4), 2), ((B, 4), 1))
Hope that made sense.
As you can see over, i only want an element if there is atleast one occurence.
I have tried this so far:
val newRDD = RDD1.map{case(x, y) => for(i <- 0 to (z.size-1)){if(y.contains(z(i))) {((x, z(i)), 1)}}}
My output here is an RDD[Unit]
Im not sure if what i'm asking for is even possible, or if i have to do it an other way.
So it is just another word count
val rdd = sc.parallelize(Seq(
("A", Array("2", "3", "4")),
("B", Array("4", "4", "4")),
("A", Array("4", "5"))))
val z = Array("1", "4")
To make lookups efficient convert z to Set:
val zs = z.toSet
val result = rdd
.flatMapValues(_.filter(zs contains _).distinct)
.map((_, 1))
.reduceByKey(_ + _)
where
_.filter(zs contains _).distinct
filters out values that occur in z and deduplicates.
result.take(2).foreach(println)
// ((B,4),1)
// ((A,4),2)

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.union(rddRandom)
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))
...
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

Getting first n distinct Key Tuples in Scala Spark

I have a RDD with Tuple as follows
(a, 1), (a, 2), (b,1)
How can I can get the first two tuples with distinct keys. If I do a take(2), I will get (a, 1) and (a, 2)
What I need is (a, 1), (b,1) (Keys are distinct). Values are irrelevant.
Here's what I threw together in Scala.
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.reduceByKey((k1,k2) => k1)
.collect()
Outputs
Array[(String, Int)] = Array((a,1), (b,1))
As you already have a RDD of Pair, your RDD has extra key-value functionality provided by org.apache.spark.rdd.PairRDDFunctions. Lets make use of that.
val pairRdd = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
// RDD[(String, Int)]
val groupedRdd = pairRdd.groupByKey()
// RDD[(String, Iterable[Int])]
val requiredRdd = groupedRdd.map((key, iter) => (key, iter.head))
// RDD[(String, Int)]
Or in short
sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
.groupByKey()
.map((key, iter) => (key, iter.head))
It is easy.....
you just need to use the function just like the bellow:
val data = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 1)))
data.collectAsMap().foreach(println)

How to transform RDD[(Key, Value)] into Map[Key, RDD[Value]]

I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)] into a Map[key, RDD[Value]], knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]] by key and call saveAsNewAPIHadoopFile for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter on each key A, B, C of RDD[(Key, Value)], but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache ?)
Thank you
You should use the code like this (Python):
rdd = sc.parallelize( [("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)] ).cache()
keys = rdd.keys().distinct().collect()
for key in keys:
out = rdd.filter(lambda x: x[0] == key).map(lambda (x,y): y)
out.saveAsNewAPIHadoopFile (...)
One RDD cannot be a part of another RDD and you have no option to just collect keys and transform their related values to a separate RDD. In my example you would iterate over the cached RDD which is ok and would work fast
It sounds like what you really want is to save your KV RDD to a separate file for each key. Rather than creating a Map[Key, RDD[Value]] consider using a MultipleTextOutputFormat similar to the example here. The code is pretty much all there in the example.
The benefit of this approach is that you're guaranteed to only take one pass over the RDD after the shuffle and you get the same result you wanted. If you did this by filtering and creating several IDs as suggested in the other answer (unless your source supported pushdown filters) you would end up taking one pass over the dataset for each individual key which would be way slower.
This is my simple test code.
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val groupby_RDD = test_RDD.groupByKey()
val result_RDD = groupby_RDD.map{v =>
var result_list:List[Int] = Nil
for (i <- v._2) {
result_list ::= i
}
(v._1, result_list)
}
The result is below
result_RDD.take(3)
>> res86: Array[(String, List[Int])] = Array((A,List(1, 3, 2)), (B,List(5, 4)), (C,List(6)))
Or you can do it like this
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val nil_list:List[Int] = Nil
val result2 = test_RDD.aggregateByKey(nil_list)(
(acc, value) => value :: acc,
(acc1, acc2) => acc1 ::: acc2 )
The result is this
result2.take(3)
>> res209: Array[(String, List[Int])] = Array((A,List(3, 2, 1)), (B,List(5, 4)), (C,List(6)))

sort two lists by their first element and zip them in scala

val descrList = cursorReal.interfaceInfo.interfaces.map {
case values => (values.ifIndex , values.ifName , values.ifType)
}
val ipAddressList = cursorReal.interfaceIpAndIndex.filter(x=> (!x.ifIpAddress.equalsIgnoreCase("0"))).map {
case values => (values.ifIndex,values.ifIpAddress)
}
For instance,
val descrList =
List((12,"VoIP-Null0",1), (8,"FastEthernet6",6), (19,"Vlan11",53),
(4,"FastEthernet2",6), (15,"Vlan1",53), (11,"GigabitEthernet0",6),
(9,"FastEthernet7",6), (22,"Vlan20",53), (13,"Wlan-GigabitEthernet0",6),
(16,"Async1",1), (5,"FastEthernet3",6), (10,"FastEthernet8",6),
(21,"Vlan12",53), (6,"FastEthernet4",6), (1,"wlan-ap0",24),
(17,"Virtual-Template1",131), (14,"Null0",1), (20,"Vlan10",53),
(2,"FastEthernet0",6), (18,"NVI0",1), (7,"FastEthernet5",6),
(29,"Virtual-Access7",131), (3,"FastEthernet1",6), (28,"Virtual-Access6",131))
val ipAddressList = List((21,"192.168.12.1"), (19,"192.168.11.1"),
(11,"104.36.252.115"), (20,"192.168.10.1"),
(22,"192.168.20.1"))
In both the lists first element is index and i have to merge these two list index wise . It means
(21,"192.168.12.1") this ipAddress should merge with (21,"Vlan12",53) and form new list like below (21,"Vlan12",53,"192.168.12.1").
scala> descrList map {case (index, v1, v2) =>
(index, v1, v2, ipAddressList.toMap.getOrElse(index, "empty"))}
res0: List[(Int, String, Int, String)] = List(
(12,VoIP-Null0,1,empty), (8,FastEthernet6,6,empty), (19,Vlan11,53,192.168.11.1),
(4,FastEthernet2,6,empty), (15,Vlan1,53,empty), (11,GigabitEthernet0,6,104.36.252.115),
(9,FastEthernet7,6,empty), (22,Vlan20,53,192.168.20.1), (13,Wlan-GigabitEthernet0,6,empty),
(16,Async1,1,empty), (5,FastEthernet3,6,empty), (10,FastEthernet8,6,empty),
(21,Vlan12,53,192.168.12.1), (6,FastEthernet4,6,empty), (1,wlan-ap0,24,empty), (17,Virtual-
Template1,131,empty), (14,Null0,1,empty), (20,Vlan10,53,192.168.10.1), (2,FastEthernet0,6,empty),
(18,NVI0,1,empty), (7,FastEthernet5,6,empty), (29,Virtual-Access7,131,empty),
(3,FastEthernet1,6,empty), (28,Virtual-Access6,131,empty))
First, I would suggest you produce a Map instead of a List. A Map by nature has an indexer, and in your case this would be the ifIndex value.
Once you have Maps in place, you can use something like this (sample from this other SO Best way to merge two maps and sum the values of same key?)
From Rex Kerr:
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
Or like this from Matthew Farwell:
(map1.keySet ++ map2.keySet).map (i=> (i,map1.getOrElse(i,0) + map2.getOrElse(i,0))}.toMap
If you cannot use Maps for whatever reason, then look into your existing project libraries. If you have Scalaz, then you have some tools already available.
Scalaz: https://github.com/scalaz/scalaz
If you have Slick, you also have some nice tools to directly use.
Slick: http://slick.typesafe.com/docs/
Consider first converting decrList to a Map, like this,
val a = (for ( (k,v1,v2) <- descrList) yield k -> (v1,v2)).toMap
Then we can look up keys for ipAddressList and agglomerate elements into a new tuple, as follows,
for ( (k,ip) <- ipAddressList ; v = a.getOrElse(k,("none","none")) ) yield (k,v._1,v._2,ip)
Hence, for ipAddressList,
res: List((21,Vlan12,53,192.168.12.1), (19,Vlan11,53,192.168.11.1),
(11,GigabitEthernet0,6,104.36.252.115), (20,Vlan10,53,192.168.10.1),
(22,Vlan20,53,192.168.20.1))
Given the data:
val descrList =
List((12, "VoIP-Null0", 1), (8, "FastEthernet6", 6), (19, "Vlan11", 53),
(4, "FastEthernet2", 6), (15, "Vlan1", 53), (11, "GigabitEthernet0", 6),
(9, "FastEthernet7", 6), (22, "Vlan20", 53), (13, "Wlan-GigabitEthernet0", 6),
(16, "Async1", 1), (5, "FastEthernet3", 6), (10, "FastEthernet8", 6),
(21, "Vlan12", 53), (6, "FastEthernet4", 6), (1, "wlan-ap0", 24),
(17, "Virtual-Template1", 131), (14, "Null0", 1), (20, "Vlan10", 53),
(2, "FastEthernet0", 6), (18, "NVI0", 1), (7, "FastEthernet5", 6),
(29, "Virtual-Access7", 131), (3, "FastEthernet1", 6), (28, "Virtual-Access6", 131))
val ipAddressList = List((21, "192.168.12.1"), (19, "192.168.11.1"),
(11, "104.36.252.115"), (20, "192.168.10.1"),
(22, "192.168.20.1"))
Merge and sort:
val addrMap = ipAddressList.toMap
val output = descrList
.filter(x => addrMap.contains(x._1))
.map(x => x match { case (i, a, b) => (i, a, b, addrMap(i)) })
.sortBy(_._1)
output foreach println
Output:
(11,GigabitEthernet0,6,104.36.252.115)
(19,Vlan11,53,192.168.11.1)
(20,Vlan10,53,192.168.10.1)
(21,Vlan12,53,192.168.12.1)
(22,Vlan20,53,192.168.20.1)