Counting number of occurrences of Array element in a RDD - scala

I have a RDD1 with Key-Value pair of type [(String, Array[String])] (i will refer to them as (X, Y)), and a Array Z[String].
I'm trying for every element in Z to count how many X instances there are that have Z in Y. I want my output as ((X, Z(i)), #ofinstances).
RDD1= ((A, (2, 3, 4), (B, (4, 4, 4)), (A, (4, 5)))
Z = (1, 4)
then i want to get:
(((A, 4), 2), ((B, 4), 1))
Hope that made sense.
As you can see over, i only want an element if there is atleast one occurence.
I have tried this so far:
val newRDD = RDD1.map{case(x, y) => for(i <- 0 to (z.size-1)){if(y.contains(z(i))) {((x, z(i)), 1)}}}
My output here is an RDD[Unit]
Im not sure if what i'm asking for is even possible, or if i have to do it an other way.

So it is just another word count
val rdd = sc.parallelize(Seq(
("A", Array("2", "3", "4")),
("B", Array("4", "4", "4")),
("A", Array("4", "5"))))
val z = Array("1", "4")
To make lookups efficient convert z to Set:
val zs = z.toSet
val result = rdd
.flatMapValues(_.filter(zs contains _).distinct)
.map((_, 1))
.reduceByKey(_ + _)
where
_.filter(zs contains _).distinct
filters out values that occur in z and deduplicates.
result.take(2).foreach(println)
// ((B,4),1)
// ((A,4),2)

Related

spark rdd filter after groupbykey

//create RDD
val rdd = sc.makeRDD(List(("a", (1, "m")), ("b", (1, "m")),
("a", (1, "n")), ("b", (2, "n")), ("c", (1, "m")),
("c", (5, "m")), ("d", (1, "m")), ("d", (1, "n"))))
val groupRDD = rdd.groupByKey()
after groupByKey i want to filter the second element is not equal 1 and get
("b", (1, "m")),("b", (2, "n")), ("c", (1, "m")), ("c", (5, "m"))
groupByKey() is must necessary, could help me, thanks a lot.
add:
but if the second element type is string,filter the second element All of them equal x ,like
("a",("x","m")), ("a",("x","n")), ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m")), ("d",("x","m")), ("d",("x","n"))
and also get the same result ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m"))
You could do:
val groupRDD = rdd
.groupByKey()
.filter(value => value._2.map(tuple => tuple._1).sum != value._2.size)
.flatMapValues(list => list) // to get the result as you like, because right now, they are, e.g. (b, Seq((1, m), (1, n)))
What this does, is that we are first grouping keys through groupByKey, then we are filtering through filter by summing the keys from your grouped entries, and checking whether the sum is as much as the grouped entries size. For example:
(a, Seq((1, m), (1, n)) -> grouped by key
(a, Seq((1, m), (1, n), 2 (the sum of 1 + 1), 2 (size of sequence))
2 = 2, filter this row out
The final result:
(c,(1,m))
(b,(1,m))
(c,(5,m))
(b,(2,n))
Good luck!
EDIT
Under the assumption that key from tuple can be any string; assuming rdd is your data that contains:
(a,(x,m))
(c,(x,m))
(c,(z,m))
(d,(x,m))
(b,(x,m))
(a,(x,n))
(d,(x,n))
(b,(y,n))
Then we can construct uniqueCount as:
val uniqueCount = rdd
// we swap places, we want to check for combination of (a, 1), (b, a), (b, b), (c, a), etc.
.map(entry => ((entry._1, entry._2._1), entry._2._2))
// we count keys, meaning that (a, 1) gives us 2, (b, a) gives us 1, (b, b) gives us 1, etc.
.countByKey()
// we filter out > 2, because they are duplicates
.filter(a => a._2 == 1)
// we get the very keys, so we can filter below
.map(a => a._1._1)
.toList
Then this:
val filteredRDD = rdd.filter(a => uniqueCount.contains(a._1))
Gives this output:
(b,(y,n))
(c,(x,m))
(c,(z,m))
(b,(x,m))

Apache Spark: RDD multiple passes with a simple operation

I've encountered this problem, as I'm learning Apache Spark framework.
Consider the following simple RDD
scala> val rdd1 = sc.parallelize(List((1, Set("C3", "C2")),
(2, Set("C1", "C5", "C3")),
(3, Set("C2", "C7"))))
rdd1: RDD[(Int, Set[String])]
I want to intersect each Set in every element in rdd1 with the sets of every other element in the "same" rdd1; so that the results would be of the form:
newRDD: RDD[(Int, Int, Set[String])]
// and newRDD.collect will look like:
newRDD: Array[(Int, Int, Set[String])] = Array((1, 1, Set("C3", "C2")), (1, 2, Set("C3")), (1, 3, Set("C2")),
(2, 1, Set("C3")), (2, 2, Set("C1", "C5", "C3")), (2, 3, Set()),
(3, 1, Set("C2")), (3, 2, Set()), (1, 3, Set("C2", "C7")))
I tried nesting rdd1 like so
scala> val newRDD = rdd1 map (x => {rdd1 map (y => (x._1, y._1, x._2.intersect(y._2)))})
however, this will throw 'Task not serilizable' exception.
Now if I wanted to avoid rdd1.collect() or any other action operations before performing
scala> val newRDD = rdd1 map (x => {rdd1 map (y => (x._1, y._1, x._2.intersect(y._2)))})
would it be possible to achive the desired newRDD?
The reason why you are getting 'Task not serilizable' exception is because you are trying to put an RDD in a map for an other RDD, in this case Spark would try to serialise the second RDD. Normally this kind of problem you'd solve with joins:
val newRDD = rdd1.cartesian(rdd1).map { case ((a, aSet), (b, bSet)) =>
(a, b, aSet.intersect(bSet))
}
Here the cartesian join creates a pair of each sets in a new RDD, which you can intersect.

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.union(rddRandom)
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))
...
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

How to transform RDD[(Key, Value)] into Map[Key, RDD[Value]]

I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)] into a Map[key, RDD[Value]], knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]] by key and call saveAsNewAPIHadoopFile for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter on each key A, B, C of RDD[(Key, Value)], but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache ?)
Thank you
You should use the code like this (Python):
rdd = sc.parallelize( [("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)] ).cache()
keys = rdd.keys().distinct().collect()
for key in keys:
out = rdd.filter(lambda x: x[0] == key).map(lambda (x,y): y)
out.saveAsNewAPIHadoopFile (...)
One RDD cannot be a part of another RDD and you have no option to just collect keys and transform their related values to a separate RDD. In my example you would iterate over the cached RDD which is ok and would work fast
It sounds like what you really want is to save your KV RDD to a separate file for each key. Rather than creating a Map[Key, RDD[Value]] consider using a MultipleTextOutputFormat similar to the example here. The code is pretty much all there in the example.
The benefit of this approach is that you're guaranteed to only take one pass over the RDD after the shuffle and you get the same result you wanted. If you did this by filtering and creating several IDs as suggested in the other answer (unless your source supported pushdown filters) you would end up taking one pass over the dataset for each individual key which would be way slower.
This is my simple test code.
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val groupby_RDD = test_RDD.groupByKey()
val result_RDD = groupby_RDD.map{v =>
var result_list:List[Int] = Nil
for (i <- v._2) {
result_list ::= i
}
(v._1, result_list)
}
The result is below
result_RDD.take(3)
>> res86: Array[(String, List[Int])] = Array((A,List(1, 3, 2)), (B,List(5, 4)), (C,List(6)))
Or you can do it like this
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val nil_list:List[Int] = Nil
val result2 = test_RDD.aggregateByKey(nil_list)(
(acc, value) => value :: acc,
(acc1, acc2) => acc1 ::: acc2 )
The result is this
result2.take(3)
>> res209: Array[(String, List[Int])] = Array((A,List(3, 2, 1)), (B,List(5, 4)), (C,List(6)))

Scala, converting multiple lists to list of tuples [duplicate]

This question already has answers here:
Can I zip more than two lists together in Scala?
(11 answers)
Closed 9 years ago.
I have 3 lists like
val a = List("a", "b", "c")
val b = List(1, 2, 3)
val c = List(4, 5, 6)
I want convert them as follows
List(("a", 1, 4), ("b", 2, 5), ("c", 3, 6))
Please let me know how to get this result
If you have two or three lists that you need zipped together you can use zipped
val a = List("a", "b", "c")
val b = List(1, 2, 3)
val c = List(4, 5, 6)
(a,b,c).zipped.toList
This results in: List((a,1,4), (b,2,5), (c,3,6))
Should be easy to achieve:
(a zip b) zip c map {
case ((x, y), z) => (x, y, z)
};
Use:
(a zip b) zip c map { case ((av,bv),cv) => (av,bv,cv) }
Note: This shortens the result list of the shortest of a,b,c. If you'd rather have the result list padded with default values, use zipAll.