Match two RDDs [String] - scala

I try to match two RDD's:
RDD1 contains a huge amount of words [String] and RDD2 contains city names [String].
I want to return a RDD with the elements from RDD1 that are in RDD2.
Something like the opposite to subtract.
Afterwards I want to count the occurrence of each remaining word, but that won't be a problem.
I want to return an RDD with the elements from RDD1 that are in RDD2
If I got you right:
Note the difference between this code and intersection:
val rdd1 = sc.parallelize(Seq("a", "a", "b", "c"))
val rdd2 = sc.parallelize(Seq("a", "c", "d"))
val diff = rdd1.subtract(rdd2)
res0: Array[String] = Array(a, a, c)
res1: Array[String] = Array(a, c)
So, if your first RDD contains duplicates, and your goal is to take into account that duplicates, your may prefer double subtract solution. Otherwise, intersection fits well.


creating pair RDD in spark using scala

Im new to spark so I need to create a RDD with just two element.
Array1 = ((1,1)(1,2)(1,3),(2,1),(2,2),(2,3)
when I execute groupby key the output is ((1,(1,2,3)),(2,(1,2,3))
But I need the output to just have 2 value pair with the key. I'm not sure how to get it.
Expected Output = ((1,(1,2)),(1,(1,3)),(1(2,3),(2(1,2)),(2,(1,3)),(2,(2,3)))
The values should only be printed once. There should only be (1,2) and not (2,1)
or like (2,3) not (3,4)
You can get the result you require as follows:
// Prior to doing the `groupBy`, you have an RDD[(Int, Int)], x, containing:
// (1,1),(1,2),(1,3),(2,1),(2,2),(2,3)
// Can simply map values as below. Result is a RDD[(Int, (Int, Int))].
val x: RDD[(Int, Int)] = sc.parallelize(Seq((1,1),(1,2),(1,3),(2,1),(2,2),(2,3))
val y: RDD[(Int, (Int, Int))] = => (t._1, t)) // Map first value in pair tuple to the tuple
y.collect // Get result as an array
// res0: Array[(Int, (Int, Int))] = Array((1,(1,1)), (1,(1,2)), (1,(1,3)), (2,(2,1)), (2,(2,2)), (2,(2,3)))
That is, the result is a pair RDD that relates the key (the first value of each pair) to the pair (as a tuple). Do not use groupBy, since—in this case—it will not give you what you want.
If I understand your requirement correctly, you can use groupByKey and flatMapValues to flatten the 2-combinations of the grouped values, as shown below:
val rdd = sc.parallelize(Seq(
(1, 1), (1, 2), (1 ,3), (2, 1), (2, 2), (2, 3)
map{ case (k, v) => (k, (v(0), v(1))) }.
// res1: Array[(Int, (Int, Int))] =
// Array((1,(1,2)), (1,(1,3)), (1,(2,3)), (2,(1,2)), (2,(1,3)), (2,(2,3)))

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble /, scala.util.Random.nextLong()))
}.map(x => => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) =>> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize ={case (k, list) => (k,list.size)}
val totalElements ={case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples ={case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

Extract elements of lists in an RDD

What I want to achieve
I'm working with Spark and Scala. I have two Pair RDDs.
rdd1 : RDD[(String, List[String])]
rdd2 : RDD[(String, List[String])]
Both RDDs are joined on their first value.
val joinedRdd = rdd1.join(rdd2)
So the resulting RDD is of type RDD[(String, (List[String], List[String]))]. I want to map this RDD and extract the elements of both lists, so that the resulting RDD contains just these elements of the two lists.
rdd1 (id, List(a, b))
rdd2 (id, List(d, e, f))
wantedResult (a, b, d, e, f)
Naive approach
My naive approach would be to adress each element directly with (i), like below:
val rdd = rdd1.join(rdd2)
.map({ case (id, lists) =>
(lists._1(0), lists._1(1), lists._2(0), lists._2(2), lists._2(3)) })
/* results in RDD[(String, String, String, String, String)] */
Is there a way to get the elements of each list, without adressing each individually? Something like "lists._1.extractAll". Is there a way to use flatMap to achieve what I'm trying to achieve?
You can simply concatenate the two lists with the ++ operator:
val res: RDD[List[String]] = rdd1.join(rdd2)
.map { case (_, (list1, list2)) => list1 ++ list2 }
Probably a better approach that would avoid to carry List[String] around that may be very big would be to explode the RDD into smaller (key value) pairs, concatenate them and then do a groupByKey:
val flatten1: RDD[(String, String)] = rdd1.flatMapValues(identity)
val flatten2: RDD[(String, String)] = rdd2.flatMapValues(identity)
val res: RDD[Iterable[String]] = (flatten1 ++ flatten2).groupByKey.values

Calculation on consecutive array elements

I have this:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
I would like to calculate the duration between each two consecutive timestamps from myInput and retrieve those like the following:
val myOutput = ArrayBuffer(
This is a paired evaluation, which led me to think something with foldLeft() might do the trick, but after giving this a little more thought, I could not come up with a solution.
I have already put something together with some for loops and .indices, but this does not seem as clean and concise as it could be. I would appreciate if somebody had a better option.
You can use zip and sliding to achieve what you want. For example, if you have a collection
scala> List(2,3,5,7,11)
res8: List[Int] = List(2, 3, 5, 7, 11)
The list of differences is res8.sliding(2).map{case List(fst,snd)=>snd-fst}.toList, which you can zip with the original list.
scala>{case List(fst,snd)=>snd-fst}.toList)
res13: List[(Int, Int)] = List((2,1), (3,2), (5,2), (7,4))
You can zip your array with itself, after dropping the first item - to match each item with the consecutive one - and then map to the calculated result:
val myInput:ArrayBuffer[(String,String)] = ArrayBuffer(
val result: ArrayBuffer[(String, Int)] = {
case ((k1, v1), (k2, v2)) => (k1, v2.toInt - v1.toInt)
// prints:
// (a,500)
// (b,1000)

Scala: Create a new list where each element is the elemnt of old list repeated with different suffix

This seems like it should be really easy but I can't quite put it together. I want to take a list of strings and create a new list that contains two of each element form the first list but with a different suffix. So:
List("a", "b", "c") -> List("a_x", "a_y", "b_x", "b_y", "c_x", "c_y"
I tried
val list2 => i+"_x", i+"_y")
but scala said I had too many arguments. This got close:
val list2 => (i+"_x", i+"_y"))
but it produced List(("a_x", "a_y"), ("b_x", "b_y"), ("c_x", "c_y")) which is not what I want. I'm sure I ;m missing something obvious.
You want flatMap, to first map, then flatten the structure of the result into a flat list. Each individual result must itself be a collection (not a tuple):
scala> List("a", "b", "c").flatMap(i => List(i + "-x", i + "-y"))
res0: List[String] = List(a-x, a-y, b-x, b-y, c-x, c-y)
With a for comprehension:
scala> val prefixes = List("a", "b", "c")
prefixes: List[String] = List(a, b, c)
scala> val suffixes = List("x", "y")
suffixes: List[String] = List(x, y)
scala> for (prefix <- prefixes; suffix <- suffixes) yield prefix + "_" + suffix
res1: List[String] = List(a_x, a_y, b_x, b_y, c_x, c_y)
This is basically just syntactic sugar for Seth Tisue's answer.