How to get Rdd values that exists in array? - scala

I have Rdd[(Int, Double)]
and an array[Int] and i want to get a new Rdd[(Int, Double)] with only those Int that exist in the array too.
E.g if my array is [0, 1, 2] and my rdd is (1, 4.2), (5, 4.3), i want to get as output rdd only the (1, 4.2)
I am thinking about using filter with a function that iterates the array, do the comparison and returns true/false but i am not sure if it is the logic of spark.
Something like:
val newrdd = rdd.filter(x => f(x._1, array))
where
f(x:Int, y:Array[In]): Boolean ={
val z = false
for (a<-0 to y.length-1){
if (x == y(a)){
z = true
z}
z
}

//Input rdd
val rdd = sc.parallelize(Seq((1,4.2),(5,4.3)))
//array, convert to rdd
val arrRdd = sc.parallelize(Array(0,1,2))
//convert rdd and arrRdd to dataframe
val arrDF = arrRdd.toDF()
val df = rdd.toDF()
//do join and again convert it to rdd
df.join(arrDF,df.col("_1") === arrDF.col("value"),"leftsemi").rdd.collect
//output Array([1,4.2])

Try this:
rdd.filter(x => Array(0,1,2).contains(x._1)).collect.foreach(println)
Output:
(1,4.2)

val acceptableValues = array.toSet
rdd.filter { case (x, _) => acceptableValues(x) }

Related

Accessing a specific element of an Array RDD in apache-spark scala

I have a RDD that is containing an array of key,value pairs. I want to get an element with key (say 4).
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at <console>:29
I have tried to apply filter on it but getting error.
scala> val c = b.filter(p => p(0) = 4);
<console>:31: error: value update is not a member of (Int, String)
val c = b.filter(p => p(0) = 4);
I want to print the key,value pair with specific key (say 4) as Array((4,lion))
The data is always coming in the form of an array of key,value pair
use p._1 instead of p(0).
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd: RDD[(Int, String)] = rdd.keyBy(_.length)
val filterRdd: RDD[(Int, String)] = kvRdd.filter(p => p._1 == 4)
//display rdd
println(filterRdd.collect().toList)
List((4,lion))
There's a lookup method applicable to RDDs of Key-Value pairs (RDDs of type RDD[(K,V)]) that directly offers this functionality.
b.lookup(4)
// res4: Seq[String] = WrappedArray(lion)
b.lookup(5)
// res6: Seq[String] = WrappedArray(tiger, eagle)

Replacing the values of an RDD with another

I have two data sets like below. Each data set has "," separated numbers in each line.
Dataset 1
1,2,0,8,0
2,0,9,0,3
Dataset 2
7,5,4,6,3
4,9,2,1,8
I have to replace the zeroes of the first data set with the corresponding values from the data set 2.
So the result would look like this
1,2,4,8,3
2,9,9,1,3
I replaced the values with the code below.
val rdd1 = sc.textFile(dataset1).flatMap(l => l.split(","))
val rdd2 = sc.textFile(dataset2).flatMap(l => l.split(","))
val result = rdd1.zip(rdd2).map( x => if(x._1 == "0") x._2 else x._1)
The output I got is of the format RDD[String]. But I need the output in the format RDD[Array[String]] as this format would be more suitable for my further transformations.
If you want an RDD[Array[String]], where each element of the array correspond to a line, don't flat map the values after splitting, just map them.
scala> val rdd1 = sc.parallelize(List("1,2,0,8,0", "2,0,9,0,3")).map(l => l.split(","))
rdd1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at map at <console>:27
scala> val rdd2 = sc.parallelize(List("7,5,4,6,3", "4,9,2,1,8")).map(l => l.split(","))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at <console>:27
scala> val result = rdd1.zip(rdd2).map{case(arr1, arr2) => arr1.zip(arr2).map{case(v1, v2) => if(v1 == "0") v2 else v1}}
result: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:31
scala> result.collect
res0: Array[Array[String]] = Array(Array(1, 2, 4, 8, 3), Array(2, 9, 9, 1, 3))
or maybe less verbose:
val result = rdd1.zip(rdd2).map(t => t._1.zip(t._2).map(x => if(x._1 == "0") x._2 else x._1))

SubtractByKey and keep rejected values

I was playing around with spark and I am getting stuck with something that seems foolish.
Let's say we have two RDD:
rdd1 = {(1, 2), (3, 4), (3, 6)}
rdd2 = {(3, 9)}
if I am doing rdd1.substrackByKey(rdd2) , I will get {(1, 2)} wich is perfectly fine. But I also want to save the rejected values {(3,4),(3,6)} to another RDD, is there a prebuilt function in spark or an elegant way to do this?
Please keep in mind that I am new with Spark, any help will be appreciated, thanks.
As Rohan suggests, there is no (to the best of my knowledge) standard API call to do this. What you want to do can be expressed as Union - Intersection.
Here is how you can do this on spark:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val intersection = r1.map(_._1).intersection(r2.map(_._1))
val union = r1.map(_._1).union(r2.map(_._1))
val diff = union.subtract(intersection)
diff.collect()
> Array[Int] = Array(1)
To get the actual pairs:
val d = diff.collect()
r1.union(r2).filter(x => d.contains(x._1)).collect
I think I claim this is slightly more elegant:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val r3 = r1.leftOuterJoin(r2)
val subtracted = r3.filter(_._2._2.isEmpty).map(x=>(x._1, x._2._1))
val discarded = r3.filter(_._2._2.nonEmpty).map(x=>(x._1, x._2._1))
//subtracted: (1,2)
//discarded: (3,4)(3,6)
The insight is noticing that leftOuterJoin produces both the discarded (== records with a matching key in r2) and remaining (no matching key) in one go.
It's a pity Spark doesn't have RDD.partition (in the Scala collection sense of split a collection into two depending on a predicate) or we could caclculate subtracted and discarded in one pass
You can try
val rdd3 = rdd1.subtractByKey(rdd2)
val rdd4 = rdd1.subtractByKey(rdd3)
But you won't be keeping the values, just running another subtraction.
Unfortunately, I don't think there's an easy way to keep the rejected values using subtractByKey(). I think one way you get your desired result is through cogrouping and filtering. Something like:
val cogrouped = rdd1.cogroup(rdd2, numPartitions)
def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
You might be able to borrow the work done here to make the last two lines look more elegant.
When I run this on your example, I see:
scala> val rdd1 = sc.parallelize(Array((1, 2), (3, 4), (3, 6)))
scala> val rdd2 = sc.parallelize(Array((3, 9)))
scala> val cogrouped = rdd1.cogroup(rdd2)
scala> def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
scala> val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> res1.collect()
...
res7: Array[(Int, Int)] = Array((1,2))
scala> res2.collect()
...
res8: Array[(Int, Int)] = Array((3,4), (3,6))
First use substractByKey() and then subtract
val rdd1 = spark.sparkContext.parallelize(Seq((1,2), (3,4), (3,5)))
val rdd2 = spark.sparkContext.parallelize(Seq((3,10)))
val result = rdd1.subtractByKey(rdd2)
result.foreach(print) // (1,2)
val rejected = rdd1.subtract(result)
rejected.foreach(print) // (3,5)(3,4)

issue while creating a pair RDD in spark from text file and applying reduceByKey

To run some simple spark transformation given in learning Spark i need to create one pair RDD
(example: {(1, 2), (3, 4), (3, 6)})
What is the best way to create this so I can use groupByKey() etc on this. I tried putting this in a file and reading by below code but some how this doesn't work
Text file content
1 2
3 4
3 6
Code
val lines = sc.textFile("path_to_file")
val pairs = lines.map(x => (x.split(" ")(0), x))
pairs.foreach(println)
It prints as below
scala> pairs.foreach(println)
(1,1 2)
(3,3 4)
(3,3 6)
While I want it as
1 2
3 4
3 6
Is there any easier way to do this in scala ?
Split the text file content based on index for both key and value to generate a pair RDD.
val pairs = lines.map(x => (x.split(" ")(0), (x.split(" ")(1)))
Try this:
scala> val pairsRDD = lines.flatMap { x =>
x.split("""\s+""") match {
case Array(a,b) => Some((a,b))
case _ => None
}
}
pairsRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[21] at flatMap at <console>:23
scala> val pairs = pairsRDD.collect
pairs: Array[(String, String)] = Array((1,2), (3,4), (3,6))
scala> pairs foreach println
(1,2)
(3,4)
(3,6)
NOTE: If you want the values a numeric instead of String, just add type conversion ( .toInt , .toDouble etc ).
Thanks all for reply , Here is the solution that worked for me
val lines = sc.textFile("path to file ")
val pairs = lines.keyBy( line => (line.split(" ")(0))).mapValues( line => line.split(" ") (1).trim.toInt)
pairs.reduceByKey((x,y) => x+y).foreach(println)
scala> pairs.reduceByKey((x,y) => x+y).foreach(println)
(3,10)
(1,2)
You can use the following
val pairs = lines.flatMap(x => x.split("\n") )
Good luck!

Selecting multiple arbitrary columns from Scala array using map()

I'm new to Scala (and Spark). I'm trying to read in a csv file and extract multiple arbitrary columns from the data. The following function does this, but with hard-coded column indices:
def readCSV(filename: String, sc: SparkContext): RDD[String] = {
val input = sc.textFile(filename).map(line => line.split(","))
val out = input.map(csv => csv(2)+","+csv(4)+","+csv(15))
return out
}
Is there a way to use map with an arbitrary number of column indices passed to the function in an array?
If you have a sequence of indices, you could map over it and return the values :
scala> val m = List(List(1,2,3), List(4,5,6))
m: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6))
scala> val indices = List(0,2)
indices: List[Int] = List(0, 2)
// For each inner sequence, get the relevant values
// indices.map(inner) is the same as indices.map(i => inner(i))
scala> m.map(inner => indices.map(inner))
res1: List[List[Int]] = List(List(1, 3), List(4, 6))
// If you want to join all of them use .mkString
scala> m.map(inner => indices.map(inner).mkString(","))
res2: List[String] = List(1,3, 4,6) // that's actually a List containing 2 String