Replacing the values of an RDD with another - scala

I have two data sets like below. Each data set has "," separated numbers in each line.
Dataset 1
1,2,0,8,0
2,0,9,0,3
Dataset 2
7,5,4,6,3
4,9,2,1,8
I have to replace the zeroes of the first data set with the corresponding values from the data set 2.
So the result would look like this
1,2,4,8,3
2,9,9,1,3
I replaced the values with the code below.
val rdd1 = sc.textFile(dataset1).flatMap(l => l.split(","))
val rdd2 = sc.textFile(dataset2).flatMap(l => l.split(","))
val result = rdd1.zip(rdd2).map( x => if(x._1 == "0") x._2 else x._1)
The output I got is of the format RDD[String]. But I need the output in the format RDD[Array[String]] as this format would be more suitable for my further transformations.

If you want an RDD[Array[String]], where each element of the array correspond to a line, don't flat map the values after splitting, just map them.
scala> val rdd1 = sc.parallelize(List("1,2,0,8,0", "2,0,9,0,3")).map(l => l.split(","))
rdd1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at map at <console>:27
scala> val rdd2 = sc.parallelize(List("7,5,4,6,3", "4,9,2,1,8")).map(l => l.split(","))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at <console>:27
scala> val result = rdd1.zip(rdd2).map{case(arr1, arr2) => arr1.zip(arr2).map{case(v1, v2) => if(v1 == "0") v2 else v1}}
result: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:31
scala> result.collect
res0: Array[Array[String]] = Array(Array(1, 2, 4, 8, 3), Array(2, 9, 9, 1, 3))
or maybe less verbose:
val result = rdd1.zip(rdd2).map(t => t._1.zip(t._2).map(x => if(x._1 == "0") x._2 else x._1))

Related

How to get Rdd values that exists in array?

I have Rdd[(Int, Double)]
and an array[Int] and i want to get a new Rdd[(Int, Double)] with only those Int that exist in the array too.
E.g if my array is [0, 1, 2] and my rdd is (1, 4.2), (5, 4.3), i want to get as output rdd only the (1, 4.2)
I am thinking about using filter with a function that iterates the array, do the comparison and returns true/false but i am not sure if it is the logic of spark.
Something like:
val newrdd = rdd.filter(x => f(x._1, array))
where
f(x:Int, y:Array[In]): Boolean ={
val z = false
for (a<-0 to y.length-1){
if (x == y(a)){
z = true
z}
z
}
//Input rdd
val rdd = sc.parallelize(Seq((1,4.2),(5,4.3)))
//array, convert to rdd
val arrRdd = sc.parallelize(Array(0,1,2))
//convert rdd and arrRdd to dataframe
val arrDF = arrRdd.toDF()
val df = rdd.toDF()
//do join and again convert it to rdd
df.join(arrDF,df.col("_1") === arrDF.col("value"),"leftsemi").rdd.collect
//output Array([1,4.2])
Try this:
rdd.filter(x => Array(0,1,2).contains(x._1)).collect.foreach(println)
Output:
(1,4.2)
val acceptableValues = array.toSet
rdd.filter { case (x, _) => acceptableValues(x) }

Accessing a specific element of an Array RDD in apache-spark scala

I have a RDD that is containing an array of key,value pairs. I want to get an element with key (say 4).
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at <console>:29
I have tried to apply filter on it but getting error.
scala> val c = b.filter(p => p(0) = 4);
<console>:31: error: value update is not a member of (Int, String)
val c = b.filter(p => p(0) = 4);
I want to print the key,value pair with specific key (say 4) as Array((4,lion))
The data is always coming in the form of an array of key,value pair
use p._1 instead of p(0).
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd: RDD[(Int, String)] = rdd.keyBy(_.length)
val filterRdd: RDD[(Int, String)] = kvRdd.filter(p => p._1 == 4)
//display rdd
println(filterRdd.collect().toList)
List((4,lion))
There's a lookup method applicable to RDDs of Key-Value pairs (RDDs of type RDD[(K,V)]) that directly offers this functionality.
b.lookup(4)
// res4: Seq[String] = WrappedArray(lion)
b.lookup(5)
// res6: Seq[String] = WrappedArray(tiger, eagle)

Specify subset of elements in Spark RDD (Scala)

My dataset is a RDD[Array[String]] with more than 140 columns. How can I select a subset of columns without hard-coding the column numbers (.map(x => (x(0),x(3),x(6)...))?
This is what I've tried so far (with success):
val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))
However, I need more than a few columns, and would like to avoid hard-coding them.
This is what I've tried so far (that I think would be better, but has failed):
// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Error output from attempt 1:
<console>:28: error: type mismatch;
found : List[Int]
required: Int
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Attempt 2
colIndices map peopleTups.lift
// Attempt 3
colIndices map peopleTups
// Attempt 4
colIndices.map(index => peopleTups.apply(index))
I found this question and tried it, but because I'm looking at an RDD instead of an array, it didn't work: How can I select a non-sequential subset elements from an array using Scala and Spark?
You should map over the RDD instead of the indices.
val list = List.fill(2)(Array.range(1, 6))
// List(Array(1, 2, 3, 4, 5), Array(1, 2, 3, 4, 5))
val rdd = sc.parallelize(list) // RDD[Array[Int]]
val indices = Array(0, 2, 3)
val selectedColumns = rdd.map(array => indices.map(array)) // RDD[Array[Int]]
selectedColumns.collect()
// Array[Array[Int]] = Array(Array(1, 3, 4), Array(1, 3, 4))
What about this?
val data = sc.parallelize(List("a,b,c,d,e", "f,g,h,i,j"))
val indices = List(0,3,4)
data.map(_.split(",")).map(ss => indices.map(ss(_))).collect
This should give
res1: Array[List[String]] = Array(List(a, d, e), List(f, i, j))

issue while creating a pair RDD in spark from text file and applying reduceByKey

To run some simple spark transformation given in learning Spark i need to create one pair RDD
(example: {(1, 2), (3, 4), (3, 6)})
What is the best way to create this so I can use groupByKey() etc on this. I tried putting this in a file and reading by below code but some how this doesn't work
Text file content
1 2
3 4
3 6
Code
val lines = sc.textFile("path_to_file")
val pairs = lines.map(x => (x.split(" ")(0), x))
pairs.foreach(println)
It prints as below
scala> pairs.foreach(println)
(1,1 2)
(3,3 4)
(3,3 6)
While I want it as
1 2
3 4
3 6
Is there any easier way to do this in scala ?
Split the text file content based on index for both key and value to generate a pair RDD.
val pairs = lines.map(x => (x.split(" ")(0), (x.split(" ")(1)))
Try this:
scala> val pairsRDD = lines.flatMap { x =>
x.split("""\s+""") match {
case Array(a,b) => Some((a,b))
case _ => None
}
}
pairsRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[21] at flatMap at <console>:23
scala> val pairs = pairsRDD.collect
pairs: Array[(String, String)] = Array((1,2), (3,4), (3,6))
scala> pairs foreach println
(1,2)
(3,4)
(3,6)
NOTE: If you want the values a numeric instead of String, just add type conversion ( .toInt , .toDouble etc ).
Thanks all for reply , Here is the solution that worked for me
val lines = sc.textFile("path to file ")
val pairs = lines.keyBy( line => (line.split(" ")(0))).mapValues( line => line.split(" ") (1).trim.toInt)
pairs.reduceByKey((x,y) => x+y).foreach(println)
scala> pairs.reduceByKey((x,y) => x+y).foreach(println)
(3,10)
(1,2)
You can use the following
val pairs = lines.flatMap(x => x.split("\n") )
Good luck!

Selecting multiple arbitrary columns from Scala array using map()

I'm new to Scala (and Spark). I'm trying to read in a csv file and extract multiple arbitrary columns from the data. The following function does this, but with hard-coded column indices:
def readCSV(filename: String, sc: SparkContext): RDD[String] = {
val input = sc.textFile(filename).map(line => line.split(","))
val out = input.map(csv => csv(2)+","+csv(4)+","+csv(15))
return out
}
Is there a way to use map with an arbitrary number of column indices passed to the function in an array?
If you have a sequence of indices, you could map over it and return the values :
scala> val m = List(List(1,2,3), List(4,5,6))
m: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6))
scala> val indices = List(0,2)
indices: List[Int] = List(0, 2)
// For each inner sequence, get the relevant values
// indices.map(inner) is the same as indices.map(i => inner(i))
scala> m.map(inner => indices.map(inner))
res1: List[List[Int]] = List(List(1, 3), List(4, 6))
// If you want to join all of them use .mkString
scala> m.map(inner => indices.map(inner).mkString(","))
res2: List[String] = List(1,3, 4,6) // that's actually a List containing 2 String