Spark: scala - how to convert collection from RDD to another RDD - scala

How can I convert collection returned after calling take(5) to another RDD so I can save first 5 records in output file?
if I use saveAsTextfile it is not letting me use take and saveAsTextFile together (that is why you are seeing that line commented below). It stores all records from RDD in sorted order so first 5 recs are top 5 countries but I want to store only first 5 records - is it possible to convert collections[take(5)] in RDD?
val Strips = txtFileLines.map(_.split(","))
.map(line => (line(0) + "," + (line(7).toInt + line(8).toInt)))
.sortBy(x => x.split(",")(1).trim().toInt, ascending=false)
.take(5)
//.saveAsTextFile("output\\country\\byStripsBar")
Solution:
sc.parallelize(Strips, 1).saveAsTextFile("output\\country\\byStripsBar")

val rowsArray: Array[Row] = rdd.take(5)
val slicedRdd = sparkContext.parallelize(rowsArray, 1)
slicedRdd.savesTextFile("specify path here")

Unless you absolutely need the saveAsTextFile formatting, I would just print the take(5) output to a file using simple IO (like File).
Otherwise, here is the wordy RDD only solution:
scala> val rdd = sc.parallelize(5 to 1 by -1 map{x => (x, x*x)})
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[71] at parallelize at <console>:27
scala> rdd.collect
res1: Array[(Int, Int)] = Array((5,25), (4,16), (3,9), (2,4), (1,1))
scala> val top2 = rdd.sortBy(_._1).zipWithIndex.collect{case x if (x._2 < 2) => x._1}
top2: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[79] at collect at <console>:29
scala> top2.collect
res2: Array[(Int, Int)] = Array((1,1), (2,4))

Related

length of each word in array by using scala

I have data like this below. In an array we have different words
scala> val x=rdd.flatMap(_.split(" "))
x: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at flatMap at <console>:26
scala> x.collect
res41: Array[String] = Array(Roses, are, red, Violets, are, blue)
I want find the length of each word in an array in scala
Spark allows you to chain the functions that are defined on a RDD[T], which is RDD[String] in your case. You can add the map function following your flatMap function to get the lengths.
val sentence: String = "Apache Spark is a cluster compute engine"
val sentenceRDD = sc.makeRDD(List(sentence))
val wordLength = sentenceRDD.flatMap(_.split(" ")).map(_.length)
wordLength.take(2)
For instance I'll use your value x to show the demonstration:
we can do something like this to find the length of each word in array in scala
>x.map(s => s -> s.length)
This will print out the following:
Array[(String, Int)] = Array((Roses,5), (are,3), (red,3), (Violets,7), (are,3), (blue,4))
In the case, if you are using Spark. Then change as follows:
>x.map(s => s -> s.length).collect()
This will print out the following:
Array[(String, Int)] = Array((Roses,5), (are,3), (red,3), (Violets,7), (are,3), (blue,4))
If you want only the length of each word then use this:
>x.map(_.length).collect()
Output:
Array(5,3,3,7,3,4)
you can just give ...
val a = Array("Roses", "are", "red", "Violets", "are", "blue")
var b = a.map(x => x.length)
This will give you Array[Int] = Array(5, 3, 3, 7, 3, 4)

Convert List of key-value pairs in each row of RDD to single key-value in each row

I have an RDD as
List((a,b),(b,c))
List((d,e))
How can I get it as
(a,b)
(b,c)
(d,e)
I have tried RDD.flatMap(x => x), this doesn’t work because there is a list of key value pairs and not just a list of values.
rdd.flatMap(identity) will convert RDD[List[(String, String)]] to RDD[(String, String)].
scala> val rdd = sc.parallelize(List(List(("a","b"),("b","c")), List(("d","e"))))
...
rdd: org.apache.spark.rdd.RDD[List[(String, String)]] = ParallelCollectionRDD[2] at parallelize at <console>:13
scala> rdd.flatMap(identity)
res2: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[3] at flatMap at <console>:14
scala> res2.collect()
...
res3: Array[(String, String)] = Array((a,b), (b,c), (d,e))
This would work like that for any RDD[List[T]], regardless the shape of T.
Could help more if you shared a bit more info on what you try to do.

Accessing a specific element of an Array RDD in apache-spark scala

I have a RDD that is containing an array of key,value pairs. I want to get an element with key (say 4).
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at <console>:29
I have tried to apply filter on it but getting error.
scala> val c = b.filter(p => p(0) = 4);
<console>:31: error: value update is not a member of (Int, String)
val c = b.filter(p => p(0) = 4);
I want to print the key,value pair with specific key (say 4) as Array((4,lion))
The data is always coming in the form of an array of key,value pair
use p._1 instead of p(0).
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd: RDD[(Int, String)] = rdd.keyBy(_.length)
val filterRdd: RDD[(Int, String)] = kvRdd.filter(p => p._1 == 4)
//display rdd
println(filterRdd.collect().toList)
List((4,lion))
There's a lookup method applicable to RDDs of Key-Value pairs (RDDs of type RDD[(K,V)]) that directly offers this functionality.
b.lookup(4)
// res4: Seq[String] = WrappedArray(lion)
b.lookup(5)
// res6: Seq[String] = WrappedArray(tiger, eagle)

SubtractByKey and keep rejected values

I was playing around with spark and I am getting stuck with something that seems foolish.
Let's say we have two RDD:
rdd1 = {(1, 2), (3, 4), (3, 6)}
rdd2 = {(3, 9)}
if I am doing rdd1.substrackByKey(rdd2) , I will get {(1, 2)} wich is perfectly fine. But I also want to save the rejected values {(3,4),(3,6)} to another RDD, is there a prebuilt function in spark or an elegant way to do this?
Please keep in mind that I am new with Spark, any help will be appreciated, thanks.
As Rohan suggests, there is no (to the best of my knowledge) standard API call to do this. What you want to do can be expressed as Union - Intersection.
Here is how you can do this on spark:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val intersection = r1.map(_._1).intersection(r2.map(_._1))
val union = r1.map(_._1).union(r2.map(_._1))
val diff = union.subtract(intersection)
diff.collect()
> Array[Int] = Array(1)
To get the actual pairs:
val d = diff.collect()
r1.union(r2).filter(x => d.contains(x._1)).collect
I think I claim this is slightly more elegant:
val r1 = sc.parallelize(Seq((1,2), (3,4), (3,6)))
val r2 = sc.parallelize(Seq((3,9)))
val r3 = r1.leftOuterJoin(r2)
val subtracted = r3.filter(_._2._2.isEmpty).map(x=>(x._1, x._2._1))
val discarded = r3.filter(_._2._2.nonEmpty).map(x=>(x._1, x._2._1))
//subtracted: (1,2)
//discarded: (3,4)(3,6)
The insight is noticing that leftOuterJoin produces both the discarded (== records with a matching key in r2) and remaining (no matching key) in one go.
It's a pity Spark doesn't have RDD.partition (in the Scala collection sense of split a collection into two depending on a predicate) or we could caclculate subtracted and discarded in one pass
You can try
val rdd3 = rdd1.subtractByKey(rdd2)
val rdd4 = rdd1.subtractByKey(rdd3)
But you won't be keeping the values, just running another subtraction.
Unfortunately, I don't think there's an easy way to keep the rejected values using subtractByKey(). I think one way you get your desired result is through cogrouping and filtering. Something like:
val cogrouped = rdd1.cogroup(rdd2, numPartitions)
def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
You might be able to borrow the work done here to make the last two lines look more elegant.
When I run this on your example, I see:
scala> val rdd1 = sc.parallelize(Array((1, 2), (3, 4), (3, 6)))
scala> val rdd2 = sc.parallelize(Array((3, 9)))
scala> val cogrouped = rdd1.cogroup(rdd2)
scala> def flatFunc[A, B](key: A, values: Iterable[B]) : Iterable[(A, B)] = for {value <- values} yield (key, value)
scala> val res1 = cogrouped.filter(_._2._2.isEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> val res2 = cogrouped.filter(_._2._2.nonEmpty).flatMap { case (key, values) => flatFunc(key, values._1) }
scala> res1.collect()
...
res7: Array[(Int, Int)] = Array((1,2))
scala> res2.collect()
...
res8: Array[(Int, Int)] = Array((3,4), (3,6))
First use substractByKey() and then subtract
val rdd1 = spark.sparkContext.parallelize(Seq((1,2), (3,4), (3,5)))
val rdd2 = spark.sparkContext.parallelize(Seq((3,10)))
val result = rdd1.subtractByKey(rdd2)
result.foreach(print) // (1,2)
val rejected = rdd1.subtract(result)
rejected.foreach(print) // (3,5)(3,4)

issue while creating a pair RDD in spark from text file and applying reduceByKey

To run some simple spark transformation given in learning Spark i need to create one pair RDD
(example: {(1, 2), (3, 4), (3, 6)})
What is the best way to create this so I can use groupByKey() etc on this. I tried putting this in a file and reading by below code but some how this doesn't work
Text file content
1 2
3 4
3 6
Code
val lines = sc.textFile("path_to_file")
val pairs = lines.map(x => (x.split(" ")(0), x))
pairs.foreach(println)
It prints as below
scala> pairs.foreach(println)
(1,1 2)
(3,3 4)
(3,3 6)
While I want it as
1 2
3 4
3 6
Is there any easier way to do this in scala ?
Split the text file content based on index for both key and value to generate a pair RDD.
val pairs = lines.map(x => (x.split(" ")(0), (x.split(" ")(1)))
Try this:
scala> val pairsRDD = lines.flatMap { x =>
x.split("""\s+""") match {
case Array(a,b) => Some((a,b))
case _ => None
}
}
pairsRDD: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[21] at flatMap at <console>:23
scala> val pairs = pairsRDD.collect
pairs: Array[(String, String)] = Array((1,2), (3,4), (3,6))
scala> pairs foreach println
(1,2)
(3,4)
(3,6)
NOTE: If you want the values a numeric instead of String, just add type conversion ( .toInt , .toDouble etc ).
Thanks all for reply , Here is the solution that worked for me
val lines = sc.textFile("path to file ")
val pairs = lines.keyBy( line => (line.split(" ")(0))).mapValues( line => line.split(" ") (1).trim.toInt)
pairs.reduceByKey((x,y) => x+y).foreach(println)
scala> pairs.reduceByKey((x,y) => x+y).foreach(println)
(3,10)
(1,2)
You can use the following
val pairs = lines.flatMap(x => x.split("\n") )
Good luck!