Spark RDD pipe value from tuple - scala

I have a Spark RDD where each element is a tuple in the form (key, input). I would like to use the pipe method to pass the inputs to an external executable and generate a new RDD of the form (key, output). I need the keys for correlation later.
Here is an example using the spark-shell:
val data = sc.parallelize(
Seq(
("file1", "one"),
("file2", "two two"),
("file3", "three three three")))
// Incorrectly processes the data (calls toString() on each tuple)
data.pipe("wc")
// Loses the keys, generates extraneous results
data.map( elem => elem._2 ).pipe("wc")
Thanks in advance.

The solution with map is not correct as map is not guarantee to preserve partitioning so using zip after will fail. You need to use mapValues to preserve the partition of the initial RDD.
data.zip(
data.mapValues{ _.toString }.pipe("my_executable")
).map { case ((key, input), output) =>
(key, output)
}

Considering you cannot pass label in/out of executable, this might work:
rdd
.map(x => x._1)
.zip(rdd
.map(x => x._2)
.pipe("my executable"))
Please, be aware, that this can be fragile, and will definitely break if your executable not produces exactly single line on each input record.

Related

read a file in scala and get key value pairs as Map[String, List[String]]

i am reading a file and getting the records as a Map[String, List[String]] in spark-scala. similar thing i want to achieve in pure scala form without any spark references(not reading an rdd). what should i change to make it work in a pure scala way
rdd
.filter(x => (x != null) && (x.length > 0))
.zipWithIndex()
.map {
case (line, index) =>
val array = line.split("~").map(_.trim)
(array(0), array(1), index)
}
.groupBy(_._1)
.mapValues(x => x.toList.sortBy(_._3).map(_._2))
.collect
.toMap
For the most part it will remain the same except for the groupBy part in rdd. Scala List also has the map, filter, reduce etc. methods. So they can be used in almost a similar fashion.
val lines = Source.fromFile('filename.txt').getLines.toList
Once the file is read and stored in List, the methods can be applied to it.
For the groupBy part, one simple approach can be to sort the tuples on the key. That will effectively cluster the tuples with same keys together.
val grouped = scala.util.Sorting.stablesort(arr, (e1: String, e2: String, e3: String)
=> e1._1 < e2._2)
There can be better solutions definitely, but this would effectively do the same task.
I came up with the below approach
Source.fromInputStream(
getClass.getResourceAsStream(filePath)).getLines.filter(
lines =>(lines != null) && (lines.length > 0)).map(_.split("~")).toList.groupBy(_(0)).map{ case (key, values) => (key, values.map(_(1))) }

How to convert (key,array(value)) to (key,value) in Spark

I have a RDD like below:
val rdd1 = sc.parallelize(Array((1,Array((3,4),(4,5))),(2,Array((4,2),(4,4),(3,9)))))
which is RDD[(Int,Array[(Int,Int)])] I want to get the result like RDD[(Int,(Int,Int)] by some operations such as flatMap or else. In this example, the result should be:
(1,(3,4))
(1,(4,5))
(2,(4,2))
(2,(4,4))
(2,(3,9))
I am quite new to spark, so what could I do to achieve this?
Thanks a lot.
you can use flatMap in your case like this :
val newRDD: RDD[(Int, (Int, Int))] = rdd1
.flatMap { case (k, values) => values.map(v => (k, v))}
Assume that as RDD as rd. Use below code to get the data as you want
rdd1.flatMap(x => x._2.map(y => (x._1,y)))
Internal map method in flatmap read x._2 which is array and read each value of array at a time as y. After that flat map will give them as separate items. x._1 is the first value in the RDD.

How to fill Scala Seq of Sets with unique values from Spark RDD?

I'm working with Spark and Scala. I have an RDD of Array[String] that I'm going to iterate through. The RDD contains values for attributes like (name, age, work, ...). I'm using a Sequence of mutable Sets of Strings (called attributes) to collect all unique values for each attribute.
Think of the RDD as something like this:
("name1","21","JobA")
("name2","21","JobB")
("name3","22","JobA")
In the end I want something like this:
attributes = (("name1","name2","name3"),("21","22"),("JobA","JobB"))
I have the following code:
val someLength = 10
val attributes = Seq.fill[mutable.Set[String]](someLength)(mutable.Set())
val splitLines = rdd.map(line => line.split("\t"))
lines.foreach(line => {
for {(value, index) <- line.zipWithIndex} {
attributes(index).add(value)
// #1
}
})
// #2
When I debug and stop at the line marked with #1, everything is fine, attributes is correctly filled with unique values.
But after the loop, at line #2, attributes is empty again. Looking into it shows, that attributes is a sequence of sets, that are all of size 0.
Seq()
Seq()
...
What am I doing wrong? Is there some kind of scoping going on, that I'm not aware of?
The answer lies in the fact that Spark is a distributed engine. I will give you a rough idea of the problem that you are facing. Here the elements in each RDD are bucketed into Partitions and each Partition potentially lives on a different node.
When you write rdd1.foreach(f) that f is wrapped inside a closure (Which gets copies of the corresponding objects). Now, this closure is serialized and then sent to each node where it is applied for each element in that Partition.
Here, your f will get a copy of attributes in its wrapped closure and hence when f is executed, it interacts with that copy of attributes and not with attributes that you want. This results in your attributes being left out without any changes.
I hope the problem is clear now.
val yourRdd = sc.parallelize(List(
("name1","21","JobA"),
("name2","21","JobB"),
("name3","22","JobA")
))
val yourNeededRdd = yourRdd
.flatMap({ case (name, age, work) => List(("name", name), ("age", age), ("work", work)) })
.groupBy({ case (attrName, attrVal) => attrName })
.map({ case (attrName, group) => (attrName, group.toList.map(_._2).distinct })
// RDD(
// ("name", List("name1", "name2", "name3")),
// ("age", List("21", "22")),
// ("work", List("JobA", "JobB"))
// )
// Or
val distinctNamesRdd = yourRdd.map(_._1).distinct
// RDD("name1", "name2", "name3")
val distinctAgesRdd = yourRdd.map(_._2).distinct
// RDD("21", "22")
val distinctWorksRdd = yourRdd.map(_._3).distinct
// RDD("JobA", "JobB")

Flattening the key of a RDD

I have a Spark RDD of type (Array[breeze.linalg.DenseVector[Double]], breeze.linalg.DenseVector[Double]). I wish to flatten its key to transform it into a RDD of type breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double]). I am currently doing:
val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))
The signature of anonymousOrdering() is String => (Array[DenseVector[Double]], DenseVector[Double]).
It returns type mismatch: required: TraversableOnce[?]. The Python code doing the same thing is:
newRDD = oldRDD.flatMap(lambda point: [(tile, point) for tile in anonymousOrdering(point)])
How to do the same thing in Scala ? I generally use flatMapValuesbut here I need to flatten the key.
If I understand your question correctly, you can do:
val newRDD = oldRDD.flatMap(ob => anonymousOrdering(ob))
// newRDD is RDD[(Array[DenseVector], DenseVector)]
In that case, you can "flatten" the Array portion of the tuple using pattern matching and a for/yield statement:
newRDD = newRDD.flatMap{case (a: Array[DenseVector[Double]], b: DenseVector[Double]) => for (v <- a) yield (v, b)}
// newRDD is RDD[(DenseVector, DenseVector)]
Although it's still not clear to me where/how you want to use groupByKey
Change the code to use Map instead of FlatMap:
val newRDD = oldRDD.map(ob => anonymousOrdering(ob)).groupByKey()
You would only want to use flatmap here if anonymousOrdering returned a list of tuples and you wanted it flattened down.
As anonymousOrdering() is a function that you have in your code, update it in order to return a Seq[(breeze.linalg.DenseVector[Double], breeze.linalg.DenseVector[Double])]. It is like doing (tile, point) for tile in anonymousOrdering(point)] but directly at the end of the anonymous function. The flatMap will then take care to create one partition for each element of the sequences.
As a general rule, avoid having a collection as a key in a RDD.

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}