How do I iterate RDD's in apache spark (scala) - scala

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}

You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())

The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.

I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();

what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.

for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}

Related

Spark: Transforming JSON files to correct format

I've 100+ million records stored in files with the following JSON structure (real data has way more columns, rows and is also nested)
{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}
The sqlContext.read.json function is unable to parse this since the records aren't on multiple lines but on one big line. The solution below solves this problem, but is a big performance killer. What would be the best way, performance wise, to handle this issue in Apache Spark?
val rdd = sc.wholeTextFiles("s3://some-bucket/**/*")
val validJSON = rdd.flatMap(_._2.replace("}{", "}\n{").split("\n"))
val df = sqlContext.read.json(validJSON)
df.count()
df.select("id").show()
This is a riff on Antot's answer, which should handle nested JSON
input.toVector
.foldLeft((false, Vector.empty[Char], Vector.empty[String])) {
case ((true, charAccum, strAccum), '{') => (false, Vector('{'), strAccum :+ charAccum.mkString);
case ((_, charAccum, strAccum), '}') => (true, charAccum :+ '}', strAccum);
case ((_, charAccum, strAccum), char) => (false, charAccum :+ char, strAccum)
}
._3
Basically what it does is split the data into a Vector[Char], and uses foldLeft to aggregate the input into substrings. The trick is to keep track of just enough information about the previous character to figure out if a { marks the start of a new object.
I used this input to test it (basically the OP's sample input, with a nested object thrown in):
val input = """{"id":"2-2-3","key":{ "test": "value"}}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}"""
And got this result, which looks good:
Vector({"id":"2-2-3","key":{ "test": "value"}},
{"id":"2-2-3","key":"value"},
{"id":"2-2-3","key":"value"},
{"id":"2-2-3","key":"value"})
The problem with the original approach is the call _._2.replace("}{", "}\n{", which creates another huge string from the input one, with new line chars inserted, which is then split once again into an array.
An improvement is possible by minimizing the creation of intermediate strings and retrieving the target ones as soon as possible. For this, we can play a bit with substrings:
val validJson = rdd.flatMap(rawJson => {
// functions extracted to make it more readable.
def nextObjectStartIndex(fromIndex: Int):Int = rawJson._2.indexOf('{', fromIndex)
def currObjectEndIndex(fromIndex: Int): Int = rawJson._2.indexOf('}', fromIndex)
def extractObject(fromIndex: Int, toIndex: Int): String = rawJson._2.substring(fromIndex, toIndex + 1)
// the resulting strings are put in a local buffer
val buffer = new ListBuffer[String]()
// init the scanning of the input string
var posStartNextObject = nextObjectStartIndex(0)
// main loop terminates when there are no more '{' chars
while (posStartNextObject != -1) {
val posEndObject = currObjectEndIndex(posStartNextObject)
val extractedObject = extractObject(posStartNextObject, posEndObject)
posStartNextObject = nextObjectStartIndex(posEndObject)
buffer += extractedObject
}
buffer
})
Please note that this approach would work only if the objects in the input JSON are not nested, supposing that all curly braces separate objects of same level.

Serialising temp collections created in Spark executors during task execution

I'm trying to find an effective way of writing collections created inside tasks to the output files of the job. For example, if we iterate over a RDD using foreach, we can create data structures that are local to the executor ex.,ListBuffer arr in the following code snippet. My problem is that how do I serialise arr and write it to file?
(1) Should I use FileWriter api or Spark saveAsTextFile will work?
(2) What will be the advantages of using one over the other
(3) Is there a better way of achieving the same.
PS: The reason I am using foreach instead of map is because I might not be able to transform all my RDD rows and I want to avoid getting Null values in the output.
val dataSorted: RDD[(Int, Int)] = <Some Operation>
val arr: ListBuffer = ListBuffer[(String, String)]()
dataSorted.foreach {
case (e, r) => {
if(e.id > 1000) {
arr += (("a", "b"))
}
}
}
Thanks,
Devj
You should not use driver's variables, but Accumulators - therw are articles about them with code examples here and here, also this question maybe helpful - there is simplified code example of custom AccumulatorParam
Write your own accumulator, that is able to add (String, String) or use built-in CollectionAccumulator. This is implementation of AccumulatorV2, new version of accumulator from Spark 2
Other way is to use Spark built-in filter and map functions - thanks #ImDarrenG for suggesting flatMap, but I think filter and map will be easier:
val result : Array[(String, String)] = someRDD
.filter(x => x._1 > 1000) // filter only good rows
.map (x => ("a", "b"))
.collect() // convert to arrat
The Spark API saves you some file handling code but essentially achieves the same thing.
The exception is if you are not using, say, HDFS and do not want your output file to be partitioned (spread across the executors file systems). In this case you will need to collect the data to the driver and use FileWriter to write to a single file, or files, and how you achieve that will depend on how much data you have. If you have more data than driver has memory you will need to handle it differently.
As mentioned in another answer, you're creating an array in your driver, while adding items from your executors, which will not work in a cluster environment. Something like this might be a better way to map your data and handle nulls:
val outputRDD = dataSorted.flatMap {
case (e, r) => {
if(e.id > 1000) {
Some(("a", "b"))
} else {
None
}
}
}
// save outputRDD to file/s here using the approapriate method...

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

How to add line number into each line?

suppose these are my data:
‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.
and i want to add a number to every line like below output:
1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.
save them to file.
i've tried:
object DS_E5 {
def main(args: Array[String]): Unit = {
var i=0
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val sample1 = sc.textFile("data.txt")
for(sample<-sample1){
i=i+1
val ss=sample.map(l=>(i,sample))
println(ss)
}
}
}
but its output is like blew :
Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...
How can i edit my code to generate an output like my favorite output?
zipWithIndex is what you need here. It maps from RDD[T] to RDD[(T, Long)] by adding an index on the second position of the pair.
sample1
.zipWithIndex()
.map { case (line, i) => i.toString + ", " + line }
or using string interpolation (see a comment by #DanielC.Sobral)
sample1
.zipWithIndex()
.map { case (line, i) => s"$i, $line" }
By calling val sample1 = sc.textFile("data.txt") you are creating a new RDD.
If you need just an output, you can try to use next code:
sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))
Basically, by using this code, you will do this:
Using .zipWithIndex() will return new RDD[(T, Long)], where (T, Long) is a Tuple, T is a previous RDD elements datatype (java.lang.String, I believe) and Long is an index of element in RDD.
You performed transformation, now you need to make an action. foreach, in this case, suits very well. What is basically does: it applies your statement to every element in current RDD, so we just call quickly formatted println.

Formatting the join rdd - Apache Spark

I have two key value pair RDD, I join the two rdd's and I saveastext file, here is the code:
val enKeyValuePair1 = rows_filter6.map(line => (line(8) -> (line(0),line(4),line(10),line(5),line(6),line(14),line(1),line(9),line(12),line(13),line(3),line(15),line(7),line(16),line(2),line(14))))
val enKeyValuePair = DATA.map(line => (line(0) -> (line(2),line(3))))
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val output = final_res.saveAsTextFile("C:/out")
my output is as follows:
(534309,((17999,5161,45005,00000,XYZ,,29.95,0.00),None))
How can i get rid of all the parenthesis?
I want my output as follows:
534309,17999,5161,45005,00000,XYZ,,29.95,0.00,None
When outputing to a text file Spark will just use the toString representation of the element in the RDD. If you want control over the format, then, tou can do one last transform of the data to a String before the call to saveAsTextFile.
Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. In your example I'd do:
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")
The first val line will take the tuple that is passed into the map function and assign the components to the values on the left. The second line creates a temporary Seq with the fields in the order you want displayed and then invokes mkString(",") to join the fields using a comma.
In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map.
simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}
While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple parameter cannot be destructured into the expected form.
You can do something like this:
import scala.collection.JavaConversions._
val output = sc.parallelize(List((534309,((17999,5161,45005,1,"XYZ","",29.95,0.00),None))))
val result = output.map(p => p._1 +=: p._2._1.productIterator.toBuffer += p._2._2)
.map(p => com.google.common.base.Joiner.on(", ").join(p.iterator))
I used guava to format string but there is porbably scala way of doing this.
do a flatmap before saving. Or, you can write a simple format function and use it in map.
Adding a bit code, just to show how it can be done. function formatOnDemand can be anything
test = sc.parallelize([(534309,((17999,5161,45005,00000,"XYZ","",29.95,0.00),None))])
print test.collect()
print test.map(formatOnDemand).collect()
def formatOnDemand(t):
out=[]
out.append(t[0])
for tok in t[1][0]:
out.append(tok)
out.append(t[1][1])
return out
>>>
[(534309, ((17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0), None))]
[[534309, 17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0, None]]