Scala - how to get and print the contents of Either? - scala

I'm processing data where some records may be corrupted. So I decided to explore the data and used Either to divide valid and invalid records.
I figured out how to count the number of each kind of records and now getting the output for failedCount and successCount successfully.
But I have a problem with printing out each invalid (Left) sale record. What could be wrong with my approach?
I don't get any output when printing out failedSales
def filterSales(rawSales: RDD[Sale]): RDD[(String, Sale)] = {
val filteredSales = rawSales
.map(sale => {
val saleOption = Try(sale.id -> sale)
saleOption match {
case Success(successSale) => Right(successSale)
case Failure(e) => Left(s"Corrupted sale: $sale;", e)
}
})
val failedCount: Long = filteredSales.filter(_.isLeft).count()
val successCount: Long = filteredSales.filter(_.isRight).count()
println("FAILED SALES COUNT: " + failedCount)
println("SUCCESS SALES COUNT: " + successCount)
// Problem here
val failedSales: RDD[Either.LeftProjection[(String, Throwable), (String, Sale)]] = filteredSales.map(_.left)
println("FAILED SALES: ")
// Doesn't produce any output
failedSales.foreach(println)
}

When you call foreach(fn) on an RDD then the funtion fn (println in your case) is executed on the slave nodes where the RDD is distributed. So it's happening somewhere but not on the driver program you're looking at.
If you have a small data set then you could collect() the RDD so the data is returned to your driver and you can println that.
If you have large data, you could saveAsTextFile() so it gets written to HDFS and you can download from there.

Related

What part of this code will execute on Spark driver?

val formatter: DateTimeFormatter = DateTimeFormatter.ofPattern("yyyy/MM")
def getEventCountOnWeekdaysPerMonth(data: RDD[(LocalDateTime, Long)]): Array[(String, Long)] = {
val result = data
.filter(e => e._1.getDayOfWeek.getValue < DayOfWeek.SATURDAY.getValue)
.map(mapDateTime2Date)
.reduceByKey(_ + _)
.collect()
result
.map(e => (e._1.format(formatter), e._2))
}
private def mapDateTime2Date(v: (LocalDateTime, Long)): (LocalDate, Long) = {
(v._1.toLocalDate.withDayOfMonth(1), v._2)
}
In the above code piece, data stored in "result" will be sent to driver during execution because of collect.
Will the mapping on "result" take place on driver or executors will also store the "result" and perform the mapping and store it till next action is called ?
Actually, nothing is executed here because there are only method declarations (besides of the formatter). If you call getEventCountOnWeekdaysPerMonth, only this line will be executed on the driver :
result
.map(e => (e._1.format(formatter), e._2))
This is because result is a plain scala array.
Execution on driver, in which case next Action is not relevant. collect means result set on driver and futher processing there. Would need makeRDD or equivalent for map to push processing to executors.

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

Stop processing large text files in Apache Spark after certain amount of errors

I'm very new to Spark. I'm working in 1.6.1.
Let's imagine I have large file, I'm reading it into RDD[String] thru textFile.
Then I want to validate each line in some function.
Because file is huge, I want to stop processing when I reached certain amount of errors, let's say 1000 lines.
Something like
val rdd = sparkContext.textFile(fileName)
rdd.map(line => myValidator.validate(line))
here is validate function:
def validate(line:String) : (String, String) = {
// 1st in Tuple for resulted line, 2nd ,say, for validation error.
}
How to calculate errors inside 'validate'?. It is actually executed in parallel on multiple nodes? Broadcasts? Accumulators?
You can achieve this behavior using Spark's laziness by "splitting" the result of the parsing into success and failures, calling take(n) on the failures, and only using the success data if there were less then n failures.
To achieve this more conveniently, I'd suggest changing the signature of validate to return some type that can easily distinguish success from failure, e.g. scala.util.Try:
def validate(line:String) : Try[String] = {
// returns Success[String] on success,
// Failure (with details in the exception object) otherwise
}
And then, something like:
val maxFailures = 1000
val rdd = sparkContext.textFile(fileName)
val parsed: RDD[Try[String]] = rdd.map(line => myValidator.validate(line)).cache()
val failures: Array[Throwable] = parsed.collect { case Failure(e) => e }.take(maxFailures)
if (failures.size == maxFailures) {
// report failures...
} else {
val success: RDD[String] = parsed.collect { case Success(s) => s }
// continue here...
}
Why would this work?
If there are less then 1000 failures, the entire dataset will be parsed when take(maxFailures) is called, the successful data will be cached and ready to use
If there are 1000 failures or more, the parsing would stop there, as the take operation won't require anymore reads

How to add line number into each line?

suppose these are my data:
‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.
and i want to add a number to every line like below output:
1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.
save them to file.
i've tried:
object DS_E5 {
def main(args: Array[String]): Unit = {
var i=0
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val sample1 = sc.textFile("data.txt")
for(sample<-sample1){
i=i+1
val ss=sample.map(l=>(i,sample))
println(ss)
}
}
}
but its output is like blew :
Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...
How can i edit my code to generate an output like my favorite output?
zipWithIndex is what you need here. It maps from RDD[T] to RDD[(T, Long)] by adding an index on the second position of the pair.
sample1
.zipWithIndex()
.map { case (line, i) => i.toString + ", " + line }
or using string interpolation (see a comment by #DanielC.Sobral)
sample1
.zipWithIndex()
.map { case (line, i) => s"$i, $line" }
By calling val sample1 = sc.textFile("data.txt") you are creating a new RDD.
If you need just an output, you can try to use next code:
sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))
Basically, by using this code, you will do this:
Using .zipWithIndex() will return new RDD[(T, Long)], where (T, Long) is a Tuple, T is a previous RDD elements datatype (java.lang.String, I believe) and Long is an index of element in RDD.
You performed transformation, now you need to make an action. foreach, in this case, suits very well. What is basically does: it applies your statement to every element in current RDD, so we just call quickly formatted println.

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}