Efficiency of flatMap vs map followed by reduce in Spark - scala

I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:
val textFile = sc.textFile("sherlock.txt")
My purpose is to count the number of words in the file. I came across two alternative ways to do the job.
First using flatMap:
textFile.flatMap(line => line.split(" ")).count()
Second using map followed by reduce:
textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)
Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?
Does the scala interpreter convert both into the most efficient form ?

I will argue that the most idiomatic way to handle this would be to map and sum:
textFile.map(_.split(" ").size).sum
but the end of the day a total cost will be dominated by line.split(" ").
You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array but I doubt it is worth all the fuss in general.
If you prefer a little bit deeper insight count is defined as:
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
where Utils.getIteratorSize is pretty much a naive iteration over Iterator with a sum of ones and sum is equivalent to
_.fold(0.0)(_ + _)

Related

scala - error: ')' expected but '(' found

I'm new to Scala and I cannot find out what is causing this error, I have searched similar topics but unfortunately, none of them worked for me. I've got a simple code to find the line from some README.md file with the most words in it. The code I wrote is:
val readme = sc.textFile("/PATH/TO/README.md")
readme.map(lambda line :len(line.split())).reduce(lambda a, b: a if (a > b) else b)
and the error is:
Name: Compile Error
Message: <console>:1: error: ')' expected but '(' found.
readme.map(lambda line :len(line.split()) ).reduce( lambda a, b: a
if (a > b) else b ) ^
<console>:1: error: ';' expected but ')' found.
readme.map(lambda line :len(line.split()) ).reduce( lambda a, b: a
if (a > b) else b ) ^
Your code isn't valid Scala.
I think what you might be trying to do is to determine the largest number of words on a single line in a README file using Spark. Is that right? If so, then you likely want something like this:
val readme = sc.textFile("/PATH/TO/README.md")
readme.map(_.split(' ').length).reduce(Math.max)
That last line uses some argument abbreviations. This alternative version is equivalent, but a little more explicit:
readme.map(line => line.split(' ').length).reduce((a, b) => Math.max(a, b))
The map function converts an RDD of Strings (each line in the file) into an RDD of Ints (the number of words on a single line, delimited - in this particular case - by spaces). The reduce function then returns the largest value of its two arguments - which will ultimately result in a single Int value representing the largest number of elements on a single line of the file.
After re-reading your question, it seems that you might want to know the line with the most words, rather than how many words are present. That's a little trickier, but this should do the trick:
readme.map(line => (line.split(' ').length, line)).reduce((a, b) => if(a._1 > b._1) a else b)._2
Now map creates an RDD of a tuple of (Int, String), where the first value is the number of words on the line, and the second is the line itself. reduce then retains whichever of its two tuple arguments has the larger integer value (._1 refers to the first element of the tuple). Since the result is a tuple, we then use ._2 to retrieve the corresponding line (the second element of the tuple).
I'd recommend you read a good book on Scala, such as Programming in Scala, 3rd Edition, by Odersky, Spoon & Venners. There's also some tutorials and an overview of the language on the main Scala language site. Coursera also has some free Scala training courses that you might want to sign up for.

Split file text by newline Scala

I want to read 100 numbers from a file which are stored in such a fashion:
Each number is on the different line. I am not sure which data structure should be used here because later I will need to sum all these numbers altogether and extract first 10 digits of the sum.
I only managed to simply read the file, but I want to split all the text by newline separators and get each number as a list or array element:
val source = Source.fromFile("pathtothefile")
val lines = source.getLines.mkString
I would be grateful for any advice on a data structure to be used here!
Update on approach:
val lines = Source.fromFile("path").getLines.toList
you almost have it there, just map to BigInt, then you have a list of BigInt
val lines = Source.fromFile("path").getLines.map(BigInt(_)).toList
(and then you can use .sum to sum them all up, etc)

How to copy the key of the previous line, to the key field of the next line in a Key-Value Pair RDD

Sample Dataset:
$, Claw "OnCreativity" (2012) [Himself]
$, Homo Nykytaiteen museo (1986) [Himself] <25>
Suuri illusioni (1985) [Guests] <22>
$, Steve E.R. Sluts (2003) (V) <12>
$hort, Too 2012 AVN Awards Show (2012) (TV) [Himself - Musical Guest]
2012 AVN Red Carpet Show (2012) (TV) [Himself]
5th Annual VH1 Hip Hop Honors (2008) (TV) [Himself]
American Pimp (1999) [Too $hort]
I have created a Key-Value Pair RDD as using the following code:
To split data: val actorTuple = actor.map(l => l.split("\t"))
To make KV pair: val actorKV = actorTuple.map(l => (l(0), l(l.length-1))).filter{case(x,y) => y != "" }
The Key-Value RDD output on console:
Array(($, Claw,"OnCreativity" (2012) [Himself]), ($, Homo,Nykytaiteen museo (1986) [Himself] <25>), ("",Suuri illusioni (1985) [Guests] <22>), ($, Steve,E.R. Sluts (2003) (V) <12>).......
But, a lot of lines have this "" as key i.e blank (see the RDD output above), because of the nature of dataset, So, I want to have a function that copies the actor of the previous line to this line if it's empty.
How this can be done.
New to Spark and Scala. But perhaps it would be simpler to change your parsing of the lines, and first create a pair RDD with values of type list, eg.
($, Homo, (Nykytaiteen museo (1986) [Himself] <25>,Suuri illusioni (1985) [Guests] <22>) )
I don't know your data, but perhaps if a line doesn't begin with "$" you append onto the value list.
Then depending on what you want to do, perhaps you could use flatMapValues(func) on the pair RDD described above. This applies a function which returns an iterator to each value of a pair RDD, and for each element returned, produces a key-value entry with the old key.
ADDED:
What format is your input data ("Sample Dataset") in? Is it a text file or .tsv?
You probably want to load the whole file at once. That is, use .wholeTextFiles() rather than .textFile() to load your data. This is because your records are stored across more than one line in the file.
ADDED
I'm not going to download the file, but it seems to me that each record you are interested in begins with "$".
Spark can work with any of the Hadoop Input formats, so check those to see if there is one which will work for your sample data.
If not, you could write your own Hadoop InputFormat implementation that parses files into records split on this character instead of the default for TextFiles, which is the '\n' character.
Continuing from the idea xyzzy gave, how about you try this after loading in the file as a string:
val actorFileSplit = actorsFile.split("\n\n")
val actorData = sc.parallelize(actorsFileSplit)
val actorDataSplit = actorsData.map(x => x.split("\t+",2).toList).map(line => (line(0), line(1).split("\n\t+").toList))
To explain what I'm doing, I start by splitting the string up every time we find a line break. Consecutively I parallelize this into a sparkcontext for mapping functions. Then I split every entry into two parts which is delimited by the first occurance of a number of tabs (one or more). The first part should now be the actor and the second part should still be the string with the movie titles. The second part may once again be split at every new line followed by a number of tabs. This should create a list with all the titles for every actor. The final result is in the form:
actorDataSplit = [(String, [String])]
Good luck

Meaning of 2nd parameter in StringOps.split(String, Int)

I was trying to split a string and keep the empty strings. Fortunately i found a promising solution which gave me my expected results as following REPL session depicts:
scala> val test = ";;".split(";",-1)
test: Array[String] = Array("", "", "")
I was curious what the second parameter actually does and dived into the scala documentation but found nothing except this:
Also inside the REPL interpreter i only get the following information:
scala> "asdf".split
TAB
def split(String): Array[String]
def split(String, Int): Array[String]
Question
Does anybody have an alternate source of documentation for such badly documented parameters?
Or can someone explain what this 2dn parameter does on this specific function?
This is the same split from java.lang.String, which as it so happens, has better documentation:
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.

A scala program much slower than the corresponding python script

I have written a short Scala program to read a large file, process it and store the result in another file. The file contains about 60000 lines of numbers, and I need to extract from each third line only the first number. Eventually I save those numbers to a different file. Although numbers, I treat them as strings all along the way.
Here is the Scala code:
import scala.io.Source
import java.io.BufferedWriter
import java.io.FileWriter
object Analyze {
def main(args: Array[String]) {
val fname = "input.txt"
val counters = Source.fromFile(fname).mkString.split("\\n").grouped(3)
.map(_(2).split("\\s+")(0))
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters.reduceLeft(_ + "\n" + _))
f.close()
}
}
I like very much the Scala's capability of powerful one liners. The one-liner in the above code reads the entire text from the file, splits it into lines, groups the lines to groups of 3 lines, and then takes from each group the third line, splits it and takes the first number.
Here is the equivalient python script:
fname = 'input.txt'
with file(fname) as f:
lines = f.read().splitlines()
linegroups = [lines[i:i+3] for i in range(0, len(lines), 3)]
nums = [linegroup[2].split()[0] for linegroup in linegroups]
with file('output2.txt', 'w') as f:
f.write('\n'.join(nums))
Python is not capable of such one liners. In the above script the first line of code reads the file into a list of lines, the next one groups the lines into groups of 3, and the next one creates a list consisting of the first number of every last line of each group. It's very similar to the Scala code, only it runs much much faster.
The python script runs in a fraction of a second on my laptop, while the Scala program runs for 15 seconds! I commented out the code that saves the result to the file, and the duration fell to 5 seconds, which is still way too slow. Also I don't understand why it takes so long to save the numbers to the file. When I dealt with larger files, the python script ran for a few seconds, while the Scala program running time was in order of minutes, which I couldn't use to analyze my files.
I'll appreciate you advice for this issue.
Thanks
I took the liberty of cleaning up the code, this should run more efficiently by avoiding the initial mkString, not needing a regex to perform the whitespace split, and not pre-aggregating the results before writing them out. I also used methods that are better self-documenting:
val fname = "input.txt"
val lines = (Source fromFile fname).getLines
val counters =
(lines grouped 3 withPartial false) map { _.last takeWhile (!_.isWhitespace) }
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters mkString "\n")
f.close()
Warning, untested code
This is largely irrelevant though, depending on how you're profiling. If you're including the JVM startup time in your metrics, then all bets are off - no amount of code optimisation could help you there.
I'd normally also suggest pre-warming the JVM by running the routine a few hundred times before you time it, but this isn't so practical in the face of file I/O.
I timed the version provided by Kevin with minor edits (removed withPartial since the python version doesn't handle padding either):
import scala.io.Source
import java.io.BufferedWriter
import java.io.FileWriter
object A extends App {
val fname = "input.txt"
val lines = (Source fromFile fname).getLines
val counters =
(lines grouped 3) map { _.last takeWhile (!_.isWhitespace) }
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters mkString "\n")
f.close()
}
With 60,000 lines here are the timing:
$ time scala -cp classes A
real 0m2.823s
$ time /usr/bin/python A.py
real 0m0.437s
With 900,000 lines:
$ time scala -cp classes A
real 0m5.226s
$ time /usr/bin/python A.py
real 0m3.319s
With 2,700,000 lines:
$ time scala -cp classes A
real 0m9.516s
$ time /usr/bin/python A.py
real 0m10.635s
The scala version outperforms the python version after that. So it seems some of the long timing is due to JVM initialization and JIT compilation time.
In addition to #notan3xit's answer, you could also write counters to files without concatenating them first:
val f = new BufferedWriter(new FileWriter("output1.txt"))
f.write(counters.head.toString)
counters.tail.foreach(c => f.write("\n" + c.toString))
f.close()
Though you could do the same in Python...
Try this code for write to file:
val f = new java.io.PrintWriter(new java.io.File("output1.txt"))
f.write(counters.reduce(_ + "\n" + _))
f.close()
Much faster .