Functional Style reading large csv file in Scala - scala

I am new to functional-style programming and scala, so my question might seem to be bit primitive.
Is there a specific way to read csv file in scala using functional style? Also, how the inner joins are performed for combining 2 csv files in scala using functional style?
I know spark and generally use data frame but don't have any idea in scala and finding it tough to search on google as well, since don't have much knowledge about it. Also, if anyone knows good links for the functional style programming for scala it would be great help.

The question is indeed too broad.
Is there a specific way to read csv file in scala using functional
style?
So far I don't know of a king's road to parse CSVs completely hassle-free.
CSV parsing includes
going through the input line-by-line
understanding, what to do with the (optional) header
accurately parsing each line each line according to the CSV specification
turning line parts into a business object
I recommend to
turn your input into Iterator[String]
split each line into parts using a library of your choice (e.g. opencsv)
manually create a desired domain object from the line parts
Here is an simple example which (ignores error handling and potential header)
case class Person(name: String, street: String)
val lineParser = new CSVParserBuilder().withSeparator(',').build()
val lines: Iterator[String] = Source.fromInputStream(new FileInputStream("file.csv")).getLines()
val parsedObjects: Iterator[Person] = lines.map(line => {
val parts: Array[String] = lineParser.parseLine(line)
Person(parts(0), parts(1))
})

Related

Is it Scala style to use a for loop in Scala/Spark?

I have heard that it is a good practice in Scala to eliminate for loops and do things "the Scala way". I even found a Scala style checker at http://www.scalastyle.org. Are for loops a no-no in Scala? In a course at https://www.udemy.com/course/apache-spark-with-scala-hands-on-with-big-data/learn/lecture/5363798#overview I found this example, which makes me thing that for looks are okay to use, but using the Scala format and syntax of course, in a single line and not like the traditional Java for looks in multiple lines of code. See this example I found from that Udemy course:
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
That for loop prints this result, as expected:
Enterprise Defiant Voyager Deep Space Nine
I was wondering if using for as in the example above is acceptable Scala style code, or it if is a no-no and why. Thank you!
There is no problem in this for loop, but you can use functions form List object for your work in more functional way.
e.g. instead of using
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
You can use
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
shipList.foreach(element => println(element) )
or
shipList.foreach(println)
You can use for loops in Scala, there is no problem with that. But the difference is that this for-loop is not an expression and does not return a value, so you need to use a variable in order to return any value. Scala gives preference to work with immutable types.
In your example you print messages in the console, you need to perform a "side effect" to extract the value breaking the referencial transparency, I mean, you depend on the IO operation to extract a value, or you have mutate a variable which is in the scope which maybe is being accessed by another thread or another concurrent task thereby there is no guarantee that the value that you collect wont be what you are expecting. Obviously, all these hypothesis are related to concurrent/parallel programming and there is where Scala and the immutable style help.
To show the elements of a collection you can use a for loop, but if you want to count the total number of chars in Scala you do that using a expression like:
val chars = shipList.foldLeft(0)((a, b) => a + b.length)
To sum up, most of the times the Scala code that you will read uses immutable style of programming although not always because Scala supports the other way of coding too, but it is weird to find something using a classic Java OOP style, mutating object instances and using getters and setters.

Training Sparks word2vec with a RDD[String]

I'm new to Spark and Scala so I might have misunderstood some basic things here. I'm trying to train Sparks word2vec model on my own data. According to their documentation, one way to do this is
val input = sc.textFile("text8").map(line => line.split(" ").toSeq)
val word2vec = new Word2Vec()
val model = word2vec.fit(input)
The text8 dataset contains one line of many words, meaning that input will become an RDD[Seq[String]].
After massaging my own dataset, which has one word per line, using different maps etc. I'm left with an RDD[String], but I can't seem to be able to train the word2vec model on it. I tried doing input.map(v => Seq(v)) which does actually give an RDD[Seq[String]], but that will give one sequence for each word, which I guess is totally wrong.
How can I wrap a sequence around my strings, or is there something else I have missed?
EDIT
So I kind of figured it out. From my clean being an RDD[String] I do val input = sc.parallelize(Seq(clean.collect().toSeq)). This gives me the correct data structure (RDD[Seq[String]]) to fit the word2vec model. However, running collect on a large dataset gives me out of memory error. I'm not quite sure how they intend the fitting to be done? Maybe it is not really parallelizable. Or maybe I'm supposed to have several semi-long sequences of strings inside and RDD, instead of one long sequence like I have now?
It seems that the documentation is updated in an other location (even though I was looking at the "latest" docs). New docs are at: https://spark.apache.org/docs/latest/ml-features.html
The new example drops the text8 example file alltogether. I'm doubting whether the original example ever worked as intended. The RDD input to word2vec should be a set of lists of strings, typically sentences or otherwise constructed n-grams.
Example included for other lost souls:
val documentDF = sqlContext.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
Why not
input.map(v => v.split(" "))
or whatever would be an appropriate delimiter to split your words on. This will give you the desired sequence of strings - but with valid words.
As I can recall, word2vec in ml take dataframe as argument and word2vec in mllib can take rdd as argument. The example you posted is for word2vec in ml. Here is the official guide: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

Scala Parser Combinators: Parsing in a stream

I'm using the native parser combinator library in scala, and I'd like to use it to parse a number of large files. I have my combinators set up, but the file that I'm trying to parse is too large to be read into memory all at once. I'd like to be able to stream from an input file through my parser and read it back to disk so that I don't need to store it all in memory at once.My current system looks something like this:
val f = Source.fromFile("myfile")
parser.parse(parser.document.+, f.reader).get.map{_.writeToFile}
f.close
This reads the whole file in as it parses, which I'd like to avoid.
There is no easy or built-in way to accomplish this using scala's parser combinators, which provide a facility for implementing parsing expression grammars.
Operators such as ||| (longest match) are largely incompatible with a stream parsing model, as they require extensive backtracking capabilities. In order to accomplish what you are trying to do, you would need to re-formulate your grammar such that no backtracking is required, ever. This is generally much harder than it sounds.
As mentioned by others, your best bet would be to look into a preliminary phase where you chunk your input (e.g. by line) so that you can handle a portion of the stream at a time.
One easy way of doing it is to grab an Iterator from the Source object and then walk through the lines like so:
val source = Source.fromFile("myFile")
val lines = source.getLines
for (line <- lines) {
// Do magic with the line-value
}
source.close // Close the file
But you will need to be able to use the lines one by one in your parser of course.
Source: https://groups.google.com/forum/#!topic/scala-user/LPzpXo3sUVE
You might try the StreamReader class that is part of the parsing package.
You would use it something like:
val f = StreamReader( fromFile("myfile","UTF-8").reader() )
parseAll( parser, f )
The longest match as one poster above mentioned combined with regex's using source.subSequence(0, source.length) means even StreamReader doesn't help.
The best kludgy answer I have is use getLines as others have mentioned, and chunk as the accepted answer mentions. My particular input required me to chunk 2 lines at a time. You could build an iterator out of the chunks you build to make it slightly less ugly.

Readin a two-dimensional array using scala

Suppose I have a txt file named "input.txt" and I want to use scala to read it in. The dimension of the file is not available in the beginning.
So, how to construct such an Array[Array[Float]]? What I want is a simple and neat way rather than write some code like in Java to iterates over lines and parse each number. I think functional programming should be quite good at it.. but cannot think of one up to now.
Best Regards
If your input is correct, you can do it in such way:
val source = io.Source.fromFile("input.txt")
val data = source.getLines().map(line => line.split(" ").map(_.toFloat)).toArray
source.close()
Update: for additional information about using Source check this thread

Scala's compiler lexer

I'm trying to get a list of tokens (I'm most interested in keywords) and their positions for a given scala source file.
I think there is a lexer utility inside scala compiler, but I can't find it. Can you point me into the right direction?
A simple lexer for a Scala-like language is provided in a standard library.
A small utility program which tokenizes Scala source using the same lexer as compiler does lives here
Scalariform has an accurate Scala lexer you can use:
import scalariform.lexer._
val tokens = ScalaLexer.rawTokenise("class A", forgiveErrors = true)
val keywords = tokens.find(_.tokenType.isKeyword)
val comments = tokens.find(_.tokenType.isComment)
Parser Combinators might help in what you are trying to achieve here, especially if you later on are not only interessted in keyword parsing.