Scala Parser Combinators: Parsing in a stream - scala

I'm using the native parser combinator library in scala, and I'd like to use it to parse a number of large files. I have my combinators set up, but the file that I'm trying to parse is too large to be read into memory all at once. I'd like to be able to stream from an input file through my parser and read it back to disk so that I don't need to store it all in memory at once.My current system looks something like this:
val f = Source.fromFile("myfile")
parser.parse(parser.document.+, f.reader).get.map{_.writeToFile}
f.close
This reads the whole file in as it parses, which I'd like to avoid.

There is no easy or built-in way to accomplish this using scala's parser combinators, which provide a facility for implementing parsing expression grammars.
Operators such as ||| (longest match) are largely incompatible with a stream parsing model, as they require extensive backtracking capabilities. In order to accomplish what you are trying to do, you would need to re-formulate your grammar such that no backtracking is required, ever. This is generally much harder than it sounds.
As mentioned by others, your best bet would be to look into a preliminary phase where you chunk your input (e.g. by line) so that you can handle a portion of the stream at a time.

One easy way of doing it is to grab an Iterator from the Source object and then walk through the lines like so:
val source = Source.fromFile("myFile")
val lines = source.getLines
for (line <- lines) {
// Do magic with the line-value
}
source.close // Close the file
But you will need to be able to use the lines one by one in your parser of course.
Source: https://groups.google.com/forum/#!topic/scala-user/LPzpXo3sUVE

You might try the StreamReader class that is part of the parsing package.
You would use it something like:
val f = StreamReader( fromFile("myfile","UTF-8").reader() )
parseAll( parser, f )

The longest match as one poster above mentioned combined with regex's using source.subSequence(0, source.length) means even StreamReader doesn't help.
The best kludgy answer I have is use getLines as others have mentioned, and chunk as the accepted answer mentions. My particular input required me to chunk 2 lines at a time. You could build an iterator out of the chunks you build to make it slightly less ugly.

Related

Functional Style reading large csv file in Scala

I am new to functional-style programming and scala, so my question might seem to be bit primitive.
Is there a specific way to read csv file in scala using functional style? Also, how the inner joins are performed for combining 2 csv files in scala using functional style?
I know spark and generally use data frame but don't have any idea in scala and finding it tough to search on google as well, since don't have much knowledge about it. Also, if anyone knows good links for the functional style programming for scala it would be great help.
The question is indeed too broad.
Is there a specific way to read csv file in scala using functional
style?
So far I don't know of a king's road to parse CSVs completely hassle-free.
CSV parsing includes
going through the input line-by-line
understanding, what to do with the (optional) header
accurately parsing each line each line according to the CSV specification
turning line parts into a business object
I recommend to
turn your input into Iterator[String]
split each line into parts using a library of your choice (e.g. opencsv)
manually create a desired domain object from the line parts
Here is an simple example which (ignores error handling and potential header)
case class Person(name: String, street: String)
val lineParser = new CSVParserBuilder().withSeparator(',').build()
val lines: Iterator[String] = Source.fromInputStream(new FileInputStream("file.csv")).getLines()
val parsedObjects: Iterator[Person] = lines.map(line => {
val parts: Array[String] = lineParser.parseLine(line)
Person(parts(0), parts(1))
})

Scala Anorm - how use it properly

Scala's play framework claims that Anorm, and writing your own SQL is better that ORM's. One of the reasons is that you anyway most often want only transfer data between database and frontend as json. However, most tutorials, and even Play documentation give examples of parsing sql's returned values into case classes, in order to parse it again into json. We still have an object relational mapping anyway, or am I missing a point?
In my database there exists a table with 33 columns. Declaring a case class takes me 33 lines, declaring a parser with ~ operator, takes another 33. Using case statement to create an Object, another 66! Seriously, what am I doing wrong? Is there any shortcut? In django the same thing takes only 33 lines.
If you're using Anorm within a Play application, then the mapping into a Json object of your case class (assuming it has fairly normal apply and unapply functions defined for it, which most do) should be pretty much as simple as defining an implicit which uses the >2.10 macro based Json-inception methods...so all you actually need is a definition like this:
implicit val myCaseFormats = Json.format[MyCaseClass]
where 'MyCaseClass' is the name of your case type. You could even bake this into the parser combinator you use for de-serialising row-sets back from the database...that would dramatically clean up your code and cut down the amount of code you have to write.
See here for details on the Json macros:
https://www.playframework.com/documentation/2.1.1/ScalaJsonInception
I use this quite extensively in a pretty large code-base and it does make things quite clean.
In terms of your parsers for Anorm, remember that you don't have to produce a case-class instance as a result of a parse...you can actually return anything you like, which could just be an indexed sequence of your column values (if you're using something like Shapeless to allow for mixed-type lists etc...) or some other structure.
You do hav macro support in Anorm as well so the the parsers for your case classes can be one liners like this:
import norm.{Macro, Rowset}
val parser = Macro.namedParser[MyCaseClass]
If you want to do something custom, (such as parse direct to JsValue) then you have the flexibility to just hand-craft a more crafty parser.
HTH

Parboiled2: reference to position in source text from AST

I am writing a DSL, and learning parboiled2, at the same time. Once my AST is built, I would like to run some semantic checks and, if there are any errors, output error messages that reference the offending positions in the source text.
I am writing things like the following, which, so far, do work:
case class CtxElem[A](start:Int, end:Int, elem:A)
def Identifier = rule {
push(cursor) ~
capture(Alpha ~ zeroOrMore(AlphaNum)) ~
push(cursor) ~
WhiteSpace
~> ((start, identifier, finish) => CtxElem(start, finish, identifier))
}
Is there a better or simpler way?
Parboiled 2 (for now) doesn't support parser recovery strategies. It means that if parser will fail - it will stop. As far as I remember it should print the symbol where it failed, or at least you could get the cursor
So if you're trying to build your own DSL and you need that kind of functionality, I would propose you to use a different tool like ANTLR. Parboiled1 supports parser recovery techniques, but for now it's dead in buried if favor of support for the second version. Parboiled 2 is good in parsing of log files or configuration files (that should correct by default), but it's not good for building DSLs.

Readin a two-dimensional array using scala

Suppose I have a txt file named "input.txt" and I want to use scala to read it in. The dimension of the file is not available in the beginning.
So, how to construct such an Array[Array[Float]]? What I want is a simple and neat way rather than write some code like in Java to iterates over lines and parse each number. I think functional programming should be quite good at it.. but cannot think of one up to now.
Best Regards
If your input is correct, you can do it in such way:
val source = io.Source.fromFile("input.txt")
val data = source.getLines().map(line => line.split(" ").map(_.toFloat)).toArray
source.close()
Update: for additional information about using Source check this thread

Variable in CDATA in Scala

Is there a way to put a variable to be expanded in a cdata section in scala
val reason = <reason><![CDATA[ {failedReason} ]]></reason>
It could be even simplier:
val reason = <reason>{scala.xml.PCData(failedReason)}</reason>
I am not sure if you can get that through native XML support, but you could do something like:
scala.xml.XML.loadString("<reason><![CDATA[%s]]></reason>".format(failedReason))
You lose some of the compile-time validations that way, but it should give you am xml element with the data which you are looking for. Since it returns a scala.xml.Elem, you can also embed the result in a larger XML structure.
EDIT
After thinking about this a bit more, the following may be a beter (and less fragile) way to do this. It restricts the free-text portion to only the CDATA, minimizing the potential for unbalanced expressions.
<reason>{ scala.xml.Unparsed("<![CDATA[%s]]>".format(failedReason)) }</reason>