Parsing very large xml lazily - scala

I have a huge xml file (40 gbs). I would like to extract some fields from it without loading the entire file into memory. Any suggestions?

A quick example with XMLEventReader based on a tutorial for SAXParser here (as posted by Rinat Tainov).
I'm sure it can be done better but just to show basic usage:
import scala.io.Source
import scala.xml.pull._
object Main extends App {
val xml = new XMLEventReader(Source.fromFile("test.xml"))
def printText(text: String, currNode: List[String]) {
currNode match {
case List("firstname", "staff", "company") => println("First Name: " + text)
case List("lastname", "staff", "company") => println("Last Name: " + text)
case List("nickname", "staff", "company") => println("Nick Name: " + text)
case List("salary", "staff", "company") => println("Salary: " + text)
case _ => ()
}
}
def parse(xml: XMLEventReader) {
def loop(currNode: List[String]) {
if (xml.hasNext) {
xml.next match {
case EvElemStart(_, label, _, _) =>
println("Start element: " + label)
loop(label :: currNode)
case EvElemEnd(_, label) =>
println("End element: " + label)
loop(currNode.tail)
case EvText(text) =>
printText(text, currNode)
loop(currNode)
case _ => loop(currNode)
}
}
}
loop(List.empty)
}
parse(xml)
}

User SAXParser, it will not load entire xml to memory. Here good java example, easily can be used in scala.

If you are happy looking at alternative xml libraries then Scales Xml provides three main pull parsing approaches:
Iterator based - simply use hasNext, next to get more items
iterate function - provides an Iterator but for trees identified by a simple path
Iteratee based - allows combinations of multiple paths
The focus of the upcoming 0.5 version is asynchronous parsing via aalto-xml, allowing for additional non-blocking control options.
In all cases you can control both memory usage and how the document is processed with Scales.

Related

Unpacking tuple directly into class in scala

Scala gives the ability to unpack a tuple into multiple local variables when performing various operations, for example if I have some data
val infos = Array(("Matt", "Awesome"), ("Matt's Brother", "Just OK"))
then instead of doing something ugly like
infos.map{ person_info => person_info._1 + " is " + person_info._2 }
I can choose the much more elegant
infos.map{ case (person, status) => person + " is " + status }
One thing I've often wondered about is how to directly unpack the tuple into, say, the arguments to be used in a class constructor. I'm imagining something like this:
case class PersonInfo(person: String, status: String)
infos.map{ case (p: PersonInfo) => p.person + " is " + p.status }
or even better if PersonInfo has methods:
infos.map{ case (p: PersonInfo) => p.verboseStatus() }
But of course this doesn't work. Apologies if this has already been asked -- I haven't been able to find a direct answer -- is there a way to do this?
I believe you can get to the methods at least in Scala 2.11.x, also, if you haven't heard of it, you should checkout The Neophyte's Guide to Scala Part 1: Extractors.
The whole 16 part series is fantastic, but part 1 deals with case classes, pattern matching and extractors, which is what I think you are after.
Also, I get that java.lang.String complaint in IntelliJ as well, it defaults to that for reasons that are not entirely clear to me, I was able to work around it by explicitly setting the type in the typical "postfix style" i.e. _: String. There must be some way to work around that though.
object Demo {
case class Person(name: String, status: String) {
def verboseStatus() = s"The status of $name is $status"
}
val peeps = Array(("Matt", "Alive"), ("Soraya", "Dead"))
peeps.map {
case p # (_ :String, _ :String) => Person.tupled(p).verboseStatus()
}
}
UPDATE:
So after seeing a few of the other answers, I was curious if there was any performance differences between them. So I set up, what I think might be a reasonable test using an Array of 1,000,000 random string tuples and each implementation is run 100 times:
import scala.util.Random
object Demo extends App {
//Utility Code
def randomTuple(): (String, String) = {
val random = new Random
(random.nextString(5), random.nextString(5))
}
def timer[R](code: => R)(implicit runs: Int): Unit = {
var total = 0L
(1 to runs).foreach { i =>
val t0 = System.currentTimeMillis()
code
val t1 = System.currentTimeMillis()
total += (t1 - t0)
}
println(s"Time to perform code block ${total / runs}ms\n")
}
//Setup
case class Person(name: String, status: String) {
def verboseStatus() = s"$name is $status"
}
object PersonInfoU {
def unapply(x: (String, String)) = Some(Person(x._1, x._2))
}
val infos = Array.fill[(String, String)](1000000)(randomTuple)
//Timer
implicit val runs: Int = 100
println("Using two map operations")
timer {
infos.map(Person.tupled).map(_.verboseStatus)
}
println("Pattern matching and calling tupled")
timer {
infos.map {
case p # (_: String, _: String) => Person.tupled(p).verboseStatus()
}
}
println("Another pattern matching without tupled")
timer {
infos.map {
case (name, status) => Person(name, status).verboseStatus()
}
}
println("Using unapply in a companion object that takes a tuple parameter")
timer {
infos.map { case PersonInfoU(p) => p.name + " is " + p.status }
}
}
/*Results
Using two map operations
Time to perform code block 208ms
Pattern matching and calling tupled
Time to perform code block 130ms
Another pattern matching without tupled
Time to perform code block 130ms
WINNER
Using unapply in a companion object that takes a tuple parameter
Time to perform code block 69ms
*/
Assuming my test is sound, it seems the unapply in a companion-ish object was ~2x faster than the pattern matching, and pattern matching another ~1.5x faster than two maps. Each implementation probably has its use cases/limitations.
I'd appreciate if anyone sees anything glaringly dumb in my testing strategy to let me know about it (and sorry about that var). Thanks!
The extractor for a case class takes an instance of the case class and returns a tuple of its fields. You can write an extractor which does the opposite:
object PersonInfoU {
def unapply(x: (String, String)) = Some(PersonInfo(x._1, x._2))
}
infos.map { case PersonInfoU(p) => p.person + " is " + p.status }
You can use tuppled for case class
val infos = Array(("Matt", "Awesome"), ("Matt's Brother", "Just OK"))
infos.map(PersonInfo.tupled)
scala> infos: Array[(String, String)] = Array((Matt,Awesome), (Matt's Brother,Just OK))
scala> res1: Array[PersonInfo] = Array(PersonInfo(Matt,Awesome), PersonInfo(Matt's Brother,Just OK))
and then you can use PersonInfo how you need
You mean like this (scala 2.11.8):
scala> :paste
// Entering paste mode (ctrl-D to finish)
case class PersonInfo(p: String)
Seq(PersonInfo("foo")) map {
case p# PersonInfo(info) => s"info=$info / ${p.p}"
}
// Exiting paste mode, now interpreting.
defined class PersonInfo
res4: Seq[String] = List(info=foo / foo)
Methods won't be possible by the way.
Several answers can be combined to produce a final, unified approach:
val infos = Array(("Matt", "Awesome"), ("Matt's Brother", "Just OK"))
object Person{
case class Info(name: String, status: String){
def verboseStatus() = name + " is " + status
}
def unapply(x: (String, String)) = Some(Info(x._1, x._2))
}
infos.map{ case Person(p) => p.verboseStatus }
Of course in this small case it's overkill, but for more complex use cases this is the basic skeleton.

How can i convert this nonfunctional scala code with immutable members to an elegant solution?

How to avoid mutable index and make this more elegant?I know Null has to be changed with Option , i am just curious about the answers.
class Person(val name: String, val department: String)
var people = Array(new Person(“Jones”, “Marketing”), new Person(“Smith”, “Engineering”))
var engineer: Person = null
var index = 0
while (index < people.length) {
if (people(index).department == “Engineering”)
engineer = people(index)
index = index + 1
}
println(engineer.name + “ is an engineer”)
class Person(val name: String, val department: String)
val people = Array(new Person(“Jones”, “Marketing”), new Person(“Smith”, “Engineering”))
// Option[ Person ]... None if no Person satisfy this condition... Some( p ), if Person p is the first Person to satisfy the condition.
val personOption = people.find( p => p.department == "Engineering" )
personOption match {
case Some( p ) => println( " Found one engineer - " + p )
case None => println( "No engineer" )
}
If you want to find the last engineer in the array, you would probably use:
case class Person(val name: String, val department: String)
val people = Array(Person(“Jones”, “Marketing”), Person(“Smith”, “Engineering”))
def findLastEngineer(l: Seq[Person]) : Option[Person] =
people.foldLeft(None) {
case (previousOpt, eng) => if (eng.department == "Engineering") Some(eng) else previousOpt
}
}
println(findLastEngineer(people).map(_.name).getOrElse("Not found"))
I would do something like this:
case class Person(name: String, department: String)
val people = List(Person("Jones", "Marketing"), Person("Smith", "Engineering"))
val engineers = people.filter { person:Person => person.department == "Engineering" }
engineers.map { engineer: Person => println(engineer.name + " is an engineer") }
Try to use functions to transform your types in others. Usually we use map/reduce/filter functions to do this.
Here's how I would refactor:
// Refactored into a case class, since it's a simple data container
case class Person(name: String, department: String)
// Using the case class convenience apply method to drop `new`
val people = Array(Person(“Jones”, “Marketing”), Person(“Smith”, “Engineering”))
// Selects all the engineers. You could add `.headOption` to get the first.
val engineers = people.filter(_.department == "Engineering")
// Functional way of iterating the collection of engineers
// Also, using string interpolation to print
for (engineer <- engineers) println(s"${engineer.name} is an engineer.")
Alternatively, you could use collect to filter and pick the name:
// Collect is kind of a like a handy filter + map
val engineerNames = people.collect {
case Person(name, "Engineering") => name
}
for (name <- engineerNames) println(s"$name is an engineer.")
One last tip, if your departments are some finite set of fixed options, you should probably also consider making it a type:
sealed trait Department
case object Engineering extends Department
case object Marketing extends Department
// ... for each valid department
And then you can match on identity, rather than value. This lets you rely on the type system instead of constantly having to validate strings (known to some as stringly-typed programming). Best practice is to validate your data as early as possible into types, deal with it as typed data, and then only convert back to string for exporting data back out of your system (e.g. printing to screen, logging, serving via API).
You can use find to find first:
people
.find { _.department == "Engineering" }
.foreach { engineer => println(engineer.name + " is an engineer") }
or filter to find all:
people
.filter { _.department == "Engineering" }
.foreach { engineer => println(engineer.name + " is an engineer") }
By the way you can fix your code just by moving increment operation outside the if block:
if (people(index).department == "Engineering") {
engineer = people(index)
// index = index + 1
}
index = index + 1
After that you should check engineer for null, because your array may not contain a Person for your condition.
So it looks like you want to find last Person, thus you can use:
people
.foldLeft(None: Option[Person])((r, p) =>
if (p.department == "Engineering") Some(p) else r)
.foreach { engineer => println(engineer.name + " is an engineer") }
Also after avoiding all vars you can also change your Array (which is mutable structure) to List (by default scala.collection.immutable.List)

Pattern matching syntax in Scala/Unfiltered

I'm new to Scala and trying to understand the syntax the pattern matching constructs, specifically from examples in Unfiltered (http://unfiltered.databinder.net/Try+Unfiltered.html).
Here's a simple HTTP server that echos back Hello World! and 2 parts of the path if the path is 2 parts long:
package com.hello
import unfiltered.request.GET
import unfiltered.request.Path
import unfiltered.request.Seg
import unfiltered.response.ResponseString
object HelloWorld {
val sayhello = unfiltered.netty.cycle.Planify {
case GET(Path(Seg(p :: q :: Nil))) => {
ResponseString("Hello World! " + p + " " + q);
}
};
def main(args: Array[String]) {
unfiltered.netty.Http(10000).plan(sayhello).run();
}
}
Also for reference the source code for the Path, Seg, and GET/Method objects:
package unfiltered.request
object Path {
def unapply[T](req: HttpRequest[T]) = Some(req.uri.split('?')(0))
def apply[T](req: HttpRequest[T]) = req.uri.split('?')(0)
}
object Seg {
def unapply(path: String): Option[List[String]] = path.split("/").toList match {
case "" :: rest => Some(rest) // skip a leading slash
case all => Some(all)
}
}
class Method(method: String) {
def unapply[T](req: HttpRequest[T]) =
if (req.method.equalsIgnoreCase(method)) Some(req)
else None
}
object GET extends Method("GET")
I was able to break down how most of it works, but this line leaves me baffled:
case GET(Path(Seg(p :: q :: Nil))) => {
I understand the purpose of the code, but not how it gets applied. I'm very interested in learning the ins and outs of Scala rather than simply implementing an HTTP server with it, so I've been digging into this for a couple hours. I understand that it has something to do with extractors and the unapply method on the GET, Path, and Seg objects, I also knows that when I debug it hits unapply in GET before Path and Path before Seg.
I don't understand the following things:
Why can't I write GET.unapply(req), but I can write GET(req) or GET() and it will match any HTTP GET?
Why or how does the compiler know what values get passed to each extractor's unapply method? It seems that it will just chain them together unless one of them returns a None instead of an Some?
How does it bind the variables p and q? It knows they are Strings, it must infer that from the return type of Seg.unapply, but I don't understand the mechanism that assigns p the value of the first part of the list and q the value of the second part of the list.
Is there a way to rewrite it to make it more clear what's happening? When I first looked at this example, I was confused by the line
val sayhello = unfiltered.netty.cycle.Planify {, I dug around and rewrote it and found out that it was implicitly creating a PartialFunction and passing it to Planify.apply.
One way to understand it is to rewrite this expression the way that it gets rewritten by the Scala compiler.
unfiltered.netty.cycle.Planify expects a PartialFunction[HttpRequest[ReceivedMessage], ResponseFunction[NHttpResponse]], that is, a function that may or may not match the argument. If there's no match in either of the case statements, the request gets ignored. If there is a match -- which also has to pass all of the extractors -- the response will be returned.
Each case statement gets an instance of HttpRequest[ReceivedMessage]. Then, it applies it with left associativity through a series of unapply methods for each of the matchers:
// The request passed to us is HttpRequest[ReceivedMessage]
// GET.unapply only returns Some if the method is GET
GET.unapply(request) flatMap { getRequest =>
// this separates the path from the query
Path.unapply(getRequest) flatMap { path =>
// splits the path by "/"
Seg.unapply(path) flatMap { listOfParams =>
// Calls to unapply don't end here - now we build an
// instance of :: class, which
// since a :: b is the same as ::(a, b)
::.unapply(::(listOfParams.head, listOfParams.tail)) flatMap { case (p, restOfP) =>
::.unapply(::(restOfP.head, Nil)) map { case (q, _) =>
ResponseString("Hello World! " + p + " " + q)
}
}
}
}
}
Hopefully, this gives you an idea of how the matching works behind the scenes. I'm not entirely sure if I got the :: bit right - comments are welcome.

How do I parse DBObject to case class object using subset2?

Does anyone know how to parse DBObject to case class object using subset2 ? Super concise documentation doesn't help me :(
Consider following case class
case class MenuItem(id : Int, name: String, desc: Option[String], prices: Option[Array[String]], subitems: Option[Array[MenuItem]])
object MenuItem {
implicit val asBson = BsonWritable[MenuItem](item =>
{
val buf: DBObjectBuffer = DBO("id" -> item.id, "name" -> item.name)
item.desc match { case Some(value) => buf.append("desc" -> value) case None => }
item.prices match { case Some(value) => buf.append("prices" -> value) case None => }
item.subitems match { case Some(value) => buf.append("subitems" -> value) case None => }
buf()
}
)
}
and I wrote this parser
val menuItemParser: DocParser[MenuItem] = int("id") ~ str("name") ~ str("desc").opt ~ get[Array[String]]("prices").opt ~ get[Array[MenuItem]]("subitems").opt map {
case id ~ name ~ desc_opt ~ prices_opt ~ subitems => {
MenuItem(id, name, desc_opt, prices_opt, subitems)
}
}
It works if I remove last field subitems. But version shown above doesn't compile because MenuItem has field that references itself. It gives me following error
Cannot find Field for Array[com.borsch.model.MenuItem]
val menuItemParser: DocParser[MenuItem] = int("id") ~ str("name") ~ str("desc").opt ~ get[Array[String]]("prices").opt ~ get[Array[MenuItem]]("subitems").opt map {
^
It obviously doesn't compile because last get wants Field[MenuItem] implicit. But if I define it for MenuItem wouldn't it be pretty much copy-paste of DocParser[MenuItem] ?
How would you do it elegantly ?
I am an author of Subset (both 1.x and 2).
The README states that you need to have Field[T] per every T you would like to read (it's under "deserialization" section)
Just a side note. Frankly I don't find very logical to name a deserializer for MenuItem to be jodaDateTime.
Anyway Field[T] must translate from vanilla BSON types to your T. BSON cannot store MenuItem natively, see native BSON types here
But certainly the main problem is that you have a recursive data structure, so your "serializer" (BsonWritable) and "deserializer" (Field) must be recursive as well. Subset has implicit serializer/deserializer for List[T], but they require you to provide those for MenuItem : recursion.
To keep things short, I shall demonstrate you how you would write something like that for simpler "case class".
Suppose we have
case class Rec(id: Int, children: Option[List[Rec]])
Then the writer may look like
object Rec {
implicit object asBson extends BsonWritable[Rec] {
override def apply(rec: Rec) =
Some( DBO("id" -> rec.id, "children" -> rec.children)() )
}
Here, when you are writing rec.children into "DBObject", BsonWriteable[Rec] is being used and it requires "implicit" Field[Rec] in turn. So, this serializer is recursive.
As of the deserializer, the following will do
import DocParser._
implicit lazy val recField = Field({ case Doc(rec) => rec })
lazy val Doc: DocParser[Rec] =
get[Int]("id") ~ get[List[Rec]]("children").opt map {
case id ~ children => new Rec(id, children)
}
}
These are mutually recursive (remember to use lazy val!)
You would use them like so:
val dbo = DBO("y" -> Rec(123, Some(Rec(234, None) :: Nil))) ()
val Y = DocParser.get[Rec]("y")
dbo match {
case Y(doc) => doc
}

Pattern match in foreach and then do a final step

Is it possible to do anything after a pattern match in a foreach statement?
I want to do a post match step e.g. to set a variable. I also want to force a Unit return as my foreach is String => Unit, and by default Scala wants to return the last statement.
Here is some code:
Iteratee.foreach[String](_ match {
case "date" => out.push("Current date: " + new Date().toString + "<br/>")
case "since" => out.push("Last command executed: " + (ctm - last) + "ms before now<br/>")
case unknow => out.push("Command: " + unknown + " not recognized <br/>")
} // here I would like to set "last = ctm" (will be a Long)
)
UPDATED:
New code and context. Also new questions added :) They are embedded in the comments.
def socket = WebSocket.using[String] { request =>
// Comment from an answer bellow but what are the side effects?
// By convention, methods with side effects takes an empty argument list
def ctm(): Long = System.currentTimeMillis
var last: Long = ctm
// Command handlers
// Comment from an answer bellow but what are the side effects?
// By convention, methods with side effects takes an empty argument list
def date() = "Current date: " + new Date().toString + "<br/>"
def since(last: Long) = "Last command executed: " + (ctm - last) + "ms before now<br/>"
def unknown(cmd: String) = "Command: " + cmd + " not recognized <br/>"
val out = Enumerator.imperative[String] {}
// How to transform into the mapping strategy given in lpaul7's nice answer.
lazy val in = Iteratee.foreach[String](_ match {
case "date" => out.push(date)
case "since" => out.push(since(last))
case unknown => out.push(unknown)
} // Here I want to update the variable last to "last = ctm"
).mapDone { _ =>
println("Disconnected")
}
(in, out)
}
I don't know what your ctm is, but you could always do this:
val xs = List("date", "since", "other1", "other2")
xs.foreach { str =>
str match {
case "date" => println("Match Date")
case "since" => println("Match Since")
case unknow => println("Others")
}
println("Put your post step here")
}
Note you should use {} instead of () when you want use a block of code as the argument of foreach().
I will not answer your question, but I should note that reassigning variables in Scala is a bad practice. I suggest you to rewrite your code to avoid vars.
First, transform your strings to something else:
val strings = it map {
case "date" => "Current date: " + new Date().toString + "<br/>"
case "since" => "Last command executed: " + (ctm - last) + "ms before now<br/>"
case unknow => "Command: " + unknown + " not recognized <br/>"
}
Next, push it
strings map { out.push(_) }
It looks like your implementation of push has side effects. Bad for you, because such methods makes your program unpredictable. You can easily avoid side effects by making push return a tuple:
def push(s: String) = {
...
(ctm, last)
}
And using it like:
val (ctm, last) = out.push(str)
Update:
Of course side effects are needed to make programs useful. I only meant that methods depending on outer variables are less predictable than pure one, it is hard to reason about it. It is easier to test methods without side effects.
Yes, you should prefer vals over vars, it makes your program more "functional" and stateless. Stateless algorithms are thread safe and very predictable.
It seems like your program is stateful by nature. At least, try to stay as "functional" and stateless as you can :)
My suggested solution of your problem is:
// By convention, methods with side effects takes an empty argument list
def ctm(): Long = // Get current time
// Command handlers
def date() = "Current date: " + new Date().toString + "<br/>"
def since(last: Long) = "Last command executed: " + (ctm() - last) + "ms before now<br/>"
def unknown(cmd: String) = "Command: " + unknown + " not recognized <br/>"
// In your cmd processing loop
// First, map inputs to responses
val cmds = inps map {
case "date" => date()
case "since" => since(last)
case unk => unknown(unk)
}
// Then push responses and update state
cmds map { response =>
out.push(response)
// It is a good place to update your state
last = ctm()
}
It is hard to test this without context of your code, so you should fit it to your needs yourself. I hope I've answered your question.