scala-xml child method from an xml node gets trailling white space - scala

I am actually working on windows and I have to parse xml from a file.
The issue is when i parse the root element, and get the children via the child method, I am getting empty children.
XML.load("my_path\\sof.xml").child
res0: Seq[scala.xml.Node] = List(
, <b/>,
)
This is my xml file
sof.xml
<a>
<b></b>
</a>
But when I remove every \n and \r of the file like this :
sof.xml
<a><b></b></a>
I got the following result which is expected
res0: Seq[scala.xml.Node] = List(<b/>)
My question is, is there an option to read it correctly from the intended form?

The issue is the newlines/whitespace are treated as Text nodes. The scala.xml.Utility.trim(x: Node) method will remove the unnecessary whitespace:
scala> val a = XML.loadString("""<a>
| <b></b>
| </a>""")
a: scala.xml.Elem =
<a>
<b/>
</a>
scala> scala.xml.Utility.trim(a)
res0: scala.xml.Node = <a><b/></a>
Note that this differs from the .collect method if you have actual Text nodes inbetween elements, e.g.:
scala> val a = XML.loadString("""<a>
| <b>Test </b> Foo
| </a>""")
a: scala.xml.Elem =
<a>
<b>Test </b> Foo
</a>
scala> scala.xml.Utility.trim(a).child
res0: Seq[scala.xml.Node] = List(<b>Test</b>, Test)
scala> a.child.collect { case e: scala.xml.Elem => e }
res1: Seq[scala.xml.Elem] = List(<b>Test </b>)
Using .collect method, the "Foo" string is excluded from the children list.

I checked that with this on Mac:
XML.loadString("""<a>
| <b></b>
|</a>""").child
This results in the same behavior - which I also not understand.
However this can fix this in your code:
XML.loadString("""<a>
| <b></b>
|</a>""").child
.collect{ case e: Elem=> e}
This will eliminate the xml.Texts.

Related

Reading empty files with scala

I wrote the following function in scala which reads a file into a list of strings. My aim is to make sure that if the file input is empty that the returned list is empty too. Any idea how to do this in an elegant way:
def linesFromFile(file: String): List[String] = {
def materialize(buffer: BufferedReader): List[String] = materializeReverse(buffer, Nil).reverse
def materializeReverse(buffer: BufferedReader, accumulator: List[String]): List[String] = {
buffer.readLine match {
case null => accumulator
case line => materializeReverse(buffer, line :: accumulator)
}
}
val buffer = new BufferedReader(new FileReader(file))
materialize(buffer)
}
Your code should work, but it's rather inefficient in memory usage: you read in the entire file into memory, then waste more memory and processing putting the lines in the right order.
Using the Source.fromFile method in the standard library is your best bet (which also supports various file encodings), as specified in other comments/answers.
However, if you must roll your own, I think using a Stream (a lazy form of list) makes more sense than a List. You can return each line one at a time, and can terminate the stream when the end of the file is reached. This can be done as follows:
import java.io.{BufferedReader, FileReader}
def linesFromFile(file: String): Stream[String] = {
// The value of buffer is available to the following helper function. No need to pass as
// an argument.
val buffer = new BufferedReader(new FileReader(file))
// Helper: retrieve next line from file. Called only when next value requested.
def materialize: Stream[String] = {
// Uncomment to demonstrate non-recursive nature of this method.
//println("Materialize called!")
// Read the next line and wrap in an option. This avoids the hated null.
Option(buffer.readLine) match {
// If we've seen the end of the file, return an empty stream. We're done reading.
case None => {
buffer.close()
Stream.empty
}
// Otherwise, prepend the line read to another call to this helper.
case Some(line) => line #:: materialize
}
}
// Start the process.
materialize
}
Although it looks like materialize is recursive, in fact it is only called when another value needs to be retrieved, so you do not need to worry about stack overflows or recursion. You can verify this by uncommenting the println call.
For example (in a Scala REPL session):
$ scala
Welcome to Scala 2.12.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_171).
Type in expressions for evaluation. Or try :help.
scala> import java.io.{BufferedReader, FileReader}
import java.io.{BufferedReader, FileReader}
scala> def linesFromFile(file: String): Stream[String] = {
|
| // The value of buffer is available to the following helper function. No need to pass as
| // an argument.
| val buffer = new BufferedReader(new FileReader(file))
|
| // Helper: retrieve next line from file. Called only when next value requested.
| def materialize: Stream[String] = {
|
| // Uncomment to demonstrate non-recursive nature of this method.
| println("Materialize called!")
|
| // Read the next line and wrap in an option. This avoids the hated null.
| Option(buffer.readLine) match {
|
| // If we've seen the end of the file, return an empty stream. We're done reading.
| case None => {
| buffer.close()
| Stream.empty
| }
|
| // Otherwise, prepend the line read to another call to this helper.
| case Some(line) => line #:: materialize
| }
| }
|
| // Start the process.
| materialize
| }
linesFromFile: (file: String)Stream[String]
scala> val stream = linesFromFile("TestFile.txt")
Materialize called!
stream: Stream[String] = Stream(Line 1, ?)
scala> stream.head
res0: String = Line 1
scala> stream.tail.head
Materialize called!
res1: String = Line 2
scala> stream.tail.head
res2: String = Line 2
scala> stream.foreach(println)
Line 1
Line 2
Materialize called!
Line 3
Materialize called!
Line 4
Materialize called!
Note how materialize is only called when we attempt to read another line from the file. Furthermore, it is not called if we've already retrieved a line (for example, both Line 1 and Line 2 in the output are only preceded by Materialize called! when first referenced).
To your point about empty files, in this case, an empty stream is returned:
scala> val empty = linesFromFile("EmptyFile.txt")
Materialize called!
empty: Stream[String] = Stream()
scala> empty.isEmpty
res3: Boolean = true
If by "empty" you mean contains absolutely nothing, your aim is already attained.
If however you meant "contains only whitespaces", you can filter out lines containing only whitespaces from your final list by modifying materializeReverse.
def materializeReverse(buffer: BufferedReader, accumulator: List[String]): List[String] = {
buffer.readLine match {
case null => accumulator
case line if line.trim.isEmpty => materializeReverse(buffer, accumulator)
case line => materializeReverse(buffer, line :: accumulator)
}
}
You can try this method
val result = Source.fromFile("C:\\Users\\1.txt").getLines.toList
Hope that helps. Please ask if you need any further clarification.

Functionally transform this String into a List of objects

I have a String in this csv format :
//> lines : String = a1 , 2 , 10
//| a2 , 2 , 5
//| a3 , 8 , 4
//| a4 , 5 , 8
//| a5 , 7 , 5
//| a6 , 6 , 4
//| a8 , 4 , 9
I would like to convert this String into a List of objects where each line of the String represents a new entry in the object List. I can think how to do this imperatively -
Divide the String into multiple lines and split every line into its csv tokens. Loop over each line and for each line create a new object and add it to a List. But I'm trying to think about this functionally and I'm not sure where to start. Any pointers please ?
Let's assume you're starting with an iterator producing one String for each line. The Source class can do this if you're loading from a file, or you can use val lines = input.split("\n") if you're already starting with everything in a single String
This also works with List, Seq, etc. Iterator isn't a pre-requisite.
So you map over the input to parse each line
val lines = input split "\n"
val output = lines map { line => parse(line) }
or (in point-free style)
val output = lines map parse
All you need is the parse method, and a type that lines should be parsed to. Case classes are a good bet here:
case class Line(id: String, num1: Int, num2: Int)
So to parse. I'm going too wrap the results in a Try so you can capture errors:
def parse(line: String): Try[Line] = Try {
//split on commas and trim whitespace
line.split(",").trim match {
//String.split produces an Array, so we pattern-match on an Array of 3 elems
case Array(id,n1,n2) =>
// if toInt fails it'll throw an Exception to be wrapped in the Try
Line(id, n1.toInt, n2.toInt)
case x => throw new RuntimeException("Invalid line: " + x)
}
}
Put it all together and you end up with output being a CC[Try[Line]], where CC is the collection-type of lines (e.g. Iterator, Seq, etc.)
You can then isolate the errors:
val (goodLines, badLines) = output.partition(_.isSuccess)
Or if you simply want to strip out the intermediate Trys and discard the errors:
val goodLines: Seq[Line] = output collect { case Success(line) => line }
ALL TOGETHER
case class Line(id: String, num1: Int, num2: Int)
def parse(line: String): Try[Line] = Try {
line.split(",").trim match {
case Array(id,n1,n2) => Line(id, n1.toInt, n2.toInt)
case x => throw new RuntimeException("Invalid line: " + x)
}
}
val lines = input split "\n"
val output = lines map parse
val (successes, failures) = output.partition(_.isSuccess)
val goodLines = successes collect { case Success(line) => line }
Not sure if this is the exact output you want since there wasn't a sample output provided. Should be able to get what you want from this though.
scala> val lines: String = """a1,2,10
| a2,2,5
| a3,8,4
| a4,5,8
| a5,7,5
| a6,6,4
| a8,4,9"""
lines: String =
a1,2,10
a2,2,5
a3,8,4
a4,5,8
a5,7,5
a6,6,4
a8,4,9
scala> case class Line(s: String, s2: String, s3: String)
defined class Line
scala> lines.split("\n").map(line => line.split(",")).map(split => Line(split(0), split(1), split(2)))
res0: Array[Line] = Array(Line(a1,2,10), Line(a2,2,5), Line(a3,8,4), Line(a4,5,8), Line(a5,7,5), Line(a6,6,4), Line(a8,4,9))

How to handle nil in scala XML parsing?

I have a XML document representing my model that I need to parse and save in db. In some fields it may have NULL values indicated by xsi:nil. Like so
<quantity xsi:nil="true"/>
For parsing I use scala.xml DSL. The problem is I can't find any way of determining if something is nil or not. This: (elem \ "quantity") just returns an empty string which then blows up when I try to convert it to number. Also wrapping that with Option doesn't help.
Is there any way to get None, Nil or even null from that XML piece?
In this case, you could use namespace URI with your XML with attribute method to get the text in the "xsi:nil" attribute.
Here is a working example:
scala> val xml = <quantity xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/>
xml: scala.xml.Elem = <quantity xsi:nil="true" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"></quantity>
scala> xml.attribute("http://www.w3.org/2001/XMLSchema-instance", "nil")
res0: Option[Seq[scala.xml.Node]] = Some(true)
If you consider a empty node is None, then you don't even need to bother the attribute. Just filter out the node without any text inside it, and using headOption to get the value.
scala> val s1 = <quantity xsi:nil="true">12</quantity>
s1: scala.xml.Elem = <quantity xsi:nil="true">12</quantity>
scala> val s2 = <quantity xsi:nil="true"/>
s2: scala.xml.Elem = <quantity xsi:nil="true"></quantity>
scala> s1.filterNot(_.text.isEmpty).headOption.map(_.text.toInt)
res10: Option[Int] = Some(12)
scala> s2.filterNot(_.text.isEmpty).headOption.map(_.text.toInt)
res11: Option[Int] = None
If you use xtract you can do this with a combination of filter and otpional:
(__ \ "quantity").read[Node]
.filter(_.attribute("http://www.w3.org/2001/XMLSchema-instance", "nil").isEmpty)
.map(_.toDouble).optional
See https://www.lucidchart.com/techblog/2016/07/12/introducing-xtract-a-new-xml-deserialization-library-for-scala/
Disclaimer: I work for Lucid Software and am a contributor to xtract.

How should I match a pattern in Scala?

I need to do a pattern in Scala, this is a code:
object Wykonaj{
val doctype = DocType("html", PublicID("-//W3C//DTD XHTML 1.0 Strict//EN","http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"), Nil)
def main(args: Array[String]) {
val theUrl = "http://axv.pl/rss/waluty.php"
val xmlString = Source.fromURL(new URL(theUrl)).mkString
val xml = XML.loadString(xmlString)
val zawartosc= (xml \\ "description")
val pattern="""<descrition> </descrition>""".r
for(a <-zawartosc) yield a match{
case pattern=>println(pattern)
}
}
}
The problem is, I need to do val pattern=any pattern, to get from
<description><![CDATA[ <img src="http://youbookmarks.com/waluty/pic/waluty/AUD.gif"> dolar australijski 1AUD | 2,7778 | 210/A/NBP/2010 ]]> </description>
only it dolar australijski 1AUD | 2,7778 | 210/A/NBP/2010.
Try
import scala.util.matching.Regex
//...
val Pattern = new Regex(""".*; ([^<]*) </description>""")
//...
for(a <-zawartosc) yield a match {
case Pattern(p) => println(p)
}
It's a bit of a kludge (I don't use REs with Scala very often), but it seems to work. The CDATA is stringified as > entities, so the RE tries to find text after a semicolon and before a closing description tag.
val zawartosc = (xml \\ "description")
val pattern = """.*(dolar australijski.*)""".r
val allMatches = (for (a <- zawartosc; text = a.text) yield {text}) collect {
case pattern(value) => value }
val result = allMatches.headOption // or .head
This is mostly a matter of using the right regular expression. In this case you want to match the string that contains dolar australijski. It has to allow for extra characters before dolar. So use .*. Then use the parens to mark the start and end of what you need. Refer to the Java api for the full doc.
With respect to the for comprehension, I convert the XML element into text before doing the match and then collect the ones that match the pattern by using the collect method. Then the desired result should be the first and only element.

Scala Newb Question - about scoping and variables

I'm parsing XML, and keep finding myself writing code like:
val xml = <outertag>
<dog>val1</dog>
<cat>val2</cat>
</outertag>
var cat = ""
var dog = ""
for (inner <- xml \ "_") {
inner match {
case <dog>{ dg # _* }</dog> => dog = dg(0).toString()
case <cat>{ ct # _* }</cat> => cat = ct(0).toString()
}
}
/* do something with dog and cat */
It annoys me because I should be able to declare cat and dog as val (immutable), since I only need to set them once, but I have to make them mutable. And besides that it just seems like there must be a better way to do this in scala. Any ideas?
Here are two (now make it three) possible solutions. The first one is pretty quick and dirty. You can run the whole bit in the Scala interpreter.
val xmlData = <outertag>
<dog>val1</dog>
<cat>val2</cat>
</outertag>
// A very simple way to do this mapping.
def simpleGetNodeValue(x:scala.xml.NodeSeq, tag:String) = (x \\ tag).text
val cat = simpleGetNodeValue(xmlData, "cat")
val dog = simpleGetNodeValue(xmlData, "dog")
cat will be "val2", and dog will be "val1".
Note that if either node is not found, an empty string will be returned. You can work around this, or you could write it in a slightly more idiomatic way:
// A more idiomatic Scala way, even though Scala wouldn't give us nulls.
// This returns an Option[String].
def getNodeValue(x:scala.xml.NodeSeq, tag:String) = {
(x \\ tag).text match {
case "" => None
case x:String => Some(x)
}
}
val cat1 = getNodeValue(xmlData, "cat") getOrElse "No cat found."
val dog1 = getNodeValue(xmlData, "dog") getOrElse "No dog found."
val goat = getNodeValue(xmlData, "goat") getOrElse "No goat found."
cat1 will be "val2", dog1 will be "val1", and goat will be "No goat found."
UPDATE: Here's one more convenience method to take a list of tag names and return their matches as a Map[String, String].
// Searches for all tags in the List and returns a Map[String, String].
def getNodeValues(x:scala.xml.NodeSeq, tags:List[String]) = {
tags.foldLeft(Map[String, String]()) { (a, b) => a(b) = simpleGetNodeValue(x, b)}
}
val tagsToMatch = List("dog", "cat")
val matchedValues = getNodeValues(xmlData, tagsToMatch)
If you run that, matchedValues will be Map(dog -> val1, cat -> val2).
Hope that helps!
UPDATE 2: Per Daniel's suggestion, I'm using the double-backslash operator, which will descend into child elements, which may be better as your XML dataset evolves.
scala> val xml = <outertag><dog>val1</dog><cat>val2</cat></outertag>
xml: scala.xml.Elem = <outertag><dog>val1</dog><cat>val2</cat></outertag>
scala> val cat = xml \\ "cat" text
cat: String = val2
scala> val dog = xml \\ "dog" text
dog: String = val1
Consider wrapping up the XML inspection and pattern matching in a function that returns the multiple values you need as a tuple (Tuple2[String, String]). But stop and consider: it looks like it's possible to not match any dog and cat elements, which would leave you returning null for one or both of the tuple components. Perhaps you could return a tuple of Option[String], or throw if either of the element patterns fail to bind.
In any case, you can generally solve these initialization problems by wrapping up the constituent statements into a function to yield an expression. Once you have an expression in hand, you can initialize a constant with the result of its evaluation.