How to do regex pattern matching in Scala - scala

I have a list of String in Scala, each String has a key/value format as follows:
<row Id="25780063" PostTypeId="2" ParentId="25774527" CreationDate="2014-09-11T05:56:29.900" />
Each String may have some extra key/value. I'd like to extract the value for a few keys for each string. Here is the pattern I've defined but it is not working properly
val idPattern = "Id=(.*).r
val typePattern = "PostTypeId=(.*)".r
How can I correctly extract the value for 'Id' and 'PostTypeId'?

Making it unanchored says find instead of match all input.
scala> val id = """Id="([^"]*)"""".r.unanchored
id: scala.util.matching.UnanchoredRegex = Id="([^"]*)"
scala> """stuff Id="something" more""" match { case id(x) => x }
res7: String = something
scala> id.findFirstIn("""stuff Id="something" more""")
res8: Option[String] = Some(Id="something")

First you have to define the regex as valid stable identifiers.
val IdPattern = "Id=(.*).r
val TypePattern = "PostTypeId=(.*)".r
Note the initial uppercase, required for pattern matching (or use backquotes if really want it lowercased).
Then,
aString match {
case IdPattern(group) => println(s"id=$group")
case TypePattern(group) => println(s"type=$group")
}

Related

How do I use a regex variable in filterNot in Scala?

Using Scala I am trying to remove URLs from data as per this question. And the following code works fine:
val removeRegexUDF = udf(
(input: Seq[String]) => input.filterNot(s => s.matches("(https?\\://)\\S+" ))
filteredDF.withColumn("noURL", removeRegexUDF('filtered)).select("racist", "filtered","noURL").show(100, false)
Now I want to use a variable instead of the literal regular expression, so I try:
val urls = """(https?\\://)\\S+"""
val removeRegexUDF = udf(
(input: Seq[String]) => input.filterNot(s => s.matches(urls ))
but this seems to have no effect on the data. I try:
val urls = """(https?\\://)\\S+""".r
but this gives error:
urls: scala.util.matching.Regex = (https?\\://)\\S+
<console>:45: error: type mismatch;
found : scala.util.matching.Regex
required: String
(input: Seq[String]) => input.filterNot(s => s.matches(urls) )
Any guidance on how to achieve this is much appreciated.
I guess that has to do with using single quotes vs. triple quotes. In the first example you put additional backslashes to escape the characters while in the latter one you don't need them - wrapping the string with triple quotes is enough.
println("(https?\\://)\\S+") // (https?\://)\S+
println("""(https?\\://)\\S+""") // (https?\\://)\\S+
println("""(https?\://)\S+""") // (https?\://)\S+

Check if a value either equals a string or is an an array which contains a string

A Scala map contains a key X.
The value can be either an array of Strings Array("Y")
or a simple String object, "Y".
I need to retrieve the value from the map and test
if the value is a string,
mayMap("X")=="Y"
or, if the value is an array.
myMap("X").contains("Y")
I don't want to use an if statement statement to check the type first of the value first of all. One option would be to write a function which checks the value, if it is an array then return the array, otherwise create an array with the single string element contained in the map. Then the call would be:
myToArrayFunction(myMap("X")).contains("Y")
That's what I actually do in Java.
But this is Scala. Is there a better idiom to do this in one line using pre-existing functions?
This should work:
myMap.get("X") match {
case None => println("oh snap!")
case Some(x) => x match {
case i: String => println(s"hooray for my String $i") // do something with your String here
case a: Array[String] => println(s"It's an Array $a") // do something with your Array here
}
}
case class Y(name: String)
//val a = Map[String, Any]("X" -> "Y")
val a = Map[String, Any]("X" -> Y("Peter"))
a.getOrElse("X", "Default") match {
case s: String => println(s)
case Y(name) => println(name)
}
you can also use something like this:
//val a = Map[String, Any]("X" -> "Y")
val a = Map[String, Any]("X" -> Y("Peter"))
a.map(v => println(v._2))

How to pattern match on List('(', List[Char],')')?

I am struggling a bit with some pattern matching on a List[Char]. I would like to extract sub-lists that are enclosed by parentheses. So, I would like to extract "test" as a List[Char] when given "(test)". So basically a match on List('(', List[Char],')'). I am able to match on List('(',t,')') where t is a single character, but not a variable amount of characters.
How should this be declared?
val s = "(test)"
s match {
case List('(',t,')') => {
println("matches single character")
}
case '('::x::y => {
//x will be the first character in the List[Char] (after the '(') and y the tail
}
}
s match {
case '(' +: t :+ ')' => ...
}
Read about custom extractors in Scala and then see http://www.scala-lang.org/api/2.11.8/index.html#scala.collection.$colon$plus$ to understand how it works.
Note that it'll match any suitable Seq[Char], but a string isn't really one; it can only be converted (implicitly or explicitly). So you can use one of
val s: Seq[Char] = ...some String or List[Char]
val s = someString.toSeq
I expect that performance for String should be good enough (and if it's critical, don't use this); but for large List[Char] this will be quite slow.

Addition of numbers recursively in Scala

In this Scala code I'm trying to analyze a string that contains a sum (such as 12+3+5) and return the result (20). I'm using regex to extract the first digit and parse the trail to be added recursively. My issue is that since the regex returns a String, I cannot add up the numbers. Any ideas?
object TestRecursive extends App {
val plus = """(\w*)\+(\w*)""".r
println(parse("12+3+5"))
def parse(str: String) : String = str match {
// sum
case plus(head, trail) => parse(head) + parse(trail)
case _ => str
}
}
You might want to use the parser combinators for an application like this.
"""(\w*)\+(\w*)""".r also matches "+" or "23+" or "4 +5" // but captures it only in the first group.
what you could do might be
scala> val numbers = "[+-]?\\d+"
numbers: String = [+-]?\d+
^
scala> numbers.r.findAllIn("1+2-3+42").map(_.toInt).reduce(_ + _)
res4: Int = 42
scala> numbers.r.findAllIn("12+3+5").map(_.toInt).reduce(_ + _)
res5: Int = 20

How should I match a pattern in Scala?

I need to do a pattern in Scala, this is a code:
object Wykonaj{
val doctype = DocType("html", PublicID("-//W3C//DTD XHTML 1.0 Strict//EN","http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"), Nil)
def main(args: Array[String]) {
val theUrl = "http://axv.pl/rss/waluty.php"
val xmlString = Source.fromURL(new URL(theUrl)).mkString
val xml = XML.loadString(xmlString)
val zawartosc= (xml \\ "description")
val pattern="""<descrition> </descrition>""".r
for(a <-zawartosc) yield a match{
case pattern=>println(pattern)
}
}
}
The problem is, I need to do val pattern=any pattern, to get from
<description><![CDATA[ <img src="http://youbookmarks.com/waluty/pic/waluty/AUD.gif"> dolar australijski 1AUD | 2,7778 | 210/A/NBP/2010 ]]> </description>
only it dolar australijski 1AUD | 2,7778 | 210/A/NBP/2010.
Try
import scala.util.matching.Regex
//...
val Pattern = new Regex(""".*; ([^<]*) </description>""")
//...
for(a <-zawartosc) yield a match {
case Pattern(p) => println(p)
}
It's a bit of a kludge (I don't use REs with Scala very often), but it seems to work. The CDATA is stringified as > entities, so the RE tries to find text after a semicolon and before a closing description tag.
val zawartosc = (xml \\ "description")
val pattern = """.*(dolar australijski.*)""".r
val allMatches = (for (a <- zawartosc; text = a.text) yield {text}) collect {
case pattern(value) => value }
val result = allMatches.headOption // or .head
This is mostly a matter of using the right regular expression. In this case you want to match the string that contains dolar australijski. It has to allow for extra characters before dolar. So use .*. Then use the parens to mark the start and end of what you need. Refer to the Java api for the full doc.
With respect to the for comprehension, I convert the XML element into text before doing the match and then collect the ones that match the pattern by using the collect method. Then the desired result should be the first and only element.