extraction with scala regex - scala

I have the following string as input :
val input=" 12 BANANAPPLESTRAWBERRY LOC 8(05). "
And I want to parse this string using Regex and pattern matching
The wanted output shall look like:
val output=" BANANAPPLESTRAWBERRY Int(5)"
I wrote the following function:
val rega = """(.{6}[^*])(\s*)(\d*)(\s*)(\S*)(\s*)(\S{4})(8\(0*(\d+)\))(.*)""".r
def functionEx(s: String, regb: Regex): String = {
s match {
case regb(start,space, nb, space2, naame, space3, loc, vartype, length, end) => name + "(Int"+ length +")"
case _ => ""
}
}
When I call this function on my input : functionEx(input,rega) I got the empty string : ""
Any help with please
Thanks a lot

I don't know what you tried there with that regex. It captures all kind of groups that are seemingly mostly irrelevant (whitespace, full stop in the end). It also contains parts like \S{4} that don't seem to match anything, and magic number like 9 that doesn't appear in the input string. So, obviously, it doesn't match the input, so the second case in the pattern matching applies, and a "" is returned.
Try this instead:
val input=" 12 BANANAPPLESTRAWBERRY LOC 8(05). "
import scala.util.matching.Regex
val rega = """^(.{6})\s+(\d*)\s+(\w+)\s+(\w+)\s+8\(0*(\d+)\)\s*\.\s*""".r
def functionEx(s: String, regb: Regex): String = {
s match {
case regb(start, nb, name, loc, length) => name + " Int("+ length +")"
case _ => ""
}
}
println(functionEx(input ,rega))
println(functionEx(input ,rega))
It produces the following output:
BANANAPPLESTRAWBERRY Int(5)

Related

Extracting a Word after delimiter in scala

I want to extract a word from a String in Scala
val text = "This ball is from Rock. He is good boy"
I want to extract "Rock" from the string.
I tried:
val op = text.subString(4)
text is not a fixed length string. I just want to pick first word after "From".
This doesnt give the right word. can anyone suggest.
This does what you want:
text.drop(text.indexOfSlice("from ")+5).takeWhile(_.isLetter)
or more generally
val delim = "from "
text.drop(text.indexOfSlice(delim)+delim.length).takeWhile(_.isLetter)
The indexOfSlice finds the position of the delimiter and the drop removes everything up to the end of the delimiter. The takeWhile takes all the letters for the word and stops at the first non-letter character (in this case ".").
Note that this is case sensitive so it will not find "From ", it will only work with "from ". If more complex testing is required then use split to convert to separate words and check each word in turn.
This is because you are telling scala to print everything from 4th index to end of string.
In your case you first want to split the string into words which can be done using split function and then access the word you want.
Note: split gives you an array of string and array index begin from 0 in scala so rock would be at the 4th index
This piece of code should work for you. Basically I am using a function to the processing of index of the word that immediately follows a substring (in this case from)
val text = "This ball is from Rock. He is good boy"
val splitText = text.split("\\W+")
println(splitText(4))
Based on your comment below, I would create a code something like this
import scala.util.control.Breaks.{break, breakable}
object Question extends App {
val text = "This ball is from Rock. He is from good boy"
val splitText = text.split("\\W+")
val index = getIndex(splitText, "from")
println(index)
if (index != -1) {
println(splitText(index))
}
def getIndex(arrSplit: Array[String], subString: String): Int = {
var output = -1
var index = 0
breakable {
for(item <- arrSplit) {
if(item.equalsIgnoreCase(subString)) {
output = index + 1
break
}
index = index + 1
}
}
output
}
}
I hope this is what you are expecting:
object Test2 extends App {
val text = "This ball is from Rock. He is good boy"
private val words = text.split(" ")
private val wordToTrack = "from"
val indexOfFrom = words.indexOf(wordToTrack);
val expectedText = words.filterNot{
case word if indexOfFrom < words.size & words.contains(wordToTrack) =>
word == words(indexOfFrom + 1)
case _ => false
}.mkString(" ")
print(expectedText)
}
words.contains(wordToTrack) guards the scenario if the from word(i.e tracking word for this example) is missing in the input text string.
I have used the partial function along with the filter to get the desired result.
You probably want something more general so that you can extract a word from a sentence if that word is present in the input, without having to hard-code offsets:
def extractWordFromString(input: String, word: String): Option[String] = {
val wordLength = word.length
val start = input.indexOfSlice(word)
if (start == -1) None else Some(input.slice(start, start + wordLength))
}
Executing extractWordFromString(text, "Rock") will give you an option containing the target word from input if it was found, and an empty option otherwise. That way you can handle the case where the word you were searching for was not found.

finding a substring in a text column that start and end with a specific string

I'm trying to scan a text dataframe column and retrieve a string that starts with a specific string and ends with a specific string.I tried to use substring with instr but couldn't get it working.
what you could do is use regex and pattern matching to achieve this
# def getText(startsWith: String, endsWith: String)(text: String): (String, String) = {
val rr = s"($startsWith(.+?)$endsWith)".r
text match {
case rr(all, partial) => (all, partial)
case _ => ("", "")
}
}
defined function getText
# getText("1", "2")("1hdfjhsdf2")
res5: (String, String) = ("1hdfjhsdf2", "hdfjhsdf")
this should do what you want

Addition of numbers recursively in Scala

In this Scala code I'm trying to analyze a string that contains a sum (such as 12+3+5) and return the result (20). I'm using regex to extract the first digit and parse the trail to be added recursively. My issue is that since the regex returns a String, I cannot add up the numbers. Any ideas?
object TestRecursive extends App {
val plus = """(\w*)\+(\w*)""".r
println(parse("12+3+5"))
def parse(str: String) : String = str match {
// sum
case plus(head, trail) => parse(head) + parse(trail)
case _ => str
}
}
You might want to use the parser combinators for an application like this.
"""(\w*)\+(\w*)""".r also matches "+" or "23+" or "4 +5" // but captures it only in the first group.
what you could do might be
scala> val numbers = "[+-]?\\d+"
numbers: String = [+-]?\d+
^
scala> numbers.r.findAllIn("1+2-3+42").map(_.toInt).reduce(_ + _)
res4: Int = 42
scala> numbers.r.findAllIn("12+3+5").map(_.toInt).reduce(_ + _)
res5: Int = 20

How to distinguish triple quotes from single quotes in macros?

I am writing a macro m(expr: String), where expr is an expression in some language (not Scala):
m("SOME EXPRESSION")
m("""
SOME EXPRESSION
""")
When I am parsing the expression I would like to report error messages with proper locations in the source file. To achieve this I should know the location of the string literal itself and the number of quotes of the literal (3 or 1). Unfortunately, I did not find any method that returns the number of quotes of the literal:
import scala.language.experimental.macros
import scala.reflect.macros.blackbox.Context
object Temp {
def m(s: String) : String = macro mImpl
def mImpl(context: Context)(s: context.Expr[String]): context.universe.Tree = {
import context.universe._
s match {
case l # Literal(Constant(p: String)) =>
if (l.<TRIPLE QUOTES>) {
...
} else {
...
}
case _ =>
context.abort(context.enclosingPosition, "The argument of m must be a string literal")
}
}
}
What should I put instead of <TRIPLE QUOTES>?
The only way i can think of is to access the source file and check for triple quotes:
l.tree.pos.source.content.startsWith("\"\"\"",l.tree.pos.start)
You need also to edit your matching case:
case l # Expr(Literal(Constant(p: String))) =>
Here the version with some explanation:
val tree: context.universe.Tree = l.tree
val pos: scala.reflect.api.Position = tree.pos
val source: scala.reflect.internal.util.SourceFile = pos.source
val content: Array[Char] = source.content
val start: Int = pos.start
val isTriple: Boolean = content.startsWith("\"\"\"",start)

Accessing Scala Parser regular expression match data

I wondering if it's possible to get the MatchData generated from the matching regular expression in the grammar below.
object DateParser extends JavaTokenParsers {
....
val dateLiteral = """(\d{4}[-/])?(\d\d[-/])?(\d\d)""".r ^^ {
... get MatchData
}
}
One option of course is to perform the match again inside the block, but since the RegexParser has already performed the match I'm hoping that it passes the MatchData to the block, or stores it?
Here is the implicit definition that converts your Regex into a Parser:
/** A parser that matches a regex string */
implicit def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(source.subSequence(start, start + matched.end).toString,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
Just adapt it:
object X extends RegexParsers {
/** A parser that matches a regex string and returns the Match */
def regexMatch(r: Regex): Parser[Regex.Match] = new Parser[Regex.Match] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(matched,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
val t = regexMatch("""(\d\d)/(\d\d)/(\d\d\d\d)""".r) ^^ { case m => (m.group(1), m.group(2), m.group(3)) }
}
Example:
scala> X.parseAll(X.t, "23/03/1971")
res8: X.ParseResult[(String, String, String)] = [1.11] parsed: (23,03,1971)
No, you can't do this. If you look at the definition of the Parser used when you convert a regex to a Parser, it throws away all context and just returns the full matched string:
http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/util/parsing/combinator/RegexParsers.scala?view=markup#L55
You have a couple of other options, though:
break up your parser into several smaller parsers (for the tokens you actually want to extract)
define a custom parser that extracts the values you want and returns a domain object instead of a string
The first would look like
val separator = "-" | "/"
val year = ("""\d{4}"""r) <~ separator
val month = ("""\d\d"""r) <~ separator
val day = """\d\d"""r
val date = ((year?) ~ (month?) ~ day) map {
case year ~ month ~ day =>
(year.getOrElse("2009"), month.getOrElse("11"), day)
}
The <~ means "require these two tokens together, but only give me the result of the first one.
The ~ means "require these two tokens together and tie them together in a pattern-matchable ~ object.
The ? means that the parser is optional and will return an Option.
The .getOrElse bit provides a default value for when the parser didn't define a value.
When a Regex is used in a RegexParsers instance, the implicit def regex(Regex): Parser[String] in RegexParsers is used to appoly that Regex to the input. The Match instance yielded upon successful application of the RE at the current input is used to construct a Success in the regex() method, but only its "end" value is used, so any captured sub-matches are discarded by the time that method returns.
As it stands (in the 2.7 source I looked at), you're out of luck, I believe.
I ran into a similar issue using scala 2.8.1 and trying to parse input of the form "name:value" using the RegexParsers class:
package scalucene.query
import scala.util.matching.Regex
import scala.util.parsing.combinator._
object QueryParser extends RegexParsers {
override def skipWhitespace = false
private def quoted = regex(new Regex("\"[^\"]+"))
private def colon = regex(new Regex(":"))
private def word = regex(new Regex("\\w+"))
private def fielded = (regex(new Regex("[^:]+")) <~ colon) ~ word
private def term = (fielded | word | quoted)
def parseItem(str: String) = parse(term, str)
}
It seems that you can grab the matched groups after parsing like this:
QueryParser.parseItem("nameExample:valueExample") match {
case QueryParser.Success(result:scala.util.parsing.combinator.Parsers$$tilde, _) => {
println("Name: " + result.productElement(0) + " value: " + result.productElement(1))
}
}