Parsing a delimited multiline string using scala StandardTokenParser - scala

I have found a few similar questions but nothing that seems to directly address my needs here.
I am creating a DSL using Scala and have much of it already defined. However, part of the language needs to handle blocks of multi-line textual documentation that are collected and handled as individual entities by the parser. I would like to delimit these blocks in some way (say with something like {{ and }}) and just collect everything between the delimiters and return it as a DocString (a case class in my parser). These blocks will then be used to create additional end-user documentation along with the rest of the parsed file(s).
The parser is already structured as a StandardTokenParsers-derived class. I suppose I could convert it to a RegexParsers-derived class and just use regular expressions but that would be a major change and a lot of my grammar would have to be reworked. I am not sure if there would be any advantage to doing this (other than supporting the desired documentation blocks).
I have seen Using regex in StandardTokenParsers and found this. I am not sure either of those will actually handle what I need, however, or how to begin if they do.
If anyone has any ideas as to a viable way to proceed I would appreciate some pointers.
As an example, here is something I have tried (from Using regex in StandardTokenParsers):
object DModelParser extends StandardTokenParsers {
...
def modelElement: Parser[ModelElement] =
(other stuff, not important here) | docBlock
import scala.util.matching.Regex
import lexical.StringLit
def regexBlockMatch(r: Regex): Parser[String] = acceptMatch(
"string block matching regex " + r,
{case StringLit(s) if r.unapplySeq(s).isDefined => s})
val bmr = """\{\{((?s).*)\}\}""".r
def docBlockStr: Parser[String] = regexBlockMatch(bmr)
def docBlock: Parser[DocString] =
docBlockStr ^^ { s => new DocString(s) }
...
}
However, when passing it even a single line like the following:
{{ A block of docs }}
it fails to match causing the parser to stop parsing. I think the problem is in the case StringLit(s) in this case but I am not sure.
Edit
OK. StringLit was a problem. I forgot that this will only match strings in double quotes. So I tried replacing the string above with:
"{{ A block of docs }}"
and it works fine. However, the multi-line issue still remains. If I replace this with:
"{{ A block
of docs }}"
Then it still fails to parse. Again, I think it is the StringLit not working across line-feeds.
Edit
Another option occurred to me but I am not sure how to make it work in the parser. If I can read and match a line that only contains the opening delimiter then collect into a List[String] all the lines until a line that only contains the closing delimiter that would be sufficient. Is there a way to do this?
Edit 6/22/2015
I went a different direction and this seems to work for the examples I have tried so far:
// https://stackoverflow.com/questions/24771341/scala-regex-multiline-match-with-negative-lookahead
def docBlockRE = regex("""(?s).*?(?=}})""".r)
def docBlock: Parser[DocString] =
"{{" ~> docBlockRE <~ "}}" ^^ { case str => new DocString(str) }

Related

Removing Data Type From Tuple When Printing In Scala

I currently have two maps: -
mapBuffer = Map[String, ListBuffer[(Int, String, Float)]
personalMapBuffer = Map[mapBuffer, String]
The idea of what I'm trying to do is create a list of something, and then allow a user to create a personalised list which includes a comment, so they'd have their own list of maps.
I am simply trying to print information as everything is good from the above.
To print the Key from mapBuffer, I use: -
mapBuffer.foreach(line => println(line._1))
This returns: -
Sample String 1
Sample String 2
To print the same thing from personalMapBuffer, I am using: -
personalMapBuffer.foreach(line => println(line._1.map(_._1)))
However, this returns: -
List(Sample String 1)
List(Sample String 2)
I obviously would like it to just return "Sample String" and remove the List() aspect. I'm assuming this has something to do with the .map function, although this was the only way I could find to access a tuple within a tuple. Is there a simple way to remove the data type? I was hoping for something simple like: -
line._1.map(_._1).removeDataType
But obviously no such pre-function exists. I'm very new to Scala so this might be something extremely simple (which I hope it is haha) or it could be a bit more complex. Any help would be great.
Thanks.
What you see if default List.toString behaviour. You build your own string with mkString operation :
val separator = ","
personalMapBuffer.foreach(line => println(line._1.map(_._1.mkString(separator))))
which will produce desired result of Sample String 1 or Sample String 1, Sample String 2 if there will be 2 strings.
Hope this helps!
I have found a way to get the result I was looking for, however I'm not sure if it's the best way.
The .map() method just returns a collection. You can see more info on that here:- https://www.geeksforgeeks.org/scala-map-method/
By using any sort of specific element finder at the end, I'm able to return only the element and not the data type. For example: -
line._1.map(_._1).head
As I was writing this Ivan Kurchenko replied above suggesting I use .mkString. This also works and looks a little bit better than .head in my mind.
line._1.map(_._1).mkString("")
Again, I'm not 100% if this is the most efficient way but if it is necessary for something, this way has worked for me for now.

Is there a way to filter a field not containing something in a spark dataframe using scala?

Hopefully I'm stupid and this will be easy.
I have a dataframe containing the columns 'url' and 'referrer'.
I want to extract all the referrers that contain the top level domain 'www.mydomain.com' and 'mydomain.co'.
I can use
val filteredDf = unfilteredDf.filter(($"referrer").contains("www.mydomain."))
However, this pulls out the url www.google.co.uk search url that also contains my web domain for some reason. Is there a way, using scala in spark, that I can filter out anything with google in it while keeping the correct results I have?
Thanks
Dean
You can negate predicate using either not or ! so all what's left is to add another condition:
import org.apache.spark.sql.functions.not
df.where($"referrer".contains("www.mydomain.") &&
not($"referrer".contains("google")))
or separate filter:
df
.where($"referrer".contains("www.mydomain."))
.where(!$"referrer".contains("google"))
You may use a Regex. Here you can find a reference for the usage of regex in Scala. And here you can find some hints about how to create a proper regex for URLs.
Thus in your case you will have something like:
val regex = "PUT_YOUR_REGEX_HERE".r // something like (https?|ftp)://www.mydomain.com?(/[^\s]*)? should work
val filteredDf = unfilteredDf.filter(regex.findFirstIn(($"referrer")) match {
case Some => true
case None => false
} )
This solution requires a bit of work but is the safest one.

How to suppress printing of variable values in zeppelin

Given the following snippet:
val data = sc.parallelize(0 until 10000)
val local = data.collect
println(s"local.size")
Zeppelin prints out the entire value of local to the notebook screen. How may that behavior be changed?
You can also try adding curly brackets around your code.
{val data = sc.parallelize(0 until 10000)
val local = data.collect
println(s"local.size")}
Since 0.6.0, Zeppelin provides a boolean flag zeppelin.spark.printREPLOutput in spark's interpreter configuration (accessible via the GUI), which is set to true by default.
If you set its value to false then you get the desired behaviour that only explicit print statements are output.
See also: https://issues.apache.org/jira/browse/ZEPPELIN-688
What I do to avoid this is define a top-level function, and then call it:
def run() : Unit = {
val data = sc.parallelize(0 until 10000)
val local = data.collect
println(local.size)
}
run();
FWIW, this appears to be new behaviour.
Until recently we have been using Livy 0.4, it only output the content of the final statement (rather than echoing the output of the whole script).
When we upgraded to Livy 0.5, the behaviour changed to output the entire script.
While splitting the paragraph and hiding the output does work, it seems like an unnecessary overhead to the usability of Zeppelin.
for example, if you need to refresh your output, then you have to remember to run two paragraphs (i.e. the one that sets up your output and the one containing the actual println).
There are, IMHO, other usability issues with this approach that makes, again IMHO, Zeppelin less intuitive to use.
Someone has logged this JIRA ticket to address "the problem", please vote for it:
LIVY-507
Zeppelin, as well as spark-shell REPL, always prints the whole interpreter output.
If you really want to have only local.size string printed - best way to do it is to put println "local.size" statement inside the separate paragraph.
Then you can hide all output of the previous paragraph using small "book" icon on the top-right.
a simple trick I am using is to define
def !() ="_ __ ___ ___________________________________________________"
and use as
$bang
above or close to the code I want to check
and it works
res544: String = _ __ ___ ___________________________________________________
then I just leave there commented out ;)
// hope it helps

Parsing options that take more than one value with scopt in scala

I am using scopt to parse command line arguments in scala. I want it to be able to parse options with more than one value. For instance, the range option, if specified, should take exactly two values.
--range 25 45
Coming, from python background, I am basically looking for a way to do the following with scopt instead of python's argparse:
parser.add_argument("--range", default=None, nargs=2, type=float,
metavar=('start', 'end'),
help=(" Foo bar start and stop "))
I dont think minOccurs and maxOccurs solves my problem exactly, nor the key:value example in its help.
Looking at the source code, this is not possible. The Read type class used has a member tuplesToRead, but it doesn't seem to be working when you force it to 2 instead of 1. You will have to make a feature request, I guess, or work around this by using --min 25 --max 45, or --range '25 45' with a custom Read instance that splits this string into two parts. As #roterl noted, this is not a standard way of parsing.
It should be ok if only your values are delimited with something else than a space...
--range 25-45
... although you need to split them manually. Parse it with something like:
opt[String]('r', "range").action { (x, c) =>
val rx = "([0-9]+)\\-([0-9]+)".r
val rx(from, to) = x
c.copy(from = from.toInt, to = to.toInt)
}
// ...
println(s" Got range ${parsedArgs.from}..${parsedArgs.to}")

What does $ mean in a Scala Play Template, and how to deal with Options

First off, what exactly does $ mean in the Scala Play Template engine?
Second off, I am trying to deal with an type Option in my Scala Play template, and it seems what I am doing should be rather simple. Here is a snippet of code from my template.
#c = { Some(complication) }
<div id="complication">
#Html( (#c.name getOrElse "") ) `
</div>
Where complication is of type Option[T]. The name field is of type string.
I've tried extracting it in to a another variable and then referencing the name field from that, but that seems so obtuse, there has to be a better solution it seems.
With $ I think you may be referring to Scala's string interpolation.
Say you have val str: String = "hello", then s"$str world" is equivalent to str + " world"
You don't need the # symbol inside #Html.
#Html( (c.name.getOrElse("")))