I'm trying to define a DSL that will dictate how to parse CSVs. I would like to define simple functions to transform values as the values are being extracted from CSV. The DSL will defined in a text file.
For example, if the CSV looks like this:
id,name,amt
1,John Smith,$10.00
2,Bob Uncle,$20.00
I would like to define the following function (and please note that I would like to be able to execute arbitrary code) on the amt column
(x: String) => x.replace("$", "")
Is there a way to evaluate the function above and execute it for each of the amt values?
First, please consider that there's probably a better way to do this. For one thing, it seems concerning that your external DSL contains Scala code. Does this really need to be a DSL?
That said, it's possible to evaluate arbitrary Scala using ToolBox:
import scala.reflect.runtime.universe._
import scala.tools.reflect.ToolBox
val code = """(x: String) => x.replace("$", "")"""
val toolbox = runtimeMirror(getClass.getClassLoader).mkToolBox()
val func = toolbox.eval(toolbox.parse(code)).asInstanceOf[String => String]
println(func("$10.50")) // prints "10.50"
Related
Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))
As you know, if you use saveAsTextFile on an RDD[String, Int], the output looks like this:
(T0000036162,1747)
(T0000066859,1704)
(T0000043861,1650)
(T0000075501,1641)
(T0000071951,1638)
(T0000075623,1638)
(T0000070102,1635)
(T0000043868,1627)
(T0000094043,1626)
You may want to use this file in Spark again and what should be best practice for reading and parsing it? Should it be something like that or is there any elegant way for it?
val lines = sc.textFile("result/hebe")
case class Foo(id: String, count: Long)
val parsed = lines
.map(l => l.stripPrefix("(").stripSuffix(")").split(","))
.map(l => new Foo(id=l(0),count = l(1).toLong))
It depends what you're looking for.
If you want something pretty I'd consider possibly adding an alternative constructor to Foo which takes a single string so you could have something like
lines.map(new Foo)
And Foo would look like
case class Foo(id: String, count: Long) {
def apply(l: String): Foo = {
val split = l.stripPrefix("(").stripSuffix(")").split(",")
new Foo(l(0), l(1))
}
}
If you have no requirement to output the data like that then I'd consider saving it as a sequence file.
If performance isn't an issue then its fine. I'd just say the most important thing is to just isolate the text parsing so that later you could unit test it and come back to it later and easily edit it.
you should either save it as a Dataframe which will use the case class as a schema (that allows you to easily parse it back into Spark) or you should map out the individual components of your RDD (so you remove the brackets before saving) since it only makes the file larger:
yourRDD.toDF("id","count").saveAsParquetFile(path)
when you load in the DF, you can pass it through a schema definition to get it back into a RDD if you want
RDDInput = input.map(x=>(x.getAs[Long]("id"),x.getAs[Int]("count")))
If you prefer to store as a text file, you could consider mapping the elements without the brackets:
yourRDD.map(x => s"${x._1}, ${x._2}")
The best way will be, you write dataframes instead of RDD directly as file.
Code that writing files -
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd.toDF()
df.write.parquet("dir”)
Code that reading files -
val rdd = sqlContext.read.parquet(“dir”).rdd.map(row => (row.getString(0),row.getLong(1)))
Before making saveAsTextFile use map(x=>x.mkString(",").
rdd.map(x=>x.mkString(",").saveAsTextFile(path). Output will not have bracket.
Output of this will be:-
T0000036162,1747
T0000066859,1704
I am new to Scala and have read threads related to String Interpolation.
My requirement is that , I want to find the type of an expression, before actually evaluating the expression.
To make it clear:
val tesType = s"${{10 * Math.cos(0)+ 3}.getClass}"
This gives me the return type of the entire expression.
Is it possible to generalise this by replacing the actual expression by a variable containing the expression?
Something like:
val expression="10 * Math.cos(0)+ 3"
val tesType = s"${{expression}.getClass}"
Would something like this be possible or I am totally wrong in thinking in this direction?
Thanks
It's not possible to do with string interpolation. What you actually want to do is to compile scala code in runtime from string (file etc).
For ex twitter Eval library can be used for this purposes:
https://eknet.org/main/dev/runtimecompilescala.html
If you want the expression to be in a string, then see mst's answer (you can also use scala-compiler itself as a library, but its API is harder to use).
If you have an expression as an expression, you can do
import scala.reflect.runtime.universe._
def typeOfExpr[T: TypeTag](t: => T) = typeOf[T]
typeOfExpr(10 * Math.cos(0)+ 3) // returns Double
import net.liftweb.json._
import net.liftweb.json.JsonAST._
import net.liftweb.json.Extraction._
import net.liftweb.json.Printer._
implicit val formats = net.liftweb.json.DefaultFormats
val jV = JArray(List(JInt(10),JString("ahem"),JBool(false)))
I am dealing with a situation of mixed types and attempting to convert Jv to a List[Strings] using
jV.extract[List[String]]
The extraction does not work.
Can someone tell me how should I go about doing this
Lift JSON doesn't have a conversion between Strings and JBools defined in the serialisers.
Does the List inside of the Array always have the same shape? If so, then you can do something like:
case class Datum(id: BigInt, comment: String, bool: Boolean)
val data = jv.extract[List[Datum]]
If that won't work for you as there isn't a uniform shape but you still just want a list of Strings, then you can transform the JBools into JStrings before trying to do the extraction:
jv.map({
case JBool(bool) => if (bool) JString("true") else JString("false")
case x => x
}).extract[List[String]]
In general though, I'd encourage you think about why you are discarding the type information here. Quite a lot of Scala's power comes from it's type system so it's better to use it than to lose it by string-typing things here.
I would like to implement an external DSL such as SQL in Scala using Macros. I have already seen papers on how to implement internal DSLs with Scala. Also, I've recently written an article about how this can be done in Java, myself.
Now, internal DSLs always feel a bit clumsy as they have to be implemented and used in the host language (e.g. Scala) and adhere to the host language's syntax constraints. That's why I'm hoping that Scala Macros will allow to internalise an external DSL without any such constraints. However, I don't fully understand Scala Macros and how far I can go with them. I've seen that SLICK and also a much less-known library called sqltyped have started using Macros, but SLICK uses a "Scalaesque" syntax for querying, which isn't really SQL, whereas sqltyped uses Macros to parse SQL strings (which can be done without Macros, too). Also, the various examples given on the Scala website are too trivial for what I'm trying to do
My question is:
Given an example external DSL defined as some BNF grammar like this:
MyGrammar ::= (
'SOME-KEYWORD' 'OPTION'?
(
( 'CHOICE-1' 'ARG-1'+ )
| ( 'CHOICE-2' 'ARG-2' )
)
)
Can I implement the above grammar using Scala Macros to allow for client programs like this? Or are Scala Macros not powerful enough to implement such a DSL?
// This function would take a Scala compile-checked argument and produce an AST
// of some sort, that I can further process
def evaluate(args: MyGrammar): MyGrammarEvaluated = ...
// These expressions produce a valid result, as the argument is valid according
// to my grammar
val result1 = evaluate(SOME-KEYWORD CHOICE-1 ARG-1 ARG-1)
val result2 = evaluate(SOME-KEYWORD CHOICE-2 ARG-2)
val result3 = evaluate(SOME-KEYWORD OPTION CHOICE-1 ARG-1 ARG-1)
val result4 = evaluate(SOME-KEYWORD OPTION CHOICE-2 ARG-2)
// These expressions produce a compilation error, as the argument is invalid
// according to my grammar
val result5 = evaluate(SOME-KEYWORD CHOICE-1)
val result6 = evaluate(SOME-KEYWORD CHOICE-2 ARG-2 ARG-2)
Note, I'm not interested in solutions that parse strings, the way sqltyped does
It's been some time since this question was answered by paradigmatic, but I've just stumbled upon it and thought it's worth extending.
An internalized DSL must indeed be valid Scala code with all the names defined before macro expansion, however one can overcome this restriction with a carefully designed syntax and Dynamics.
Let's say we wanted to create a simple, silly DSL allowing us to introduce people in a classy way. It might look like this:
people {
introduce John please
introduce Frank and Lilly please
}
We would like to translate (as part of compilation) the above code to an object (of a class derived for example from class People) containing definitions of fields of type Person for every introduced person - something like this:
new People {
val john: Person = new Person("John")
val frank: Person = new Person("Frank")
val lilly: Person = new Person("Lilly")
}
To make it possible we need to define some artificial objects and classes having two purposes: defining grammar (somewhat...) and tricking the compiler into accepting undefined names (like John or Lilly).
import scala.language.dynamics
trait AllowedAfterName
object and extends Dynamic with AllowedAfterName {
def applyDynamic(personName: String)(arg: AllowedAfterName): AllowedAfterName = this
}
object please extends AllowedAfterName
object introduce extends Dynamic {
def applyDynamic(personName: String)(arg: AllowedAfterName): and.type = and
}
These dummy definitions make our DSL code legal - the compiler translates it to the below code before proceeding to macro expansion:
people {
introduce.applyDynamic("John")(please)
introduce.applyDynamic("Frank")(and).applyDynamic("Lilly")(please)
}
Do we need this ugly and seemingly redundant please? One could probably come up with a nicer syntax, for example using Scala's postfix operator notation (language.postfixOps), but that gets tricky due to semicolon inference (you can try it yourself in the REPL console or IntelliJ's Scala Worksheet). It's easiest to just interlace keywords with undefined names.
As we've made the syntax legal, we can process the block with a macro:
def people[A](block: A): People = macro Macros.impl[A]
class Macros(val c: whitebox.Context) {
import c.universe._
def impl[A](block: c.Tree) = {
val introductions = block.children
def getNames(t: c.Tree): List[String] = t match {
case q"applyDynamic($name)(and).$rest" =>
name :: getNames(q"$rest")
case q"applyDynamic($name)(please)" =>
List(name)
}
val names = introductions flatMap getNames
val defs = names map { n =>
val varName = TermName(n.toLowerCase())
q"val $varName: Person = new Person($n)"
}
c.Expr[People](q"new People { ..$defs }")
}
}
The macro finds all the introduced names by pattern matching against the expanded dynamic calls and generates the desired output code. Notice that the macro must be whitebox in order to be allowed to return an expression of a type derived from the one declared in the signature.
I don't think so. The expression you pass to a macro must be a valid Scala expression and identifiers should be defined.