What's the meaning of "$" in Dataset's operators (like select or filter)?

What's the meaning of "$" in Dataset's operators (like select or filter)? - scala

I am a bit confused about using $ to reference columns in DataFrame operators like select or filter.
The following statements work:
df.select("app", "renders").show
df.select($"app", $"renders").show
But, only the first statement in the following works:
df.filter("renders = 265").show // <-- this works
df.filter($"renders" = 265).show // <-- this does not work (!) Why?!
However, this again works:
df.filter($"renders" > 265).show
Basically, what is this $ in DataFrame's operators and when/how should I use it?

Implicits are a major feature of the Scala language that take a lot of different forms--like implicit classes as we will see shortly. They have different purposes, and they all come with varying levels of debate regarding how useful or dangerous they are. Ultimately though, implicits generally come down to simply having the compiler convert one class to another when you bring them into scope.
Why does this matter? Because in Spark there is an implicitclass called StringToColumn that endows a StringContext with additional functionality. As you can see, StringToColumn adds the $ method to the Scala class StringContext. This method produces a ColumnName, which extends Column.
The end result of all this is that the $ method allows you to treat the name of a column, represented as a String, as if it were the Column itself. Implicits, when used wisely, can produce convenient conversions like this to make development easier.
So let's use this to understand what you found:
df.select("app","renders").show -- succeeds because select takes multiple Strings
df.select($"app",$"renders").show -- succeeds because select takes multiple Columnss that result after the implicit conversions are applied
df.filter("renders = 265").show -- succeeds because Spark supports SQL-like filters
df.filter($"renders" = 265).show -- fails because $"renders" is of type Column after implicit conversion, and Columns use the custom === operator for equality (unlike the case in SQL).
df.filter($"renders" > 265).show -- succeeds because you're using a Column after implicit conversion and > is a function on Column.

$ is a way to convert a string to the column with that name.
Both options of select work originally because select can receive either a column or a string.
When you do the filter $"renders" = 265 is an attempt at assigning a number to the column. > on the other hand is a comparison method. You should be using === instead of =.

Related

Escape string interpolation in anorm

I want to insert the literal '${a}' into a table using anorm 2.5.2, which means I want to execute the bare SQL query
INSERT INTO `db`.`table` (`a`) VALUES ('${a}');
without using any anorm / string interpolation. When I try to do the following
SQL("INSERT INTO `db`.`table` (`a`) VALUES ('${a}');").execute()
I get an anorm.Sql$MissingParameter: Missing parameter value exception because it tries to use anorm interpolation on ${a} but no value a is available in the scope.
How to escape the anorm / string interpolations $... and ${...}?
Escape a dollar sign in string interpolation doesn't seem to work here.

You can make ${a} the value of a parameter, i.e.
SQL("""INSERT INTO db.table (a) VALUES ({x})""").on("x" -> s"$${a}")
(s"$${a}" is the way to write "${a}" without getting a warning about possible missing interpolators).
The same can be written equivalently as
val lit = s"$${a}"
SQL"""INSERT INTO db.table (a) VALUES ($lit)"""
The below will probably work, but I am not sure:
SQL"INSERT INTO db.table (a) VALUES ('$${a}')"
It may also be worth asking if it's intentional behavior or a bug: when talking about parametrized SQL queries, it doesn't make sense to have a parameter inside '.

Dataframe: Adding prefix to all columns in Scala

val prefix = "ABC"
val renamedColumns = df.columns.map(c=> df(c).as(s"$prefix$c"))
val dfNew = df.select(renamedColumns: _*)
Hi,
I am fairly new to scala and the code above works perfectly to add a prefix to all columns. Can someone please explain the breakdown of how it works ?
The second line above will return a map of col1 as ABCcol1, col2 as ABCcol2.... etc
I have trouble understanding what the third line is doing , especailly the ":_* at the end.
thanks for your help in advance.

The third line is an example of Scala's syntactic sugar. Essentially, Scala has ways to shorten just exactly what you are typing, and you have discovered the dreaded :_*.
There are two portions to this small bit - the : and the _* serve two different purposes. The : is typically for ascription, which tells the compiler "this is the type that I need to use for this method". The _* however, is your type - in Scala this is the type varargs. Varargs is a type that has an arbitrary number of values (good resource here). It allows you to pass a method a list that you do not know the number of elements in.
In your example, you are creating a variable called renamedColumns from the columns of your original dataframe, with the new string appendage. Although you may know just how many columns are in your df, Scala does not. When you create dfNew, you are running a select statement on that and passing in your new column names, of which there could be an arbitrary number.
Essentially, you do not know how many columns you may have, so you pass in your varargs to allow the number to be arbitrary, thus determined by the compiler.

How to define a union type that works at runtime?

Following on form this excellent set of answers on how to define union types in Scala. I've been using the Miles Sabin definition of Union types, but one questions remains.
How do you work with these if the type isn't know until Runtime? For example:
trait inv[-A] {}
type Or[A,B] = {
type check[X] = (inv[A] with inv[B]) <:< inv[X]
}
case class Foo[A : (Int Or String)#check](a: A)
Foo(1) // Foo[Int] = Foo(1)
Foo("hi") // Foo[String] = Foo(hi)
Foo(2.0) // Error!
This example works since the parameter A is know at compile time, and calling Foo(1) is really calling Foo[Int](1). However, what do you do if parameter A isn't known until runtime? Maybe you're paring a file that contains the data for Foo's, in which case the type parameter of Foo isn't know until you read the data. There's no easy way to set parameter A in this case.
The best solutions I've been able to come up with are:
Pattern Match on the data you've read and then create different Foo's based that type. In my case this isn't feasible because my case-class actually contains dozens of union types, so there'd be hundreds of combinations of types to pattern match.
Cast the type you've just read to be (String or Int), so you have a single type to pass around, that passes the Type Class constraint when you create Foo with it. Then return Foo[_] instead. This puts the onus back on the Foo user to work out the type of each field (since they'll appear to be type Any), but at least it defers having to know the type until the field is actually used, in which case a pattern match seems more tractable.
The second solution looks like this:
def parseLine: Any // Parses data point, but can be either a String or
// Int, so returns Any.
def mkFoo: Foo[_] = {
val a = parseLine.asInstanceOf[Int with String]
Foo(a) // Passes type constraint now
}
In practice I've ended up using the second solution, but I'm wondering if there's something better I can do?
Another way to state the problem is: What does it mean to return a Union Type? Functions can only return a single type, and the trickery we use with Miles Sabin union types is only useful for the types you pass in, not for the types you return.
PS. For context, why this is a problem in my case is that I'm generating a set of case-classes from a Json schema file. Json naturally supports union types, so I would like to make my case classes reflect that too. This works great in one direction: users creating case-classes to be serialized out to Json. But gets sticky in the other direction: user's parsing Json files to have a set of populated case classes returned to them.

The "standard" Scala solution to this problem is to use an ordinary discriminated-union type (ie, to forego true union types altogether):
sealed trait Foo
case class IntFoo(x: Int) extends Foo
case class StringFoo(x: String) extends Foo
This reflects the fact that, as you observe, the particular type of the member is a runtime value; the JVM type-tag of the Foo instance provides this runtime value.
Miles Sabin's implementation of union types is very clever, but I'm not sure if it provides any practical benefit, because it only restricts the type of thing that can go into a Foo, but provides the user of a Foo with no computable version of that restriction, in the way a match provides you with a computable version of the sealed trait. In general, for a restriction to be useful, it needs two sides: a check that only the right things are put in, and an extractor (aka an eliminator) that allows the same right things to come out the other end.
Perhaps if you gave some explanation of why you're looking for a purer union type it would illuminate whether regular discriminated unions are sufficient or if you really need something more.

There's a reason every JSON parser for Scala requires well defined types into which the JSON will be converted, even if some fields have to be dropped: you cannot work with something you don't know the type of.
To given an example, say you have a, and maybe a is a String, maybe it's an Int, but you don't know what it is. Why computation could you possibly make with a, not knowing its type? Why would your code compute the sum of all a's, for instance, if you didn't know in advance it was a number?
Generally, the answer to that is to perform user-provided data manipulation at runtime over data with unknown characteristics, as the user itself sees that it's a number and decides they want to know what the sum of that field is. Fine, but you are going the wrong way about it if so.
There is a well defined way to represent JSON data in Scala (and, for that matter, any data that has the same characteristics as JSON. Which is using a hierarchy of classes. A json value may be a json object, array or one of a number of primitives. A json object contains a list of key/value pairs, whose keys are json strings and values are json values. And so on. This is easy to represent, and there are many library doing so already. In fact, there are so many that there's a project called Json4s which presents a unified API which can be used and is implemented by many of the aforementioned libraries.
Things like the records which Miles Sabin's Shapeless library provide are intended to be used when the input doesn't have a well defined schema, but the program knows what it needs from that input. And, yes, the program might know what to do with a if it is an Int or a String, but not every possible value.

The next Scala 3 (mid 2020) based on Dotty will implement the proposal for Union Type from last Sept. 2018
You see it in "a tour of Scala 3" (June 2019)
Union Types Provide ad-hoc combinations of types
Subsetting = Subtyping
No boxing overhead
case class UserName(name: String)
case class Password(hash: Hash)
def help(id: UserName | Password) = {
val user = id match {
case UserName(name) => lookupName(name)
case Password(hash) => lookupPassword(hash)
}
...
}
Union Types Work also with singleton types
Great for JS interop
type Command = "Click" | "Drag" | "KeyPressed"
def handleEvent(kind: Command) = kind match {
case "Click" => MouseClick()
case "Drag" => MoveTo()
case "KeyPressed" => KeyPressed()
}

Anorm: WHERE condition, conditionally

Consider a repository/DAO method like this, which works great:
def countReports(customerId: Long, createdSince: ZonedDateTime) =
DB.withConnection {
implicit c =>
SQL"""SELECT COUNT(*)
FROM report
WHERE customer_id = $customerId
AND created >= $createdSince
""".as(scalar[Int].single)
}
But what if the method is defined with optional parameters:
def countReports(customerId: Option[Long], createdSince: Option[ZonedDateTime])
Point being, if either optional argument is present, use it in filtering the results (as shown above), and otherwise (in case it is None) simply leave out the corresponding WHERE condition.
What's the simplest way to write this method with optional WHERE conditions? As Anorm newbie I was struggling to find an example of this, but I suppose there must be some sensible way to do it (that is, without duplicating the SQL for each combination of present/missing arguments).
Note that the java.time.ZonedDateTime instance maps perfectly and automatically into Postgres timestamptz when used inside the Anorm SQL call. (Trying to extract the WHERE condition as a string, outside SQL, created with normal string interpolation did not work; toString produces a representation not understood by the database.)
Play 2.4.4

One approach is to set up filter clauses such as
val customerClause =
if (customerId.isEmpty) ""
else " and customer_id={customerId}"
then substitute these into you SQL:
SQL(s"""
select count(*)
from report
where true
$customerClause
$createdClause
""")
.on('customerId -> customerId,
'createdSince -> createdSince)
.as(scalar[Int].singleOpt).getOrElse(0)
Using {variable} as opposed to $variable is I think preferable as it reduces the risk of SQL injection attacks where someone potentially calls your method with a malicious string. Anorm doesn't mind if you have additional symbols that aren't referenced in the SQL (i.e. if a clause string is empty). Lastly, depending on the database(?), a count might return no rows, so I use singleOpt rather than single.
I'm curious as to what other answers you receive.
Edit: Anorm interpolation (i.e. SQL"...", an interpolation implementation beyond Scala's s"...", f"..." and raw"...") was introduced to allow the use $variable as equivalent to {variable} with .on. And from Play 2.4, Scala and Anorm interpolation can be mixed using $ for Anorm (SQL parameter/variable) and #$ for Scala (plain string). And indeed this works well, as long as the Scala interpolated string does not contains references to an SQL parameter. The only way, in 2.4.4, I could find to use a variable in an Scala interpolated string when using Anorm interpolation, was:
val limitClause = if (nameFilter="") "" else s"where name>'$nameFilter'"
SQL"select * from tab #$limitClause order by name"
But this is vulnerable to SQL injection (e.g. a string like it's will cause a runtime syntax exception). So, in the case of variables inside interpolated strings, it seems it is necessary to use the "traditional" .on approach with only Scala interpolation:
val limitClause = if (nameFilter="") "" else "where name>{nameFilter}"
SQL(s"select * from tab $limitClause order by name").on('limitClause -> limitClause)
Perhaps in the future Anorm interpolation could be extended to parse the interpolated string for variables?
Edit2: I'm finding there are some tables where the number of attributes that might or might not be included in the query changes from time to time. For these cases I'm defining a context class, e.g. CustomerContext. In this case class there are lazy vals for the different clauses that affect the sql. Callers of the sql method must supply a CustomerContext, and the sql will then have inclusions such as ${context.createdClause} and so on. This helps give a consistency, as I end up using the context in other places (such as total record count for paging, etc.).

Finally got this simpler approach posted by Joel Arnold to work in my example case, also with ZonedDateTime!
def countReports(customerId: Option[Long], createdSince: Option[ZonedDateTime]) =
DB.withConnection {
implicit c =>
SQL( """
SELECT count(*) FROM report
WHERE ({customerId} is null or customer_id = {customerId})
AND ({created}::timestamptz is null or created >= {created})
""")
.on('customerId -> customerId, 'created -> createdSince)
.as(scalar[Int].singleOpt).getOrElse(0)
}
The tricky part is having to use {created}::timestamptz in the null check. As Joel commented, this is needed to work around a PostgreSQL driver issue.
Apparently the cast is needed only for timestamp types, and the simpler way ({customerId} is null) works with everything else. Also, comment if you know whether other databases require something like this, or if this is a Postgres-only peculiarity.
(While wwkudu's approach also works fine, this definitely is cleaner, as you can see comparing them side to side in a full example.)

SQL DSL for Scala

I am struggling to create a SQL DSL for Scala. The DSL is an extension to Querydsl, which is a popular Query abstraction layer for Java.
I am struggling now with really simple expressions like the following
user.firstName == "Bob" || user.firstName == "Ann"
As Querydsl supports already an expression model which can be used here I decided to provide conversions from Proxy objects to Querydsl expressions. In order to use the proxies I create an instance like this
import com.mysema.query.alias.Alias._
var user = alias(classOf[User])
With the following implicit conversions I can convert proxy instances and proxy property call chains into Querydsl expressions
import com.mysema.query.alias.Alias._
import com.mysema.query.types.expr._
import com.mysema.query.types.path._
object Conversions {
def not(b: EBoolean): EBoolean = b.not()
implicit def booleanPath(b: Boolean): PBoolean = $(b);
implicit def stringPath(s: String): PString = $(s);
implicit def datePath(d: java.sql.Date): PDate[java.sql.Date] = $(d);
implicit def dateTimePath(d: java.util.Date): PDateTime[java.util.Date] = $(d);
implicit def timePath(t: java.sql.Time): PTime[java.sql.Time] = $(t);
implicit def comparablePath(c: Comparable[_]): PComparable[_] = $(c);
implicit def simplePath(s: Object): PSimple[_] = $(s);
}
Now I can construct expressions like this
import com.mysema.query.alias.Alias._
import com.mysema.query.scala.Conversions._
var user = alias(classOf[User])
var predicate = (user.firstName like "Bob") or (user.firstName like "Ann")
I am struggling with the following problem.
eq and ne are already available as methods in Scala, so the conversions aren't triggered when they are used
This problem can be generalized as the following. When using method names that are already available in Scala types such as eq, ne, startsWith etc one needs to use some kind of escaping to trigger the implicit conversions.
I am considering the following
Uppercase
var predicate = (user.firstName LIKE "Bob") OR (user.firstName LIKE "Ann")
This is for example the approach in Circumflex ORM, a very powerful ORM framework for Scala with similar DSL aims. But this approach would be inconsistent with the query keywords (select, from, where etc), which are lowercase in Querydsl.
Some prefix
var predicate = (user.firstName :like "Bob") :or (user.firstName :like "Ann")
The context of the predicate usage is something like this
var user = alias(classOf[User])
query().from(user)
.where(
(user.firstName like "Bob") or (user.firstName like "Ann"))
.orderBy(user.firstName asc)
.list(user);
Do you see better options or a different approach for SQL DSL construction for Scala?
So the question basically boils down to two cases
Is it possible to trigger an implicit type conversion when using a method that exists in the super class (e.g. eq)
If it is not possible, what would be the most Scalaesque syntax to use for methods like eq, ne.
EDIT
We got Scala support in Querydsl working by using alias instances and a $-prefix based escape syntax. Here is a blog post on the results : http://blog.mysema.com/2010/09/querying-with-scala.html

There was a very good talk at Scala Days: Type-safe SQL embedded in Scala by Christoph Wulf.
See the video here: Type-safe SQL embedded in Scala by Christoph Wulf

Mr Westkämper - I was pondering this problem, and I wondered if would be possible to use 'tracer' objects, where the basic data types such as Int and String would be extended such that they contained source information, and the results of combining them would likewise hold within themselves their sources and the nature of the combination.
For example, your user.firstName method would return a TracerString, which extends String, but which also indicates that the String corresponds to a column in a relation. The == method would be overwritten such that it returns an EqualityTracerBoolean which extends Boolean. This would preserve the standard Scala semantics. However, the constructor for EqualityTracerBoolean would record the fact that the result of the expression was derived by comparing a column in a relation to a string constant. Your 'where' method could then analyse the EqualityTracerBoolean object returned by the conditional expression evaluated over a dummy argument in order to derive the expression used to create it.
There would have to be override defs for inequality operators, as well as plus and minus, for Ints, and whatever else you wished to represent from sql, and corresponding tracer classes for each of these. It would be a bit of a project!
Anyway, I decided not to bother, and use squeryl instead.

I didn't have the exact same problem with jOOQ, as I'm using a bit more verbose operator names: equal, notEqual, etc instead of eq, ne. On the other hand, there is a val operator in jOOQ for explicitly creating bind values, which I had to overload with value, as val is a keyword in Scala. Is overloading operators an option for you? I documented my attempts of running jOOQ in Scala here:
http://lukaseder.wordpress.com/2011/12/11/the-ultimate-sql-dsl-jooq-in-scala/
Just like you, I had also thought about capitalising all keywords in a major release (including SELECT, FROM, etc). But that will leave an open question about whether "compound" keywords should be split in two method calls, or connected by an underscore: GROUP().BY() or GROUP_BY(). WHEN().MATCHED().THEN().UPDATE() or WHEN_MATCHED_THEN_UPDATE(). Since the result is not really satisfying, I guess it's not worth to break backwards-compatibility for such a fix, even if the two-method-call option would look very very nice in Scala, as . and () can be omitted. So maybe, jOOQ and QueryDSL should both be "wrapped" (as opposed to "extended") by a dedicated Scala-API?

What about decompiling the bytecode at runtime? I started to write such a tool:
http://h2database.com/html/jaqu.html#natural_syntax
I know it's a hack, so please don't vote -1 :-) I just wanted to mentioned it. It's a relatively novel approach. Instead of decompiling at runtime, it might be possible to do it at compile time using an annotation processor, not sure if that's possible using Scala (and not sure if it's really possible with Java, but Project Lombok seems to do something like that).