Apache Spark: get elements of Row by name - scala

In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract values by name? I can see how to do some really awkward stuff:
def foo(r: Row) = {
val ix = (0 until r.schema.length).map( i => r.schema(i).name -> i).toMap
val field1 = r.getString(ix("field1"))
val field2 = r.getLong(ix("field2"))
...
}
dataframe.map(foo)
I figure there must be a better way - this is pretty verbose, it requires creating this extra structure, and it also requires knowing the types explicitly, which if incorrect, will produce a runtime exception rather than a compile-time error.

You can use "getAs" from org.apache.spark.sql.Row
r.getAs("field1")
r.getAs("field2")
Know more about getAs(java.lang.String fieldName)

This is not supported at this time in the Scala API. The closest you have is this JIRA titled "Support converting DataFrames to typed RDDs"

Related

How can I dynamically (runtime) generate a sorted collection in Scala using the java.lang.reflect.Type?

Given an array of items I need to generate a sorted collection in Scala for a java.lang.reflect.Type but I'm unable to do so. The following snippet might explain better.
def buildList(paramType: Type): SortedSet[_] = {
val collection = new Array[Any](5)
for (i <- 0 until 5) {
collection(i) = new EasyRandom().nextObject(paramType.asInstanceOf[Class[Any]])
}
SortedSet(collection:_*)
}
I'm unable to do as I get the error "No implicits found for parameter ord: Ordering[Any]". I'm able to work around this if I swap to an unsorted type such as Set.
def buildList(paramType: Type): Set[_] = {
val collection = new Array[Any](5)
for (i <- 0 until 5) {
collection(i) = new EasyRandom().nextObject(paramType.asInstanceOf[Class[Any]])
}
Set(collection:_*)
}
How can I dynamically build a sorted set at runtime? I've been looking into how Jackson tries to achieve the same but I couldn't quite follow how to get T here: https://github.com/FasterXML/jackson-module-scala/blob/0e926622ea4e8cef16dd757fa85400a0b9dcd1d3/src/main/scala/com/fasterxml/jackson/module/scala/introspect/OrderingLocator.scala#L21
(Please excuse me if my question is unclear.)
This happens because SortedSet needs a contextual (implicit) Ordering type class instance for a given type A
However, as Luis said on the comment section, I'd strongly advice you against using this approach and using a safer, strongly typed one, instead.
Generating random case classes (which I suppose you're using since you're using Scala) should be easy with the help of some libraries like magnolia. That would turn your code into something like this:
def randomList[A : Ordering : Arbitrary]: SortedSet[A] = {
val arb: Arbitrary[A] = implicitly[Arbitrary[A]]
val sampleData = (1 to 5).map(arb.arbitrary.sample)
SortedSet(sampleData)
}
This approach involves some heavy concepts like implicits and type classes, but is way safer.

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

Passing parameters to scala slick query [duplicate]

There is a similar question here but it doesn't actually answer the question.
Is it possible to use IN clause in plain sql Slick?
Note that this is actually part of a larger and more complex query, so I do need to use plain sql instead of slick's lifted embedding. Something like the following will be good:
val ids = List(2,4,9)
sql"SELECT * FROM coffee WHERE id IN ($ids)"
The sql prefix unlocks a StringContext where you can set SQL parameters. There is no SQL parameter for a list, so you can easily end up opening yourself up to SQL injection here if you're not careful. There are some good (and some dangerous) suggestions about dealing with this problem with SQLServer on this question. You have a few options:
Your best bet is probably to use the #$ operator together with mkString to interpolate dynamic SQL:
val sql = sql"""SELECT * FROM coffee WHERE id IN (#${ids.mkString(",")})"""
This doesn't properly use parameters and therefore might be open to SQL-injection and other problems.
Another option is to use regular string interpolation and mkString to build the statement:
val query = s"""SELECT * FROM coffee WHERE id IN (${ids.mkString(",")})"""
StaticQuery.queryNA[Coffee](query)
This is essentially the same approach as using #$, but might be more flexible in the general case.
If SQL-injection vulnerability is a major concern (e.g. if the elements of ids are user provided), you can build a query with a parameter for each element of ids. Then you'll need to provide a custom SetParameter instance so that slick can turn the List into parameters:
implicit val setStringListParameter = new SetParameter[List[String]]{
def apply(v1: List[String], v2: PositionedParameters): Unit = {
v1.foreach(v2.setString)
}
}
val idsInClause = List.fill(ids.length)("?").mkString("(", ",", ")")
val query = s"""SELECT * FROM coffee WHERE id IN ($idsInClause)"""
Q.query[List[String], String](query).apply(ids).list(s)
Since your ids are Ints, this is probably less of a concern, but if you prefer this method, you would just need to change the setStringListParameter to use Int instead of String:
val ids = List(610113193610210035L, 220702198208189710L)
implicit object SetListLong extends SetParameter[List[Long]] {
def apply(vList: List[Long], pp: PositionedParameters) {
vList.foreach(pp.setLong)
}
}
val select = sql"""
select idnum from idnum_0
where idnum in ($ids#${",?" * (ids.size - 1)})
""".as[Long]
#Ben Reich is right.
this is another sample code, test on slick 3.1.0.
($ids#${",?" * (ids.size - 1)})
Although this is not universal answer and may not be what the author wanted, I still want to point this out to whoever views this question.
Some DB backends support array types, and there are extensions to Slick that allow setting these array types in the interpolations.
For example, Postgres has the syntax where column = any(array), and with slick-pg you can use this syntax like so:
def query(ids: Seq[Long]) = db.run(sql"select * from table where ids = any($ids)".as[Long])
This brings a much cleaner syntax, which is friendlier to the statement compiler cache and also safe from SQL injections and overall danger of creating a malformed SQL with the #$var interpolation syntax.
I have written a small extension to Slick that addresses exactly this problem: https://github.com/rtkaczyk/inslick
For the given example the solution would be:
import accode.inslick.syntax._
val ids = List(2,4,9)
sqli"SELECT * FROM coffee WHERE id IN *$ids"
Additionally InSlick works with iterables of tuples or case classes. It's available for all Slick 3.x versions and Scala versions 2.11 - 2.13. We've been using it in production for several months at the company I work for.
The interpolation is safe from SQL injection. It utilises a macro which rewrites the query in a way similar to trydofor's answer
Ran into essentially this same issue in Slick 3.3.3 when trying to use a Seq[Long] in an IN query for MySQL. Kept getting a compilation error from Slick of:
could not find implicit value for parameter e: slick.jdbc.SetParameter[Seq[Long]]
The original question would have been getting something like:
could not find implicit value for parameter e: slick.jdbc.SetParameter[List[Int]]
Slick 3.3.X+ can handle binding the parameters for the IN query, as long as we provide the implicit definition of how Slick should do so for the types we're using. This means adding the implicit val definition somewhere at the class level. So, like:
class MyClass {
// THIS IS THIS KEY LINE TO ENABLE SLICK TO BIND THE PARAMS
implicit val setListInt = SetParameter[List[Int]]((inputList, params) => inputList.foreach(params.setInt))
def queryByHardcodedIds() = {
val ids: List[Int] = List(2,4,9)
sql"SELECT * FROM coffee WHERE id IN ($ids)" // SLICK CAN AUTO-HANDLE BINDING NOW
}
}
Similar for the case of Seq[Long] & others. Just make sure your types/binding aligns to what you need Slick to handle:
implicit val setSeqLong = SetParameter[Seq[Long]]((inputList, params) => inputList.foreach(params.setLong))
// ^^Note the `SetParameter[Seq[Long]]` & `.setLong` for type alignment

scala: map-like structure that doesn't require casting when fetching a value?

I'm writing a data structure that converts the results of a database query. The raw structure is a java ResultSet and it would be converted to a map or class which permits accessing different fields on that data structure by either a named method call or passing a string into apply(). Clearly different values may have different types. In order to reduce burden on the clients of this data structure, my preference is that one not need to cast the values of the data structure but the value fetched still has the correct type.
For example, suppose I'm doing a query that fetches two column values, one an Int, the other a String. The result then names of the columns are "a" and "b" respectively. Some ideal syntax might be the following:
val javaResultSet = dbQuery("select a, b from table limit 1")
// with ResultSet, particular values can be accessed like this:
val a = javaResultSet.getInt("a")
val b = javaResultSet.getString("b")
// but this syntax is undesirable.
// since I want to convert this to a single data structure,
// the preferred syntax might look something like this:
val newStructure = toDataStructure[Int, String](javaResultSet)("a", "b")
// that is, I'm willing to state the types during the instantiation
// of such a data structure.
// then,
val a: Int = newStructure("a") // OR
val a: Int = newStructure.a
// in both cases, "val a" does not require asInstanceOf[Int].
I've been trying to determine what sort of data structure might allow this and I could not figure out a way around the casting.
The other requirement is obviously that I would like to define a single data structure used for all db queries. I realize I could easily define a case class or similar per call and that solves the typing issue, but such a solution does not scale well when many db queries are being written. I suspect some people are going to propose using some sort of ORM, but let us assume for my case that it is preferred to maintain the query in the form of a string.
Anyone have any suggestions? Thanks!
To do this without casting, one needs more information about the query and one needs that information at compiole time.
I suspect some people are going to propose using some sort of ORM, but let us assume for my case that it is preferred to maintain the query in the form of a string.
Your suspicion is right and you will not get around this. If current ORMs or DSLs like squeryl don't suit your fancy, you can create your own one. But I doubt you will be able to use query strings.
The basic problem is that you don't know how many columns there will be in any given query, and so you don't know how many type parameters the data structure should have and it's not possible to abstract over the number of type parameters.
There is however, a data structure that exists in different variants for different numbers of type parameters: the tuple. (E.g. Tuple2, Tuple3 etc.) You could define parameterized mapping functions for different numbers of parameters that returns tuples like this:
def toDataStructure2[T1, T2](rs: ResultSet)(c1: String, c2: String) =
(rs.getObject(c1).asInstanceOf[T1],
rs.getObject(c2).asInstanceOf[T2])
def toDataStructure3[T1, T2, T3](rs: ResultSet)(c1: String, c2: String, c3: String) =
(rs.getObject(c1).asInstanceOf[T1],
rs.getObject(c2).asInstanceOf[T2],
rs.getObject(c3).asInstanceOf[T3])
You would have to define these for as many columns you expect to have in your tables (max 22).
This depends of course on that using getObject and casting it to a given type is safe.
In your example you could use the resulting tuple as follows:
val (a, b) = toDataStructure2[Int, String](javaResultSet)("a", "b")
if you decide to go the route of heterogeneous collections, there are some very interesting posts on heterogeneous typed lists:
one for instance is
http://jnordenberg.blogspot.com/2008/08/hlist-in-scala.html
http://jnordenberg.blogspot.com/2008/09/hlist-in-scala-revisited-or-scala.html
with an implementation at
http://www.assembla.com/wiki/show/metascala
a second great series of posts starts with
http://apocalisp.wordpress.com/2010/07/06/type-level-programming-in-scala-part-6a-heterogeneous-list%C2%A0basics/
the series continues with parts "b,c,d" linked from part a
finally, there is a talk by Daniel Spiewak which touches on HOMaps
http://vimeo.com/13518456
so all this to say that perhaps you can build you solution from these ideas. sorry that i don't have a specific example, but i admit i haven't tried these out yet myself!
Joschua Bloch has introduced a heterogeneous collection, which can be written in Java. I once adopted it a little. It now works as a value register. It is basically a wrapper around two maps. Here is the code and this is how you can use it. But this is just FYI, since you are interested in a Scala solution.
In Scala I would start by playing with Tuples. Tuples are kinda heterogeneous collections. The results can be, but not have to be accessed through fields like _1, _2, _3 and so on. But you don't want that, you want names. This is how you can assign names to those:
scala> val tuple = (1, "word")
tuple: ([Int], [String]) = (1, word)
scala> val (a, b) = tuple
a: Int = 1
b: String = word
So as mentioned before I would try to build a ResultSetWrapper around tuples.
If you want "extract the column value by name" on a plain bean instance, you can probably:
use reflects and CASTs, which you(and me) don't like.
use a ResultSetToJavaBeanMapper provided by most ORM libraries, which is a little heavy and coupled.
write a scala compiler plugin, which is too complex to control.
so, I guess a lightweight ORM with following features may satisfy you:
support raw SQL
support a lightweight,declarative and adaptive ResultSetToJavaBeanMapper
nothing else.
I made an experimental project on that idea, but note it's still an ORM, and I just think it may be useful to you, or can bring you some hint.
Usage:
declare the model:
//declare DB schema
trait UserDef extends TableDef {
var name = property[String]("name", title = Some("姓名"))
var age1 = property[Int]("age", primary = true)
}
//declare model, and it mixes in properties as {var name = ""}
#BeanInfo class User extends Model with UserDef
//declare a object.
//it mixes in properties as {var name = Property[String]("name") }
//and, object User is a Mapper[User], thus, it can translate ResultSet to a User instance.
object `package`{
#BeanInfo implicit object User extends Table[User]("users") with UserDef
}
then call raw sql, the implicit Mapper[User] works for you:
val users = SQL("select name, age from users").all[User]
users.foreach{user => println(user.name)}
or even build a type safe query:
val users = User.q.where(User.age > 20).where(User.name like "%liu%").all[User]
for more, see unit test:
https://github.com/liusong1111/soupy-orm/blob/master/src/test/scala/mapper/SoupyMapperSpec.scala
project home:
https://github.com/liusong1111/soupy-orm
It uses "abstract Type" and "implicit" heavily to make the magic happen, and you can check source code of TableDef, Table, Model for detail.
Several million years ago I wrote an example showing how to use Scala's type system to push and pull values from a ResultSet. Check it out; it matches up with what you want to do fairly closely.
implicit val conn = connect("jdbc:h2:f2", "sa", "");
implicit val s: Statement = conn << setup;
val insertPerson = conn prepareStatement "insert into person(type, name) values(?, ?)";
for (val name <- names)
insertPerson<<rnd.nextInt(10)<<name<<!;
for (val person <- query("select * from person", rs => Person(rs,rs,rs)))
println(person.toXML);
for (val person <- "select * from person" <<! (rs => Person(rs,rs,rs)))
println(person.toXML);
Primitives types are used to guide the Scala compiler into selecting the right functions on the ResultSet.