Is there any way to specify type in scala dynamically - scala

I'm new in Spark, Scala, so sorry for stupid question. So I have a number of tables:
table_a, table_b, ...
and number of corresponding types for these tables
case class classA(...), case class classB(...), ...
Then I need to write a methods that read data from these tables and create dataset:
def getDataFromSource: Dataset[classA] = {
val df: DataFrame = spark.sql("SELECT * FROM table_a")
df.as[classA]
}
The same for other tables and types. Is there any way to avoid routine code - I mean individual fucntion for each table and get by with one? For example:
def getDataFromSource[T: Encoder](table_name: String): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
Then create list of pairs (table_name, type_name):
val tableTypePairs = List(("table_a", classA), ("table_b", classB), ...)
Then to call it using foreach:
tableTypePairs.foreach(tupl => getDataFromSource[what should I put here?](tupl._1))
Thanks in advance!

Something like this should work
def getDataFromSource[T](table_name: String, encoder: Encoder[T]): Dataset[T] =
spark.sql(s"SELECT * FROM $table_name").as(encoder)
val tableTypePairs = List(
"table_a" -> implicitly[Encoder[classA]],
"table_b" -> implicitly[Encoder[classB]]
)
tableTypePairs.foreach {
case (table, enc) =>
getDataFromSource(table, enc)
}
Note that this is a case of discarding a value, which is a bit of a code smell. Since Encoder is invariant, tableTypePairs isn't going to have that useful of a type, and neither would something like
tableTypePairs.map {
case (table, enc) =>
getDataFromSource(table, enc)
}

One option is to pass the Class to the method, this way the generic type T will be inferred:
def getDataFromSource[T: Encoder](table_name: String, clazz: Class[T]): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
tableTypePairs.foreach { case (table name, clazz) => getDataFromSource(tableName, clazz) }
But then I'm not sure of how you'll be able to exploit this list of Dataset without .asInstanceOf.

Related

Scala iterator on pattern match

I need help to iterate this piece of code written in Spark-Scala with DataFrame. I'm new on Scala, so I apologize if my question may seem trivial.
The function is very simple: Given a dataframe, the function casts the column if there is a pattern matching, otherwise select all field.
/* Load sources */
val df = sqlContext.sql("select id_vehicle, id_size, id_country, id_time from " + working_database + carPark);
val df2 = df.select(
df.columns.map {
case id_vehicle # "id_vehicle" => df(id_vehicle).cast("Int").as(id_vehicle)
case other => df(other)
}: _*
)
This function, with pattern matching, works perfectly!
Now I have a question: Is there any way to "iterate" this? In practice I need a function that given a dataframe, an Array[String] of column (column_1, column_2, ...) and another Array[String] of type (int, double, float, ...), return to me the same dataframe with the right cast at right position.
I need help :)
//Your supplied code fits nicely into this function
def castOnce(df: DataFrame, colName: String, typeName: String): DataFrame = {
val colsCasted = df.columns.map{
case colName => df(colName).cast(typeName).as(colName)
case other => df(other)
}
df.select(colsCasted:_ *)
}
def castMany(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val colsWithTypes: Array[(String, String)] = colNames.zip(typeNames)
colsWithTypes.foldLeft(df)((cAndType, newDf) => castOnce(newDf, cAndType._1, cAndType._2))
}
When you have a function that you just need to apply many times to the same thing a fold is often what you want.
The above code zips the two arrays together to combine them into one.
It then iterates through this list applying your function each time to the dataframe and then applying the next pair to the resultant dataframe etc.
Based on your edit I filled in the function above. I don't have a compiler so I'm not 100% sure its correct. Having written it out I am also left questioning my original approach. Below is a better way I believe but I am leaving the previous one for reference.
def(df: DataFrame, colNames: Array[String], typeNames: Array[String]): DataFrame = {
assert(colNames.length == typeNames.length, "The lengths are different")
val nameToType: Map[String, String] = colNames.zip(typeNames).toMap
val newCols= df.columns.map{dfCol =>
nameToType.get(dfCol).map{newType =>
df(dfCol).cast(newType).as(dfCol)
}.getOrElse(df(dfCol))
}
df.select(newCols:_ *)
}
The above code creates a map of column name to the desired type.
Then foreach column in the dataframe it looks the type up in the Map.
If the type exists we cast the column to that new type. If the column does not exist in the Map then we default to the column from the DataFrame directly.
We then select these columns from the DataFrame

Strange "Task not serializable" with Spark

In my program, I have a method which returns some RDD, let's call it myMethod which takes a non-serializable parameter and let the RDD be of the type Long (my real RDD is a Tuple type but only contains primitive types).
When I try something like this:
val x: NonSerializableThing = ...
val l: Long = ...
myMethod(x, l).map(res => res + l) // myMethod's RDD does NOT include the NonSerializableThing
I get Task not serializable.
When I replace res + l by res + 1L (i.e., some constant), it runs.
From the serialization trace, it tries to serialize the NonSerializableThing and chokes there, but I double-checked my method and this object never appears in an RDD.
When I try to collect output of myMethod directly, i.e. with
myMethod(x, l).take(1) foreach println
I also get no problems.
The method uses the NonSerializableThing to get a (local) Seq of values on which multiple Cassandra queries are made (this is needed because I need to construct the partition keys to query for), like this:
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val someParam1: String = x.someProperty
x.getSomeSeq.flatMap(y: OtherNonSerializableThing => {
val someParam2: String = y.someOtherProperty
y.someOtherSeq.map(someParam3: String =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", someParam1, someParam2, someParam3, l).
map(_.getLong(0))
}.reduce((a, b) => a.union(b))
}
The getSomeSeq and someOtherSeq return plain non-spark Seqs
What I want to achieve is to "union" multiple Cassandra queries.
What could be the problem here?
EDIT, Addendum, as requested by Jem Tucker:
What I have in my class is something like this:
implicit class MySparkExtension(sc: SparkContext) {
def getThing(/* some parameters */): NonSerializableThing = { ... }
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val someParam1: String = x.someProperty
x.getSomeSeq.flatMap(y: OtherNonSerializableThing => {
val someParam2: String = y.someOtherProperty
y.someOtherSeq.map(someParam3: String =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", someParam1, someParam2, someParam3, l).
map(_.getLong(0))
}.reduce((a, b) => a.union(b))
}
}
This is declared in a package object. The problem occurrs here:
// SparkContext is already declared as sc
import my.pkg.with.extension._
val thing = sc.getThing(/* parameters */)
val l = 42L
val rdd = sc.myMethod(thing, l)
// until now, everything is OK.
// The following still works:
rdd.take(5) foreach println
// The following causes the exception:
rdd.map(x => x >= l).take(5) foreach println
// While the following works:
rdd.map(x => x >= 42L).take(5) foreach println
I tested this entered "live" into a Spark shell as well as in an algorithm submitted via spark-submit.
What I now want to try (as per my last comment) is the following:
implicit class MySparkExtension(sc: SparkContext) {
def getThing(/* some parameters */): NonSerializableThing = { ... }
def myMethod(x: NonSerializableThing, l: Long): RDD[Long] = {
val param1 = x.someProperty
val partitionKeys =
x.getSomeSeq.flatMap(y => {
val param2 = y.someOtherProperty
y.someOtherSeq.map(param3 => (param1, param2, param3, l)
}
queryTheDatabase(partitionKeys)
}
private def queryTheDatabase(partitionKeys: Seq[(String, String, String, Long)]): RDD[Long] = {
partitionKeys.map(k =>
sc.cassandraTable("fooKeyspace", "fooTable").
select("foo").
where("bar=? and quux=? and baz=? and l=?", k._1, k._2, k._3, k._4).
map(_.getLong(0))
).reduce((a, b) => a.union(b))
}
}
I believe this could work because the RDD is constructed in the method queryTheDatabase now, where no NonSerializableThing exists.
Another option might be: The NonSerializableThing would indeed be serializable, but I pass in the SparkContext in there as an implicit constructor parameter. I think if I make this transient, it would (uselessly) get serialized but not cause any problems.
When you replace l with 1L Spark no longer tries to serialize the class with the method / variables in and so the error is not thrown.
You should be able to fix by marking val x: NonSerializableThing = ... as transient e.g.
#transient
val x: NonSerializableThing = ...
This means when the class is serialized this variable should be ignored.

Scala Ordinal Method Call Aliasing

In Spark SQL we have Row objects which contain a list of records that make up a row (think Seq[Any]). A Rowhas ordinal accessors such as .getInt(0) or getString(2).
Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what ordinal is what, making the code confusing.
Say for example I have the following code
def doStuff(row: Row) = {
//extract some items from the row into a tuple;
(row.getInt(0), row.getString(1)) //tuple of ID, Name
}
The question becomes how could I create aliases for these fields in a Row object?
I was thinking I could create methods which take a implicit Row object;
def id(implicit row: Row) = row.getInt(0)
def name(implicit row: Row) = row.getString(1)
I could then rewrite the above as;
def doStuff(implicit row: Row) = {
//extract some items from the row into a tuple;
(id, name) //tuple of ID, Name
}
Is there a better/neater approach?
You could implicitly add those accessor methods to row:
implicit class AppRow(r:Row) extends AnyVal {
def id:String = r.getInt(0)
def name:String = r.getString(1)
}
Then use it as:
def doStuff(row: Row) = {
val value = (row.id, row.name)
}
Another option is to convert Row into a domain-specific case class, which IMHO leads to more readable code:
case class Employee(id: Int, name: String)
val yourRDD: SchemaRDD = ???
val employees: RDD[Employee] = yourRDD.map { row =>
Employee(row.getInt(0), row.getString(1))
}
def doStuff(e: Employee) = {
(e.name, e.id)
}

Access database column names from a Table?

Let's say I have a table:
object Suppliers extends Table[(Int, String, String, String)]("SUPPLIERS") {
def id = column[Int]("SUP_ID", O.PrimaryKey)
def name = column[String]("SUP_NAME")
def state = column[String]("STATE")
def zip = column[String]("ZIP")
def * = id ~ name ~ state ~ zip
}
Table's database name
The table's database name can be accessed by going: Suppliers.tableName
This is supported by the Scaladoc on AbstractTable.
For example, the above table's database name is "SUPPLIERS".
Columns' database names
Looking through AbstractTable, getLinearizedNodes and indexes looked promising. No column names in their string representations though.
I assume that * means "all the columns I'm usually interested in." * is a MappedProjection, which has this signature:
final case class MappedProjection[T, P <: Product](
child: Node,
f: (P) ⇒ T,
g: (T) ⇒ Option[P])(proj: Projection[P])
extends ColumnBase[T] with UnaryNode with Product with Serializable
*.getLinearizedNodes contains a huge sequence of numbers, and I realized that at this point I'm just doing a brute force inspection of everything in the API for possibly finding the column names in the String.
Has anybody also encountered this problem before, or could anybody give me a better understanding of how MappedProjection works?
It requires you to rely on Slick internals, which may change between versions, but it is possible. Here is how it works for Slick 1.0.1: You have to go via the FieldSymbol. Then you can extract the information you want like how columnInfo(driver: JdbcDriver, column: FieldSymbol): ColumnInfo does it.
To get a FieldSymbol from a Column you can use fieldSym(node: Node): Option[FieldSymbol] and fieldSym(column: Column[_]): FieldSymbol.
To get the (qualified) column names you can simply do the following:
Suppliers.id.toString
Suppliers.name.toString
Suppliers.state.toString
Suppliers.zip.toString
It's not explicitly stated anywhere that the toString will yield the column name, so your question is a valid one.
Now, if you want to programmatically get all the column names, then that's a bit harder. You could try using reflection to get all the methods that return a Column[_] and call toString on them, but it wouldn't be elegant. Or you could hack a bit and get a select * SQL statement from a query like this:
val selectStatement = DB withSession {
Query(Suppliers).selectStatement
}
And then parse our the column names.
This is the best I could do. If someone knows a better way then please share - I'm interested too ;)
Code is based on Lightbend Activator "slick-http-app".
slick version: 3.1.1
Added this method to the BaseDal:
def getColumns(): mutable.Map[String, Type] = {
val columns = mutable.Map.empty[String, Type]
def selectType(t: Any): Option[Any] = t match {
case t: TableExpansion => Some(t.columns)
case t: Select => Some(t.field)
case _ => None
}
def selectArray(t:Any): Option[ConstArray[Node]] = t match {
case t: TypeMapping => Some(t.child.children)
case _ => None
}
def selectFieldSymbol(t:Any): Option[FieldSymbol] = t match {
case t: FieldSymbol => Some(t)
case _ => None
}
val t = selectType(tableQ.toNode)
val c = selectArray(t.get)
for (se <- c.get) {
val col = selectType(se)
val fs = selectFieldSymbol(col.get)
columns += (fs.get.name -> fs.get.tpe)
}
columns
}
this method gets the column names (real names in DB) + types form the TableQ
used imports are:
import slick.ast._
import slick.util.ConstArray

Allocation of Function Literals in Scala

I have a class that represents sales orders:
class SalesOrder(val f01:String, val f02:Int, ..., f50:Date)
The fXX fields are of various types. I am faced with the problem of creating an audit trail of my orders. Given two instances of the class, I have to determine which fields have changed. I have come up with the following:
class SalesOrder(val f01:String, val f02:Int, ..., val f50:Date){
def auditDifferences(that:SalesOrder): List[String] = {
def diff[A](fieldName:String, getField: SalesOrder => A) =
if(getField(this) != getField(that)) Some(fieldName) else None
val diffList = diff("f01", _.f01) :: diff("f02", _.f02) :: ...
:: diff("f50", _.f50) :: Nil
diffList.flatten
}
}
I was wondering what the compiler does with all the _.fXX functions: are they instanced just once (statically), and can be shared by all instances of my class, or will they be instanced every time I create an instance of my class?
My worry is that, since I will use a lot of SalesOrder instances, it may create a lot of garbage. Should I use a different approach?
One clean way of solving this problem would be to use the standard library's Ordering type class. For example:
class SalesOrder(val f01: String, val f02: Int, val f03: Char) {
def diff(that: SalesOrder) = SalesOrder.fieldOrderings.collect {
case (name, ord) if !ord.equiv(this, that) => name
}
}
object SalesOrder {
val fieldOrderings: List[(String, Ordering[SalesOrder])] = List(
"f01" -> Ordering.by(_.f01),
"f02" -> Ordering.by(_.f02),
"f03" -> Ordering.by(_.f03)
)
}
And then:
scala> val orderA = new SalesOrder("a", 1, 'a')
orderA: SalesOrder = SalesOrder#5827384f
scala> val orderB = new SalesOrder("b", 1, 'b')
orderB: SalesOrder = SalesOrder#3bf2e1c7
scala> orderA diff orderB
res0: List[String] = List(f01, f03)
You almost certainly don't need to worry about the perfomance of your original formulation, but this version is (arguably) nicer for unrelated reasons.
Yes, that creates 50 short lived functions. I don't think you should be worried unless you have manifest evidence that that causes a performance problem in your case.
But I would define a method that transforms SalesOrder into a Map[String, Any], then you would just have
trait SalesOrder {
def fields: Map[String, Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] = {
val af = a.fields
val bf = b.fields
af.collect { case (key, value) if bf(key) != value => key }
}
If the field names are indeed just incremental numbers, you could simplify
trait SalesOrder {
def fields: Iterable[Any]
}
def diff(a: SalesOrder, b: SalesOrder): Iterable[String] =
(a.fields zip b.fields).zipWithIndex.collect {
case ((av, bv), idx) if av != bv => f"f${idx + 1}%02d"
}