In Spark SQL we have Row objects which contain a list of records that make up a row (think Seq[Any]). A Rowhas ordinal accessors such as .getInt(0) or getString(2).
Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what ordinal is what, making the code confusing.
Say for example I have the following code
def doStuff(row: Row) = {
//extract some items from the row into a tuple;
(row.getInt(0), row.getString(1)) //tuple of ID, Name
}
The question becomes how could I create aliases for these fields in a Row object?
I was thinking I could create methods which take a implicit Row object;
def id(implicit row: Row) = row.getInt(0)
def name(implicit row: Row) = row.getString(1)
I could then rewrite the above as;
def doStuff(implicit row: Row) = {
//extract some items from the row into a tuple;
(id, name) //tuple of ID, Name
}
Is there a better/neater approach?
You could implicitly add those accessor methods to row:
implicit class AppRow(r:Row) extends AnyVal {
def id:String = r.getInt(0)
def name:String = r.getString(1)
}
Then use it as:
def doStuff(row: Row) = {
val value = (row.id, row.name)
}
Another option is to convert Row into a domain-specific case class, which IMHO leads to more readable code:
case class Employee(id: Int, name: String)
val yourRDD: SchemaRDD = ???
val employees: RDD[Employee] = yourRDD.map { row =>
Employee(row.getInt(0), row.getString(1))
}
def doStuff(e: Employee) = {
(e.name, e.id)
}
Related
I'm new in Spark, Scala, so sorry for stupid question. So I have a number of tables:
table_a, table_b, ...
and number of corresponding types for these tables
case class classA(...), case class classB(...), ...
Then I need to write a methods that read data from these tables and create dataset:
def getDataFromSource: Dataset[classA] = {
val df: DataFrame = spark.sql("SELECT * FROM table_a")
df.as[classA]
}
The same for other tables and types. Is there any way to avoid routine code - I mean individual fucntion for each table and get by with one? For example:
def getDataFromSource[T: Encoder](table_name: String): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
Then create list of pairs (table_name, type_name):
val tableTypePairs = List(("table_a", classA), ("table_b", classB), ...)
Then to call it using foreach:
tableTypePairs.foreach(tupl => getDataFromSource[what should I put here?](tupl._1))
Thanks in advance!
Something like this should work
def getDataFromSource[T](table_name: String, encoder: Encoder[T]): Dataset[T] =
spark.sql(s"SELECT * FROM $table_name").as(encoder)
val tableTypePairs = List(
"table_a" -> implicitly[Encoder[classA]],
"table_b" -> implicitly[Encoder[classB]]
)
tableTypePairs.foreach {
case (table, enc) =>
getDataFromSource(table, enc)
}
Note that this is a case of discarding a value, which is a bit of a code smell. Since Encoder is invariant, tableTypePairs isn't going to have that useful of a type, and neither would something like
tableTypePairs.map {
case (table, enc) =>
getDataFromSource(table, enc)
}
One option is to pass the Class to the method, this way the generic type T will be inferred:
def getDataFromSource[T: Encoder](table_name: String, clazz: Class[T]): Dataset[T] = {
val df: DataFrame = spark.sql(s"SELECT * FROM $table_name")
df.as[T]
}
tableTypePairs.foreach { case (table name, clazz) => getDataFromSource(tableName, clazz) }
But then I'm not sure of how you'll be able to exploit this list of Dataset without .asInstanceOf.
I have a basic enum type Currency that will include all major currencies traded e.g. EUR, USD, JPY, etc. This code I can write or generate one time. However, I'd also like to have strong enum type for all currency pair combinations e.g. EURCHF, USDCHF, etc. Is there any provision in Scala that would allow me to build such a derived enum type dynamically? I could also do it with some script generator from outside ... but I wonder whether it would be possible.
object Ccy extends Enumeration {
type Type = Value
val USD = Value("USD")
val CHF = Value("CHF")
val EUR = Value("EUR")
val GBP = Value("GBP")
val JPY = Value("JPY")
}
object CcyPair extends Enumeration {
type Type = Value
// ??? Ccy.values.toSeq.combinations(2) ...
}
UPDATE using the accepted answer as reference this was my solution implementation:
import scala.language.dynamics
object CcyPair extends Enumeration with Dynamic {
type Type = Value
/*
* contains all currency combinations including the symmetric AB and BA
*/
private val byCcy: Map[(Ccy.Value, Ccy.Value), Value] =
Ccy.values.toSeq.combinations(2).map { case Seq(c1, c2) =>
Seq(
(c1, c2) -> Value(c1.toString + c2.toString),
(c2, c1) -> Value(c2.toString + c1.toString)
)
}.flatten.toMap
/**
* reverse lookup to find currencies by currency pair, needed to find
* the base and risk components.
*/
private val revByCcy = byCcy.toSeq.map { case (((ccyRisk, ccyBase), ccyPair)) =>
ccyPair -> (ccyRisk, ccyBase)
}.toMap
def apply(ccy1: Ccy.Value, ccy2: Ccy.Value): Value = {
assert(ccy1 != ccy2, "currencies should be different")
byCcy((ccy1, ccy2))
}
implicit class DecoratedCcyPair(ccyPair: CcyPair.Type) {
def base: Ccy.Type = {
revByCcy(ccyPair)._1
}
def risk: Ccy.Type = {
revByCcy(ccyPair)._2
}
def name: String = ccyPair.toString()
}
def selectDynamic(ccyPair: String): Value = withName(ccyPair)
}
and then I can do things like:
val ccyPair = CcyPair.EURUSD
// or
val ccyPair = CcyPair(Ccy.EUR, Ccy.USD)
// and then do
println(ccyPair.name)
// and extract their parts like:
// print the base currency of the pair i.e. EUR
println(CcyPair.EURUSD.base)
// print the risk currency of the pair i.e. USD
println(CcyPair.EURUSD.risk)
There is no magic in Scala's Enumeration. The call to the Value function inside simply does some modifications to Enumeration's internal mutable structures. So you just have to call Value for each pair of currencies. The following code will work:
object CcyPair1 extends Enumeration {
Ccy.values.toSeq.combinations(2).foreach {
case Seq(c1, c2) =>
Value(c1.toString + c2.toString)
}
}
It's not very comfortable to work with though. You can access the values only through withName or values functions.
scala> CcyPair1.withName("USDEUR")
res20: CcyPair1.Value = USDEUR
But it's possible to extend this definition, for example, to allow retrieving CcyPair.Value by a pair of Ccy.Values, or to allow access by object fields with Dynamic, or to provide other facilities you may need:
import scala.language.dynamics
object CcyPair2 extends Enumeration with Dynamic {
val byCcy: Map[(Ccy.Value, Ccy.Value), Value] =
Ccy.values.toSeq.combinations(2).map {
case Seq(c1, c2) =>
(c1, c2) -> Value(c1.toString + c2.toString)
}.toMap
def forCcy(ccy1: Ccy.Value, ccy2: Ccy.Value): Value = {
assert(ccy1 != ccy2, "currencies should be different")
if (ccy1 < ccy2) byCcy((ccy1, ccy2))
else byCcy((ccy2, ccy1))
}
def selectDynamic(pairName: String): Value =
withName(pairName)
}
This definition is a bit more useful:
scala> CcyPair2.forCcy(Ccy.USD, Ccy.EUR)
res2: CcyPair2.Value = USDEUR
scala> CcyPair2.forCcy(Ccy.EUR, Ccy.USD)
res3: CcyPair2.Value = USDEUR
scala> CcyPair2.USDCHF
res4: CcyPair2.Value = USDCHF
The scenario is similar to the question at How to better parse the same table twice with Anorm? however the described solutions on that question can no longer be used.
On the scenario where a Message has 2 users I need to parse the from_user and to_user with SQL joins.
case class User(id: Long, name: String)
case class Message(id: Long, body: String, to: User, from: User)
def userParser(alias: String): RowParser[User] = {
get[Long](alias + "_id") ~ get[String](alias + "_name") map {
case id~name => User(id, name)
}
}
val parser: RowParser[Message] = {
userParser("from_user") ~
userParser("to_user") ~
get[Long]("messages.id") ~
get[String]("messages.name") map {
case from~to~id~body => Message(id, body, to, from)
}
}
// More alias here possible ?
val aliaser: ColumnAliaser = ColumnAliaser.withPattern((0 to 2).toSet, "from_user.")
SQL"""
SELECT from_user.* , to_user.*, message.* FROM MESSAGE
JOIN USER from_user on from_user.id = message_from_user_id
JOIN USER to_user on to_user.id = message.to_user
"""
.asTry(parser, aliaser)
If I'm right thinking you want to apply multiple ColumnAliaser with different aliasing policies to the same query, it's important to understand that ColumnAliaser is "just" a specific implementation of Function[(Int, ColumnName), Option[String]], so it can be defined/composed as any Function, and is simplified by the factory functions in its companion object.
import anorm.{ ColumnAliaser, ColumnName }
val aliaser = new ColumnAliaser {
def as1 = ColumnAliaser.withPattern((0 to 2).toSet, "from_user.")
def as2 = ColumnAliaser.withPattern((2 to 4).toSet, "to_user.")
def apply(column: (Int, ColumnName)): Option[String] =
as1(column).orElse(as2(column))
}
Every time I try to create a new table in cassandra with a new TableDef I end up with a clustering order of ascending and I'm trying to get descending.
I'm using Cassandra 2.1.10, Spark 1.5.1, and Datastax Spark Cassandra Connector 1.5.0-M2.
I'm creating a new TableDef
val table = TableDef("so", "example",
Seq(ColumnDef("parkey", PartitionKeyColumn, TextType)),
Seq(ColumnDef("ts", ClusteringColumn(0), TimestampType)),
Seq(ColumnDef("name", RegularColumn, TextType)))
rdd.saveAsCassandraTableEx(table, SomeColumns("key", "time", "name"))
What I'm expecting to see in Cassandra is
CREATE TABLE so.example (
parkey text,
ts timestamp,
name text,
PRIMARY KEY ((parkey), ts)
) WITH CLUSTERING ORDER BY (ts DESC);
What I end up with is
CREATE TABLE so.example (
parkey text,
ts timestamp,
name text,
PRIMARY KEY ((parkey), ts)
) WITH CLUSTERING ORDER BY (ts ASC);
How can I force it to set the clustering order to descending?
I was not able to find a direct way of doing this. Additionally there are a lot of other options you may want to specify. I ended up extending ColumnDef and TableDef and overriding the cql method in TableDef. An example of the solution I came up with is below. If someone has a better way or this becomes natively supported I'd be happy to change the answer.
// Scala Enum
object ClusteringOrder {
abstract sealed class Order(val ordinal: Int) extends Ordered[Order]
with Serializable {
def compare(that: Order) = that.ordinal compare this.ordinal
def toInt: Int = this.ordinal
}
case object Ascending extends Order(0)
case object Descending extends Order(1)
def fromInt(i: Int): Order = values.find(_.ordinal == i).get
val values = Set(Ascending, Descending)
}
// extend the ColumnDef case class to add enum support
class ColumnDefEx(columnName: String, columnRole: ColumnRole, columnType: ColumnType[_],
indexed: Boolean = false, val clusteringOrder: ClusteringOrder.Order = ClusteringOrder.Ascending)
extends ColumnDef(columnName, columnRole, columnType, indexed)
// Mimic the ColumnDef object
object ColumnDefEx {
def apply(columnName: String, columnRole: ColumnRole, columnType: ColumnType[_],
indexed: Boolean, clusteringOrder: ClusteringOrder.Order): ColumnDef = {
new ColumnDefEx(columnName, columnRole, columnType, indexed, clusteringOrder)
}
def apply(columnName: String, columnRole: ColumnRole, columnType: ColumnType[_],
clusteringOrder: ClusteringOrder.Order = ClusteringOrder.Ascending): ColumnDef = {
new ColumnDefEx(columnName, columnRole, columnType, false, clusteringOrder)
}
// copied from ColumnDef object
def apply(column: ColumnMetadata, columnRole: ColumnRole): ColumnDef = {
val columnType = ColumnType.fromDriverType(column.getType)
new ColumnDefEx(column.getName, columnRole, columnType, column.getIndex != null)
}
}
// extend the TableDef case class to override the cql method
class TableDefEx(keyspaceName: String, tableName: String, partitionKey: Seq[ColumnDef],
clusteringColumns: Seq[ColumnDef], regularColumns: Seq[ColumnDef], options: String)
extends TableDef(keyspaceName, tableName, partitionKey, clusteringColumns, regularColumns) {
override def cql = {
val stmt = super.cql
val ordered = if (clusteringColumns.size > 0)
s"$stmt\r\nWITH CLUSTERING ORDER BY (${clusteringColumnOrder(clusteringColumns)})"
else stmt
appendOptions(ordered, options)
}
private[this] def clusteringColumnOrder(clusteringColumns: Seq[ColumnDef]): String =
clusteringColumns.map { col =>
col match {
case c: ColumnDefEx => if (c.clusteringOrder == ClusteringOrder.Descending)
s"${c.columnName} DESC" else s"${c.columnName} ASC"
case c: ColumnDef => s"${c.columnName} ASC"
}
}.toList.mkString(", ")
private[this] def appendOptions(stmt: String, opts: String) =
if (stmt.contains("WITH") && opts.startsWith("WITH")) s"$stmt\r\nAND ${opts.substring(4)}"
else if (!stmt.contains("WITH") && opts.startsWith("AND")) s"WITH ${opts.substring(3)}"
else s"$stmt\r\n$opts"
}
// Mimic the TableDef object but return new TableDefEx
object TableDefEx {
def apply(keyspaceName: String, tableName: String, partitionKey: Seq[ColumnDef],
clusteringColumns: Seq[ColumnDef], regularColumns: Seq[ColumnDef], options: String = "") =
new TableDefEx(keyspaceName, tableName, partitionKey, clusteringColumns, regularColumns,
options)
def fromType[T: ColumnMapper](keyspaceName: String, tableName: String): TableDef =
implicitly[ColumnMapper[T]].newTable(keyspaceName, tableName)
}
This allowed me to create new tables in this manner:
val table = TableDefEx("so", "example",
Seq(ColumnDef("parkey", PartitionKeyColumn, TextType)),
Seq(ColumnDefEx("ts", ClusteringColumn(0), TimestampType, ClusteringOrder.Descending)),
Seq(ColumnDef("name", RegularColumn, TextType)))
rdd.saveAsCassandraTableEx(table, SomeColumns("key", "time", "name"))
Using Slick's lifted embedding, I define a class extending AbstractTable, with some primary keys spanning multiple columns. For example:
class Foo extends AbstractTable[(some, tuple, type)](tag, name)
{
def col1 = ...
def col2 = ...
def col3 = ...
def * = (col1, col2, col3)
def pk = primaryKey(name, (col1, col2))
...
}
Somewhere in the code, I hold a PrimaryKey reference that corresponds to that primary key (the code in question is generic and must not depend on the knowledge of which specific tables and which columns are defined). I also hold a reference to a TableElementType tuple corresponding to a row in this table, as defined by the * projection.
How do I programmatically obtain the primary key projection of that element? That is, given the PrimaryKey and TableElementType references as arguments, I want to obtain the (val1, val2) tuple out of the (val1, val2, val3) TableElementType tuple, in this example. I didn't find readily available methods to achieve that in the Slick documentation.
AbstractTable has a different signature, it takes 3 parameters and not 2:
abstract class AbstractTable[T](val tableTag: Tag, val schemaName: Option[String], val tableName: String) extends ColumnBase[T]
Here I'm extending Table for convenience, what you can do is twist the shape in your table type parameter:
class Foo(tag: Tag) extends Table[((String, String), String)](tag, "Foo") {
def col1 = column[String]("c1")
def col2 = column[String]("c2")
def col3 = column[String]("c3")
def * = ((col1, col2), col3)
def pk = primaryKey("pk", (col1, col2))
}
Note that if you have (String, String, String) then the projection function must be also defined so:
def * = (col1, col2, col3)
// this below doesn't match the given shape:
// def * = ((col1, col2), col3)
If you don't want to, just define a custom method:
def someF(results: List[(String, String, String)]): List[((String, String), String)] = {
results.map { case(a,b,c) => ((a,b),c) }
}
The code is untested because at the moment I don't have a database available but it compiles.
Edit:
You could use this:
def tupleKeys[T <: Table[S], S](table: TableQuery[T]): Tuple2[Option[PrimaryKey], Option[PrimaryKey]] = {
val keys: List[PrimaryKey] = table.baseTableRow.primaryKeys.toList
if(keys.length == 2) (Option(keys(0)), Option(keys(1)))
else (Option(keys(0)), None)
}
But as I said, you don't know how many keys you have before running the script while the tuple has its length fixed, I wrapped the values in option because you may or not have 2 keys and you can either return a Tuple1 or a Tuple2 since they are different types, you can use this method into the projection function to twist the keys shape but it still won't be as dynamic as you want it.